peri tiger teams fy07 report performance engineering research institute october 30, 2007 contact:...
TRANSCRIPT
PERI Tiger TeamsFY07 Report
Performance Engineering Research InstituteOctober 30, 2007
Contact: Bronis R. de [email protected]
Tiger Team Process and Milestones
Process from Section 4.3 of Proposal: Select one or two applications per year Consult with Office of Science Program Managers Consist of three to four PERI researchers
Milestones Q1: Identify applications and teams for current year Q1: Report on prior year’s teams Q3: Report progress; reassign as per DOE needs
FY07 Selection Process
Delayed to allow completion of application survey
Received guidance to focus on 3 JOULE metric codes S3D, GTC and Chimera
Initial discussions w/JOULE Metric coordinator K. Roche SciDAC PI meeting in Atlanta in January of 2007 Strong interest from both S3D and GTC Chimera expressed concerns over their staffing and needs
Narrowed to focus on S3D and GTC in early March
FY07 Tiger Team Formation
Solicited interest in participating on teams
Assignments made by PERI management based on: Perceived code team needs Prior engagement activities Balance of expertise
Participants from six of nine PERI institutions Also strong participation in both teams by Univ. of Oregon
Coordination Team-specific mailing lists Regular telecons
S3D Tiger Team
Team Lead: Bronis de Supinski (LLNL)
PERI Team Members John Mellor-Crummey, Mike Fagan (Rice) Nick Wright, Allan Snavely (SDSC) David Bailey (LBNL) Rich Vuduc (LLNL)
Affiliate Team Members Sameer Shende, Alan Morris , Allen Maloney, Kevin Huck (Oregon) Jeff Larkin (Cray/ORNL)
Application Team Participants Jackie Chen, David Lignell (SNL)
Facilitators Kenny Roche, Pat Worley (ORNL)
S3D: Direct numerical simulation (DNS) of turbulent combustion
State-of-the-art code developed at CRF/Sandia 2007 INCITE award - 6M hours on XT3/4 at NCCS Tier 1 pioneering application for 250TF system
Why DNS? Study micro-physics of turbulent reacting flows
Full access to time resolved fields Physical insight into chemistry turbulence interactions
Develop & validate reduced model descriptions usedin macro-scale simulations of engineering-level systems
DNSDNS PhysicalPhysicalModelsModels
EngineeringEngineeringCFD codesCFD codes
(RANS, LES)(RANS, LES)
Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL
Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL
S3D - DNS Solver
Solves compressible reacting Navier-Stokes equations
High fidelity numerical methods 8th order finite-difference 4th order explicit RK integrator
Hierarchy of molecular transport models
Detailed chemistry
Multiphysics (sprays, radiation & soot) From SciDAC-TSTC
(Terascale Simulation of Combustion)
S3D Parallelization
Fortran90 + MPI
3D domain decomposition each MPI process manages part of the domain
All processes have same number of grid points & same computational load
Inter-processor communication only between nearest neighbors in 3D mesh large messages; non-blocking sends & receives
All-to-all communication only required for monitoring & synchronization ahead of I/O
Communication
ComputationkN 2
kN 3
1
N
S3D logical topology
Text courtesy of S3D PI, Jacqueline H. Chen, SNL
A Performance Mystery in S3D on PWR4 (SDSC)
Following line of code (and many similar others) has ~70% L1 hit rate
diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * (grad_Ys(:,:,:,n,m) + Ys(:,:,:,n) *grad_mixMW(:,:,:,m) ))
Total L2 data cache accesses: 9784.594 M
% accesses from L2 per cycle: 5.112 %
L2 traffic: 1194408.401 MBytes
L2 bandwidth per processor: 9183.869 MBytes/sec
Total load and store operations: 33073.374 M
Number of loads per load miss: 30.527
Number of stores per store miss: 1.014
Number of load/stores per D1 miss: 3.380
L1 cache hit rate: 70.415 %
Performance model provides expectation of 90%...
Discrepancy Understood, Performance Optimized
diffFlux is defined as a pointer: “diffFlux => grad_Ys”
Compiler unrolls the loop suboptimally Loops over the 2nd index instead of the 1st
i.e., It accesses memory in “nx-size” strides Alias analysis not sufficient to allow “obvious” optimization
Simple fix on IBM systems Use “-qalias=noaryovrlp” compiler flag on IBM Runtime on 8 PWR4+ 1.5 GHz CPUs, 200 timesteps
2949 s (before), 2728 s (after) 7.5% improvement and L1 hit rates what they should be
Same loops show expected ~93 % L1 hit rate on XT3/4
Substantial time in exp in the getrates routine Power4+ profiles Code examination revealed calls were not vectorizable
Perl script transformed from 0% to 50% vectorized Substantial Power4+ performance improvement
30% for getrates routine Approximately 10% overall
Smaller perfromance improvement on XT4 Approximately 10% for getrates routine Approximately ~1.5% overall Subject of continuing tuning effort (D. Bailey)
Vectorizing exp for S3D (SDSC)
# of cpu cycles PWR4+/1.5GHz Opteron / 2.6 GHz
exp 160 49
Vectorised exp 8 31
S3D Performance at the Loop Level (Rice)
Wasted Opportunity(Maximum FLOP rate
* cycles - (actual FLOPs)) / total waste
highlighted loop accounts for11.4% of total program waste
Overall performance (15% of peak)2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle
S3D: What Opportunities Exist?
reuse
performance problem
data streams in/out of memory
initialize
update
reuse
5D loop nest:2D explicit loops
3D F90 vector syntax
Apply LoopTool to S3D Diffusive Flux Loop
!dir$ uj 3 do m=1,3 ! DIRECTION!dir$ uj 2 do n=1,n_spec-1 ! SPECIES
!dir$ unswitch 2 if (baro_switch) then ! driving force includes gradient in mole fraction and baro-diffusion:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * ( grad_mixMW(:,:,:,m) & + (1 - molwt(n)*avmolwt) * grad_P(:,:,:,m)/Press)) else ! driving force is just the gradient in mole fraction:!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * grad_mixMW(:,:,:,m) ) endif
! Add thermal diffusion:!dir$ unswitch 2 if (thermDiff_switch) then!dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = diffFlux(:,:,:,n,m) - Ds_mixavg(:,:,:,n) * Rs_therm_diff(:,:,:,n) * molwt(n) * avmolwt * grad_T(:,:,:,m) / Temp endif
! compute contribution to nth species diffusive flux ! this will ensure that the sum of the diffusive fluxes is zero.!dir$ fuse 1 1 1 diffFlux(:,:,:,n_spec,m) = diffFlux(:,:,:,n_spec,m) - diffFlux(:,:,:,n,m)
enddo ! SPECIES enddo ! DIRECTION
unswitching directives
controlled fusiondirectives
unroll and jam directives
if BSelseif TD
n=1,nspec-1
m=1,3
LoopTool
Optimization of S3D Diffusive Flux Loop
Transformation Log:Scalarization (4 stmts)Loop unswitching (2 conditions)Fusion (loops within 4 outer nests)Unroll-and-jam (2 loops)Peeling excess iterations (4 nests)
2.94x faster than original (6.7% total savings)
if BS if TD
else
else if TD
else
n=1,nspec-2,2
n=1,nspec-2,2
n=1,nspec-2,2
n=1,nspec-2,2
(35 lines) (445 lines)
S3D: An Unexpected Bottleneck
Adjust routine interfaces to avoid copy
100% faster
an implicit loop that copies a non-contiguous4D slice of 5D data to
contiguous storage
5.4% time
S3D Node Performance Tuning Summary
More opportunities remain Register reuse and tiling of stencil computations Inlining + fusion + array contraction of temporary variables
Further improvements require more changes Lots of potential smaller improvements
Enabling technologies contributions HPCToolkit enabled identifying and assessing bottlenecks LoopTool helped automate tedious code transformations
Achieved ~12.7% overall improvement Node performance increased from 15% of peak to 17.4% Estimated savings for 2M CPU hour run: 254K CPU hours
S3D Scaling Performance (App Team)
S3D performance on XT at NCCS
0
10
20
30
40
50
60
70
80
1 10 100 1000 10000 100000
No of cores
Co
st
per
gri
d p
oin
t p
er
tim
e s
tep
(i
n m
icro
seco
nd
)
XT3 XT4 XT7
S3D Scaling Study (Oregon)
Harness test case
Platform: Jaguar Combined Cray XT3/XT4 at ORNL Several runs to identify scaling trends Focus on 6400p
Evaluate impact of combined XT3/XT4 nodes Performance evaluation of MPI_Wait Study mapping of MPI ranks to nodes
Total Runtime Breakdown by Events - Time
MPI_Wait
WRITE_SAVEFILE*
*Recent analysis indicates not a scaling issue
TAU: ParaProf Profile
MPI_Wait times exhibit two
equivalence classes
Same equivalence classes also seen in memory bandwidth
intensive computation
routines
S3D Scaling Study Conclusion
Determined that XT3 nodes slowed certain S3D routines Consistent across all XT3 nodes Memory bandwidth limited routines
Suggested load balancing optimization Reduce grid size in one dimension
for XT3 nodes Not yet implemented due to
concerns over long term relevance Provided estimate of benefit for
combined XT3/XT4 runs
Many scaling and single node results appear in S3D IOP paper
Projected Heterogeneous Scaling (LLNL)
0
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1 1.2
Proportion XT4 Nodes
Co
st
pe
r g
rid
po
int
pe
r ti
me
ste
p
(in
mic
ros
ec
on
d)
S3D Modeling Results & Future Directions
PMaC predictions for S3D on XT3 and XT4 Currently within 15% for an 8 CPU run Extending to larger CPU counts Working on improving accuracy
What is the expected performance of S3D on ORNL’s 250 TFLOP machine?
Will our optimizations benefit quad-core system? Different cache structure
L2 1MB→512KB L3 0 → 2MB shared
What architecture will S3D perform best on?
GTC Tiger Team
Team Lead: Shirley Moore (UTK) PERI Team Members
Haihang You (UTK) John Mellor-Crummey, Gabriel Marin, Guohua Jin (Rice) Hongzhang Shan (LBNL)
Affiliate Team Members Kevin Huck (UOregon) Ed D’Azevedo (ORNL) Lenny Oliker (LBNL)
Application Team Participants Stephane Ethier, Weixing Wang, Wei-li Lee (PPPL) Scott Klasky (ORNL)
Facilitators Kenny Roche, Pat Worley (ORNL) Bronis de Supinski (LLNL)
GTC: Gyrokinetic Toroidal Code from PPPL
Particle in Cell (PIC) code with gyrokinetic simulation
GTC-S: “shaped” code More realistically represents experimentally relevant geometry
GTC-P is a new “petascale” version Partitions the poloidal plane into radial shells
Fortran 90 and MPI, PETSc used for Poisson solves Currently no OpenMP in GTC-S or GTC-P OpenMP may be considered for multicore
Code team science goals Impact of turbulent transport in burning plasma fusion devices Integrated simulations for ITER plasmas for a range of
temporal and spatial scales
The Gyrokinetic Toroidal Code
3D particle-in-cell code to study microturbulencein magnetically confined fusion plasmas
Solves the gyro-averaged Vlasov equation
Gyrokinetic Poisson equation solved in real space
Low noise f method
Global code (full torus as opposed to only a flux tube)
Massively parallel: typical runs use 1024+ processors
Electrostatic (for now…)
Nonlinear and fully self-consistent
Written in Fortran 90/95
Originally optimized for superscalar processors
Particle-in-Cell (PIC) Method
Particles sample distribution function.
The particles interact via a grid, on which the potential is calculated from deposited charges.
The PIC Steps• “SCATTER”, or deposit,
charges on the grid (nearest neighbors)
• Solve Poisson equation• “GATHER” forces on each
particle from potential• Move particles (PUSH)• Repeat…
Charge Deposition for Charged Rings:4-Point Average Method
Classic PIC
Charge Deposition Step (SCATTER operation)
4-Point Average GK(W.W. Lee)
GTC
Point-charge particles replaced by charged rings due to gyro-averaging
Application Team’s Flagship Code: The Gyrokinetic Toroidal Code (GTC)
Fully global 3D particle-in-cell code (PIC) in toroidal geometry Developed by Prof. Zhihong Lin (now at
UC Irvine) Used for non-linear gyrokinetic
simulations of plasma microturbulence Fully self-consistent Uses magnetic field line following
coordinates () [Boozer, 1981] Guiding center Hamiltonian [White and
Chance, 1984] Non-spectral Poisson solver [Lin and
Lee, 1995] Low numerical noise algorithm (f
method) Full torus (global) simulation
Scales to a very large number of processors
Excellent theoretical tool!
GTC Mesh and Geometry
Saves a factor of about 100 in CPU time
() /q
Field-line following coordinates
Poloidal plane (cross-section)unstructured mesh
Processor 2Processor 3
Processor 0Processor 1
New GTC Codes Use a New Parallel Model:Domain Decomposition + Particle Splitting
1D Domain decomposition: Several MPI processes can now
share a section of the torus
Particle splitting method The particles in a toroidal section are
equally divided between several MPI processes
Particles randomly distributed between processors within a toroidal domain
Pure MPI version
But OpenMP still there… for multicore?
New Version (GTC-S) InputsExperimental Equilibrium and Profiles
Original GTC has flat temperature and density to set the scale for the gyroradius and the grid, and an analytical gradient for the turbulence drive
GTC-S uses experimental profiles and plasma boundary extracted from the experimental database by using the widely-used TRANSP tool (http://w3.pppl.gov/transp/)
The magnetic equilibrium is calculated from the profiles and boundary by using ESC or JSOLVER
Spline coefficients are calculated for the equilibrium and profiles to allow interpolations at the particle positions
New Grid Follows Change inGyro-radius with Temperature Profile
Local gyro-radius proportional to temperature
Evenly spaced radial grid in new coordinate where
T
Original GTC circular gridwith flat temperature New GTC-S follows T(r)
)(/ rTTdr
dic
Poloidal Component of B Field Takeninto Account for gyro-orbit
For large-aspect ratio circular concentric cross-section, the difference between a poloidal plane and a gyro-plane is neglected.
A more accurate treatment is used here for general geometry.
Projection of gyro-plane on poloidal plane results in elliptic orbit.
4-point average method uses ellipse
GTC Performance Issues
Three basic operation types govern PIC performance Grids work (i.e., Poisson solve) Particle processing (e.g., position and velocity updates) Interpolation between the above (i.e., charge deposition and
field calculation in particle pushing)
Main GTC performance bottleneck is the charge deposition, or scatter operation True of most PIC codes More complex in GTC due to fast gyrating particles
Motion described by charged rings tracked by their guiding center
More GTC Performance Issues
Some scaling issues with GTC-P relative to expectations Time doubles when it should stay flat Load imbalance in particle push routine apparently due to
variation in TLB misses
179% speedup going from single to dual core mode Main computational kernels not memory bandwidth bound Warning: as number of cores increases, other routines that are
showing slowdown on dual core may start to dominate
Status of GTC Tiger Team Effort
PERI Application Survey completed
Several conference calls w/application team participants
GTC and GTC-S code versions released to Tiger Team and Performance Database WG members on request
Awaiting release of GTC-P code to investigate: Poor scaling Load imbalance issues
Profiling of GTC-S carried out on Jaguar using TAU Data accessible in password-protected PerfDMF database
Optimization of charge deposition by UTK
Detailed modeling, analysis, and optimization of GTC-S by Rice Brief summary follows; details in a submitted paper
TAU Profile Showing Weak Scaling of GTC-S on Jaguar
Hand optimization of Charge Deposition (UTK)
Hand-tuning techniques Common subexpression elimination Code movement Loop unrolling Cache blocking
Improved performance of chargei by ~10%
Changes incorporated into GTC-S code
Written up as success story for Fred Johnson
Modeling, Analysis, and Optimization of GTC at Rice
Detailed modeling of computation and memory hierarchy performance of GTC-S using Rice modeling toolkit
Identified opportunities for data and loop transformations
Transformations improved program node performance 33% on Itanium2 and 13% on Opteron 275 Changes sent to Stephane Ethier; awaiting response
GTC-S Memory Hierarchy Performance - I
• Total L3 miss count• L3 cache misses due to
fragmentation of data in cache lines: 14.4% of total
GTC-S suffers from poor spatial locality due to data layout
Model L3 cache miss counts for individual arrays at the loop level
particle_array is an alias to array zion used in gcmotion
fragmentation of arrays zion (AKA particle_array) + zion0, accounts for:– 95% of all L3 fragmentation misses– 48% of all misses to the zion arrays– 13.7% of total L3 cache misses
Solution: transpose particle arrays zion and zion0 transform arrays of structures into structures of arrays
(values predicted for 64 radial grid points and 15 particles/cell)
GTC-S Memory Hierarchy Performance - II
Understanding spatial and temporal data reuse patterns in GTC-S
Figure below: program scopes carrying > 2% of L3 cache misses Carried misses are non-compulsory misses (capacity + conflict misses) Carrying scope is the innermost dynamic scope in which the data is reused
Two loops in main carry 40% of all L3 carried misses; misses cannot be removed.
21.4% of misses are carried by the iterative loop of the Poisson solver. A recurrence in the solver prevents transformations.
Focus on routines chargei and pushi Fuse the two main loops in chargei Apply tiling and fusion over several loop nests in pushi
(values predicted for 64 radial grid points and 15 particles/cell)
GTC-S Memory Hierarchy Performance - III
Pinpointing and reducing TLB misses
335 do kz=1,mzbig336 wz=real(kz)/real(mzbig)337 zdum=zetamin+deltaz*(real(k-1)+wz)338 do i=idiag1,idiag2339 ii=igrid(i)340 do j=1,mtdiag346 phiflux(kz+(k-1)*mzbig,j,i)= 347348 enddo349 enddo350 enddo
(values predicted for 64 radial grid points and 15 particles/cell)
Interchange loop kz to the innermost position
Outer loop kz iterates over inner dimension of phiflux
Additional transformations Apply unroll & jam to increase ILP in routine spcpft Transform arrays used in the Poisson solver to
improve spatial locality
GTC-S Performance Improvements on Itanium2
Percentages represent incremental improvements for each transformation
Results for 10 and 100 particles/cell
TransformationL2 misses
(%)L3 misses
(%)TLB
misses (%)Execution Time (%)
+zion transpose -27 / -37.4 -30.9 / -39.1 -10.6 / -81.6 -6.9 / -11.7
+chargei fusion -3.9 / -6.2 -4.3 / -6.8 -1.6 / -3.1 -11.4 / -18.2
+spcpft U&J 0 / 0 +0.1 / +0.4 -0.1 / -0.4 -11.3 / -1.9
+poisson transf. -6.6 / -1 -6.4 / -1.3 -1.4 / +2 -7.4 / -1.4
+smooth LI -3 / -0.4 -2.4 / -0 -63.9 / -3.6 -0.7 / 0
+pushi tile/fuse -8.9 / -13.3 -10.9 / -16 -3.4 / -9.4 +0.6 / +0.8
Total -49.4 / -58.3 -54.8 / -63 -81.1 / -96.2 -37.3 / -32.4
Itanium2 has 16KB dedicated instruction cache. Improvements in data locality negated by increase in instruction cache misses. Bigger impact expected with larger instruction cache, e.g. Montecito.
Side effect: big reduction in unnecessary data prefetches inserted by Intel compiler
GTC-S Performance Improvements on Opteron
Issues Hardware prefetcher crucial for performance on Opteron
Prefetcher tracks up to 20 parallel data streams Zion transpose increases # of parallel streams in key loops
Reduces effectiveness of hardware prefetcher Data reuse improvements are negated by higher number of
non-prefetched memory accesses
Approach Reorganize five arrays in pushi as one array Reorganize fourteen arrays in gcmotion as four arrays
Result: Improves execution time on Opteron by 13% Reduces cache and TLB misses by > 50%
Exploring Run-time Data Reordering at Rice
Issue Performance degrades during GTC execution as particles
become disordered w.r.t. underlying tokamak grid
Preliminary study Particle reordering improves temporal locality during charge
deposition and particle pushing
Currently developing on-line feedback and control mechanism for particle reordering
What Worked
Close interactions with multiple members of app teams
Tiger Team specific mailing list for S3D Generated team-wide comments, tapping more expertise Not used very much for GTC
Large distributed teams Somewhat surprising Avoid duplication of measurement effort University of Oregon participation as affiliate was exemplary Rice participation was also exemplary
Publications and publicity S3D science focused IOP paper SciDAC Review paper & SciDAC conf. presentation (Mellor-Crummey) GTC success for Fred Johnson
What Didn’t
Timing of application selection Not finalized until halfway through fiscal year Delayed by survey OK for first year; future implications?
Long Jaguar down time soon after teams formed
Initial understanding of code distribution Provided through JOULE process, NOT direct from application teams
An appropriate distribution mechanism but unsettling to application team
Frequent, on-going concern of application teams Will always start with application team in future, regardless of reason for
selection or appropriateness of distribution
Mechanism for providing improvements back to application team Slow and cumbersome; No CVS access May not be solvable due to application team need for internal control Addressed by repeated direct interactions
FY08 Tiger Team Issues and Proposed Solutions
Which applications will be focus of FY08 Tiger Teams? Guidance from HQ requested Recommend one XT4-focused team, one BG/P-focused team Is JOULE precedent to continue? Expect timing to be similar to FY07 (January/March), if maybe a little sooner Plan to continue work with S3D and GTC during FY08 selection process
Solves late Q2 decision, one of FY07’s biggest issues Suggest elimination of Q3 reassessment milestone in light of timing
What happens to teams from previous year? Application tuning does not respect fiscal year boundaries Good relationships established
Don’t want to lose them Are liaison activities sufficient to maintain them?
Plan to slowly devolve FY07 teams into very active liaison activities Use different participants for FY08 teams in order to balance staffing requirements
How do we ensure that the results are publicized? Initial S3D paper is good; potential for more GTC success story is good; plan similar one for S3D Continued interactions will support solving this question