a comparison of climate applications on accelerated and conventional architectures
DESCRIPTION
A comparison of climate applications on accelerated and conventional architectures. Srinath Vadlamani Youngsung Kim and John Dennis ASAP-TDD-CISL NCAR. Presentation has two parts. The overall picture of acceleration effort and different techniques should be understood. [ Srinath V.] - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/1.jpg)
Srinath VadlamaniYoungsung Kimand John DennisASAP-TDD-CISL
NCAR
A comparison of climate applications on accelerated and conventional architectures.
![Page 2: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/2.jpg)
The overall picture of acceleration effort and different techniques should be understood. [Srinath V.] We use small investigative kernels to help teach us. We use instrumentation tools to help us work with the larger
code set.The investigative DG_KERNEL shows what is possible if
everything was simple. [Youngsung K.] DG_KERNEL helped us understand the hardware. DG_KERNEL helped us understand coding practices and software
instructions to achieve superb performance.
Presentation has two parts
![Page 3: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/3.jpg)
ASAP Personnel Srinath Vadlamani, John Dennis, Youngsung Kim,
Michael Arndt, Ben Jamroz and Rich Loft Active collaborators
Intel: Michael Greenfield, Ruchira Sasanka, Sergey Egorov, Karthik Raman, and Mark Lubin
NREL: Ilene Carpenter
Application and Scalability Performance (ASAP) team researching modern micro-architecture for
Climate Codes
![Page 4: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/4.jpg)
Climate simulations simulate 100s to 1000s of years of activity.
Currently high resolution climate simulations rate is 2 ~ 3 simulated year per day (SYPD) [~40k pes].
GPUs and Coprocessors can help to increase SYPD.
Many collaborators mandates the use of many architectures.
We must use these architectures efficiently for successful SYPD speed up, which requires knowing the hardware!
Climate codes ALWAYS NEED A FASTER SYSTEM.
![Page 5: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/5.jpg)
Conventional CPU based:
NCAR Yellowstone (Xeon: SNB) - CESM, HOMME
ORNL TITAN (AMD: Interlagos) - benchmark kernel
Xeon Phi based:
TACC Stampede - CESM
NCAR test system (SE10x changing to 7120) - HOMME
GPU based:
NCAR Caldera cluster (M2070Q) -HOMME
ORNL Titan(K20x) -HOMME
TACC Stampede (K20) - benchmark kernels only.
We have started the acceleration effort a specific platforms.
![Page 6: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/6.jpg)
CESM is a large application so we need to create benchmarks kernels to understand the hardware.Smaller examples are easier to understand and manipulate.
The first two kernels we have created are DG_KERNEL from HOMME [detailed by Youngsung]Standalone driver for WETDEPA_V2.
We can learn how to use accelerated hardware for climate codes by creating representative
examples.
![Page 7: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/7.jpg)
We created DG_KERNEL knowing it could be a well vectorized code (with help).
What if we want to start cherry picking subroutines and loops to try the learned techniques?
Instrumentation tools are available with teams that are willing to support your efforts.Trace based tools offer great detail.Profile tools present summaries upfront.Previous NCAR-SEA conference highlighted
such tools.
Knowing what can be accelerated is half the battle.
![Page 8: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/8.jpg)
Extrae tracing tool developed at Barcelona Supercomputer CenterH. Servat, H. Labart, J. GimenezAutomatic performance identification process is a BSC research project.Produces a time series of communication and hardware counter events.
Paraver is the visualizer that also performs statistical analysis.There are clustering techniques which uses a folding concept plus the research identification process to create “synthetic” traces with fewer samples.
Extrae tracing can pick out problematic regions of a large code.
![Page 9: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/9.jpg)
Clustering groups with similar bad computational characteristics is a good guide.
•Result of an Extrae trace of CESM on Yellowstone.•Similar to exclusive execution time.
![Page 10: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/10.jpg)
Extrae tracing exposed possible waste of cycles.
•Red: Instructions count.•Blue: d(INS)/dt
![Page 11: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/11.jpg)
Paraver identified code regions.
•Trace identifies what code is active when. •We now examine code regions for characteristics amenable to acceleration.
![Page 12: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/12.jpg)
Automatic Performance Identification highlighted these group’s subroutines.
Group A Overall Execution time %
conden 2.7
compute_usschu 3.3
rtrnmc 1.75
Group B
micro_gm_tend 1.36
wetdepa_v2 2.5
Group C
reftra_sw 1.71
spcvmc_sw 1.21
vrtqdr_sw 1.43
•small number of lines of code•ready to be vectorized
![Page 13: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/13.jpg)
The subroutine has sections of double nested loops.These loops are very long with branches.Compilers will have trouble vectorizing loops containing branches.
The restructure started with breaking up loops.We collected scalars into arrays for vector
operations.We broke up very long expressions into smaller
pieces.
wetdepa_v2 can be vectorized with recoding.
![Page 14: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/14.jpg)
Modification of the code does compare well with compiler optimization.
•Vectorizing?•-vec-report=3,6•Code optimized
•Modification was for a small number of lines.•-O3 fast for orig. gave incorrect results
![Page 15: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/15.jpg)
Modified wetdepa_v2 placed back in to CESM on SNB shows better use of resources.
•2.5% --> .7% of overall execution in CESM on Yellowstone.
![Page 16: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/16.jpg)
CAM-SE configuration was profiled on Stampede at TACC using TAU.It provides different levels of introspection of subroutine and loop efficiency.
This process taught us some more about hardware counter metrics.
Initial investigation fits in the core-count to core-count comparison.
Profilers are also useful tools for understanding code efficiency in the BIG code.
![Page 17: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/17.jpg)
Hot Spots can be associated with largest exclusive execution time.
Long time may be a branchy section of code.
Long exclusive time on both devices is a good place to start looking.
![Page 18: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/18.jpg)
Low VI is a candidate for acceleration techniques
High VI could be misleading.
Note: The VI metric is defined differently on Sandybridge and Xeon Phi. http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
Possible speedup can be achieved with a gain in Vectorization Intensity (VI)
![Page 19: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/19.jpg)
CESM on KNC not competitive today.
Deviceavg. time step [s]
Sandybridge -O2
30.88
KNC -O0 1566.7KNC -O2,
“derivatitive_mod.F90” -
O1
660.48
KNC -O2, “derivatitive_mod.F
90” -O1-align array64bytes
516.83
• FC5 ne16g37
• 16 MPI ranks/node
• 1 rank/core• 8 nodes• Single
thread
![Page 20: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/20.jpg)
Device MPI rank/device
Threads/ MPI rank
Avg. 1 day dt [s]
Dual SNB 16 1 4.07
Dual SNB 1 16 6.74
Xeon Phi SE10
60 2 611.60
Xeon Phi SE10
1 120 31.66
Xeon Phi SE10
1 192 32.36
Xeon Phi SE10
2 96 24.07
Xeon Phi SE10
8 24 18.60
Xeon Phi SE10
16 12 19.74
Hybrid-parallelism is promising for CESM on KNC
• FCIDEAL ne16ne16• Stampede: 8 nodes• F03 use of
allocatable derived type components to overcome threading issue [all –O2]
• Intel compiler and IMPI
• KNC 4.6x slower•Will get better with Xeon Phi tuning techniques
![Page 21: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/21.jpg)
CESM is running on the TACC Stampede KNC cluster.We are more familiar with possibilities on GPUs and KNCs
by using climate code benchmark kernels.
Kernels are useful for discovering acceleration strategies and hardware investigations. Results are promising.
We now have tracing and profiling tool knowledge to help identify acceleration possibilities with in the large code base.
We have strategies for symmetric operation as a very attractive mode of execution.
Though CESM is not competitive on a KNC cluster today, the kernel experience shows what is possible.
Part 1. Conclusion: We are hopeful to see speedup on accelerated hardware.
![Page 22: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/22.jpg)
ASAP/TDD/CISL/NCAR
Youngsung Kim
Performance Tuning Techniques
for GPU and MIC
![Page 23: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/23.jpg)
Introduction Kernel-based approach Micro-architectures
MIC performance evolutionsCUDA-C performance evolutionsCPU performance evolutions along with MIC evolutionsGPU programming : Open ACC, CUDA Fortran, and F2C-ACCOne source considerationSummary
Contents
![Page 24: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/24.jpg)
What is a kernel? A small computation-intensive part of existing large code Represent characteristics of computations
Benefit of kernel-based approach Easy to manipulate and understand
CESM: >1.5M LOC Easy to convert to various programming technologies
CUDA-C, CUDA-Fortran, OpenACC, and F2C-ACC Easy to isolate issues for analysis
Simplify hardware counter analysis
Motivation of kernel-based approach
![Page 25: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/25.jpg)
Origin*a kernel derived from the computational part of the gradient calculation in the Discontinuous Galerkin formulation of the shallow water equations from HOMME.
Implementation from HOMMESimilar to “dg3d_gradient_mass” function in “dg3d_core_mod.F90”
Calculate gradient of flux vectors and update the flux vectors using the calculated gradient
DG kernel
*: D. Nair, Stephen J. Thomas, and Richard D. Loft: A discontinuous Galerkin global shallow water model, Monthly Weather Review, Vol. 133, pp 876-888
![Page 26: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/26.jpg)
Floating point operations No dependancy between elements # of elements
Can be calculated from source code analytically Ex.) When nit=1000, nelem=1024, nx=4(npts=nx*nx) ≈ 2 GFLOP
OpenMP Two OpenMP Parallel regions for Do loops on element index(ie)
DG KERNEL – source code
![Page 27: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/27.jpg)
CPU Conventional multi-core: 1 ~ 16+ cores/~256-bit vector registers Many programming language: Fortran, C/C++, etc. Intel SandyBridge E5-2670
Peak performance(2 Sockets): 332.8 DP GFLOPS(Estimated by presenter)
MIC Based on Intel Pentium cores with extensions including wider vector registers. Many core and wider vector: 60+ cores/512-bit vector registers Limited programming language(extensions only from Intel): C/C++, Fortran Intel KNC( a.k.a. MIC)
Peak Performance(7120): 1.208 DP TFLOPS
GPU Many light-weight threads: ~2680+ threads(threading & vectorization) Limited programming language(extensions): CUDA-C, CUDA-Fortran, OpenCL,
OpenACC, F2C-ACC, etc. Peak performances
Nvidia K20x: 1.308 DP TFLOPS Nvidia K20: 1.173 DP TFLOPS Nvidia M2070Q: 515.2 GFLOPS
Micro-architectures
![Page 28: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/28.jpg)
The best performance results from CPU, GPU, and MIC
5.4x
6.6x
MIC
![Page 29: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/29.jpg)
Compiler options -mmic
Environmental variables OMP_NUM_THREADS=240 KMP_AFFINITY =
'granularity=fine,compact'Native mode only
No cost of memory copy between CPU and MIC
Supports from Intel R. Sasanka
MIC evolution
15.6x
![Page 30: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/30.jpg)
Source modification NONE
Compiler options -mmic –openmp –O3
MIC ver. 1
![Page 31: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/31.jpg)
Source codei = 1s1 = (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)
Compiler options -mmic -openmp –O3
Performance Considerations Complete unroll of three nested loops Vectorized, but not efficiently enough
MIC ver. 2
![Page 32: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/32.jpg)
MIC ver. 3
![Page 33: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/33.jpg)
MIC ver. 4
![Page 34: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/34.jpg)
MIC ver. 5
![Page 35: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/35.jpg)
CPU Evolutions with MIC evolutions
Generally, performance tuning on a micro-architecture also helps to improve performance on another micro-architecture. However, it is not always true.
GPU
![Page 36: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/36.jpg)
Compiler options -O3 -arch=sm_35 same to all versions
“Offload mode” only However, the time cost
for data copy between CPU and GPU is not included for comparison to MIC native mode
CUDA-C Evolutions
14.2x
![Page 37: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/37.jpg)
CUDA-C ver. 1
![Page 38: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/38.jpg)
CUDA-C ver. 2
![Page 39: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/39.jpg)
CUDA-C ver. 3
![Page 40: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/40.jpg)
CUDA-C ver. 4
![Page 41: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/41.jpg)
Source Code ie = (blockidx%x - 1)*NDIV + (threadidx%x - 1)/(NX*NX) + 1 ii = MODULO(threadIdx%x - 1, NX*NX) + 1 IF (ie > SET_NELEM) RETURN
k = MODULO(ii-1,NX) + 1 l = (ii - 1)/NX + 1 s2 = 0.0_8 DO j=1, NX s1 = 0.0_8 DO i = 1, NX s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + & delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) END DO ! i loop s2 = s2 + s1*gw(j) END DO ! j grad(ii,ie) = s2 flx(ii,ie) = flx(ii,ie)+ dt*grad(ii,ie) fly(ii,ie) = fly(ii,ie)+ dt*grad(ii,ie)
Performance considerations Maintains source structure of original
Fortran Needs understanding on CUDA
threading model, especially for debugging and performance tuning
Supports implicit memory copy, which is convenient but could negatively impact to performance if over-used.
CUDA-Fortran
![Page 42: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/42.jpg)
OpenACC
![Page 43: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/43.jpg)
F2C-ACC
![Page 44: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/44.jpg)
One source is highly desirable Hard to manage versions from multiple micro-architectures and
multiple programming technologies Performance enhancement can be applied to multiple versions
simultaneouslyConditional Compilation
Macro to insert & delete code for a particular technology User control compilation by using compiler macro
Hard to get one source for CUDA-C Many scientific codes are written in Fortran CUDA-C has quite different code structure and should be written
in C Performance impact
Highest performance tuning techniques hardly allow one source
One source
![Page 45: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/45.jpg)
Faster hardware provides us with potential to the performance. However, we can exploit the potential only through better software.
Better software on accelerators generally means that it utilizes many cores and wide vectors simultaneously and efficiently.
In practice, those massive parallelisms can be achieved effectively by, among others, 1) re-using data that are loaded onto faster memory and 2) accessing successive array elements with aligned unit-stride manner.
Conclusions
![Page 46: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/46.jpg)
Using those techniques, we have achieved considerable amount of speed-ups for DG KERNEL Speed-ups compared the best one socket SandyBridge
performance MIC: 6.6x GPU: 5.4x
Speed-ups from initial version to the best performed version MIC: 15.6x GPU: 14.2x
Our next challenge is to applying the techniques that we have learned from kernel experiments to real software package.
Conclusions - continued
![Page 47: A comparison of climate applications on accelerated and conventional architectures](https://reader033.vdocuments.site/reader033/viewer/2022051416/5681376f550346895d9f0a39/html5/thumbnails/47.jpg)
Contacts: [email protected], [email protected]: http://www2.cisl.ucar.edu/org/cisl/tdd/asapCESM: http://www2.cesm.ucar.eduHOMME: http://www.homme.ucar.eduExtrae:
http://www.bsc.es/es/computer-sciences/performance-tools/trace-generation
TAU: http://www.cs.uoregon.edu/research/tau/home.php
Thank you for your attention.