s8241 versioning gpu- acclerated wrf to...
TRANSCRIPT
1
Jeff Adie, 26 March, 2018
(Presented by Stan Posey, NVIDIA)
S8241 – VERSIONING GPU-ACCLERATED WRF TO 3.7.1
2
ACKNOWLEDGEMENT
The work presented here today would not have been possible without the efforts of NVIDIA applications engineers, particularly Carl Ponder and Alexey Romanenko.
Their work on the original WRF 3.6.1 GPU port provided groundwork for what follows
3
INTRODUCTION
• WRF is an important top 10 application in the HPC community
• As CPU performance plateaus, GPU accelerated computing becomes increasingly important for both overall scaling and performance/watt
• WRF is a challenging code to optimize, and a large effort was invested in the original OpenACC port of 3.6.1 by NVIDIA
• However the release that followed, 3.7.1, included new features and performance improvements, and NVIDIA received several requests for this release
• This talk presents the challenges of updating a large community code with local optimizations, provides detailed results, and considers future work
Motivation for this Work
4
WHAT IS WRF?
• Numerical Weather Prediction (NWP) for operational and research forecasting
• Open source, community code, mostly developed by NCAR, NOAA, and AFWA
• Original version released in 1999
• Around 1 million lines of Fortran, using various revs from F77 to F2003
• More than 50,000 users, including substantial number of operational weather centers
• Modular, with 2 dycores (only ARW in this work), and around 80 physics modules
• Add-on packages for forest fires, hydrology, atmospheric chemistry, data assimilation
Weather Research and Forecasting model
5
WRF VERSIONING
WRF generally has two releases each year:
• Major (feature) version release in April
• Minor (bugfix) version release in August
3.6 3.6.1 3.7 3.7.1 3.8 3.8.1 3.9 3.9.1
Apr 18 Aug 14 Apr 17 Aug 14 Apr 8 Aug 12 Apr 17 Aug 17
2014 2015 2016 2017
6
GPU ACCELERATED WRF
Several early efforts targeting specific (physics) modules
• 2008: Michalakes & Vachhrajani’s early efforts with WSM5 (8x speedup)
• 2009: Linford, et al accelerated KPP part of WRF-Chem) (5.5x speedup)
• 2011: Mielikainen, et al accelerated the RRTM module
• 2012: Mielikainen, et al improved WSM5 (200x speedup)
• 2013: SSEC showed speedups for multiple physics modules @ GTC2013
Accelerated versions of RRTMG for both SW and LW radiation included in WRF 3.7
• 2015: Michalakes, Iacono/AER
Previous Efforts
7
GPU ACCELERATED WRF Current Efforts – TempoQuest www.tempoquest.com
WRF 3.8.1 Results for EM LES
Case on PSG - 4 nodes
Source: TQI – Abdi; 27 Dec 17
PSG cluster nodes:
2 CPUs, 16 cores each
4 x P100 GPUs
CPU-only uses WRF 3.8.1 trunk
CPU-only MPI task each core
CPU+GPU MPI task per GPU
EM LES case mostly dycore
computations
GPU
CPU 9.0x
Ideal - GPU
6.2x speedup 4nodes/8cpus/16gpus
Higher
Is
Better
8
• OpenACC – not CUDA
• Directives based
• Permits single Source tree for CPU and GPU
• Easier to maintain and support
• Initial efforts focused on WRF 3.6.1
• However, demand rising for 3.7.1 support
GPU ACCELERATED WRF Current Efforts – NVIDIA OpenACC Approach
9
THE CHALLENGE
• The main WRF source tree is ~1 million lines of code in ~1,200 source files
• Multiple contributors, multiple styles, multiple versions of Fortran (plus a little C)
• Porting from 3.6.1 to 3.7.1 included changes to 237 files and 143 k LoC (about 15% of the total codebase)
• Necessary to merge in all of the NVIDIA OpenACC work, and integrate with the 3.6.1-to-3.7.1 changes from the community
10
THE CHALLENGE
Many WRF kernels contain insufficient work for good accelerated speedups on GPUs.
NVIDIA Devtechs often modify these kernels to improve performance.
Need to reapply (and sometimes adapt) that implementation when porting versions.
Merging NVIDIA code and Community code
Community Routine
V3.6.1
NVIDIA Modified
V3.6.1
Community Routine
V3.7.1
NVIDIA Modified
V3.7.1
11
THE CHALLENGE
real,dimension( its:ite, kts:kte ):: thx
…
do k = kts,kte
do i = its,ite
thx(i,k) = tx(i,k)/pi2d(i,k)
enddo
enddo
real,dimension( its:ite, kts:kte,jts:jte ) :: thx
…
$acc kernels
do j = jts,jte
do k = kts,kte
do i = its,ite
thx(i,k,j) = tx(i,k,j)/pi2d(i,k,j)
enddo
enddo
enddo
real,dimension( its:ite, kts:kte ):: thx, thlix
…
do k = kts,kte
do i = its,ite
thx(i,k) = tx(i,k)/pi2d(i,k)
thlix(i,k) = (tx(i,k)-
xlv*qx(i,ktrace2+k)/cp-
2.834E6*qx(i,ktrace3+k)/cp)/pi2d(i,k)
enddo
enddo
real,dimension( its:ite, kts:kte,jts:jte ) :: thx,
thlix
…
$acc kernels
do j = jts,jte
do k = kts,kte
do i = its,ite
thx(i,k,j) = tx(i,k,j)/pi2d(i,k,j)
thlix(i,k,j) = (tx(i,k,j)-
xlv*qx(i,ktrace2+k,j)/cp-
2.834E6*qx(i,ktrace3+k,j)/cp)/pi2d(i,k,j)
enddo
enddo
enddo
Example: phys/module_bl_ysu.F, in ysu2d()
WRFV3.6.1(NCAR)
WRFV3.6.1(NV)
WRFV3.7.1(NCAR)
WRFV3.7.1(NV)
12
METHODOLOGY
• Interface to physics packages is through the main solver
• As these interfaces did change from 3.6.1 to 3.7.1
• Keep track of additional (new) variables, and determine if they need to be GPU resident or not, add to OpenACC data pragmas if so.
• Keep track of everything, package by package
Separate dycore from physics, port one then another
13
METHODOLOGY You need a lot of pixels!
3.6.1 NCAR 3.7.1 NCAR 3.7.1 NV 3.6.1 NV Compile/Debug/Run Diff display
14
RESULTS GPU WRF V3.7.1 vs GPU WRF V3.6.1 Scaling over different model sizes
0
0.2
0.4
0.6
0.8
1
1.2
Small Medum Large
361 GPU 371 GPU
Smaller is better
15
RESULTS Dynamics - ConUS 2.5 km model, WRF V3.7.1
0
50000
100000
150000
200000
250000
300000
350000
Elap
sed
Tim
e (μ
s)
CPU GPU
Smaller is better
16
RESULTS Overall Run - ConUS 2.5 km model, WRF V3.7.1
Smaller is better
0
500
1000
1500
2000
2500
Elap
sed
Tim
e (s
)
CPU GPU
17
RESULTS Overall Run - ConUS 2.5 km model, WRF V3.7.1 Scaling
Smaller is better
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 4
Elap
sed
Tim
e
Nodes
18
LESSONS LEARNED
• Porting is far, far simpler than starting over
• Even with the pain of two diverging trees (community + NVIDIA)
• Order of magnitude less effort
• Need to get changes into the main tree as much as possible
• NVIDIA will unlikely port all physics packages
• WRF moving to physics “suites”, approved combos of packages
• Two initial suites: Tropical & ConUS
Yes, the effort is worth it
19
THE FUTURE
• Caps for the dycore, physics, etc
• Fixed routines: init, run, etc.
• Commented args: type, dim
• Can we push acceleration into
these caps? Or into the CCPP
driver?
• Need to work closely with the
community
Common Community Physics Packages – NOAA funded
20