s8241 versioning gpu- acclerated wrf to...

1

Jeff Adie, 26 March, 2018

(Presented by Stan Posey, NVIDIA)

S8241 – VERSIONING GPU-ACCLERATED WRF TO 3.7.1

2

ACKNOWLEDGEMENT

The work presented here today would not have been possible without the efforts of NVIDIA applications engineers, particularly Carl Ponder and Alexey Romanenko.

Their work on the original WRF 3.6.1 GPU port provided groundwork for what follows

3

INTRODUCTION

• WRF is an important top 10 application in the HPC community

• As CPU performance plateaus, GPU accelerated computing becomes increasingly important for both overall scaling and performance/watt

• WRF is a challenging code to optimize, and a large effort was invested in the original OpenACC port of 3.6.1 by NVIDIA

• However the release that followed, 3.7.1, included new features and performance improvements, and NVIDIA received several requests for this release

• This talk presents the challenges of updating a large community code with local optimizations, provides detailed results, and considers future work

Motivation for this Work

4

WHAT IS WRF?

• Numerical Weather Prediction (NWP) for operational and research forecasting

• Open source, community code, mostly developed by NCAR, NOAA, and AFWA

• Original version released in 1999

• Around 1 million lines of Fortran, using various revs from F77 to F2003

• More than 50,000 users, including substantial number of operational weather centers

• Modular, with 2 dycores (only ARW in this work), and around 80 physics modules

• Add-on packages for forest fires, hydrology, atmospheric chemistry, data assimilation

Weather Research and Forecasting model

5

WRF VERSIONING

WRF generally has two releases each year:

• Major (feature) version release in April

• Minor (bugfix) version release in August

3.6 3.6.1 3.7 3.7.1 3.8 3.8.1 3.9 3.9.1

Apr 18 Aug 14 Apr 17 Aug 14 Apr 8 Aug 12 Apr 17 Aug 17

2014 2015 2016 2017

6

GPU ACCELERATED WRF

Several early efforts targeting specific (physics) modules

• 2008: Michalakes & Vachhrajani’s early efforts with WSM5 (8x speedup)

• 2009: Linford, et al accelerated KPP part of WRF-Chem) (5.5x speedup)

• 2011: Mielikainen, et al accelerated the RRTM module

• 2012: Mielikainen, et al improved WSM5 (200x speedup)

• 2013: SSEC showed speedups for multiple physics modules @ GTC2013

Accelerated versions of RRTMG for both SW and LW radiation included in WRF 3.7

• 2015: Michalakes, Iacono/AER

Previous Efforts

7

GPU ACCELERATED WRF Current Efforts – TempoQuest www.tempoquest.com

WRF 3.8.1 Results for EM LES

Case on PSG - 4 nodes

Source: TQI – Abdi; 27 Dec 17

PSG cluster nodes:

2 CPUs, 16 cores each

4 x P100 GPUs

CPU-only uses WRF 3.8.1 trunk

CPU-only MPI task each core

CPU+GPU MPI task per GPU

EM LES case mostly dycore

computations

GPU

CPU 9.0x

Ideal - GPU

6.2x speedup 4nodes/8cpus/16gpus

Higher

Is

Better

http://www.tempoquest.com/

8

• OpenACC – not CUDA

• Directives based

• Permits single Source tree for CPU and GPU

• Easier to maintain and support

• Initial efforts focused on WRF 3.6.1

• However, demand rising for 3.7.1 support

GPU ACCELERATED WRF Current Efforts – NVIDIA OpenACC Approach

9

THE CHALLENGE

• The main WRF source tree is ~1 million lines of code in ~1,200 source files

• Multiple contributors, multiple styles, multiple versions of Fortran (plus a little C)

• Porting from 3.6.1 to 3.7.1 included changes to 237 files and 143 k LoC (about 15% of the total codebase)

• Necessary to merge in all of the NVIDIA OpenACC work, and integrate with the 3.6.1-to-3.7.1 changes from the community

10

THE CHALLENGE

Many WRF kernels contain insufficient work for good accelerated speedups on GPUs.

NVIDIA Devtechs often modify these kernels to improve performance.

Need to reapply (and sometimes adapt) that implementation when porting versions.

Merging NVIDIA code and Community code

Community Routine

V3.6.1

NVIDIA Modified

V3.6.1

Community Routine

V3.7.1

NVIDIA Modified

V3.7.1

11

THE CHALLENGE

real,dimension( its:ite, kts:kte ):: thx

…

do k = kts,kte

do i = its,ite

thx(i,k) = tx(i,k)/pi2d(i,k)

enddo

enddo

real,dimension( its:ite, kts:kte,jts:jte ) :: thx

…

$acc kernels

do j = jts,jte

do k = kts,kte

do i = its,ite

thx(i,k,j) = tx(i,k,j)/pi2d(i,k,j)

enddo

enddo

enddo

real,dimension( its:ite, kts:kte ):: thx, thlix

…

do k = kts,kte

do i = its,ite

thx(i,k) = tx(i,k)/pi2d(i,k)

thlix(i,k) = (tx(i,k)-

xlv*qx(i,ktrace2+k)/cp-

2.834E6*qx(i,ktrace3+k)/cp)/pi2d(i,k)

enddo

enddo

real,dimension( its:ite, kts:kte,jts:jte ) :: thx,

thlix

…

$acc kernels

do j = jts,jte

do k = kts,kte

do i = its,ite

thx(i,k,j) = tx(i,k,j)/pi2d(i,k,j)

thlix(i,k,j) = (tx(i,k,j)-

xlv*qx(i,ktrace2+k,j)/cp-

2.834E6*qx(i,ktrace3+k,j)/cp)/pi2d(i,k,j)

enddo

enddo

enddo

Example: phys/module_bl_ysu.F, in ysu2d()

WRFV3.6.1(NCAR)

WRFV3.6.1(NV)

WRFV3.7.1(NCAR)

WRFV3.7.1(NV)

12

METHODOLOGY

• Interface to physics packages is through the main solver

• As these interfaces did change from 3.6.1 to 3.7.1

• Keep track of additional (new) variables, and determine if they need to be GPU resident or not, add to OpenACC data pragmas if so.

• Keep track of everything, package by package

Separate dycore from physics, port one then another

13

METHODOLOGY You need a lot of pixels!

3.6.1 NCAR 3.7.1 NCAR 3.7.1 NV 3.6.1 NV Compile/Debug/Run Diff display

14

RESULTS GPU WRF V3.7.1 vs GPU WRF V3.6.1 Scaling over different model sizes

0

0.2

0.4

0.6

0.8

1

1.2

Small Medum Large

361 GPU 371 GPU

Smaller is better

15

RESULTS Dynamics - ConUS 2.5 km model, WRF V3.7.1

0

50000

100000

150000

200000

250000

300000

350000

Elap

sed

Tim

e (μ

s)

CPU GPU

Smaller is better

16

RESULTS Overall Run - ConUS 2.5 km model, WRF V3.7.1

Smaller is better

0

500

1000

1500

2000

2500

Elap

sed

Tim

e (s

)

CPU GPU

17

RESULTS Overall Run - ConUS 2.5 km model, WRF V3.7.1 Scaling

Smaller is better

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 4

Elap

sed

Tim

e

Nodes

18

LESSONS LEARNED

• Porting is far, far simpler than starting over

• Even with the pain of two diverging trees (community + NVIDIA)

• Order of magnitude less effort

• Need to get changes into the main tree as much as possible

• NVIDIA will unlikely port all physics packages

• WRF moving to physics “suites”, approved combos of packages

• Two initial suites: Tropical & ConUS

Yes, the effort is worth it

19

THE FUTURE

• Caps for the dycore, physics, etc

• Fixed routines: init, run, etc.

• Commented args: type, dim

• Can we push acceleration into

these caps? Or into the CCPP

driver?

• Need to work closely with the

community

Common Community Physics Packages – NOAA funded

s8241 versioning gpu- acclerated wrf to...

Documents