gpu acceleration of a non-hydrostatic ocean model with a multigrid poisson/helmholtz solver
TRANSCRIPT
![Page 1: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/1.jpg)
GPU acceleration of a non-hydrostatic ocean model with a
multigrid Poisson/Helmholtz solver
Takateru Yamagishi1, Yoshimasa Matsumura2
1 Research Organization for Information Science and Technology
2 Institute of Low Temperature Science, Hokkaido University
6th International Workshop on Advances in High-Performance Computational Earth Sciences: Applications
& Frameworks
![Page 2: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/2.jpg)
Table of ContentsMotivation
Numerical ocean model ‘kinaco’
GPU implementation and Optimization
Evaluation and validation
Summary
![Page 3: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/3.jpg)
MotivationSignificance of numerical ocean modelling
Global climate, weather, marine resource, etc.GPU’s high computational performance
Explicit and detail expression, long time simulation, many experiment cases
Previous studiesBleichrodt et al. (2012), Milakov et al. (2013), Werkhoven et al. (2013) Xu, et al. (2015)They showed high performance, but limited to experimental studies
We aim at realistic and practical studies
![Page 4: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/4.jpg)
Non-hydrostatic numerical ocean model ‘kinaco’
Formation of Antarctic bottom water in the southern Weddell Sea
We try to accelerate this model by the GPU
![Page 5: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/5.jpg)
Basic equation of dynamics in kinaco
3D Navier-Stokes equationFluid dynamics
Poisson/Helmholtz equation∆ = , (∆ + )ℎ = 0
DiscretizationStencil access to adjacent 6 grids
Solving systems of equations: Ax=bSparse matrix-vector multiplication
Efficient solver to solve Ax=b is required
![Page 6: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/6.jpg)
CG method with multigrid preconditioner (MGCG)
Fast and scalable iteration method
Matsumura and Hasumi(2008)
Preconditioner: Multigrid method
Solve equation on various resolution grids
multigrid method
![Page 7: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/7.jpg)
Implementation to the GPUCUDA Fortran
kinaco is written in Fortran 90CUDA instructions are available
almost the same as CUDA CFollowing the original structure of CPU code
Good performance vs CPU is achieved
We aimed at further acceleration!
![Page 8: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/8.jpg)
Optimization of the MGCG solver
The cost of MGCG solver: 21% of total simulation
Mainly consists of sparse matrix-vector multiplication
Optimization1. Memory access2. Hide latency by thread/Instruction-level
parallelism3. Mixed precision preconditioner of MGCG
![Page 9: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/9.jpg)
Memory access in CPU kernel
DO k=1, n3DO j=1, n2
DO i=1, n1out(i,j,k) = a(-3,i,j,k) * x(i, j, k-1) &
+ a(-2,i,j,k) * x(i, j-1,k ) &+ a(-1,i,j,k) * x(i-1,j, k ) &+ a( 0,i,j,k) * x(i, j, k ) &+ a( 1,i,j,k) * x(i+1,j, k ) &+ a( 2,i,j,k) * x(i, j+1,k ) &+ a( 3,i,j,k) * x(i, j, k+1)
END DOEND DO
END DO
-3 -2 -1 0 1 2 3
a(-3,i,j,k)~a( 3,i,j,k)
Sparse matrix-vector kernel in the CPU code
matrix coefficient
Location of matrix coefficient
-3
3
1-1-2
2
0
CPU thread load the array ‘a’ in cache line.
![Page 10: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/10.jpg)
Memory access in GPU kernel
a(i,j,k,-3)
a(i+1,j,k,-3)
a(i+2,j,k,-3)
thread(id)thread(id+1)
thread(id+2)
a(-3:3,i,j,k) a(i,j,k,-3:3)
Each GPU thread accesses array “a” with 7 intervals.
a(-3,i,j,k) a(-3,i+1,j,k) a(-3,i+2,j,k)
thread(id) thread(id+1) thread(id+2)
Coalesced access to array “a”
![Page 11: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/11.jpg)
Hide latency by thread/Instruction-level parallelism
Hide latency = do other operations when waiting for latencyThread-level parallelism
Switch thread to hide latencyInstruction-level parallelism (Volkov, 2010)
One thread with several independent operations
Comparison of the two parallelism
![Page 12: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/12.jpg)
Case 1: Thread-level parallelism
i = threadidx%x + blockdim%x * (blockidx%x-1)j = threadidx%y + blockdim%y * (blockidx%y-1) k = threadidx%z + blockdim%z * (blockidx%z-1)
out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &+ a(i,j,k,-2) * x(i, j-1,k ) &+ a(i,j,k,-1) * x(i-1,j, k ) &+ a(i,j,k, 0) * x(i, j, k ) &+ a(i,j,k, 1) * x(i+1,j, k ) &+ a(i,j,k, 2) * x(i, j+1,k ) &+ a(i,j,k, 3) * x(i, j, k+1)
Set many threads as possible (i, j, k)
• 3D (i, j, k) threads are set• One thread for one grid
Hyde latency by switching many threads
![Page 13: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/13.jpg)
Case 2: Instruction-level parallelism
Independent operations are repeated
i = threadidx%x + blockdim%x * (blockidx%x-1)j = threadidx%y + blockdim%y * (blockidx%y-1)
DO k=1, n3out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &
+ a(i,j,k,-2) * x(i, j-1,k ) &+ a(i,j,k,-1) * x(i-1,j, k ) &+ a(i,j,k, 0) * x(i, j, k ) &+ a(i,j,k, 1) * x(i+1,j, k ) &+ a(i,j,k, 2) * x(i, j+1,k ) &+ a(i,j,k, 3) * x(i, j, k+1)
END DO
Hyde latency with instructions
• 2D (i, j) threads are set• One thread for one column
(i, j)
Case 2 is faster
![Page 14: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/14.jpg)
Mixed precision for multigrid preconditioning
Low precisionutilize GPU resources
PreconditioningLow precision is enoughGPU: Deterioration of performance with coarse grids
multigrid method
Number of iterations in CG method unchanged with/without mixed precision
![Page 15: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/15.jpg)
Evaluation, experimental settingCPU (Fujitsu SPARC64VIIIfx) vs GPU (NVIDIA K20c)
1 CPU vs 1 GPUStudy of baloclinic instability
Visbeck et al. (1996)Forcing: Coriolis force, temperature forcingStructured, Isotropic domain
size: (256, 256, 32)Time step, simulation time
2min, 5hours (150 steps)5 days(3600 steps)
256256
32
![Page 16: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/16.jpg)
Performance
CPU GPU_1 GPU_2 GPU_3 Speedup (GPU_3)
all components 174.2 42.6 39.2 37.3 4.7 Poisson/Helmholtz solver 36.8 15.8 12.4 10.5 3.5
others 137.4 26.9 26.8 26.8 5.1
Elapsed time[s]: CPU vs GPU
CPU : original CPU codeGPU_1: basic and typical implementation to the GPUGPU_2: GPU_1 + memory optimization, hyde latencyGPU_3: GPU_2 + mixed precision preconditioning
GPU achieved 4.7 times speedup vs CPU
5hours (150 steps)
![Page 17: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/17.jpg)
Surface ocean current/velocity field
GPU_3GPU_2CPU
Good reproduction of growing meanders due to baloclinic instability
![Page 18: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/18.jpg)
Temperature at the cross section
Good reproduction of vertical
convection of water
CPU GPU_2
GPU_2
![Page 19: GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver](https://reader031.vdocuments.site/reader031/viewer/2022022205/58d0f65b1a28abc00b8b51af/html5/thumbnails/19.jpg)
Summary and future worksNumerical ocean model on the GPU (K20C) vs the CPU (SPARC 64 VIIIfx)
x4.7 faster compared to CPUThe errors due to implementation
not significant to oceanic studies
Further worksApplication of mixed precision to other kernels MPI implementationRealistic experiments