daino: a high-level framework for parallel and efficient amr on … · 2017. 4. 30. · daino: a...
TRANSCRIPT
Daino: A High-level Framework for Parallel and Efficient AMR on GPUs
Mohamed Wahib1, Naoya Maruyama1,2, Takayuki Aoki2
1 RIKEN Advanced Institute for Computational Science, Kobe, Japan
2 Tokyo Institute of Technology, GSIC, Tokyo, Japan
11th May 2017
GTC17
Motivation & Problem:
“AMR is one of the paths to multi-scale exascale applications“
Producing efficient AMR code is hard (especially for GPU)
Solution:
A framework for producing efficient AMR code (for GPUs)
Architecture-independent interface provided to the user
A speedup model for quantifying the efficiency of AMR code
Key results: We evaluate three AMR applications
Speedups & scalability comparable to hand-written code
(~3,642 K20x GPUs)
Summary
2
For meshes in some simulations using PDEs:
We only require high resolution for areas of interest
Resolution changes dynamically during simulation
Achieving efficient AMR is challenging
Managing an adaptive mesh can be complicated
Balancing compute load and communication costs
3
Adaptive Mesh Refinement (AMR)
Octree-based meshes: (a) Adaptive mesh (b) Tree representation
Structured Tree-based AMR
4
Many ways to represent the mesh
We focus on octree representation (quadtree in 2D)
Mesh divided into blocks, refine/coarsen if required
(a) (b)
PE1 PE2 PE3
Operations applied on tree are distributed
How AMR Works
5
Initialize the Mesh
FOR Simulation time DO
Execute stencil operations for all blocks
Exchange ghost layers with neighbor nodes
IF time to remesh
Calculate remeshing critirion for all blocks
Refine or consolidate blocks
Balance the mesh
ENDIF
IF time to load balance
Apply load balancing algorithm
ENDIF
ENDFOR
Computation
Remeshing
Load balancing
Reduced Computation (less data in mesh)
Overhead
AMR on GPUs
6
Hard to achieve efficient AMR with GPUs
Few existing AMR frameworks support GPU:
User must provide code optimized for GPU
Scalability problems due to CPU-GPU data movement
No speedup-bound model
Contributions of our framework
1
2
3
Framework for Efficient AMR
7
A compiler and runtime
Input:
Serial code applying stencil on a uniform grid
User adds directives to identify relevant data arrays
Architecture-neutral
Output:
Executable binary for target architecture
Code is parallel and optimized for GPU (MPI+CUDA)
#pragma daino kernel void 3D_alloy(..) { #pragma daino data (Nx,Ny,Nz) {p, u, dpt, no, o;} … kernel code ... }
AMR frameworks
Architecture-neutral Interface (1 of 2)
8
Uniform Mesh Serial C Code
__global__ 3D_alloy(..) { … CUDA kernel code ... }
void 3D_alloy(..) { #pragma omp for … kernel code ... }
CUDA code OpenMP Code
Framework
GPU AMR
Executable
CPU AMR
Executable
Our framework
Framework
GPU AMR
Executable CPU AMR
Executable
1
Two benefits:
- Productivity
- Ability to apply
low-level GPU
optimizations
#pragma dno kernel
void func(float ***a, float ***b, ..) {
#pragma dno data domName(i, j, k)
a, b;
#pragma dno timeloop
for(int t; t< TIME_MAX;t++) {
for(int i; i<NX; i++)
for(int j; i<NY; j++) {
... // comput. not related to a and b
for(int k; k<NZ; k++) {
a[i][j][k] = c * (b[i-1][j][k]
+ b[i+1][j][k] + b[i][j][k]
+ b[i][j+1][k] + b[i][j-1][k]);
}
}
}
}
Minimal example of using directives in our framework
Architecture-neutral Interface (2 of 2)
9
1
A target kernel
Data arrays + iterators
Target loop
Scalable AMR: Data-centric Model (1 of 2)
10
A data-centric approach
Each computing element specializes on its data
Blocks on GPU, octree data structure on CPU
Migrate all operations touching block data to GPU
CPU only processes octree data structure
2
All kernels are data parallel (i.e. well-suited to GPU)
11
Scalable AMR: data-centric Model (2 of 2)
Finalize Copy Final Arrays
Octants
(Data Arrays)
Octants
(Data Arrays)
Octants
(Data Arrays)
GPU2 Memory
GPU1 Memory
GPU0 Memory
Octree
(AMR Metadata)
CPU Memory
Initialize
Stencil Kernel
Exchange Ghost Layers
Update & Balance Octree
Lo
op
Copy Initial Arrays
Copy Ghost Layers
Consolidate Invoke
Refine Invoke
Evaluate Error Copy δ
Post-Stencil (Correction) Invoke
Invoke
Compute Stencil Invoke
< δ
> δ
1.
2.
3.
4.
5.
6.
7.
Error Estim. Kernel
Refine Kernel
Consolidate Kernel
Correction Kernel
MO
VE
BL
OC
KS
Invoke
2
Conceptual Overview of Data-centric GPU AMR
CPU GPU
[1] Mohamed Wahib, Naoya Maruyama, Data-centric GPU-based Adaptive Mesh Refinement, IA^3'15, 5th Workshop on Irregular Applications Architectures and Algorithms, co-located with SC’15
AMR promises reduced computation
Problem overhead in managing hierarchal mesh
Project speedup bound
Informs framework designer of efficiency of AMR code
Compare achieved speedup vs. projected upper-bound speedup
Takes into account AMR overhead
If projected speedup far from achieved speedup
Some AMR overheads(s) not properly accounted for
Speedup Model
12
3
Framework Implementation (1 of 2)
13
Fixed Mesh Code
(Annotated) Compiler
Front End Passes
LLVM
Optimized LLVM-IR
Object Files
LLVM-IR
Daino Runtime
Adapted Mesh Executable
Linker
AMR library
Comm. library
Call
Front
End
Pa
ss
Pa
ss
Pa
ss
Back
EndC/C++
Machine
CodeIR IR IR IR
Clang LLVM proper
Figure 1: Overview of framework implementation
Apply translations and optimizations as passes
The Daino framework overview. Application C code is transformed to an optimized executable. Daino components enclosed in red dotted line
Framework Implementation (2 of 2)
14
Application
C Code
Stencil Code
Object Files
Emit
Compile
Stencil GPU
Kernel
(NVVM IR)
AST
NVPTX
PTX
Emit
Generate Application
LLVM IR
Stencil IR
Application
LLVM IR
AMR Driver
IR
IR Pass
Refine Kernel
(NVVM IR)Coarsen Kernel
(NVVM IR)Error Kernel
(CUDA)
Daino Runtime
AMR library
Comm. library
Translator
CUDA Driver API Call
API
Call
Executable
Link
Runtime Libraries
15
AMR Management
Maintain the octree
Orchestration of work
Memory manager
Especially important with GPU
Communication
MPI processes
Halo data exchange
Transparent access to blocks
Moving blocks (load balancing)
Evaluation
16
Application Description
Hydrodynamics
Solver
A 2nd order directionally split hyperbolic
schemes to solve Euler equations.
[RTVD scheme modified from GAMER1]
Shallow-water
Solver
We model shallow water simulations by
depth-averaging Navier–Stokes equations.
[2nd order Runge-Kutta method]
Phase-field
Simulation
3D dendritic growth during binary alloy
solidification2
[Time integartion by Allen-Chan equation]
[1] H.-Y. Schive, U.-H. Zhang, and T. Chiueh. Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/ OpenMP/GPU Parallelization in AMR. Int. J. High Perform. Comput. Appl., 26(4):367–377, Nov. 2012 [2] T. Shimokawabe et. Al, Peta-scale Phase-Field Simulation for Dendritic Solidification on the
TSUBAME 2.0 Supercomputer, SC’11
Results (1 of 4)
17
Weak scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic)
We use TSUBAME2.5 supercomputer (TokyoTech)
Up to 3,642 K20x GPUs
TSUBAME Grand Challenge Category A (full machine)
1.0E+00
5.0E+02
1.0E+03
1.5E+03
2.0E+03
16 64 256 576 1024 1600 2288 2880 3600
Ru
nti
me
(S
ec
on
ds
)
Number GPUs (Mesh size per GPU: 4,0963)
HYDRODYNAMICSUniform Mesh Auto AMR (Daino)
Hand-written AMR Auto AMR (GAMER)
9.4
x
8.5
x
0.0E+00
5.0E+02
1.0E+03
1.5E+03
2.0E+03
2.5E+03
16 64 256 576 1024 1600 2288 2880 3600
Ru
nti
me
(S
ec
on
ds
)
Number GPUs (Mesh size per 16 GPUs: 4,096x512x512)
PHASE-FIELDUniform Mesh Auto AMR Hand-written AMR
1.7
8x
1.6
6x
1.0E+00
5.1E+01
1.0E+02
1.5E+02
2.0E+02
2.5E+02
3.0E+02
16 64 256 576 1024 1600 2288 2880 3600
Ru
nti
me
(S
ec
on
ds
)
Number GPUs (Mesh size per GPU: 8,1923)
SHALLOW-WATERSUniform Mesh Auto AMR Hand-written AMR
3.8
x
2.9
x
Results (2 of 4)
18
Strong scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic)
Notes:
Phase-field achieves 1.7x speedup
Original implementation is Gordon Bell 2011 winner
Daino is faster than GAMER AMR version
GAMER is a leading framework for AMR over GPUs
1.0E+00
5.0E+03
1.0E+04
1.5E+04
2.0E+04
2.5E+04
3.0E+04
3.5E+04
16 64 256 576 1024 1600 2288 2880 3600
Ru
nti
me
(S
ec
on
ds
)
Number GPUs (Mesh size 4,0963)
PHASE-FIELD
Uniform Mesh
Auto AMR
Hand-written AMR1.7 x
1.3 x
1.0E+00
5.1E+01
1.0E+02
1.5E+02
2.0E+02
2.5E+02
3.0E+02
3.5E+02
16 64 256 576 1024 1600 2288 2880 3600
Ru
nti
me
(S
ec
on
ds
)
Number GPUs (Mesh size per GPU: 4,0963)
HYDRODYNAMICSUniform Mesh
Auto AMR (Daino)
Hand-written AMR
Auto AMR (GAMER)
9.6 x
2.1E+03
7.4 x
1.0E+00
5.1E+01
1.0E+02
1.5E+02
2.0E+02
2.5E+02
3.0E+02
16 64 256 576 1024 1600 2288 2880 3600
Ru
nti
me
(S
ec
on
ds
)
Number GPUs (Mesh size per GPU: 8,1923)
SHALLOW-WATERS
Uniform Mesh
Auto AMR
Hand-written AMR
4.1 x
3.2 x
Results (3 of 4)
19
Overhead of the AMR framework (weak scaling):
AMR overhead
from 12% in 16
GPUs to 16% in
3600 GPUs
Remeshing
kernels are well-
suited to GPU
Results (4 of 4)
Speedup: measured vs. projected. M is measured, P is the practical AMR speedup projection, and T is the theoretical AMR speedup projection.
20
Efficiency of transformation:
Achieved speedup > 86% of practical limit
0
2
4
6
8
10
Number GPUs
HYDRODYNAMICSM L A
0
0.5
1
1.5
2
Sp
ee
du
p
Number GPUs
PHASE-FIELDM L A
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Number GPUs
SHALLOW-WATERSM L A
Problem:
AMR is one of the paths to multi-scale exascale applications
Producing efficient AMR code is hard (especially for GPU)
Solution:
A framework for producing efficient AMR code (for GPUs)
Architecture-independent interface provided to the user
A speedup model for quantifying the efficiency of AMR code
Key results: We evaluate three AMR applications
Speedups & scalability comparable to hand-written code
(3,642 K20x GPUs)
Summary
21
Future Work Expand Daino
Incorporate Daino’s GPU backend in other AMR framework
Work-in-progress for porting new applications (CFD)
Supporting user-specified boundary conditions,
equations of state, and flux corrections
Extend support for Intel Xeon Phi (KNL)
We already introduced experimental support for
OpenMP (not fully optimized)
Leverage the speedup model analysis
Auto-tuning
22
Daino will be publically released at:
http://github.com/wahibium/Daino
Thank you for listening.
Questions?
23