acceleration of multi-grid linear solver inside ansys...
TRANSCRIPT
© 2011 ANSYS, Inc. March 27, 2014
Acceleration of Multi-Grid Linear Solver inside ANSYS FLUENT using AmgX
Sunil Sathe
March 27, 2014
© 2011 ANSYS, Inc. March 27, 2014
Outline
• Motivation• Usage and defaults• Supported configurations• Build/run-time architecture• CPU/GPU solver computation• GPU solver steps• Parallel considerations• Performance benefits• Future directions
© 2011 ANSYS, Inc. March 27, 2014
Motivation
• GPUs are getting more powerful• GPU cores are increasing• GPU memory is getting bigger to
the point where it can fit a large CFD problem
• Crucial for FLUENT to be able to use the power of GPUs for simulation
• Problems do exist in CFD which can use large computing power
• Coupled solver takes 60-70% time in solving the linear equation system
• Stiff chemistry problems in species can take 90-95% time in ODE solver
• Radiation models depending on their complexity can consume majority of the processing time
• All good reasons to explore GPUs
© 2011 ANSYS, Inc. March 27, 2014
Usage and Defaults
• Command line: “fluent 3d -t4 -gpgpu=1”• Coupled solver uses GPU by default for solving the
linear equation system• Scalar equations can be solved on GPU via the TUI
command: /solve/set/amg-options/amg-gpgpu-execution
© 2011 ANSYS, Inc. March 27, 2014
Supported Hardware ConfigurationsC
PU
GP
U
CP
UG
PU
CP
UG
PU
CP
UG
PU
Some nodes with 16 processes and some with 12 processes
Some nodes with 2 GPUs some with 1 GPU
15 processes not divisible by 2 GPUs
● Homogeneous process distribution● Homogeneous GPU selection● Number of processes be an exact
multiple of number of GPUs
© 2011 ANSYS, Inc. March 27, 2014
Pick one at run time
Build/Run-time Architecture
PCMPI Intel OpenMPI
Stub
AmgX
CUDA
Drivers
MPI Libraries
MPI Wrappers
Fluent
PCMPI Intel OpenMPI
Link at compile time
Dynamically load at run time● FLUENT package includes AmgX and
necessary CUDA libraries● Latest drivers are required to be installed by
the user ● AmgX is dynamically loaded when selected to
run on GPU● User selects MPI at runtime
© 2011 ANSYS, Inc. March 27, 2014
CPU/GPU Solver ComputationC
PU
GP
U
Discretization
Linear EquationSolution
Iter
ate
CP
UG
PU
CPUPowered
CPUPowered
CP
UG
PU
Discretization
Linear EquationSolution
Iter
ate
CP
UG
PU
CPUPowered
GPUPowered
Red indicates inactive/idlingGreen indicates active/computing
CPU Only Computation CPU-GPU Computation
© 2011 ANSYS, Inc. March 27, 2014
CPU/GPU Solver ComputationC
PU
GP
U
Agglomeration
Solution
Transfer to GPU
CP
U
Solution
Distribution
Transfer to CPU
CP
UC
PU
GP
U
Red indicates inactive/idlingGreen indicates active/computing
● All CPU processes consolidate their matrices on to the processes that will host GPUs
● The GPU host processes remain active while the other processes idle
● The active processes upload their consolidated matrices to the GPUs
● GPUs solve the linear system● Any communication needed between GPUs is
facilitated by the host processes● The solution obtained on the host processes
is scattered to all the processes● All the processes become active and
continue with subsequent work
© 2011 ANSYS, Inc. March 27, 2014
GPU Solver StepsCreate Configuration
Create Resources
Create Solver
Create Matrix/Vector
Upload Matrix/RHS orUpdate Coefficients/RHS
Solve on GPU
Download Solution
Setu
pC
PU→
GP
UTr
ansf
erG
PU→
CP
UTr
ansf
er● Setup is expensive but can be performed
once if solving only 1 “type” of equation on GPU
● Matrix upload is expensive but can be limited to updating only the coefficients and not reuploading the matrix sparsity structure if solving only 1 “type” of equation on GPU
● GPU solver has several control parameters which can be adjusted to get the best performance
● The costs of data transfer between CPU and GPU is unavoidable
● Good performance can be expected only if data transfer cost is minor compared to solution cost
© 2011 ANSYS, Inc. March 27, 2014
Parallel Considerations
1 2
43
5 6
87
9 10
1211
13 14
1615
2
4
5 6
87
5 6
87
13 14
Interior with PE0
Interiorwith PE3
Exterior with PE0
Exteriorwith PE3
● Mesh, cell ids and partitions● Yellow PE0● Blue PE1● Green PE2● Pink PE3
● Neighbors of PE1● 2 Neighbors: PE0, PE3● Interior cells with PE0: 5,7● Interior cells with PE3: 7,8● Exterior cells with PE0: 2,4● Exterior cells with PE3: 13,14
© 2011 ANSYS, Inc. March 27, 2014
Parallel Considerationsx x x
x x x x
x x x x
x x x x x
x x x x
x x x
x x x x x
x x x x
x x x x
x x x x x
x x x
x x x x
x x x x x
x x x x
x x x x
x x x
PE0
PE1
PE2
PE3
PE1:2 Neighbors: PE0, PE3Interior equations with PE0: 5,7Interior equations with PE3: 7,8Exterior equations with PE0: 2,4Exterior equations with PE3: 13,14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
● Neighbor information constructed on host processes
● Neighbor information is uploaded to GPUs● Neighbor information changes when mesh is
updated/repartitioned● Matrix sparsity structure and neighbor
information needs to be resent to GPU when changed
© 2011 ANSYS, Inc. March 27, 2014
Performance Benefits● 9.6M Tet Cells, Pressure based coupled, Internal Flow, Double Precision● CPU: Sandy Bridge, E5-2667, 12 = 2 Sockets x 6 Cores, 2.9 GHz, 128 GB● GPU: Kepler, Tesla K20Xm, 2688 CUDA Cores, 0.73 GHz, 6.14 GB
© 2011 ANSYS, Inc. March 27, 2014
Performance Benefits
4 Li
cen
ses
vs
8 Li
cen
ses
vs
16 L
icen
ses
vs
Most benefit
Some benefit
Less benefit
● When we add a GPU to our computation we may or may not see much speed up
● The slow down is possible because when a GPU is being used the CPUs are idling
● If we add a GPU to high CPU process count then all those CPU processes are idling and may result in a slow down
● Much more beneficial to add a GPU to low CPU process count
● With the same number of licenses GPUs can be used beneficially if used in conjunction with few CPU process counts
© 2011 ANSYS, Inc. March 27, 2014
Future Directions
• Accelerate radiation modeling with discrete ordinate method by using AmgX
• Provide user control to pick and choose which equation to run on GPU• Explore possibilities of further improvements via use of advanced AmgX
features like direct GPU communication• Explore possibilities of performance improvements for segregated solver