acceleration of multi-grid linear solver inside ansys...

© 2011 ANSYS, Inc. March 27, 2014

Acceleration of Multi-Grid Linear Solver inside ANSYS FLUENT using AmgX

Sunil Sathe

March 27, 2014

© 2011 ANSYS, Inc. March 27, 2014

Outline

• Motivation• Usage and defaults• Supported configurations• Build/run-time architecture• CPU/GPU solver computation• GPU solver steps• Parallel considerations• Performance benefits• Future directions

© 2011 ANSYS, Inc. March 27, 2014

Motivation

• GPUs are getting more powerful• GPU cores are increasing• GPU memory is getting bigger to

the point where it can fit a large CFD problem

• Crucial for FLUENT to be able to use the power of GPUs for simulation

• Problems do exist in CFD which can use large computing power

• Coupled solver takes 60-70% time in solving the linear equation system

• Stiff chemistry problems in species can take 90-95% time in ODE solver

• Radiation models depending on their complexity can consume majority of the processing time

• All good reasons to explore GPUs

© 2011 ANSYS, Inc. March 27, 2014

Usage and Defaults

• Command line: “fluent 3d -t4 -gpgpu=1”• Coupled solver uses GPU by default for solving the

linear equation system• Scalar equations can be solved on GPU via the TUI

command: /solve/set/amg-options/amg-gpgpu-execution

© 2011 ANSYS, Inc. March 27, 2014

Supported Hardware ConfigurationsC

PU

GP

U

CP

UG

PU

CP

UG

PU

CP

UG

PU

Some nodes with 16 processes and some with 12 processes

Some nodes with 2 GPUs some with 1 GPU

15 processes not divisible by 2 GPUs

● Homogeneous process distribution● Homogeneous GPU selection● Number of processes be an exact

multiple of number of GPUs

© 2011 ANSYS, Inc. March 27, 2014

Pick one at run time

Build/Run-time Architecture

PCMPI Intel OpenMPI

Stub

AmgX

CUDA

Drivers

MPI Libraries

MPI Wrappers

Fluent

PCMPI Intel OpenMPI

Link at compile time

Dynamically load at run time● FLUENT package includes AmgX and

necessary CUDA libraries● Latest drivers are required to be installed by

the user ● AmgX is dynamically loaded when selected to

run on GPU● User selects MPI at runtime

© 2011 ANSYS, Inc. March 27, 2014

CPU/GPU Solver ComputationC

PU

GP

U

Discretization

Linear EquationSolution

Iter

ate

CP

UG

PU

CPUPowered

CPUPowered

CP

UG

PU

Discretization

Linear EquationSolution

Iter

ate

CP

UG

PU

CPUPowered

GPUPowered

Red indicates inactive/idlingGreen indicates active/computing

CPU Only Computation CPU-GPU Computation

© 2011 ANSYS, Inc. March 27, 2014

CPU/GPU Solver ComputationC

PU

GP

U

Agglomeration

Solution

Transfer to GPU

CP

U

Solution

Distribution

Transfer to CPU

CP

UC

PU

GP

U

Red indicates inactive/idlingGreen indicates active/computing

● All CPU processes consolidate their matrices on to the processes that will host GPUs

● The GPU host processes remain active while the other processes idle

● The active processes upload their consolidated matrices to the GPUs

● GPUs solve the linear system● Any communication needed between GPUs is

facilitated by the host processes● The solution obtained on the host processes

is scattered to all the processes● All the processes become active and

continue with subsequent work

© 2011 ANSYS, Inc. March 27, 2014

GPU Solver StepsCreate Configuration

Create Resources

Create Solver

Create Matrix/Vector

Upload Matrix/RHS orUpdate Coefficients/RHS

Solve on GPU

Download Solution

Setu

pC

PU→

GP

UTr

ansf

erG

PU→

CP

UTr

ansf

er● Setup is expensive but can be performed

once if solving only 1 “type” of equation on GPU

● Matrix upload is expensive but can be limited to updating only the coefficients and not reuploading the matrix sparsity structure if solving only 1 “type” of equation on GPU

● GPU solver has several control parameters which can be adjusted to get the best performance

● The costs of data transfer between CPU and GPU is unavoidable

● Good performance can be expected only if data transfer cost is minor compared to solution cost

© 2011 ANSYS, Inc. March 27, 2014

Parallel Considerations

1 2

43

5 6

87

9 10

1211

13 14

1615

2

4

5 6

87

5 6

87

13 14

Interior with PE0

Interiorwith PE3

Exterior with PE0

Exteriorwith PE3

● Mesh, cell ids and partitions● Yellow PE0● Blue PE1● Green PE2● Pink PE3

● Neighbors of PE1● 2 Neighbors: PE0, PE3● Interior cells with PE0: 5,7● Interior cells with PE3: 7,8● Exterior cells with PE0: 2,4● Exterior cells with PE3: 13,14

© 2011 ANSYS, Inc. March 27, 2014

Parallel Considerationsx x x

x x x x

x x x x

x x x x x

x x x x

x x x

x x x x x

x x x x

x x x x

x x x x x

x x x

x x x x

x x x x x

x x x x

x x x x

x x x

PE0

PE1

PE2

PE3

PE1:2 Neighbors: PE0, PE3Interior equations with PE0: 5,7Interior equations with PE3: 7,8Exterior equations with PE0: 2,4Exterior equations with PE3: 13,14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

● Neighbor information constructed on host processes

● Neighbor information is uploaded to GPUs● Neighbor information changes when mesh is

updated/repartitioned● Matrix sparsity structure and neighbor

information needs to be resent to GPU when changed

© 2011 ANSYS, Inc. March 27, 2014

Performance Benefits● 9.6M Tet Cells, Pressure based coupled, Internal Flow, Double Precision● CPU: Sandy Bridge, E5-2667, 12 = 2 Sockets x 6 Cores, 2.9 GHz, 128 GB● GPU: Kepler, Tesla K20Xm, 2688 CUDA Cores, 0.73 GHz, 6.14 GB

© 2011 ANSYS, Inc. March 27, 2014

Performance Benefits

4 Li

cen

ses

vs

8 Li

cen

ses

vs

16 L

icen

ses

vs

Most benefit

Some benefit

Less benefit

● When we add a GPU to our computation we may or may not see much speed up

● The slow down is possible because when a GPU is being used the CPUs are idling

● If we add a GPU to high CPU process count then all those CPU processes are idling and may result in a slow down

● Much more beneficial to add a GPU to low CPU process count

● With the same number of licenses GPUs can be used beneficially if used in conjunction with few CPU process counts

© 2011 ANSYS, Inc. March 27, 2014

Future Directions

• Accelerate radiation modeling with discrete ordinate method by using AmgX

• Provide user control to pick and choose which equation to run on GPU• Explore possibilities of further improvements via use of advanced AmgX

features like direct GPU communication• Explore possibilities of performance improvements for segregated solver

acceleration of multi-grid linear solver inside ansys...

Documents