gpu computing with msc nastran 2013 - msc...

MSC Software Confidential

GPU Computing with MSC Nastran 2013 2013 Regional User Conference

Presented By: Srinivas Kodiyalam, NVIDIA

May 6, 2013

GPUs Accelerate Computing MSC Nastran Uses Computing Power of GPUs for Faster

Simulation GPU CPU

= Speed Up

Increasing Performance & Memory Bandwidth

NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU

Kepler Kepler

NVIDIA GPU products relevant to MSC Nastran

4

Fermi: Tesla C2075 (6 GB), Quadro 6000 (6 GB) Fermi: Tesla M2090 (6GB), M2070 (6GB)

Data Center GPUs – Server/Cluster Workstation GPUs (w/ fans)

Kepler: Tesla K20X (6 GB), K20m (5 GB) Kepler: Tesla K20c (5 GB), Quadro K6000

Generation N-1

Generation N

Costs of CAE static analysis

• Compute intensive task is the cost of solving a large

system of sparse linear equations • Double precision computations

• Equation solver is an obvious place to employ GPU to accelerate the

solution

Model Size

(DOF)

Solution Time

(secs)

Time in Equation Solver

(secs) fraction of

total

0.7M ~1200 ~700 54%

5M ~18000 ~15500 85%

Equation Solver Cost for Engine Model Benchmarks

Direct sparse solver workflow

in MSC Nastran (MSCLDL, MSCLU)

5/6/2013

In a proper order, do the

following at each node.

Assembly

Pivoting

Block factorization:

from Global Stiffness &

contribution blocks

Most time-consuming matrix update operations on GPU

Off-diagonal

update Diagonal

decomposition Trailing matrix update

11

9 10

8

6 7

5

3 4

1 2

MSC Nastran 2013

Nastran direct equation solver is GPU accelerated

Sparse direct factorization (MSCLDL, MSCLU)

Real, Complex, Symmetric, Un-symmetric

Handles very large fronts with minimal use of pinned host memory

Lowest granularity GPU implementation of a sparse direct solver; solves

unlimited sparse matrix sizes

Impacts several solution sequences:

High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)

Support of multi-GPU and for Linux and Windows

With DMP> 1, multiple fronts are factorized concurrently on multiple GPUs;

1 GPU per matrix domain

NVIDIA GPUs: Tesla K20/K20X, Tesla M2090, Tesla C2075, Quadro 6000

Release with CUDA 5

Basics of GPU Computing with MSC Nastran

GPUs are an accelerator attached to an x86 CPU

GPUs cannot operate without an x86 CPU present

MSC Nastran GPU acceleration is user-transparent

Jobs launch and complete without additional user steps

Schematic of a CPU with an attached GPU accelerator

CPU begins/ends job, GPU manages heavy computations

Schematic of an x86 CPU with a GPU accelerator

1. Nastran job launched on CPU

2. Solver operations sent to GPU

3. GPU sends results back to CPU

4. Nastran job completes on CPU

GD

DR

GD

DR

DDR

DDR

GPU I/O Hub PCI-Express

CPU

Cache

1

4

2

3

0

1.5

3

4.5

6

SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front

serial 4c 4c+1g

MSC Nastran 2013 SMP + GPU acceleration of SOL101 and SOL103

Higher is

Better

Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

1X 1X

2.7X

1.9X

6X

2.8X

Lanczos solver (SOL 103) Sparse matrix factorization

Iterate on a block of vectors (solve)

Orthogonalization of vectors

0

200

400

600

800

1000

serial 1c + 1g 4c (smp) 4c + 1g 8c(dmp=2)

8c + 2g(dmp=2)

16c(dmp=4)

16c + 4g(dmp=4)

NVH with MSC Nastran 2013 Coupled Structural-Acoustics simulation with SOL108

1X

Lower is Better

Europe Auto OEM 710K nodes, 3.83M elements

100 frequency increments (FREQ1)

Direct Sparse solver

4.8X

2.7X

5.2X

Ela

pse

d T

ime (

min

ute

s)

5.5X

11X

Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

11X 23.5X

1.0

2.7

5.2 5.5

11.1

1.0 1.0 1.1 1.2 1.350.0

2.0

4.0

6.0

8.0

10.0

Solution Price-Performance Gain

CPU Speed-up GPU Speed-up Solution Cost

Nastran SMP License 1 Core

Nastran SMP 4 Cores

Nastran DMP 8 Cores

Nastran SMP + GPU License 4 Core + 1 GPU

Nastran DMP + GPU License

8 Cores + 2 GPUs

Solution Cost Basis

- Structures Package

(Base SMP license)

- Exterior Acoustics Package

- Implicit HPC Package

(DMP Network License)

- GPU License

- $10K for System cost

- $4K for 2x Tesla 20-series

Results from PSG cluster node, 2x Sandy Bridge, 2.6GHz, 128GB memory, 2x Tesla K20X, Linux/RHEL 6.2

NOTE: Based on MSC Nastran 2013

Fac

tors

Gai

n O

ver

Bas

e L

icen

se R

esu

lts

Performance Basis

SOL108 Vehicle Model:

- NVH analysis

- Structural-Acoustics

- 100 FREQ1 increments

*

Extra 13% cost yields 200% performance

(over 8 cores)

1 year lease for SW pricing *

0

20

40

60

80

serial smp 4c smp 4c+1g (x1node)

dmp 4c+1g(x2 nodes)

dmp 4c+1g(x3 nodes)

Elap

sed

Tim

e in

Ho

urs

NVH with MSC Nastran 2013 Trimmed Car Body Frequency Response with SOL108


1X

2.5X

Lower is Better

USA Auto OEM 1.2M nodes, 7.47M DOF

Shells (CQUAD4): 1.04M

Solids (CTETRA): 0.1M

100 frequency increments (FREQ1)

4.4X

6.8X 9X

NVH with MSC Nastran 2013 Engine Model Modal Frequency with SOL111

Japan Auto OEM (Source: MSC Software, Japan)

Nodes 1.4M, Elements 0.78M

Mainly TETRA10

Modes: 104 (2500 Hz )

Front size: 23,718

2848

1000

614

586

2807

901

2303

2168

0

2000

4000

6000

8000

10000

1CPU(9052sec.)

1CPU+1GPU(5116sec.)

CPU TimeT

ime(s

ec.)

FBS+Matrix-vector MultplyShift+DecompositionLANCZOS RUNResvec

Sparse Decomposition only

335 239

2856

1027

6180

4120

291

223

0

2000

4000

6000

8000

10000

12000

1CPU(9702sec.)

1CPU+1GPU(5647sec.)

Elaps Time

Tim

e(s

ec.)

Pre_Eigenvalue

Eigenvalue

Resvec

Post_Eigenvalue

1.7x speedup

• Key factors for model selection for GPU acceleration:

• Enough work

• FLOPs, Solid & Shell models with dense fronts

• Estimated Max. front size > 10000 (real), > 5000 (complex)

• Minimize IO

• Sufficient system (host) memory

Recommendations for GPU Acceleration

Recommended Configurations

with MSC Nastran

Workstation user (smaller models; single GPU, SMP)

Dual CPU

96 GB RAM*

1x Quadro 4000/6000 GPU

1x Tesla K20c GPU

* Memory requirements dictated by problem size to minimize disk I/O.

Server/Cluster user (multi-GPU, SMP+DMP)

Each node of IB cluster:

Dual CPU

128-256 GB RAM*

2x Tesla K20/K20X GPU

6x 600 GB SAS 15K disks (scratch; RAID0)

Marc 2013 – GPU Acceleration of US Auto OEM

model

17

2.5 Million Elements

10 Million DOF

Nonlinear Bolt Tightening

12 increments, 48 cycles

0

2000

4000

6000

8000

10000

12000

14000

16000

0

500

1000

1500

2000

2500

3000

Serial(1 core)

1c + 1GPU

(SMP)

8 core(DDM = 2)

8c + 2GPU

(DDM=2)

16 core(DDM = 4)

16c + 4GPU

(DDM=4)

Matrix factorization time (s)

Total elapsed time (s)

Mat

rix

fact

ori

zati

on

tim

e (s

)

Elap

sed

tim

e fo

r 1

incr

emen

t (s

)

Lower is Better


Conclusions

GPUs provide for significant performance acceleration for solver

intensive large jobs

Max front > 10000 for real data and > 5000 for complex data

Multiple GPU performance with DMP>1 including for NVH

SOL108 (embarrassingly parallel).

NVIDIA and MSC continue to work together to tune BLAS and

LAPACK kernels for MSCLDL and MSCLU.

A number of other MSC Nastran functional areas are

candidates for GPU acceleration.

18

Thanks! Contact: [email protected]

19

gpu computing with msc nastran 2013 - msc...

Documents