gpu computing with msc nastran 2013 - msc...
TRANSCRIPT
MSC Software Confidential
GPU Computing with MSC Nastran 2013 2013 Regional User Conference
Presented By: Srinivas Kodiyalam, NVIDIA
May 6, 2013
GPUs Accelerate Computing MSC Nastran Uses Computing Power of GPUs for Faster
Simulation GPU CPU
= Speed Up
Increasing Performance & Memory Bandwidth
NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU
Kepler Kepler
NVIDIA GPU products relevant to MSC Nastran
4
Fermi: Tesla C2075 (6 GB), Quadro 6000 (6 GB) Fermi: Tesla M2090 (6GB), M2070 (6GB)
Data Center GPUs – Server/Cluster Workstation GPUs (w/ fans)
Kepler: Tesla K20X (6 GB), K20m (5 GB) Kepler: Tesla K20c (5 GB), Quadro K6000
Generation N-1
Generation N
Costs of CAE static analysis
• Compute intensive task is the cost of solving a large
system of sparse linear equations • Double precision computations
• Equation solver is an obvious place to employ GPU to accelerate the
solution
Model Size
(DOF)
Solution Time
(secs)
Time in Equation Solver
(secs) fraction of
total
0.7M ~1200 ~700 54%
5M ~18000 ~15500 85%
Equation Solver Cost for Engine Model Benchmarks
Direct sparse solver workflow
in MSC Nastran (MSCLDL, MSCLU)
5/6/2013
In a proper order, do the
following at each node.
Assembly
Pivoting
Block factorization:
from Global Stiffness &
contribution blocks
Most time-consuming matrix update operations on GPU
Off-diagonal
update Diagonal
decomposition Trailing matrix update
11
9 10
8
6 7
5
3 4
1 2
MSC Nastran 2013
Nastran direct equation solver is GPU accelerated
Sparse direct factorization (MSCLDL, MSCLU)
Real, Complex, Symmetric, Un-symmetric
Handles very large fronts with minimal use of pinned host memory
Lowest granularity GPU implementation of a sparse direct solver; solves
unlimited sparse matrix sizes
Impacts several solution sequences:
High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)
Support of multi-GPU and for Linux and Windows
With DMP> 1, multiple fronts are factorized concurrently on multiple GPUs;
1 GPU per matrix domain
NVIDIA GPUs: Tesla K20/K20X, Tesla M2090, Tesla C2075, Quadro 6000
Release with CUDA 5
Basics of GPU Computing with MSC Nastran
GPUs are an accelerator attached to an x86 CPU
GPUs cannot operate without an x86 CPU present
MSC Nastran GPU acceleration is user-transparent
Jobs launch and complete without additional user steps
Schematic of a CPU with an attached GPU accelerator
CPU begins/ends job, GPU manages heavy computations
Schematic of an x86 CPU with a GPU accelerator
1. Nastran job launched on CPU
2. Solver operations sent to GPU
3. GPU sends results back to CPU
4. Nastran job completes on CPU
GD
DR
GD
DR
DDR
DDR
GPU I/O Hub PCI-Express
CPU
Cache
1
4
2
3
0
1.5
3
4.5
6
SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front
serial 4c 4c+1g
MSC Nastran 2013 SMP + GPU acceleration of SOL101 and SOL103
Higher is
Better
Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory
1X 1X
2.7X
1.9X
6X
2.8X
Lanczos solver (SOL 103) Sparse matrix factorization
Iterate on a block of vectors (solve)
Orthogonalization of vectors
0
200
400
600
800
1000
serial 1c + 1g 4c (smp) 4c + 1g 8c(dmp=2)
8c + 2g(dmp=2)
16c(dmp=4)
16c + 4g(dmp=4)
NVH with MSC Nastran 2013 Coupled Structural-Acoustics simulation with SOL108
1X
Lower is Better
Europe Auto OEM 710K nodes, 3.83M elements
100 frequency increments (FREQ1)
Direct Sparse solver
4.8X
2.7X
5.2X
Ela
pse
d T
ime (
min
ute
s)
5.5X
11X
Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory
11X 23.5X
1.0
2.7
5.2 5.5
11.1
1.0 1.0 1.1 1.2 1.350.0
2.0
4.0
6.0
8.0
10.0
Solution Price-Performance Gain
CPU Speed-up GPU Speed-up Solution Cost
Nastran SMP License 1 Core
Nastran SMP 4 Cores
Nastran DMP 8 Cores
Nastran SMP + GPU License 4 Core + 1 GPU
Nastran DMP + GPU License
8 Cores + 2 GPUs
Solution Cost Basis
- Structures Package
(Base SMP license)
- Exterior Acoustics Package
- Implicit HPC Package
(DMP Network License)
- GPU License
- $10K for System cost
- $4K for 2x Tesla 20-series
Results from PSG cluster node, 2x Sandy Bridge, 2.6GHz, 128GB memory, 2x Tesla K20X, Linux/RHEL 6.2
NOTE: Based on MSC Nastran 2013
Fac
tors
Gai
n O
ver
Bas
e L
icen
se R
esu
lts
Performance Basis
SOL108 Vehicle Model:
- NVH analysis
- Structural-Acoustics
- 100 FREQ1 increments
*
Extra 13% cost yields 200% performance
(over 8 cores)
1 year lease for SW pricing *
0
20
40
60
80
serial smp 4c smp 4c+1g (x1node)
dmp 4c+1g(x2 nodes)
dmp 4c+1g(x3 nodes)
Elap
sed
Tim
e in
Ho
urs
NVH with MSC Nastran 2013 Trimmed Car Body Frequency Response with SOL108
Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory
1X
2.5X
Lower is Better
USA Auto OEM 1.2M nodes, 7.47M DOF
Shells (CQUAD4): 1.04M
Solids (CTETRA): 0.1M
100 frequency increments (FREQ1)
4.4X
6.8X 9X
NVH with MSC Nastran 2013 Engine Model Modal Frequency with SOL111
Japan Auto OEM (Source: MSC Software, Japan)
Nodes 1.4M, Elements 0.78M
Mainly TETRA10
Modes: 104 (2500 Hz )
Front size: 23,718
2848
1000
614
586
2807
901
2303
2168
0
2000
4000
6000
8000
10000
1CPU(9052sec.)
1CPU+1GPU(5116sec.)
CPU TimeT
ime(s
ec.)
FBS+Matrix-vector MultplyShift+DecompositionLANCZOS RUNResvec
Sparse Decomposition only
335 239
2856
1027
6180
4120
291
223
0
2000
4000
6000
8000
10000
12000
1CPU(9702sec.)
1CPU+1GPU(5647sec.)
Elaps Time
Tim
e(s
ec.)
Pre_Eigenvalue
Eigenvalue
Resvec
Post_Eigenvalue
1.7x speedup
• Key factors for model selection for GPU acceleration:
• Enough work
• FLOPs, Solid & Shell models with dense fronts
• Estimated Max. front size > 10000 (real), > 5000 (complex)
• Minimize IO
• Sufficient system (host) memory
Recommendations for GPU Acceleration
Recommended Configurations
with MSC Nastran
Workstation user (smaller models; single GPU, SMP)
Dual CPU
96 GB RAM*
1x Quadro 4000/6000 GPU
1x Tesla K20c GPU
* Memory requirements dictated by problem size to minimize disk I/O.
Server/Cluster user (multi-GPU, SMP+DMP)
Each node of IB cluster:
Dual CPU
128-256 GB RAM*
2x Tesla K20/K20X GPU
6x 600 GB SAS 15K disks (scratch; RAID0)
Marc 2013 – GPU Acceleration of US Auto OEM
model
17
2.5 Million Elements
10 Million DOF
Nonlinear Bolt Tightening
12 increments, 48 cycles
0
2000
4000
6000
8000
10000
12000
14000
16000
0
500
1000
1500
2000
2500
3000
Serial(1 core)
1c + 1GPU
(SMP)
8 core(DDM = 2)
8c + 2GPU
(DDM=2)
16 core(DDM = 4)
16c + 4GPU
(DDM=4)
Matrix factorization time (s)
Total elapsed time (s)
Mat
rix
fact
ori
zati
on
tim
e (s
)
Elap
sed
tim
e fo
r 1
incr
emen
t (s
)
Lower is Better
Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory
Conclusions
GPUs provide for significant performance acceleration for solver
intensive large jobs
Max front > 10000 for real data and > 5000 for complex data
Multiple GPU performance with DMP>1 including for NVH
SOL108 (embarrassingly parallel).
NVIDIA and MSC continue to work together to tune BLAS and
LAPACK kernels for MSCLDL and MSCLU.
A number of other MSC Nastran functional areas are
candidates for GPU acceleration.
18
Thanks! Contact: [email protected]
19