gpu computing with msc nastran 2013 - msc computing with msc nastran 2013 2013 regional user...

Download GPU Computing with MSC Nastran 2013 - MSC Computing with MSC Nastran 2013 2013 Regional User Conference Presented By: Srinivas Kodiyalam, NVIDIA May 6, 2013 . ... NVH with MSC Nastran

Post on 09-May-2018

235 views

Category:

Documents

5 download

Embed Size (px)

TRANSCRIPT

  • MSC Software Confidential

    GPU Computing with MSC Nastran 2013 2013 Regional User Conference

    Presented By: Srinivas Kodiyalam, NVIDIA

    May 6, 2013

  • GPUs Accelerate Computing MSC Nastran Uses Computing Power of GPUs for Faster

    Simulation GPU CPU

    = Speed Up

  • Increasing Performance & Memory Bandwidth

    NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU

    Kepler Kepler

  • NVIDIA GPU products relevant to MSC Nastran

    4

    Fermi: Tesla C2075 (6 GB), Quadro 6000 (6 GB) Fermi: Tesla M2090 (6GB), M2070 (6GB)

    Data Center GPUs Server/Cluster Workstation GPUs (w/ fans)

    Kepler: Tesla K20X (6 GB), K20m (5 GB) Kepler: Tesla K20c (5 GB), Quadro K6000

    Generation N-1

    Generation N

  • Costs of CAE static analysis

    Compute intensive task is the cost of solving a large

    system of sparse linear equations Double precision computations

    Equation solver is an obvious place to employ GPU to accelerate the

    solution

    Model Size

    (DOF)

    Solution Time

    (secs)

    Time in Equation Solver

    (secs) fraction of

    total

    0.7M ~1200 ~700 54%

    5M ~18000 ~15500 85%

    Equation Solver Cost for Engine Model Benchmarks

  • Direct sparse solver workflow

    in MSC Nastran (MSCLDL, MSCLU)

    5/6/2013

    In a proper order, do the

    following at each node.

    Assembly

    Pivoting

    Block factorization:

    from Global Stiffness &

    contribution blocks

    Most time-consuming matrix update operations on GPU

    Off-diagonal

    update Diagonal

    decomposition Trailing matrix update

    11

    9 10

    8

    6 7

    5

    3 4

    1 2

  • MSC Nastran 2013

    Nastran direct equation solver is GPU accelerated

    Sparse direct factorization (MSCLDL, MSCLU)

    Real, Complex, Symmetric, Un-symmetric

    Handles very large fronts with minimal use of pinned host memory

    Lowest granularity GPU implementation of a sparse direct solver; solves

    unlimited sparse matrix sizes

    Impacts several solution sequences:

    High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)

    Support of multi-GPU and for Linux and Windows

    With DMP> 1, multiple fronts are factorized concurrently on multiple GPUs;

    1 GPU per matrix domain

    NVIDIA GPUs: Tesla K20/K20X, Tesla M2090, Tesla C2075, Quadro 6000

    Release with CUDA 5

  • Basics of GPU Computing with MSC Nastran

    GPUs are an accelerator attached to an x86 CPU

    GPUs cannot operate without an x86 CPU present

    MSC Nastran GPU acceleration is user-transparent

    Jobs launch and complete without additional user steps

    Schematic of a CPU with an attached GPU accelerator

    CPU begins/ends job, GPU manages heavy computations

    Schematic of an x86 CPU with a GPU accelerator

    1. Nastran job launched on CPU

    2. Solver operations sent to GPU

    3. GPU sends results back to CPU

    4. Nastran job completes on CPU

    GD

    DR

    GD

    DR

    DDR

    DDR

    GPU I/O Hub PCI-Express

    CPU

    Cache

    1

    4

    2

    3

  • 0

    1.5

    3

    4.5

    6

    SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front

    serial 4c 4c+1g

    MSC Nastran 2013 SMP + GPU acceleration of SOL101 and SOL103

    Higher is

    Better

    Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

    1X 1X

    2.7X

    1.9X

    6X

    2.8X

    Lanczos solver (SOL 103) Sparse matrix factorization

    Iterate on a block of vectors (solve)

    Orthogonalization of vectors

  • 0

    200

    400

    600

    800

    1000

    serial 1c + 1g 4c (smp) 4c + 1g 8c(dmp=2)

    8c + 2g(dmp=2)

    16c(dmp=4)

    16c + 4g(dmp=4)

    NVH with MSC Nastran 2013 Coupled Structural-Acoustics simulation with SOL108

    1X

    Lower is Better

    Europe Auto OEM 710K nodes, 3.83M elements

    100 frequency increments (FREQ1)

    Direct Sparse solver

    4.8X

    2.7X

    5.2X

    Ela

    pse

    d T

    ime (

    min

    ute

    s)

    5.5X

    11X

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

    11X 23.5X

  • 1.0

    2.7

    5.2 5.5

    11.1

    1.0 1.0 1.1 1.2 1.350.0

    2.0

    4.0

    6.0

    8.0

    10.0

    Solution Price-Performance Gain

    CPU Speed-up GPU Speed-up Solution Cost

    Nastran SMP License 1 Core

    Nastran SMP 4 Cores

    Nastran DMP 8 Cores

    Nastran SMP + GPU License 4 Core + 1 GPU

    Nastran DMP + GPU License

    8 Cores + 2 GPUs

    Solution Cost Basis

    - Structures Package

    (Base SMP license)

    - Exterior Acoustics Package

    - Implicit HPC Package

    (DMP Network License)

    - GPU License

    - $10K for System cost

    - $4K for 2x Tesla 20-series

    Results from PSG cluster node, 2x Sandy Bridge, 2.6GHz, 128GB memory, 2x Tesla K20X, Linux/RHEL 6.2

    NOTE: Based on MSC Nastran 2013

    Fac

    tors

    Gai

    n O

    ver

    Bas

    e L

    icen

    se R

    esu

    lts

    Performance Basis

    SOL108 Vehicle Model:

    - NVH analysis

    - Structural-Acoustics

    - 100 FREQ1 increments

    *

    Extra 13% cost yields 200% performance

    (over 8 cores)

    1 year lease for SW pricing *

  • 0

    20

    40

    60

    80

    serial smp 4c smp 4c+1g (x1node)

    dmp 4c+1g(x2 nodes)

    dmp 4c+1g(x3 nodes)

    Elap

    sed

    Tim

    e in

    Ho

    urs

    NVH with MSC Nastran 2013 Trimmed Car Body Frequency Response with SOL108

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

    1X

    2.5X

    Lower is Better

    USA Auto OEM 1.2M nodes, 7.47M DOF

    Shells (CQUAD4): 1.04M

    Solids (CTETRA): 0.1M

    100 frequency increments (FREQ1)

    4.4X

    6.8X 9X

  • NVH with MSC Nastran 2013 Engine Model Modal Frequency with SOL111

    Japan Auto OEM (Source: MSC Software, Japan) Nodes 1.4M, Elements 0.78M

    Mainly TETRA10

    Modes: 104 (2500 Hz )

    Front size: 23,718

    2848

    1000

    614

    586

    2807

    901

    2303

    2168

    0

    2000

    4000

    6000

    8000

    10000

    1CPU(9052sec.)

    1CPU+1GPU(5116sec.)

    CPU TimeT

    ime(s

    ec.)

    FBS+Matrix-vector MultplyShift+DecompositionLANCZOS RUNResvec

    Sparse Decomposition only

    335 239

    2856

    1027

    6180

    4120

    291

    223

    0

    2000

    4000

    6000

    8000

    10000

    12000

    1CPU(9702sec.)

    1CPU+1GPU(5647sec.)

    Elaps Time

    Tim

    e(s

    ec.)

    Pre_Eigenvalue

    Eigenvalue

    Resvec

    Post_Eigenvalue

    1.7x speedup

  • Key factors for model selection for GPU acceleration:

    Enough work

    FLOPs, Solid & Shell models with dense fronts

    Estimated Max. front size > 10000 (real), > 5000 (complex)

    Minimize IO

    Sufficient system (host) memory

    Recommendations for GPU Acceleration

  • Recommended Configurations

    with MSC Nastran

    Workstation user (smaller models; single GPU, SMP)

    Dual CPU

    96 GB RAM*

    1x Quadro 4000/6000 GPU

    1x Tesla K20c GPU

    * Memory requirements dictated by problem size to minimize disk I/O.

    Server/Cluster user (multi-GPU, SMP+DMP)

    Each node of IB cluster:

    Dual CPU

    128-256 GB RAM*

    2x Tesla K20/K20X GPU

    6x 600 GB SAS 15K disks (scratch; RAID0)

  • Marc 2013 GPU Acceleration of US Auto OEM

    model

    17

    2.5 Million Elements

    10 Million DOF

    Nonlinear Bolt Tightening

    12 increments, 48 cycles

    0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    0

    500

    1000

    1500

    2000

    2500

    3000

    Serial(1 core)

    1c + 1GPU

    (SMP)

    8 core(DDM = 2)

    8c + 2GPU

    (DDM=2)

    16 core(DDM = 4)

    16c + 4GPU

    (DDM=4)

    Matrix factorization time (s)

    Total elapsed time (s)

    Mat

    rix

    fact

    ori

    zati

    on

    tim

    e (s

    )

    Elap

    sed

    tim

    e fo

    r 1

    incr

    emen

    t (s

    )

    Lower is Better

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

  • Conclusions

    GPUs provide for significant performance acceleration for solver

    intensive large jobs

    Max front > 10000 for real data and > 5000 for complex data

    Multiple GPU performance with DMP>1 including for NVH

    SOL108 (embarrassingly parallel).

    NVIDIA and MSC continue to work together to tune BLAS and

    LAPACK kernels for MSCLDL and MSCLU.

    A number of other MSC Nastran functional areas are

    candidates for GPU acceleration.

    18

  • Thanks! Contact: skodiyalam@nvidia.com

    19

Recommended

View more >