reducing simulation time of your mechanical applications by...

© 2014 ANSYS, Inc. November 23, 2014 1

Reducing Simulation Time of Your Mechanical Applications by Leveraging Computing Power

Wim Slagter, PhD

ANSYS, Inc.


Most Users Constrained by Hardware

Source: HPC Usage survey with over 1,800 ANSYS respondents


Problem Statement

I am not achieving the performance and throughput I was

expecting from my hardware & software

Image courtesy of Intel Corporation


Take Advantage of the Desktop Revolution

Recent advances have revolutionized the computational speed available on the desktop • Multi-core processors o Every core is really an independent processor

• Large amounts of RAM • SSDs • GPUs • FDR InfiniBand


HDD vs. SSD

Maximizing Performance – Putting it Together

The right combination of hardware and software

leads to maximum efficiency

SMP vs. DMP

Interconnects?

Clusters?

GPUs? CPUs?


Always check if job is I/O bound or compute bound

Check output file for CPU and Elapsed times • When Elapsed time >> main thread CPU time I/O bound

– Consider adding more RAM or faster hard drive configuration

• When Elapsed time ≈ main thread CPU time Compute bound – Considering moving simulation to a machine with newer, faster processors – Consider using Distributed ANSYS (DMP) instead of SMP – Consider running on more CPU cores or possibly using GPU(s)

Maximizing Performance Tip #1 Avoid Waiting for I/O to Complete

Total CPU time for main thread : 159.8 seconds . . . . . . Elapsed Time (sec) = 398.000 Date = 03/21/2014


How to Improve an I/O Bound Simulation

First consider adding more RAM • Always the best option for optimal performance • Allows the operating system to cache file data in memory

Next consider improving the I/O configuration • Need fast hard drives to feed fast processors • Consider SSDs

– Higher bandwidths and extremely low seek times • Consider RAID configurations

RAID 0 – for speed

RAID 1,5 – for redundancy

RAID 10 – for speed and redundancy


0.8x

2.9x 2.7x

5.9x 5.9x

0

1

2

3

4

5

6

7

2 cores, HDD 8 cores, HDD 8 cores, SSD

Rela

tive

Spee

dup

Benefits of SSD and RAM

16 GB RAM128 GB RAM

Example of an I/O Bound Simulation

Adding RAM gives biggest gains & allows good scaling

Lack of RAM and slow HDD ruin scaling

Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • One 10k rpm HDD, one SSD • Windows 7


How to Improve a Compute Bound Simulation

First consider using newer, faster processors • New CPU architecture and faster clock speeds always help

Next consider using parallel processing • DMP virtually always recommended over SMP

– More computations performed in parallel with DMP – Significantly faster speedups achieved using DMP – DMP can take advantage of all resources on a cluster – Whole new class of problems can be solved

Furthermore consider using GPU acceleration • Can help accelerate critical, time-consuming computations


Example of a Compute Bound Simulation

1.8x

4.0x

11.0x

0

2

4

6

8

10

12

2 cores 8 cores 8 cores, GPU

Rela

tive

Spee

dup

Benefits of DMP and GPU

Xeon x5675

Xeon E5-2670Maximum performance found by adding GPU

Using newer Xeons gives big gain

Using 8 cores gives faster performance

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • 1 Tesla K20c • Windows 7


GiGE (Gigabit Ethernet) • 1 Gbits/sec ( 100 MB/sec )

10 GiGE • 10 Gbits/sec ( 1000 MB/sec )

Myrinet (Myricom, Inc) • 2 Gbits/sec ( 250 MB/sec ) • Myri 10G – 10 Gbits/sec (4th generation Myrinet)

Infiniband (many vendors/speeds) • SDR/DDR/QDR • 1x, 4x, 12x • http://en.wikipedia.org/wiki/List_of_device_bandwidths

Not recommended!!

Bare minimum!!

RECOMMENDATION Over 1000 MB/s, especially when running on more than 4 nodes

Maximizing Performance Tip #2 Use Fast Interconnects to Feed Fast CPUs


V13sp-5 Model

Turbine geometry 2,100 K DOF SOLID187 FEs Static, nonlinear One iteration Direct sparse Linux cluster (8 cores per node) 0

10

20

30

40

50

60

8 cores 16 cores 32 cores 64 cores 128 cores

Ratin

g (r

uns/

day)

Interconnect Performance

Gigabit Ethernet

DDR Infiniband

Example of the Effect of a Fast Interconnect


1990 ► Shared Memory Multiprocessing (SMP) available

1994 ►Iterative PCG Solver introduced for large analyses

1999 - 2000 ►64-bit large memory addressing

2004 ►1st company to solve 100M structural DOF!

2007 - 2009 ►Teraflop performance at 512 cores! ►Optimized for multicore processors

1980’s ► Vector Processing on Mainframes

2005 -2007 ►Distributed ANSYS (DMP) released! ►Distributed PCG solver ►Distributed sparse solver ►Variational Technology ►Support for clusters using Windows HPC

1980

1990

2010

2000

2013

2010 ►GPU acceleration (single GPU; SMP)!

2012 ►GPU acceleration (multiple GPUs; DMP)!

2005

Maximizing Performance Tip #3 Stay Current on ANSYS Releases

2013 ►First Commercial Vendor to Release Xeon Phi!


• NEW Subspace Eigen solver supports Shared and Distributed Parallel technology

• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics

• Improved Scalability of Distributed solver at higher core counts Coupled Acoustic, 1.2 M DOF, Full Harmonic Response

2.09 MDOFs first 20 modes

ANSYS Mechanical 15.0 Latest Solver & HPC Improvements


1.3x 1.7x

2.7x 2.4x

0

1

2

3

4

5

6

Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)

Spee

dup

over

R14

.5

Improved Scaling at 8 cores

by an enhanced domain decomposition method

ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts

8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)


1.6x 1.8x

3.8x 4.0x

0

1

2

3

4

5

6


Spee

dup

over

R14

.5



by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)


1.8x 2.2x

3.9x

5.0x

0

1

2

3

4

5

6


Spee

dup

over

R14

.5



by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)


Continually improving Core Solver Rating to 80 cores

Courtesy of HP



R15.0 can be up to 30% faster than previous version

1.2x 1.2x 1.4x 1.5x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Turbine benchmark BGA benchmark

Rela

tive

Spee

dup

R145 (8 cores)R145 (8 cores + 1 GPU)R15 (8 cores + 1 GPU)

ANSYS Mechanical 15.0 Improved Performance on NVIDIA GPUs

Linux server with 32 Intel Xeon E5-4650 cores @ 2.7 GHz; 2 Tesla K20; 512 GB RAM


Lower core

counts favor a single GPU

Higher core

counts favor

multiple GPUs

Courtesy of HP

ANSYS Mechanical 15.0 Consider Adding GPUs for High Core Counts


Intel Xeon Phi coprocessors are now supported • Use ‘-acc intel’ to activate this capability • Xeon Phi models 7120, 5110, 3120 are supported • Multiple cards

Note: • Supported by sparse solver (symmetric matrices only) • Linux only (no Windows support yet) • SMP only supported

ANSYS Mechanical 15.0 GPU Acceleration through Intel Xeon Phi


Significant speedups can be achieved with Xeon Phi card • Shared Memory Sparse Solver on Linux

3.3x

4.3x 5.1x

6.0x 6.8x

0

1

2

3

4

5

6

7

8

1 core 2 cores 4 cores 8 cores 16 cores

Spee

dup

Xeon Phi Acceleration (SMP)

CPU cores onlyCPU cores + Xeon Phi

V14sp-5 Model

Turbine geometry 2.1 million DOF SOLID187 elements Static, nonlinear analysis One iteration Sparse direct solver

Linux workstation (16 Intel Xeon E5-2670 cores @ 2.6 GHz, 1 7120A Xeon Phi, 64 GB RAM).

ANSYS Mechanical 15.0 GPU Acceleration through Intel Xeon Phi


Maximizing Performance Tip #4 Take Advantage of New HPC Licensing

Each ANSYS HPC license unlocks GPUs • In terms of licensing, one GPU socket

equates to one CPU core • More value from your investment in HPC

ANSYS HPC Workgroup 16 / 32 / 64 • Ability to fully utilize the computing power of

your high-end workstation(s) • Also ideal for 2 – 4 users each with modest

HPC needs on servers or entry-level clusters • Filling the gap between HPC Packs and HPC

Workgroup 128 • “Enterprise” options available to deploy and

use anywhere in the world


2 CPU cores 2 CPU cores + Tesla K20

93

324

3.5X

Simulation productivity (with an HPC license)

K20

8 CPU cores 7 CPU cores + Tesla K20

275

576

2.1X

Simulation productivity (with an HPC Pack)

K20

V14sp-5 Model

Turbine geometry 2.1 million DOF SOLID187 elements Static, nonlinear analysis One iteration Sparse direct solver

Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU with boost clocks.

ANSY

S M

echa

nica

l job

s/da

y

Higher is

Better

Each HPC License Enables GPU Acceleration


Maximizing Performance Tip #5 Check “HPC Suitability” of Your Model

The above models can possibly be further accelerated by GPU(s): • If they do NOT have Joints in Workbench Mechanical • Sparse solver:

– Bulkier and/or higher-order FE models are good and will be accelerated – If model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla

K20 or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000). • PCG/JCG solver:

– Memory saving (MSAVE) option should be turned off for enabling GPUs – Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs

Some generic recommendations: • Try to run in-core • Avoid big and overlapping contact pairs • Try to reduce the number of RBE3 nodes

Models: • With >500,000 DOFs and don’t require 100’s of iterations


Maximizing Performance Tip #6 Check Out Additional Resources


Maximizing Performance Distributed ANSYS Performance

1 Billion DOF 64 Cores 13 Hours Later

Whole new class of problems can be SOLVED!

1 Billion DOF Solved!


• Connect with Me – [email protected]

• Connect with ANSYS, Inc.

– LinkedIn ANSYSInc – Twitter @ANSYS_Inc – Facebook ANSYSInc

• Follow our Blog

– ansys-blog.com

Thank You!

mailto:[email protected]

reducing simulation time of your mechanical applications by...

Documents