reducing simulation time of your mechanical applications by...

28
© 2014 ANSYS, Inc. November 23, 2014 1 Reducing Simulation Time of Your Mechanical Applications by Leveraging Computing Power Wim Slagter, PhD ANSYS, Inc.

Upload: others

Post on 27-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 1

Reducing Simulation Time of Your Mechanical Applications by Leveraging Computing Power

Wim Slagter, PhD

ANSYS, Inc.

Page 2: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 2

Most Users Constrained by Hardware

Source: HPC Usage survey with over 1,800 ANSYS respondents

Page 3: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 3

Problem Statement

I am not achieving the performance and throughput I was

expecting from my hardware & software

Image courtesy of Intel Corporation

Page 4: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 4

Take Advantage of the Desktop Revolution

Recent advances have revolutionized the computational speed available on the desktop • Multi-core processors o Every core is really an independent processor

• Large amounts of RAM • SSDs • GPUs • FDR InfiniBand

Page 5: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 5

HDD vs. SSD

Maximizing Performance – Putting it Together

The right combination of hardware and software

leads to maximum efficiency

SMP vs. DMP

Interconnects?

Clusters?

GPUs? CPUs?

Page 6: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 6

Always check if job is I/O bound or compute bound

Check output file for CPU and Elapsed times • When Elapsed time >> main thread CPU time I/O bound

– Consider adding more RAM or faster hard drive configuration

• When Elapsed time ≈ main thread CPU time Compute bound – Considering moving simulation to a machine with newer, faster processors – Consider using Distributed ANSYS (DMP) instead of SMP – Consider running on more CPU cores or possibly using GPU(s)

Maximizing Performance Tip #1 Avoid Waiting for I/O to Complete

Total CPU time for main thread : 159.8 seconds . . . . . . Elapsed Time (sec) = 398.000 Date = 03/21/2014

Page 7: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 7

How to Improve an I/O Bound Simulation

First consider adding more RAM • Always the best option for optimal performance • Allows the operating system to cache file data in memory

Next consider improving the I/O configuration • Need fast hard drives to feed fast processors • Consider SSDs

– Higher bandwidths and extremely low seek times • Consider RAID configurations

RAID 0 – for speed

RAID 1,5 – for redundancy

RAID 10 – for speed and redundancy

Page 8: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 8

0.8x

2.9x 2.7x

5.9x 5.9x

0

1

2

3

4

5

6

7

2 cores, HDD 8 cores, HDD 8 cores, SSD

Rela

tive

Spee

dup

Benefits of SSD and RAM

16 GB RAM128 GB RAM

Example of an I/O Bound Simulation

Adding RAM gives biggest gains & allows good scaling

Lack of RAM and slow HDD ruin scaling

Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • One 10k rpm HDD, one SSD • Windows 7

Page 9: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 9

How to Improve a Compute Bound Simulation

First consider using newer, faster processors • New CPU architecture and faster clock speeds always help

Next consider using parallel processing • DMP virtually always recommended over SMP

– More computations performed in parallel with DMP – Significantly faster speedups achieved using DMP – DMP can take advantage of all resources on a cluster – Whole new class of problems can be solved

Furthermore consider using GPU acceleration • Can help accelerate critical, time-consuming computations

Page 10: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 10

Example of a Compute Bound Simulation

1.8x

4.0x

11.0x

0

2

4

6

8

10

12

2 cores 8 cores 8 cores, GPU

Rela

tive

Spee

dup

Benefits of DMP and GPU

Xeon x5675

Xeon E5-2670Maximum performance found by adding GPU

Using newer Xeons gives big gain

Using 8 cores gives faster performance

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • 1 Tesla K20c • Windows 7

Page 11: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 11

GiGE (Gigabit Ethernet) • 1 Gbits/sec ( 100 MB/sec )

10 GiGE • 10 Gbits/sec ( 1000 MB/sec )

Myrinet (Myricom, Inc) • 2 Gbits/sec ( 250 MB/sec ) • Myri 10G – 10 Gbits/sec (4th generation Myrinet)

Infiniband (many vendors/speeds) • SDR/DDR/QDR • 1x, 4x, 12x • http://en.wikipedia.org/wiki/List_of_device_bandwidths

Not recommended!!

Bare minimum!!

RECOMMENDATION Over 1000 MB/s, especially when running on more than 4 nodes

Maximizing Performance Tip #2 Use Fast Interconnects to Feed Fast CPUs

Page 12: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 12

V13sp-5 Model

Turbine geometry 2,100 K DOF SOLID187 FEs Static, nonlinear One iteration Direct sparse Linux cluster (8 cores per node) 0

10

20

30

40

50

60

8 cores 16 cores 32 cores 64 cores 128 cores

Ratin

g (r

uns/

day)

Interconnect Performance

Gigabit Ethernet

DDR Infiniband

Example of the Effect of a Fast Interconnect

Page 13: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 13

1990 ► Shared Memory Multiprocessing (SMP) available

1994 ►Iterative PCG Solver introduced for large analyses

1999 - 2000 ►64-bit large memory addressing

2004 ►1st company to solve 100M structural DOF!

2007 - 2009 ►Teraflop performance at 512 cores! ►Optimized for multicore processors

1980’s ► Vector Processing on Mainframes

2005 -2007 ►Distributed ANSYS (DMP) released! ►Distributed PCG solver ►Distributed sparse solver ►Variational Technology ►Support for clusters using Windows HPC

1980

1990

2010

2000

2013

2010 ►GPU acceleration (single GPU; SMP)!

2012 ►GPU acceleration (multiple GPUs; DMP)!

2005

Maximizing Performance Tip #3 Stay Current on ANSYS Releases

2013 ►First Commercial Vendor to Release Xeon Phi!

Page 14: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 14

• NEW Subspace Eigen solver supports Shared and Distributed Parallel technology

• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics

• Improved Scalability of Distributed solver at higher core counts Coupled Acoustic, 1.2 M DOF, Full Harmonic Response

2.09 MDOFs first 20 modes

ANSYS Mechanical 15.0 Latest Solver & HPC Improvements

Page 15: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 15

1.3x 1.7x

2.7x 2.4x

0

1

2

3

4

5

6

Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)

Spee

dup

over

R14

.5

Improved Scaling at 8 cores

by an enhanced domain decomposition method

ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts

8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

Page 16: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 16

1.6x 1.8x

3.8x 4.0x

0

1

2

3

4

5

6

Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)

Spee

dup

over

R14

.5

Improved Scaling at 16 cores

ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts

by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

Page 17: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 17

1.8x 2.2x

3.9x

5.0x

0

1

2

3

4

5

6

Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)

Spee

dup

over

R14

.5

Improved Scaling at 32 cores

ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts

by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

Page 18: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 18

Continually improving Core Solver Rating to 80 cores

Courtesy of HP

ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts

Page 19: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 19

R15.0 can be up to 30% faster than previous version

1.2x 1.2x 1.4x 1.5x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Turbine benchmark BGA benchmark

Rela

tive

Spee

dup

R145 (8 cores)R145 (8 cores + 1 GPU)R15 (8 cores + 1 GPU)

ANSYS Mechanical 15.0 Improved Performance on NVIDIA GPUs

Linux server with 32 Intel Xeon E5-4650 cores @ 2.7 GHz; 2 Tesla K20; 512 GB RAM

Page 20: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 20

Lower core

counts favor a single GPU

Higher core

counts favor

multiple GPUs

Courtesy of HP

ANSYS Mechanical 15.0 Consider Adding GPUs for High Core Counts

Page 21: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 21

Intel Xeon Phi coprocessors are now supported • Use ‘-acc intel’ to activate this capability • Xeon Phi models 7120, 5110, 3120 are supported • Multiple cards

Note: • Supported by sparse solver (symmetric matrices only) • Linux only (no Windows support yet) • SMP only supported

ANSYS Mechanical 15.0 GPU Acceleration through Intel Xeon Phi

Page 22: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 22

Significant speedups can be achieved with Xeon Phi card • Shared Memory Sparse Solver on Linux

3.3x

4.3x 5.1x

6.0x 6.8x

0

1

2

3

4

5

6

7

8

1 core 2 cores 4 cores 8 cores 16 cores

Spee

dup

Xeon Phi Acceleration (SMP)

CPU cores onlyCPU cores + Xeon Phi

V14sp-5 Model

Turbine geometry 2.1 million DOF SOLID187 elements Static, nonlinear analysis One iteration Sparse direct solver

Linux workstation (16 Intel Xeon E5-2670 cores @ 2.6 GHz, 1 7120A Xeon Phi, 64 GB RAM).

ANSYS Mechanical 15.0 GPU Acceleration through Intel Xeon Phi

Page 23: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 23

Maximizing Performance Tip #4 Take Advantage of New HPC Licensing

Each ANSYS HPC license unlocks GPUs • In terms of licensing, one GPU socket

equates to one CPU core • More value from your investment in HPC

ANSYS HPC Workgroup 16 / 32 / 64 • Ability to fully utilize the computing power of

your high-end workstation(s) • Also ideal for 2 – 4 users each with modest

HPC needs on servers or entry-level clusters • Filling the gap between HPC Packs and HPC

Workgroup 128 • “Enterprise” options available to deploy and

use anywhere in the world

Page 24: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 24

2 CPU cores 2 CPU cores + Tesla K20

93

324

3.5X

Simulation productivity (with an HPC license)

K20

8 CPU cores 7 CPU cores + Tesla K20

275

576

2.1X

Simulation productivity (with an HPC Pack)

K20

V14sp-5 Model

Turbine geometry 2.1 million DOF SOLID187 elements Static, nonlinear analysis One iteration Sparse direct solver

Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU with boost clocks.

ANSY

S M

echa

nica

l job

s/da

y

Higher is

Better

Each HPC License Enables GPU Acceleration

Page 25: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 25

Maximizing Performance Tip #5 Check “HPC Suitability” of Your Model

The above models can possibly be further accelerated by GPU(s): • If they do NOT have Joints in Workbench Mechanical • Sparse solver:

– Bulkier and/or higher-order FE models are good and will be accelerated – If model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla

K20 or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000). • PCG/JCG solver:

– Memory saving (MSAVE) option should be turned off for enabling GPUs – Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs

Some generic recommendations: • Try to run in-core • Avoid big and overlapping contact pairs • Try to reduce the number of RBE3 nodes

Models: • With >500,000 DOFs and don’t require 100’s of iterations

Page 26: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 26

Maximizing Performance Tip #6 Check Out Additional Resources

Page 27: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 27

Maximizing Performance Distributed ANSYS Performance

1 Billion DOF 64 Cores 13 Hours Later

Whole new class of problems can be SOLVED!

1 Billion DOF Solved!

Page 28: Reducing Simulation Time of Your Mechanical Applications by …register.ansys.com.cn/ansyschina/minisite/201411_em... · 2016. 7. 21. · 1 © 2014 ANSYS, Inc. November 23, 2014 Reducing

© 2014 ANSYS, Inc. November 23, 2014 28

• Connect with Me – [email protected]

• Connect with ANSYS, Inc.

– LinkedIn ANSYSInc – Twitter @ANSYS_Inc – Facebook ANSYSInc

• Follow our Blog

– ansys-blog.com

Thank You!