reducing simulation time of your mechanical applications by...
TRANSCRIPT
© 2014 ANSYS, Inc. November 23, 2014 1
Reducing Simulation Time of Your Mechanical Applications by Leveraging Computing Power
Wim Slagter, PhD
ANSYS, Inc.
© 2014 ANSYS, Inc. November 23, 2014 2
Most Users Constrained by Hardware
Source: HPC Usage survey with over 1,800 ANSYS respondents
© 2014 ANSYS, Inc. November 23, 2014 3
Problem Statement
I am not achieving the performance and throughput I was
expecting from my hardware & software
Image courtesy of Intel Corporation
© 2014 ANSYS, Inc. November 23, 2014 4
Take Advantage of the Desktop Revolution
Recent advances have revolutionized the computational speed available on the desktop • Multi-core processors o Every core is really an independent processor
• Large amounts of RAM • SSDs • GPUs • FDR InfiniBand
© 2014 ANSYS, Inc. November 23, 2014 5
HDD vs. SSD
Maximizing Performance – Putting it Together
The right combination of hardware and software
leads to maximum efficiency
SMP vs. DMP
Interconnects?
Clusters?
GPUs? CPUs?
© 2014 ANSYS, Inc. November 23, 2014 6
Always check if job is I/O bound or compute bound
Check output file for CPU and Elapsed times • When Elapsed time >> main thread CPU time I/O bound
– Consider adding more RAM or faster hard drive configuration
• When Elapsed time ≈ main thread CPU time Compute bound – Considering moving simulation to a machine with newer, faster processors – Consider using Distributed ANSYS (DMP) instead of SMP – Consider running on more CPU cores or possibly using GPU(s)
Maximizing Performance Tip #1 Avoid Waiting for I/O to Complete
Total CPU time for main thread : 159.8 seconds . . . . . . Elapsed Time (sec) = 398.000 Date = 03/21/2014
© 2014 ANSYS, Inc. November 23, 2014 7
How to Improve an I/O Bound Simulation
First consider adding more RAM • Always the best option for optimal performance • Allows the operating system to cache file data in memory
Next consider improving the I/O configuration • Need fast hard drives to feed fast processors • Consider SSDs
– Higher bandwidths and extremely low seek times • Consider RAID configurations
RAID 0 – for speed
RAID 1,5 – for redundancy
RAID 10 – for speed and redundancy
© 2014 ANSYS, Inc. November 23, 2014 8
0.8x
2.9x 2.7x
5.9x 5.9x
0
1
2
3
4
5
6
7
2 cores, HDD 8 cores, HDD 8 cores, SSD
Rela
tive
Spee
dup
Benefits of SSD and RAM
16 GB RAM128 GB RAM
Example of an I/O Bound Simulation
Adding RAM gives biggest gains & allows good scaling
Lack of RAM and slow HDD ruin scaling
Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper
• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • One 10k rpm HDD, one SSD • Windows 7
© 2014 ANSYS, Inc. November 23, 2014 9
How to Improve a Compute Bound Simulation
First consider using newer, faster processors • New CPU architecture and faster clock speeds always help
Next consider using parallel processing • DMP virtually always recommended over SMP
– More computations performed in parallel with DMP – Significantly faster speedups achieved using DMP – DMP can take advantage of all resources on a cluster – Whole new class of problems can be solved
Furthermore consider using GPU acceleration • Can help accelerate critical, time-consuming computations
© 2014 ANSYS, Inc. November 23, 2014 10
Example of a Compute Bound Simulation
1.8x
4.0x
11.0x
0
2
4
6
8
10
12
2 cores 8 cores 8 cores, GPU
Rela
tive
Spee
dup
Benefits of DMP and GPU
Xeon x5675
Xeon E5-2670Maximum performance found by adding GPU
Using newer Xeons gives big gain
Using 8 cores gives faster performance
• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • 1 Tesla K20c • Windows 7
© 2014 ANSYS, Inc. November 23, 2014 11
GiGE (Gigabit Ethernet) • 1 Gbits/sec ( 100 MB/sec )
10 GiGE • 10 Gbits/sec ( 1000 MB/sec )
Myrinet (Myricom, Inc) • 2 Gbits/sec ( 250 MB/sec ) • Myri 10G – 10 Gbits/sec (4th generation Myrinet)
Infiniband (many vendors/speeds) • SDR/DDR/QDR • 1x, 4x, 12x • http://en.wikipedia.org/wiki/List_of_device_bandwidths
Not recommended!!
Bare minimum!!
RECOMMENDATION Over 1000 MB/s, especially when running on more than 4 nodes
Maximizing Performance Tip #2 Use Fast Interconnects to Feed Fast CPUs
© 2014 ANSYS, Inc. November 23, 2014 12
V13sp-5 Model
Turbine geometry 2,100 K DOF SOLID187 FEs Static, nonlinear One iteration Direct sparse Linux cluster (8 cores per node) 0
10
20
30
40
50
60
8 cores 16 cores 32 cores 64 cores 128 cores
Ratin
g (r
uns/
day)
Interconnect Performance
Gigabit Ethernet
DDR Infiniband
Example of the Effect of a Fast Interconnect
© 2014 ANSYS, Inc. November 23, 2014 13
1990 ► Shared Memory Multiprocessing (SMP) available
1994 ►Iterative PCG Solver introduced for large analyses
1999 - 2000 ►64-bit large memory addressing
2004 ►1st company to solve 100M structural DOF!
2007 - 2009 ►Teraflop performance at 512 cores! ►Optimized for multicore processors
1980’s ► Vector Processing on Mainframes
2005 -2007 ►Distributed ANSYS (DMP) released! ►Distributed PCG solver ►Distributed sparse solver ►Variational Technology ►Support for clusters using Windows HPC
1980
1990
2010
2000
2013
2010 ►GPU acceleration (single GPU; SMP)!
2012 ►GPU acceleration (multiple GPUs; DMP)!
2005
Maximizing Performance Tip #3 Stay Current on ANSYS Releases
2013 ►First Commercial Vendor to Release Xeon Phi!
© 2014 ANSYS, Inc. November 23, 2014 14
• NEW Subspace Eigen solver supports Shared and Distributed Parallel technology
• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics
• Improved Scalability of Distributed solver at higher core counts Coupled Acoustic, 1.2 M DOF, Full Harmonic Response
2.09 MDOFs first 20 modes
ANSYS Mechanical 15.0 Latest Solver & HPC Improvements
© 2014 ANSYS, Inc. November 23, 2014 15
1.3x 1.7x
2.7x 2.4x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spee
dup
over
R14
.5
Improved Scaling at 8 cores
by an enhanced domain decomposition method
ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts
8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
© 2014 ANSYS, Inc. November 23, 2014 16
1.6x 1.8x
3.8x 4.0x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spee
dup
over
R14
.5
Improved Scaling at 16 cores
ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts
by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
© 2014 ANSYS, Inc. November 23, 2014 17
1.8x 2.2x
3.9x
5.0x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spee
dup
over
R14
.5
Improved Scaling at 32 cores
ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts
by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
© 2014 ANSYS, Inc. November 23, 2014 18
Continually improving Core Solver Rating to 80 cores
Courtesy of HP
ANSYS Mechanical 15.0 Improved Performance at Higher Core Counts
© 2014 ANSYS, Inc. November 23, 2014 19
R15.0 can be up to 30% faster than previous version
1.2x 1.2x 1.4x 1.5x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Turbine benchmark BGA benchmark
Rela
tive
Spee
dup
R145 (8 cores)R145 (8 cores + 1 GPU)R15 (8 cores + 1 GPU)
ANSYS Mechanical 15.0 Improved Performance on NVIDIA GPUs
Linux server with 32 Intel Xeon E5-4650 cores @ 2.7 GHz; 2 Tesla K20; 512 GB RAM
© 2014 ANSYS, Inc. November 23, 2014 20
Lower core
counts favor a single GPU
Higher core
counts favor
multiple GPUs
Courtesy of HP
ANSYS Mechanical 15.0 Consider Adding GPUs for High Core Counts
© 2014 ANSYS, Inc. November 23, 2014 21
Intel Xeon Phi coprocessors are now supported • Use ‘-acc intel’ to activate this capability • Xeon Phi models 7120, 5110, 3120 are supported • Multiple cards
Note: • Supported by sparse solver (symmetric matrices only) • Linux only (no Windows support yet) • SMP only supported
ANSYS Mechanical 15.0 GPU Acceleration through Intel Xeon Phi
© 2014 ANSYS, Inc. November 23, 2014 22
Significant speedups can be achieved with Xeon Phi card • Shared Memory Sparse Solver on Linux
3.3x
4.3x 5.1x
6.0x 6.8x
0
1
2
3
4
5
6
7
8
1 core 2 cores 4 cores 8 cores 16 cores
Spee
dup
Xeon Phi Acceleration (SMP)
CPU cores onlyCPU cores + Xeon Phi
V14sp-5 Model
Turbine geometry 2.1 million DOF SOLID187 elements Static, nonlinear analysis One iteration Sparse direct solver
Linux workstation (16 Intel Xeon E5-2670 cores @ 2.6 GHz, 1 7120A Xeon Phi, 64 GB RAM).
ANSYS Mechanical 15.0 GPU Acceleration through Intel Xeon Phi
© 2014 ANSYS, Inc. November 23, 2014 23
Maximizing Performance Tip #4 Take Advantage of New HPC Licensing
Each ANSYS HPC license unlocks GPUs • In terms of licensing, one GPU socket
equates to one CPU core • More value from your investment in HPC
ANSYS HPC Workgroup 16 / 32 / 64 • Ability to fully utilize the computing power of
your high-end workstation(s) • Also ideal for 2 – 4 users each with modest
HPC needs on servers or entry-level clusters • Filling the gap between HPC Packs and HPC
Workgroup 128 • “Enterprise” options available to deploy and
use anywhere in the world
© 2014 ANSYS, Inc. November 23, 2014 24
2 CPU cores 2 CPU cores + Tesla K20
93
324
3.5X
Simulation productivity (with an HPC license)
K20
8 CPU cores 7 CPU cores + Tesla K20
275
576
2.1X
Simulation productivity (with an HPC Pack)
K20
V14sp-5 Model
Turbine geometry 2.1 million DOF SOLID187 elements Static, nonlinear analysis One iteration Sparse direct solver
Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU with boost clocks.
ANSY
S M
echa
nica
l job
s/da
y
Higher is
Better
Each HPC License Enables GPU Acceleration
© 2014 ANSYS, Inc. November 23, 2014 25
Maximizing Performance Tip #5 Check “HPC Suitability” of Your Model
The above models can possibly be further accelerated by GPU(s): • If they do NOT have Joints in Workbench Mechanical • Sparse solver:
– Bulkier and/or higher-order FE models are good and will be accelerated – If model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla
K20 or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000). • PCG/JCG solver:
– Memory saving (MSAVE) option should be turned off for enabling GPUs – Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs
Some generic recommendations: • Try to run in-core • Avoid big and overlapping contact pairs • Try to reduce the number of RBE3 nodes
Models: • With >500,000 DOFs and don’t require 100’s of iterations
© 2014 ANSYS, Inc. November 23, 2014 26
Maximizing Performance Tip #6 Check Out Additional Resources
© 2014 ANSYS, Inc. November 23, 2014 27
Maximizing Performance Distributed ANSYS Performance
1 Billion DOF 64 Cores 13 Hours Later
Whole new class of problems can be SOLVED!
1 Billion DOF Solved!
© 2014 ANSYS, Inc. November 23, 2014 28
• Connect with Me – [email protected]
• Connect with ANSYS, Inc.
– LinkedIn ANSYSInc – Twitter @ANSYS_Inc – Facebook ANSYSInc
• Follow our Blog
– ansys-blog.com
Thank You!