altair optistruct 13.0 performance benchmark and profiling€¦ · optistruct by altair • altair...

19
Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015

Upload: others

Post on 19-Nov-2020

45 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

Altair OptiStruct 13.0Performance Benchmark and Profiling

May 2015

Page 2: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: Intel, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The following was done to provide best practices

– OptiStruct performance overview

– Understanding OptiStruct communication patterns

– Ways to increase OptiStruct productivity

– MPI libraries comparisons

• For more info please refer to

– http://www.altair.com

– http://www.dell.com

– http://www.intel.com

– http://www.mellanox.com

Page 3: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

3

Objectives

• The following was done to provide best practices– OptiStruct performance benchmarking

– Interconnect performance comparisons

– MPI performance comparison

– Understanding OptiStruct communication patterns

• The presented results will demonstrate – The scalability of the compute environment to provide nearly linear

application scalability

– The capability of OptiStruct to achieve scalable productivity

Page 4: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

4

OptiStruct by Altair

• Altair OptiStruct

– OptiStruct is an industry proven, modern structural analysis solver

• Solve for linear and non-linear structural problems under static and dynamic loadings

• Market-leading solution for structural design and optimization

– Helps designers and engineers to analyze and optimize structures

• Optimize for strength, durability and NVH (Noise, Vibration, Harshness) characteristics

• Help to rapidly develop innovative, lightweight and structurally efficient designs

– Based on finite-element and multi-body dynamics technology

Page 5: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

5

Test Cluster Configuration

• Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster

– Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Turbo on, Early Snoop, Max Perf in BIOS)

– OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack

– Memory: 64GB memory, DDR3 2133 MHz

– Hard Drives: 1TB 7.2 RPM SATA 2.5”

• Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch

• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch

• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters

• Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters

• MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0

• Application: Altair OptiStruct 13.0

• Benchmark datasets:

– Engine Assembly

Page 6: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

6

PowerEdge R730Massive flexibility for data intensive operations

• Performance and efficiency

– Intelligent hardware-driven systems management

with extensive power management features

– Innovative tools including automation for

parts replacement and lifecycle manageability

– Broad choice of networking technologies from GbE to IB

– Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits

– Designed for performance workloads

• from big data analytics, distributed storage or distributed computing

where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities

– Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

– Large memory footprint (Up to 768GB / 24 DIMMs)

– High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

Page 7: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

7

OptiStruct Performance – CPU Cores

• Running more cores per node generally improves overall performance

– The “-nproc” parameter specified the number of threads spawned per MPI process

• Guideline: 6 threads per MPI process yields the best performance

– Ideal threads to be spawned appears to be 6 threads per MPI process (either 2/4 PPN)

– Having 6 threads spawned by each MPI process performs best among all other tested

Higher is better

Page 8: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

8

OptiStruct Performance – Interconnect

• EDR InfiniBand provides superior scalability performance over Ethernet

– 11 times better performance than 1GbE at 24 nodes

– 90% better performance than 10GbE at 24 nodes

– Ethernet solutions does not scale beyond 4 nodes

2 PPN / 6 ThreadsHigher is better

90%11x

Page 9: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

9

OptiStruct Profiling – Number of MPI Calls

• For 1GbE, communication time is mostly spent on point-to-point transfer

– MPI_Iprobe and MPI_Test are the tests for non-blocking transfers

– Overall runtime is significantly longer compared to faster interconnects

• For 10GbE, communication time is consumed by data transfer

– Amount of time for non-blocking transfers still significant

– Overall runtime reduces compared to 1GbE

– While time for data transfer reduces, collective operations has higher ratio as in overall

• For InfiniBand, overall runtime reduces

– Time consumed by MPI_Allreduce is more significant compared to data transfer

– Overall runtime reduces significantly compared to Ethernet

10GbE1GbE EDR IB

Page 10: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

10

OptiStruct Profiling – Number of MPI Calls

• For 1GbE, communication time is mostly spent on point-to-point transfer

– MPI_Iprobe and MPI_Test are the tests for non-blocking transfers

– Overall runtime is significantly longer compared to faster interconnects

• For 10GbE, communication time is consumed by data transfer

– Amount of time for non-blocking transfers still significant

– Overall runtime reduces compared to 1GbE

– While time for data transfer reduces, collective operations has higher ratio as in overall

• For InfiniBand, overall runtime reduces

– Time consumed by MPI_Allreduce is more significant compared to data transfer

– Overall runtime reduces significantly compared to Ethernet

10GbE1GbE EDR IB

Page 11: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

11

OptiStruct Profiling – MPI Message Sizes

• The most time consuming MPI communications are:

– MPI_Allreduce: Messages concentrated at 8B

– MPI_Iprobe and MPI_Test have volume of calls that test for completion of messages

2 PPN / 6 Threads

Page 12: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

12

OptiStruct Performance – Interconnect

• EDR IB delivers superior scalability performance over previous InfiniBand

– EDR InfiniBand improves over FDR IB by 40% at 24 nodes

– EDR InfiniBand outperforms FDR InfiniBand by 9% at 16 nodes

– New EDR IB architecture supersedes previous FDR IB generation of in scalability

4 PPN / 6 Threads

40%9%

Higher is better

Page 13: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

13

OptiStruct Performance – Processes Per Node

• OptiStruct reduces communication by deploying hybrid MPI mode

– Hybrid MPI process can spawn threads; helps reducing communications on network

– By enabling more MPI processes per node, it helps to unlock additional performance

• The following environment setting and tuned flags are used :

– I_MPI_PIN_DOMAIN auto, I_MPI_ADJUST_ALLREDUCE 2, I_MPI_ADJUST_BCAST

1, I_MPI_ADJUST_REDUCE 2,

– ulimit -s unlimited

Higher is Better

4%

10%

Page 14: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

14

OptiStruct Performance – IMPI Tuning

• Tuning Intel MPI collective algorithm can improve performance

– MPI profile shows ~30% of runtime spent on MPI_Allreduce IB communications

– Default algorithm in Intel MPI is Recursive Doubling (I_MPI_ADJUST_ALLREDUCE=1)

– Rabenseifner's algorithm for Allreduce appears to the be the best on 24 nodes

Intel MPI

Higher is better 4 PPN / 6 Threads

Page 15: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

15

OptiStruct Performance – CPU Frequency

• Increase in CPU clock speed allows higher job efficiency

– Up to 11% of high productivity by increasing clock speed from 2300MHz to 2600MHz

• Turbo Mode boosts job efficiency higher than increase in clock speed

– Up to 31% of performance jump by enabling Turbo Mode at 2600MHz

– Performance gain by turbo mode depends on environment factors, e.g. temperature

8%

Higher is better 4 PPN / 6 Threads

31%

11%10%

4%17%

Page 16: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

16

OptiStruct Profiling – Disk I/O

• OptiStruct makes use of distributed I/O of local scratch of compute nodes

– Heavy disk IO appears to take place throughout the run on each compute node

– The high I/O usage causes system memory to also to be utilized for I/O caching

– Disk I/O is distributed on all compute nodes; thus provides higher I/O performance

– Workload would complete faster as more nodes take part on the distributed I/O

Higher is better 4 PPN / 6 Threads

Page 17: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

17

OptiStruct Profiling – MPI Message Sizes

• Majority of data transfer takes place from rank 0 to the rest

– It appears that most data transfer takes place between rank 0 to the rest

– Those non-blocking communication appears data transfers to hide latency in network

– The collective operations appear to be much less in size

32 Nodes

2 PPN / 6 Threads

16 Nodes

Page 18: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

18

OptiStruct – Summary

• OptiStruct is designed to perform structural analysis at large scale

– OptiStruct designed hybrid MPI mode to perform at scale

– EDR InfiniBand shows to outperform Ethernet in scalability performance

• ~70 times better performance than 1GbE at 24 nodes

• 4.8x better performance than 10GbE at 24 nodes

• EDR InfiniBand improves over FDR IB by 40% at 24 nodes

– Hybrid MPI process can spawn threads; helps reducing communications on network

• By enabling more MPI processes per node, it helps to unlock additional performance

• Hybrid MPP version enhanced OptiStruct scalability

• Profiling and Tuning: CPU, I/O, Network

– MPI_Allreduce accounts for ~30% of runtime at scale

– Tuning for MPI_Allreduce should allow better performance at high core counts

– Guideline: 6 threads per MPI process yields the best performance

– Turbo Mode boosts job efficiency higher than increase in clock speed

– OptiStruct makes use of distributed I/O of local scratch of compute nodes

– Heavy disk IO appears to take place throughout the run on each compute node

Page 19: Altair OptiStruct 13.0 Performance Benchmark and Profiling€¦ · OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver

1919

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein