interconnect your future - hpc-ai advisory council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1k...
TRANSCRIPT
Scot Schultz, Director, HPC and Technical Marketing
HPC Advisory Council, European Conference, June 2014
Interconnect Your Future
© 2014 Mellanox Technologies 2
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2014 Mellanox Technologies 3
The Speed of Data Mandates the Use of Data
20Gbs 40Gbs 56Gbs 100Gbs
2000 2020 2010 2005 2015
200Gbs 10Gbs
Keeping You One Generation Ahead
Enabling the Use of Data
Mellanox Roadmap of Data Speed
© 2014 Mellanox Technologies 4
Architectural Foundation for Exascale Computing
Connect-IB
© 2014 Mellanox Technologies 5
Mellanox Connect-IB The World’s Fastest Adapter
The 7th generation of Mellanox interconnect adapters
World’s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand)
Delivers 137 million messages per second – 4X higher than competition
Support the new innovative InfiniBand scalable transport – Dynamically Connected
© 2014 Mellanox Technologies 6
Connect-IB Provides Highest Interconnect Throughput
Source: Prof. DK Panda
Hig
he
r is
Be
tte
r
Gain Your Performance Leadership With Connect-IB Adapters
0
2000
4000
6000
8000
10000
12000
14000
4 16 64 256 1024 4K 16K 64K 256K 1M
Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3385
6343
12485
12810
0
5000
10000
15000
20000
25000
30000
4 16 64 256 1024 4K 16K 64K 256K 1M
ConnectX2-PCIe2-QDR
ConnectX3-PCIe3-FDR
Sandy-ConnectIB-DualFDR
Ivy-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
11643
6521
21025
24727
© 2014 Mellanox Technologies 7
Connect-IB Delivers Highest Application Performance
200% Higher Performance Versus Competition, with Only 32-nodes
Performance Gap Increases with Cluster Size
© 2014 Mellanox Technologies 8
Mellanox PeerDirect™
© 2014 Mellanox Technologies 9
What is PeerDirect™
PeerDirect is natively supported by Mellanox OFED 2.1 or later distribution
Supports peer-to-peer communications between Mellanox adapters and third-party devices
No unnecessary system memory copies & CPU overhead
• No longer needs a host buffer for each device
• No longer needs to share a host buffer either
Provides native support for Intel Xeon PHI MPSS communication stack
Supports NVIDIA® GPUDirect RDMA with a separate plug-in, available at Mellanox.com
Support for RoCE protocol over Mellanox VPI
Supported with all Mellanox ConnectX-3 and Connect-IB Adapters
CPU
Chip
set
Chipset Vendor
Device
CPU
Chip
set
Chipset Vendor
Device 0101001011
© 2014 Mellanox Technologies 10
Mellanox PeerDirect™ Architecture
A generic infrastructure to allow PCIe peer devices to communicate directly
Enables direct RDMA transactions between PCIe device and Mellanox interconnect
© 2014 Mellanox Technologies 11
Native support for peer-to-peer communications between Mellanox HCA adapters and NVIDIA GPU devices
GPUDirect RDMA
© 2014 Mellanox Technologies 12
Evolution of GPUDirect RDMA
Before GPUDirect
Network and third-party device drivers, did not
share buffers, and needed to make a redundant
copy in host memory.
With GPUDirect 1.0 Shared Host Memory Pages
The network and GPU can share “pinned”
(page-locked) buffers, eliminating the need to
make a redundant copy in host memory.
Pre-GPUDirect
GPUDirect Shared Host Memory Pages Model
© 2014 Mellanox Technologies 13
Eliminates CPU bandwidth and latency bottlenecks
Uses remote direct memory access (RDMA) transfers between GPUs
Resulting in significantly improved MPI Send-Recv efficiency between GPUs in remote nodes
GPUDirect™ RDMA
With GPUDirect™ RDMA
Using PeerDirect™
© 2014 Mellanox Technologies 14
Setup Considerations for GPUDirect RDMA
CPU
Chip
set
CPU
Chipset or
PCIE switch
Note : A requirement for GPUDirect RDMA to work properly is that the NVIDIA
GPU and the Mellanox InfiniBand Adapter share the same root complex
© 2014 Mellanox Technologies 15
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
0
5
10
15
20
25
1 4 16 64 256 1K 4K
Message Size (bytes)
Late
ncy (
us
)
GPU-GPU Internode MPI Latency
Low
er is
Bette
r 67 %
5.49 usec
Performance of MVAPICH2 with GPUDirect RDMA
67% Lower Latency
5X
GPU-GPU Internode MPI Bandwidth
Hig
her
is B
ett
er
5X Increase in Throughput
Source: Prof. DK Panda
© 2014 Mellanox Technologies 16
Mellanox PeerDirect™ with NVIDIA GPUDirect RDMA
HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs
GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand • Unlocks performance between GPU and InfiniBand
• This provides a significant decrease in GPU-GPU communication latency
• Provides complete CPU offload from all GPU communications across the network
Demonstrated up to 102% performance improvement with large number of particles
21% 102%
© 2014 Mellanox Technologies 17
HPC-X
© 2014 Mellanox Technologies 18
Mellanox Optimized Software Stack Components
ConnectX-3, ConnectX-3 Pro
10/40/56 Gb/s Ethernet Support for RoCE
ConnectX-3 and Connect-IB
InfiniBand®
x86, Power8, ARM
Linux / Windows Operating System
Mellanox OFED®
PeerDirect™, Core-Direct™, GPUDirect® RDMA
Mellanox HPC-X™ MPI, SHMEM, UPC, MXM, FCA
Applications
Achieve Maximum Application Scalability and Performance Future proofed software compatibility : SDR-DDR-QDR-FDR-EDR
Communication Libraries MPI, SHMEM, UPC
© 2014 Mellanox Technologies 19
Mellanox HPC-X™ Components and Accelerators
Mellanox ScalableMPI - Message Passing Interface based on Open MPI
Mellanox ScalableSHMEM - One-sided Communications Library
Mellanox ScalableUPC – Berkeley UPC parallel programming language library
Mellanox MXM – Messaging Accelerator optimized for underlying hardware
Mellanox FCA – Fabric Collectives Accelerator supporting MPI-3
Mellanox OFED – Unlocks additional capabilities of the interconnect architecture
Profiling Tools …Benchmarking Tools….and MORE…
© 2014 Mellanox Technologies 20
HPC-X™
Complete MPI/OpenSHMEM/PGAS/UPC package for HPC environments
• Fully optimized for Mellanox InfiniBand and VPI interconnect solutions
• Supports 3rd party solutions
Components
• Communication libraries: ScalableMPI, ScalableSHMEM, ScalableUPC
• Acceleration libraries: MXM – Messaging Accelerator, FCA – Fabric Collectives Accelerator
• Tools: Integrated Performance Monitoring Tool (IPM), benchmarks etc.
CFD CAE Bio
20%-70% Performance Improvement
© 2014 Mellanox Technologies 21
Mellanox ScalableHPC Accelerates Parallel Applications
InfiniBand Verbs API
• Reliable Messaging Optimized for Mellanox HCA
• Hybrid Transport Mechanism
• Efficient Memory Registration
• Receive Side Tag Matching
• Topology Aware Collective Optimization
• Hardware Multicast
• Separate Virtual Fabric for Collectives
• CORE-Direct Hardware Offload
Memory
P1
Memory
P2
Memory
P3
MPI
Memory
P1 P2 P3
SHMEM
Logical Shared Memory
Memory
P1 P2 P3
PGAS
Memory Memory
Logical Shared Memory
© 2014 Mellanox Technologies 22
Mellanox FCA Collective Scalability
Increases CPU efficiency for increased scalability
Support for blocking and nonblocking collectives
Supports hierarchical communication algorithms (HCOL)
Supports multiple optimizations within a single
collectives algorithm
Minimize the negative effect of system noise and jitter
© 2014 Mellanox Technologies 23
Summary
© 2014 Mellanox Technologies 24
End-to-end Interconnect Solutions to Deliver Highest ROI for all Applications
Maximum Application Scalability and Performance
Increased Scalability with FCA Dominant in Storage Interconnects
World’s first 100Gb/s interconnect adapter
Innovative scalable transport – Dynamically Connected Transport
Thank You