interconnect your future - hpc-ai advisory council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1k...

Scot Schultz, Director, HPC and Technical Marketing

HPC Advisory Council, European Conference, June 2014

Interconnect Your Future

© 2014 Mellanox Technologies 2

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN


The Speed of Data Mandates the Use of Data

20Gbs 40Gbs 56Gbs 100Gbs

2000 2020 2010 2005 2015

200Gbs 10Gbs

Keeping You One Generation Ahead

Enabling the Use of Data

Mellanox Roadmap of Data Speed


Architectural Foundation for Exascale Computing

Connect-IB


Mellanox Connect-IB The World’s Fastest Adapter

The 7th generation of Mellanox interconnect adapters

World’s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand)

Delivers 137 million messages per second – 4X higher than competition

Support the new innovative InfiniBand scalable transport – Dynamically Connected


Connect-IB Provides Highest Interconnect Throughput

Source: Prof. DK Panda

Hig

he

r is

Be

tte

r

Gain Your Performance Leadership With Connect-IB Adapters

0

2000

4000

6000

8000

10000

12000

14000

4 16 64 256 1024 4K 16K 64K 256K 1M

Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3385

6343

12485

12810

0

5000

10000

15000

20000

25000

30000

4 16 64 256 1024 4K 16K 64K 256K 1M

ConnectX2-PCIe2-QDR

ConnectX3-PCIe3-FDR

Sandy-ConnectIB-DualFDR

Ivy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


11643

6521

21025

24727

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Connect-IB Delivers Highest Application Performance

200% Higher Performance Versus Competition, with Only 32-nodes

Performance Gap Increases with Cluster Size


Mellanox PeerDirect™


What is PeerDirect™

PeerDirect is natively supported by Mellanox OFED 2.1 or later distribution

Supports peer-to-peer communications between Mellanox adapters and third-party devices

No unnecessary system memory copies & CPU overhead

• No longer needs a host buffer for each device

• No longer needs to share a host buffer either

Provides native support for Intel Xeon PHI MPSS communication stack

Supports NVIDIA® GPUDirect RDMA with a separate plug-in, available at Mellanox.com

Support for RoCE protocol over Mellanox VPI

Supported with all Mellanox ConnectX-3 and Connect-IB Adapters

CPU

Chip

set

Chipset Vendor

Device

CPU

Chip

set

Chipset Vendor

Device 0101001011


Mellanox PeerDirect™ Architecture

A generic infrastructure to allow PCIe peer devices to communicate directly

Enables direct RDMA transactions between PCIe device and Mellanox interconnect


Native support for peer-to-peer communications between Mellanox HCA adapters and NVIDIA GPU devices

GPUDirect RDMA


Evolution of GPUDirect RDMA

Before GPUDirect

Network and third-party device drivers, did not

share buffers, and needed to make a redundant

copy in host memory.

With GPUDirect 1.0 Shared Host Memory Pages

The network and GPU can share “pinned”

(page-locked) buffers, eliminating the need to

make a redundant copy in host memory.

Pre-GPUDirect

GPUDirect Shared Host Memory Pages Model


Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPI Send-Recv efficiency between GPUs in remote nodes

GPUDirect™ RDMA

With GPUDirect™ RDMA

Using PeerDirect™


Setup Considerations for GPUDirect RDMA

CPU

Chip

set

CPU

Chipset or

PCIE switch

Note : A requirement for GPUDirect RDMA to work properly is that the NVIDIA

GPU and the Mellanox InfiniBand Adapter share the same root complex


0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Ban

dw

idth

(M

B/s

)

0

5

10

15

20

25

1 4 16 64 256 1K 4K


Late

ncy (

us

)

GPU-GPU Internode MPI Latency

Low

er is

Bette

r 67 %

5.49 usec

Performance of MVAPICH2 with GPUDirect RDMA

67% Lower Latency

5X

GPU-GPU Internode MPI Bandwidth

Hig

her

is B

ett

er

5X Increase in Throughput

Source: Prof. DK Panda

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Mellanox PeerDirect™ with NVIDIA GPUDirect RDMA

HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs

GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand • Unlocks performance between GPU and InfiniBand

• This provides a significant decrease in GPU-GPU communication latency

• Provides complete CPU offload from all GPU communications across the network

Demonstrated up to 102% performance improvement with large number of particles

21% 102%


HPC-X


Mellanox Optimized Software Stack Components

ConnectX-3, ConnectX-3 Pro

10/40/56 Gb/s Ethernet Support for RoCE

ConnectX-3 and Connect-IB

InfiniBand®

x86, Power8, ARM

Linux / Windows Operating System

Mellanox OFED®

PeerDirect™, Core-Direct™, GPUDirect® RDMA

Mellanox HPC-X™ MPI, SHMEM, UPC, MXM, FCA

Applications

Achieve Maximum Application Scalability and Performance Future proofed software compatibility : SDR-DDR-QDR-FDR-EDR

Communication Libraries MPI, SHMEM, UPC


Mellanox HPC-X™ Components and Accelerators

Mellanox ScalableMPI - Message Passing Interface based on Open MPI

Mellanox ScalableSHMEM - One-sided Communications Library

Mellanox ScalableUPC – Berkeley UPC parallel programming language library

Mellanox MXM – Messaging Accelerator optimized for underlying hardware

Mellanox FCA – Fabric Collectives Accelerator supporting MPI-3

Mellanox OFED – Unlocks additional capabilities of the interconnect architecture

Profiling Tools …Benchmarking Tools….and MORE…


HPC-X™

Complete MPI/OpenSHMEM/PGAS/UPC package for HPC environments

• Fully optimized for Mellanox InfiniBand and VPI interconnect solutions

• Supports 3rd party solutions

Components

• Communication libraries: ScalableMPI, ScalableSHMEM, ScalableUPC

• Acceleration libraries: MXM – Messaging Accelerator, FCA – Fabric Collectives Accelerator

• Tools: Integrated Performance Monitoring Tool (IPM), benchmarks etc.

CFD CAE Bio

20%-70% Performance Improvement


Mellanox ScalableHPC Accelerates Parallel Applications

InfiniBand Verbs API

• Reliable Messaging Optimized for Mellanox HCA

• Hybrid Transport Mechanism

• Efficient Memory Registration

• Receive Side Tag Matching

• Topology Aware Collective Optimization

• Hardware Multicast

• Separate Virtual Fabric for Collectives

• CORE-Direct Hardware Offload

Memory

P1

Memory

P2

Memory

P3

MPI

Memory

P1 P2 P3

SHMEM

Logical Shared Memory

Memory

P1 P2 P3

PGAS

Memory Memory

Logical Shared Memory


Mellanox FCA Collective Scalability

Increases CPU efficiency for increased scalability

Support for blocking and nonblocking collectives

Supports hierarchical communication algorithms (HCOL)

Supports multiple optimizations within a single

collectives algorithm

Minimize the negative effect of system noise and jitter


Summary


End-to-end Interconnect Solutions to Deliver Highest ROI for all Applications

Maximum Application Scalability and Performance

Increased Scalability with FCA Dominant in Storage Interconnects

World’s first 100Gb/s interconnect adapter

Innovative scalable transport – Dynamically Connected Transport

Thank You

interconnect your future - hpc-ai advisory council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1k...

Documents