interconnect your future - hpc-ai advisory council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1k...

25
Scot Schultz, Director, HPC and Technical Marketing HPC Advisory Council, European Conference, June 2014 Interconnect Your Future

Upload: others

Post on 13-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

Scot Schultz, Director, HPC and Technical Marketing

HPC Advisory Council, European Conference, June 2014

Interconnect Your Future

Page 2: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 2

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN

Page 3: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 3

The Speed of Data Mandates the Use of Data

20Gbs 40Gbs 56Gbs 100Gbs

2000 2020 2010 2005 2015

200Gbs 10Gbs

Keeping You One Generation Ahead

Enabling the Use of Data

Mellanox Roadmap of Data Speed

Page 4: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 4

Architectural Foundation for Exascale Computing

Connect-IB

Page 5: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 5

Mellanox Connect-IB The World’s Fastest Adapter

The 7th generation of Mellanox interconnect adapters

World’s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand)

Delivers 137 million messages per second – 4X higher than competition

Support the new innovative InfiniBand scalable transport – Dynamically Connected

Page 6: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 6

Connect-IB Provides Highest Interconnect Throughput

Source: Prof. DK Panda

Hig

he

r is

Be

tte

r

Gain Your Performance Leadership With Connect-IB Adapters

0

2000

4000

6000

8000

10000

12000

14000

4 16 64 256 1024 4K 16K 64K 256K 1M

Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3385

6343

12485

12810

0

5000

10000

15000

20000

25000

30000

4 16 64 256 1024 4K 16K 64K 256K 1M

ConnectX2-PCIe2-QDR

ConnectX3-PCIe3-FDR

Sandy-ConnectIB-DualFDR

Ivy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

11643

6521

21025

24727

Page 7: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 7

Connect-IB Delivers Highest Application Performance

200% Higher Performance Versus Competition, with Only 32-nodes

Performance Gap Increases with Cluster Size

Page 8: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 8

Mellanox PeerDirect™

Page 9: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 9

What is PeerDirect™

PeerDirect is natively supported by Mellanox OFED 2.1 or later distribution

Supports peer-to-peer communications between Mellanox adapters and third-party devices

No unnecessary system memory copies & CPU overhead

• No longer needs a host buffer for each device

• No longer needs to share a host buffer either

Provides native support for Intel Xeon PHI MPSS communication stack

Supports NVIDIA® GPUDirect RDMA with a separate plug-in, available at Mellanox.com

Support for RoCE protocol over Mellanox VPI

Supported with all Mellanox ConnectX-3 and Connect-IB Adapters

CPU

Chip

set

Chipset Vendor

Device

CPU

Chip

set

Chipset Vendor

Device 0101001011

Page 10: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 10

Mellanox PeerDirect™ Architecture

A generic infrastructure to allow PCIe peer devices to communicate directly

Enables direct RDMA transactions between PCIe device and Mellanox interconnect

Page 11: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 11

Native support for peer-to-peer communications between Mellanox HCA adapters and NVIDIA GPU devices

GPUDirect RDMA

Page 12: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 12

Evolution of GPUDirect RDMA

Before GPUDirect

Network and third-party device drivers, did not

share buffers, and needed to make a redundant

copy in host memory.

With GPUDirect 1.0 Shared Host Memory Pages

The network and GPU can share “pinned”

(page-locked) buffers, eliminating the need to

make a redundant copy in host memory.

Pre-GPUDirect

GPUDirect Shared Host Memory Pages Model

Page 13: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 13

Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPI Send-Recv efficiency between GPUs in remote nodes

GPUDirect™ RDMA

With GPUDirect™ RDMA

Using PeerDirect™

Page 14: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 14

Setup Considerations for GPUDirect RDMA

CPU

Chip

set

CPU

Chipset or

PCIE switch

Note : A requirement for GPUDirect RDMA to work properly is that the NVIDIA

GPU and the Mellanox InfiniBand Adapter share the same root complex

Page 15: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 15

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

0

5

10

15

20

25

1 4 16 64 256 1K 4K

Message Size (bytes)

Late

ncy (

us

)

GPU-GPU Internode MPI Latency

Low

er is

Bette

r 67 %

5.49 usec

Performance of MVAPICH2 with GPUDirect RDMA

67% Lower Latency

5X

GPU-GPU Internode MPI Bandwidth

Hig

her

is B

ett

er

5X Increase in Throughput

Source: Prof. DK Panda

Page 16: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 16

Mellanox PeerDirect™ with NVIDIA GPUDirect RDMA

HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs

GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand • Unlocks performance between GPU and InfiniBand

• This provides a significant decrease in GPU-GPU communication latency

• Provides complete CPU offload from all GPU communications across the network

Demonstrated up to 102% performance improvement with large number of particles

21% 102%

Page 17: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 17

HPC-X

Page 18: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 18

Mellanox Optimized Software Stack Components

ConnectX-3, ConnectX-3 Pro

10/40/56 Gb/s Ethernet Support for RoCE

ConnectX-3 and Connect-IB

InfiniBand®

x86, Power8, ARM

Linux / Windows Operating System

Mellanox OFED®

PeerDirect™, Core-Direct™, GPUDirect® RDMA

Mellanox HPC-X™ MPI, SHMEM, UPC, MXM, FCA

Applications

Achieve Maximum Application Scalability and Performance Future proofed software compatibility : SDR-DDR-QDR-FDR-EDR

Communication Libraries MPI, SHMEM, UPC

Page 19: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 19

Mellanox HPC-X™ Components and Accelerators

Mellanox ScalableMPI - Message Passing Interface based on Open MPI

Mellanox ScalableSHMEM - One-sided Communications Library

Mellanox ScalableUPC – Berkeley UPC parallel programming language library

Mellanox MXM – Messaging Accelerator optimized for underlying hardware

Mellanox FCA – Fabric Collectives Accelerator supporting MPI-3

Mellanox OFED – Unlocks additional capabilities of the interconnect architecture

Profiling Tools …Benchmarking Tools….and MORE…

Page 20: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 20

HPC-X™

Complete MPI/OpenSHMEM/PGAS/UPC package for HPC environments

• Fully optimized for Mellanox InfiniBand and VPI interconnect solutions

• Supports 3rd party solutions

Components

• Communication libraries: ScalableMPI, ScalableSHMEM, ScalableUPC

• Acceleration libraries: MXM – Messaging Accelerator, FCA – Fabric Collectives Accelerator

• Tools: Integrated Performance Monitoring Tool (IPM), benchmarks etc.

CFD CAE Bio

20%-70% Performance Improvement

Page 21: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 21

Mellanox ScalableHPC Accelerates Parallel Applications

InfiniBand Verbs API

• Reliable Messaging Optimized for Mellanox HCA

• Hybrid Transport Mechanism

• Efficient Memory Registration

• Receive Side Tag Matching

• Topology Aware Collective Optimization

• Hardware Multicast

• Separate Virtual Fabric for Collectives

• CORE-Direct Hardware Offload

Memory

P1

Memory

P2

Memory

P3

MPI

Memory

P1 P2 P3

SHMEM

Logical Shared Memory

Memory

P1 P2 P3

PGAS

Memory Memory

Logical Shared Memory

Page 22: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 22

Mellanox FCA Collective Scalability

Increases CPU efficiency for increased scalability

Support for blocking and nonblocking collectives

Supports hierarchical communication algorithms (HCOL)

Supports multiple optimizations within a single

collectives algorithm

Minimize the negative effect of system noise and jitter

Page 23: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 23

Summary

Page 24: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

© 2014 Mellanox Technologies 24

End-to-end Interconnect Solutions to Deliver Highest ROI for all Applications

Maximum Application Scalability and Performance

Increased Scalability with FCA Dominant in Storage Interconnects

World’s first 100Gb/s interconnect adapter

Innovative scalable transport – Dynamically Connected Transport

Page 25: Interconnect Your Future - HPC-AI Advisory Council · 2020-01-14 · 5 10 15 20 25 1 4 16 64 256 1K 4K Message Size (bytes) s) GPU-GPU Internode MPI Latency r 67 % 5.49 usec Performance

Thank You