interconnect your future hpc... · x86, power, gpu, arm and fpga-based compute and storage...

30
Paving the Road to Exascale February 2016 Interconnect Your Future

Upload: others

Post on 23-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

Paving the Road to Exascale

February 2016

Interconnect Your Future

Page 2: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 2- Mellanox Confidential -

The Ever Growing Demand for Higher Performance

2000 202020102005

“Roadrunner”

1st

2015

Terascale Petascale Exascale

Single-Core to Many-CoreSMP to Clusters

Performance Development

Co-Design

HW SW

APP

Hardware

Software

Application

The Interconnect is the Enabling Technology

Page 3: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 3- Mellanox Confidential -

Co-Design Architecture to Enable Exascale Performance

CPU-Centric Co-Design

Limited to Main CPU Usage

Results in Performance Limitation

Creating Synergies

Enables Higher Performance and Scale

Software

In-CPU

ComputingIn-Network

Computing

In-Storage

Computing

Page 4: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 4- Mellanox Confidential -

The Intelligence is Moving to the Interconnect

CPU

Interconnect

Past Future

Page 5: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 5- Mellanox Confidential -

Intelligent Interconnect Delivers Higher Datacenter ROI

Users

Users

Intelligence

Network Offloads

Computing for applications

Smart Network

Increase Datacenter Value

Network functions

On CPU

Page 6: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 6- Mellanox Confidential -

Breaking the Application Latency Wall

Today: Network device latencies are on the order of 100 nanoseconds

Challenge: Enabling the next order of magnitude improvement in application performance

Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10

microsecond

~100

microsecond

NetworkCommunication

Framework

Today

~10

microsecond

Communication

Framework

~0.1

microsecond

Network

~1

microsecond

Communication

Framework

Future

~0.05

microsecond

Co-Design

Network

Page 7: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 7- Mellanox Confidential -

Introducing Switch-IB 2 World’s First Smart Switch

Page 8: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 8- Mellanox Confidential -

Introducing Switch-IB 2 World’s First Smart Switch

The world fastest switch with <90 nanosecond latency

36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec

Adaptive Routing, Congestion control, support for multiple topologies

World’s First Smart Switch

Build for Scalable Compute and Storage Infrastructures

10X Higher Performance with The New Switch SHArP Technology

Page 9: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 9- Mellanox Confidential -

SHArP (Scalable Hierarchical Aggregation Protocol) Technology

Delivering 10X Performance Improvement

for MPI and SHMEM/PAGS Communications

Switch-IB 2 Enables the Switch Network to

Operate as a Co-Processor

SHArP Enables Switch-IB 2 to Manage and

Execute MPI Operations in the Network

Page 10: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 10- Mellanox Confidential -

SHArP Performance Advantage

MiniFE is a Finite Element mini-application

• Implements kernels that represent

implicit finite-element applications

10X to 25X Performance Improvement

AllRedcue MPI Collective

Page 11: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 11- Mellanox Confidential -

Multi-Host Socket DirectTM – Low Latency Socket Communication

Each CPU with direct network access

QPI avoidance for I/O – improve performance

Enables GPU / peer direct on both sockets

Solution is transparent to software

CPU CPUCPU CPUQPI

Multi-Host Socket Direct Performance

50% Lower CPU Utilization

20% lower Latency

Multi Host Evaluation Kit

Lower Application Latency, Free-up CPU

Page 12: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 12- Mellanox Confidential -

Introducing ConnectX-4 Lx Programmable Adapter

Scalable, Efficient, High-Performance and Flexible Solution

Security

Cloud/Virtualization

Storage

High Performance Computing

Precision Time Synchronization

Networking + FPGA

Mellanox Acceleration Engines

and FGPA Programmability

On One Adapter

Page 13: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 13- Mellanox Confidential -

Mellanox InfiniBand Proven and Most Scalable HPC Interconnect

“Summit” System “Sierra” System

Paving the Road to Exascale

Page 14: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 14- Mellanox Confidential -

20K InfiniBand nodes

Mellanox end-to-end scalable EDR, FDR and QDR InfiniBand

Supports variety of scientific and engineering projects

• Coupled atmosphere-ocean models

• Future space vehicle design

• Large-scale dark matter halos and galaxy evolution

Leveraging InfiniBand backward and future compatibility

System Example: NASA Ames Research Center Pleiades

High-Resolution Climate Simulations

Page 15: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 15- Mellanox Confidential -

High-Performance Designed 100Gb/s Interconnect Solutions

Transceivers

Active Optical and Copper Cables

(10 / 25 / 40 / 50 / 56 / 100Gb/s)VCSELs, Silicon Photonics and Copper

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

7.02 Billion msg/sec (195M msg/sec/port)

100Gb/s Adapter, 0.7us latency

150 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

32 100GbE Ports, 64 25/50GbE Ports

(10 / 25 / 40 / 50 / 100GbE)

Throughput of 6.4Tb/s

Page 16: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 16- Mellanox Confidential -

Leading Supplier of End-to-End Interconnect Solutions

Software and ServicesICs Switches/GatewaysAdapter Cards Cables/ModulesMetro / WAN

StoreAnalyzeEnabling the Use of Data

At the Speeds of 10, 25, 40, 50, 56 and 100 Gigabit per Second

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Page 17: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 17- Mellanox Confidential -

The Performance Power of EDR 100G InfiniBand

28%

28%-63% Increase in Overall

System Performance

Page 18: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 18- Mellanox Confidential -

End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

10, 20, 25, 40, 50, 56 and 100Gb/s Speeds

X86Open

POWERGPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures

Page 19: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 19- Mellanox Confidential -

Technology Roadmap – One-Generation Lead over the Competition

2000 202020102005

20Gbs 40Gbs 56Gbs 100Gbs

“Roadrunner”Mellanox Connected

1st3rd

TOP500 2003Virginia Tech (Apple)

2015

200Gbs

Terascale Petascale Exascale

Mellanox

Page 20: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 20- Mellanox Confidential -

Maximize Performance via Accelerator and GPU Offloads

GPUDirect RDMA Technology

Page 21: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 21- Mellanox Confidential -

GPUs are Everywhere!

GPUDirect RDMA / Sync

CPU

GPUChip

set

GPUMemory

System

Memory

1

GPU

Page 22: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 22- Mellanox Confidential -

Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPI efficiency between GPUs in remote nodes

Based on PCIe PeerDirect technology

GPUDirect™ RDMA (GPUDirect 3.0)

With GPUDirect™ RDMA

Using PeerDirect™

Page 24: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 24- Mellanox Confidential -

Mellanox GPUDirect RDMA Performance Advantage

HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs

GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand• Unlocks performance between GPU and InfiniBand

• This provides a significant decrease in GPU-GPU communication latency

• Provides complete CPU offload from all GPU communications across the network

102%

2X Application

Performance!

Page 25: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 25- Mellanox Confidential -

GPUDirect Sync (GPUDirect 4.0)

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect

• Control path still uses the CPU

- CPU prepares and queues communication tasks on GPU

- GPU triggers communication on HCA

- Mellanox HCA directly accesses GPU memory

GPUDirect Sync (GPUDirect 4.0)

• Both data path and control path go directly

between the GPU and the Mellanox interconnect

0

10

20

30

40

50

60

70

80

2 4

Ave

rag

e t

ime

pe

r it

era

tio

n (

us

)

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance

For GPU Clusters

Page 26: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 26- Mellanox Confidential -

Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network InterfaceCUDA

Driver + runtimeNetwork Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA

Driver + runtime

CUDA Application

rCUDA provides remote access from

every node to any GPU in the system

CPUVGPU

CPUVGPU

CPUVGPU

GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU

Page 27: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 27- Mellanox Confidential -

All HPC Clouds to Use Mellanox Solutions

HPC Clouds

Page 28: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 28- Mellanox Confidential -

Virtualization for HPC: Mellanox SRIOV

Page 29: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

© 2016 Mellanox Technologies 29- Mellanox Confidential -

HPC Clouds – Performance Demands Mellanox Solutions

San Diego Supercomputing Center “Comet” System (2015) to

Leverage Mellanox InfiniBand to Build HPC Cloud

Page 30: Interconnect Your Future HPC... · X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect

Thank You