the future of interconnect - hpc advisory council · the future of interconnect ... infinihost, rc...

42
Dror Goldenberg, VP Software Architecture HPC Advisory Council, Switzerland, 2014 The Future of Interconnect

Upload: hathuan

Post on 11-Apr-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Dror Goldenberg, VP Software Architecture

HPC Advisory Council, Switzerland, 2014

The Future of Interconnect

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

2

HPC The Challenges

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

3

Exascale-Class Computer Platforms – Communication Challenges

Challenge Solution focus

Very large functional unit count ~10M Scalable communication capabilities: point-to-

point & collectives

Large on-”node” functional unit count ~500 Scalable HCA architecture

Deeper memory hierarchies Cache aware network access

Smaller amounts of memory per functional unit Low latency, high b/w capabilities

May have functional unit heterogeneity Support for data heterogeneity

Component failures part of “normal” operation Resilient and redundant stack

Data movement is expensive Optimize data movement

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

4

Challenges of Scalable HPC Interfaces and Applications

Support for asynchronous communication

• Non-blocking communication interfaces

• Sparse and topology aware

System noise issues

• Interfaces that support communication delegation (e.g., offloading)

Minimize data motion

• Semantics that support data “aggregation”

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

5

Standardize your Interconnect

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

6

Standard !

Standardized wire protocol

• Mix and match of vertical and horizontal components

• Ecosystem – build together

Open-source, extensible interfaces

• Extend and optimize per applications needs

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

7

Protocols/Accelerators

Extensible Opensource Framework

HW A HW B HW C

Extended Verbs

TCP/IP Sockets RPC Packet

processing

Storage MPI /

SHMEM /

PGAS

HW D

Vendor

Extensions

Application

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

8

Speed, Speed, Speed

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

9

Technology Roadmap – Always One-Generation Ahead

2000 2020 2010 2005

20Gbs 40Gbs 56Gbs 100Gbs

“Roadrunner” Mellanox Connected

1st 3rd

TOP500 2003 Virginia Tech (Apple)

2015

200Gbs

Mega Supercomputers

Terascale Petascale Exascale

Mellanox

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

10

Connect-IB Performance

100Gb/s interconnect (dual-port FDR 56Gb/s InfiniBand)

137 million messages per second

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

11

Connect-IB Interconnect Throughput

Source: Prof. DK Panda

0

2000

4000

6000

8000

10000

12000

14000

4 16 64 256 1024 4K 16K 64K 256K 1M

Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3385

6343

12485

12810

0

5000

10000

15000

20000

25000

30000

4 16 64 256 1024 4K 16K 64K 256K 1M

ConnectX2-PCIe2-QDR

ConnectX3-PCIe3-FDR

Sandy-ConnectIB-DualFDR

Ivy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

11643

6521

21025

24727

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

12

Scalability

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

13

Collective Operations

Topology Aware

Hardware Multicast

Separate Virtual Fabric

Offload

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

0 500 1000 1500 2000 2500

Late

nc

y (

us)

Processes (PPN=8)

Barrier Collective

Without FCA With FCA

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500

La

ten

cy (

us

)

processes (PPN=8)

Reduce Collective

Without FCA With FCA

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

14

Reliable Connection Transport Mode

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

15

Dynamically Connected Transport Mode

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

16

Dynamically Connected Transport – Exascale Scalability

1

1,000

1,000,000

1,000,000,000

InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012

8 nodes

2K nodes

10K nodes

100K nodes

Ho

st

Me

mo

ry C

on

su

mp

tio

n (

MB

)

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

17

Dynamically Connected Transport Advantages

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

18

Subnet Administrator (SA) queried for every connection

Communication between all nodes creates an n^2 load on the SA

Solution: Scalable SA (SSA)

• Turns a centralized problem into a distributed one

SA Scalability

SM SA

500 MB

1.6 billion

path records

40,000 nodes

50k queries per

second

~ 9 hours

Distributed - SSA

40K-8M path records per access node ~ 1.5 hours

calculation

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

19

More…

l

r

3

1

2 m1

hh

l

HCA

1..h

HCA

1..h

b

Topologies

FAT Tree

SMSMFAT Tree

SMSMFAT Tree

SMSM

Routers

Administrator

Storage

Net.

IPC

Gateway

Mgt.

Servers

Filer

Storage

InfiniBand Subnet

QoS

Distance

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

20

Netork Offload

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

21

Scalable collective communication

Asynchronous communication

Manage communication by communication resources

Avoid system noise

Task list

Target QP for task

Operation

• Send

• Wait for completions

• Enable

• Calculate

CORE-Direct

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

22

Optimizing Non Contiguous Memory Transfers

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

23

On Demand Paging

No memory pinning, no memory registration, no registration caches!

Advantages

• Greatly simplified programming

• Unlimited MR sizes

• Physical memory optimized to hold current working set

PFN1

PFN2

PFN3

PFN4

PFN5

PFN6

0x1000

0x2000

0x3000

0x4000

0x5000

0x6000

Address Space

0x1000

0x2000

0x3000

0x4000

0x5000

0x6000

IO Virtual Address

ODP promise:

IO virtual address mapping == Process virtual address mapping

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

24

TODO: x86-> +GPU -> +ARM/POWER

Connecting Compute Elements

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

25

GPUDirect RDMA

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect RDMA

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect 1.0

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

26

CPU and Network

Integrated Discrete

POWER8

CAPI

NVLINK

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

27

Storage

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

28

InfiniBand RDMA Storage – Get the Best Performance!

Transport protocol implemented in hardware

Zero copy using RDMA

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

iSCSI/TCP iSCSI/RDMA

Dis

c a

cc

es

s L

ate

nc

y @

4K

IO

[mic

se

c]

iSCSI (TCP/IP)1 x FC 8 Gb

port4 x FC 8 Gb

portiSER 1 x

40GbE/IB Port

iSER 2 x40GbE/IB Port(+Acceleration)

KIOPs 130 200 800 1100 2300

0

500

1000

1500

2000

2500

K IO

Ps

@ 4

K IO

Siz

e

5-10% the latency under 20x the workload

@ only 131K

IOPs

@2300K IOPs

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

29

Data Protection Offloaded

Used to provide data block integrity check capabilities (CRC) for block storage (SCSI)

Proposed by the T10 committee

DIX extends the support to main memory

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

30

Power

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

31

Motivation For Power Aware Design

Today networks work at maximum capacity with almost constant dissipation

Low power silicon devices may not suffice to meet future requirements for low-energy networks

The electricity consumption of datacenters is a significant contributor to the total cost of operation

• Cooling costs scale as 1.3X the total energy consumption

Lower power consumption lowers OPEX

• Annual OPEX for 1KW is ~$1000

* According to Global e-Sustainability Initiative (GeSI)

if green network technologies (GNTs) are not adopted

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

32

Increasing Energy Efficiency with Mellanox Switches

Low Energy COnsumption NETworks

ASIC level Voltage/Frequency scaling

Dynamic array power-save

Link level Width/Speed reduction

Energy Efficient Ethernet

Switch Level Port/Module/System shutdown

System Level Energy aware cooling control

Aggregate Power Aware API

FW / MLNX-OS™ / UFM™

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

33

Speed Reduction Power Save [SRPS]

Width Reduction Power Save [WRPS]

Prototype Results - Power Scaling @ Link Level

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

FDR (56Gb-IB} QDR(40Gb-IB} SDR(10Gb-IB}

po

wer

rela

tiv

e t

o

hig

hest

sp

eed

speed

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

34

Prototype Results - Power Scaling @ System Level

Internal port shutdown (Director switch) - ~1%/port

Fan system analysis for improved power algorithm

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536

po

wer

rela

tiv

e t

o

fu

ll c

on

necti

vit

y

number of closed ports

Power decrease percentage due to internal port closing

62.2 63.2

86 87.1 88.1 89.2 90.1 91 91.9 93

84.4 85.1

0

10

20

30

40

50

60

70

80

90

100

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Tota

l SX

po

we

r [W

]

%BW per port

Total system power for SX1036

WRPS enable (auto mode) WRPS disable

Load based power scaling

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

35

Monitoring and Diagnostics

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

36

Unified Fabric Manager

Automatic Discovery Central Device Mgmt Fabric Dashboard Congestion Analysis

Health & Perf Monitoring Advanced Alerting Fabric Health Reports Service Oriented Provisioning

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

37

Cloud Technologies

Enabling the Applications of Tomorrow

• Unbounded cloud performance

• Lower your IT cost by 50%

• On-demand compute, storage and networking

• OpenCloud Architecture: Open Platform / Open Source

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

38

CloudX OpenCloud Architecture Overview

Hardware

• Off-the-shelf components

- Servers, storage, interconnect, software

• Mellanox 40/56Gb/s interconnect

• Solid State Drives (SSD)

• Latest CPU technology

OpenStack components

• Nova compute

• Network controller node

• Storage node - Cinder & Ceph

• Mellanox OpenStack plug-ins

The Building Blocks for the Most Efficient Cloud

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

39

CloudX Applications Performance Examples

4X Faster Run Time! Benchmark: TestDFSIO (1TeraByte, 100 files)

4X Higher Performance! Benchmark: 1M Records Workload (4M Operations)

2X faster run time and 2X higher throughput

2X Faster Run Time! Benchmark: MemCacheD Operations

3X Faster Run Time! Benchmark: Redis Operations

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

40

Exascale !

Thank You

© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014

42

The Only Provider of End-to-End 40/56Gb/s Solutions

From Data Center to Metro and WAN

X86, ARM and Power based Compute and Storage Platforms

The Interconnect Provider For 10Gb/s and Beyond

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN