the future of interconnect - hpc advisory council · the future of interconnect ... infinihost, rc...
TRANSCRIPT
Dror Goldenberg, VP Software Architecture
HPC Advisory Council, Switzerland, 2014
The Future of Interconnect
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
3
Exascale-Class Computer Platforms – Communication Challenges
Challenge Solution focus
Very large functional unit count ~10M Scalable communication capabilities: point-to-
point & collectives
Large on-”node” functional unit count ~500 Scalable HCA architecture
Deeper memory hierarchies Cache aware network access
Smaller amounts of memory per functional unit Low latency, high b/w capabilities
May have functional unit heterogeneity Support for data heterogeneity
Component failures part of “normal” operation Resilient and redundant stack
Data movement is expensive Optimize data movement
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
4
Challenges of Scalable HPC Interfaces and Applications
Support for asynchronous communication
• Non-blocking communication interfaces
• Sparse and topology aware
System noise issues
• Interfaces that support communication delegation (e.g., offloading)
Minimize data motion
• Semantics that support data “aggregation”
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
5
Standardize your Interconnect
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
6
Standard !
Standardized wire protocol
• Mix and match of vertical and horizontal components
• Ecosystem – build together
Open-source, extensible interfaces
• Extend and optimize per applications needs
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
7
Protocols/Accelerators
Extensible Opensource Framework
HW A HW B HW C
Extended Verbs
TCP/IP Sockets RPC Packet
processing
Storage MPI /
SHMEM /
PGAS
HW D
Vendor
Extensions
Application
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
9
Technology Roadmap – Always One-Generation Ahead
2000 2020 2010 2005
20Gbs 40Gbs 56Gbs 100Gbs
“Roadrunner” Mellanox Connected
1st 3rd
TOP500 2003 Virginia Tech (Apple)
2015
200Gbs
Mega Supercomputers
Terascale Petascale Exascale
Mellanox
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
10
Connect-IB Performance
100Gb/s interconnect (dual-port FDR 56Gb/s InfiniBand)
137 million messages per second
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
11
Connect-IB Interconnect Throughput
Source: Prof. DK Panda
0
2000
4000
6000
8000
10000
12000
14000
4 16 64 256 1024 4K 16K 64K 256K 1M
Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3385
6343
12485
12810
0
5000
10000
15000
20000
25000
30000
4 16 64 256 1024 4K 16K 64K 256K 1M
ConnectX2-PCIe2-QDR
ConnectX3-PCIe3-FDR
Sandy-ConnectIB-DualFDR
Ivy-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
11643
6521
21025
24727
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
13
Collective Operations
Topology Aware
Hardware Multicast
Separate Virtual Fabric
Offload
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
0 500 1000 1500 2000 2500
Late
nc
y (
us)
Processes (PPN=8)
Barrier Collective
Without FCA With FCA
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500
La
ten
cy (
us
)
processes (PPN=8)
Reduce Collective
Without FCA With FCA
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
14
Reliable Connection Transport Mode
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
15
Dynamically Connected Transport Mode
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
16
Dynamically Connected Transport – Exascale Scalability
1
1,000
1,000,000
1,000,000,000
InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012
8 nodes
2K nodes
10K nodes
100K nodes
Ho
st
Me
mo
ry C
on
su
mp
tio
n (
MB
)
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
17
Dynamically Connected Transport Advantages
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
18
Subnet Administrator (SA) queried for every connection
Communication between all nodes creates an n^2 load on the SA
Solution: Scalable SA (SSA)
• Turns a centralized problem into a distributed one
SA Scalability
SM SA
500 MB
1.6 billion
path records
40,000 nodes
50k queries per
second
~ 9 hours
Distributed - SSA
40K-8M path records per access node ~ 1.5 hours
calculation
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
19
More…
l
r
3
1
2 m1
hh
l
HCA
1..h
HCA
1..h
b
Topologies
FAT Tree
SMSMFAT Tree
SMSMFAT Tree
SMSM
Routers
Administrator
Storage
Net.
IPC
Gateway
Mgt.
Servers
Filer
Storage
InfiniBand Subnet
QoS
Distance
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
21
Scalable collective communication
Asynchronous communication
Manage communication by communication resources
Avoid system noise
Task list
Target QP for task
Operation
• Send
• Wait for completions
• Enable
• Calculate
CORE-Direct
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
22
Optimizing Non Contiguous Memory Transfers
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
23
On Demand Paging
No memory pinning, no memory registration, no registration caches!
Advantages
• Greatly simplified programming
• Unlimited MR sizes
• Physical memory optimized to hold current working set
PFN1
PFN2
PFN3
PFN4
PFN5
PFN6
0x1000
0x2000
0x3000
0x4000
0x5000
0x6000
Address Space
0x1000
0x2000
0x3000
0x4000
0x5000
0x6000
IO Virtual Address
ODP promise:
IO virtual address mapping == Process virtual address mapping
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
24
TODO: x86-> +GPU -> +ARM/POWER
Connecting Compute Elements
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
25
GPUDirect RDMA
Transmit Receive
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
GPUDirect RDMA
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU Memory
InfiniBand
System
Memory
1
GPUDirect 1.0
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
26
CPU and Network
Integrated Discrete
POWER8
CAPI
NVLINK
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
28
InfiniBand RDMA Storage – Get the Best Performance!
Transport protocol implemented in hardware
Zero copy using RDMA
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
iSCSI/TCP iSCSI/RDMA
Dis
c a
cc
es
s L
ate
nc
y @
4K
IO
[mic
se
c]
iSCSI (TCP/IP)1 x FC 8 Gb
port4 x FC 8 Gb
portiSER 1 x
40GbE/IB Port
iSER 2 x40GbE/IB Port(+Acceleration)
KIOPs 130 200 800 1100 2300
0
500
1000
1500
2000
2500
K IO
Ps
@ 4
K IO
Siz
e
5-10% the latency under 20x the workload
@ only 131K
IOPs
@2300K IOPs
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
29
Data Protection Offloaded
Used to provide data block integrity check capabilities (CRC) for block storage (SCSI)
Proposed by the T10 committee
DIX extends the support to main memory
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
31
Motivation For Power Aware Design
Today networks work at maximum capacity with almost constant dissipation
Low power silicon devices may not suffice to meet future requirements for low-energy networks
The electricity consumption of datacenters is a significant contributor to the total cost of operation
• Cooling costs scale as 1.3X the total energy consumption
Lower power consumption lowers OPEX
• Annual OPEX for 1KW is ~$1000
* According to Global e-Sustainability Initiative (GeSI)
if green network technologies (GNTs) are not adopted
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
32
Increasing Energy Efficiency with Mellanox Switches
Low Energy COnsumption NETworks
ASIC level Voltage/Frequency scaling
Dynamic array power-save
Link level Width/Speed reduction
Energy Efficient Ethernet
Switch Level Port/Module/System shutdown
System Level Energy aware cooling control
Aggregate Power Aware API
FW / MLNX-OS™ / UFM™
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
33
Speed Reduction Power Save [SRPS]
Width Reduction Power Save [WRPS]
Prototype Results - Power Scaling @ Link Level
60.00%
65.00%
70.00%
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
FDR (56Gb-IB} QDR(40Gb-IB} SDR(10Gb-IB}
po
wer
rela
tiv
e t
o
hig
hest
sp
eed
speed
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
34
Prototype Results - Power Scaling @ System Level
Internal port shutdown (Director switch) - ~1%/port
Fan system analysis for improved power algorithm
60.00%
65.00%
70.00%
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536
po
wer
rela
tiv
e t
o
fu
ll c
on
necti
vit
y
number of closed ports
Power decrease percentage due to internal port closing
62.2 63.2
86 87.1 88.1 89.2 90.1 91 91.9 93
84.4 85.1
0
10
20
30
40
50
60
70
80
90
100
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Tota
l SX
po
we
r [W
]
%BW per port
Total system power for SX1036
WRPS enable (auto mode) WRPS disable
Load based power scaling
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
36
Unified Fabric Manager
Automatic Discovery Central Device Mgmt Fabric Dashboard Congestion Analysis
Health & Perf Monitoring Advanced Alerting Fabric Health Reports Service Oriented Provisioning
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
37
Cloud Technologies
Enabling the Applications of Tomorrow
• Unbounded cloud performance
• Lower your IT cost by 50%
• On-demand compute, storage and networking
• OpenCloud Architecture: Open Platform / Open Source
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
38
CloudX OpenCloud Architecture Overview
Hardware
• Off-the-shelf components
- Servers, storage, interconnect, software
• Mellanox 40/56Gb/s interconnect
• Solid State Drives (SSD)
• Latest CPU technology
OpenStack components
• Nova compute
• Network controller node
• Storage node - Cinder & Ceph
• Mellanox OpenStack plug-ins
The Building Blocks for the Most Efficient Cloud
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
39
CloudX Applications Performance Examples
4X Faster Run Time! Benchmark: TestDFSIO (1TeraByte, 100 files)
4X Higher Performance! Benchmark: 1M Records Workload (4M Operations)
2X faster run time and 2X higher throughput
2X Faster Run Time! Benchmark: MemCacheD Operations
3X Faster Run Time! Benchmark: Redis Operations
© 2014 Mellanox Technologies HPC Advisory Council, Switzerland, 2014
42
The Only Provider of End-to-End 40/56Gb/s Solutions
From Data Center to Metro and WAN
X86, ARM and Power based Compute and Storage Platforms
The Interconnect Provider For 10Gb/s and Beyond
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN