la vision de bull · 02-03-2015 . bull, atos technologies bull exascale program 2 sequana bullx...
TRANSCRIPT
© Atos technologies
Sur le chemin de l’exascale: comment doper la performance des communications
BXI: Bull eXascale Interconnect
June 24, 2015
Jean-Pierre Panziera
02-03-2015
Bull, Atos Technologies
Bull Exascale program
2
SEQUANA bullx
S6000 series BXI
bullx super
computer suite
Parallel
programming
Bull, Atos Technologies
Yales – MPI profile on 1024 threads
3
MPI_Allreduce
MPI_Recv
MPI_Wait
MPI_Barrier
with FDR 56Gb/s
Total time 700s
MPI time 320s
with EDR 100Gb/s
MPI time 250s(*)
Total time 630s(*)
(*) estimations
Bull, Atos Technologies
BXI: Efficient communications for Exascale applications
▶ Bull Interconnect for the Exascale
– HW acceleration sustained performance under heavy load
– High Bandwidth, Low latency, High message rate
– Exascale scalability
▶ BXI full acceleration in hardware
– Portals 4 (unconnected protocol) - Minimum constant memory footprint
– HW support for MPI communications (send/recv, collectives, asynchronous)
– PGAS – Partitioned Global Address Space – new programming languages
▶ BXI highly scalable, efficient and reliable
– up to 64k nodes
– Adaptive Routing
– QoS / 16 virtual channels; typically 2 virtual networks (data + IO)
– Reliable: end-to-end error checking + link level CRC & ECC
Bull, Atos Technologies
BXI NIC
BXI Switch
PCI Express 16x Gen 3
BXI Link 100 (4x25) Gb/s
48 ports BXI Link
NIC ASIC switch ASIC
MPI Latency <1 µs
Issue rate 100 Mmsg/s
9600 Gb/s bandwidth
Lutetia Divio
BXI Network is based on 2 ASICs
Bull, Atos Technologies
BXI fabric
BXI Switch
Switch
BXI Switch
Switch
BXI Switch
Switch
BXI Switch
Switch
Switch
Switch
BXI Network: indirect connections BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC node
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC node
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC node
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC
BXI
NIC
NIC node
Bull, Atos Technologies
BXI: offloading MPI communication in HW (1/2)
7
#include <mpi.h>
int MPI_Isend( const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
int MPI_IRecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
int MPI_Wait(MPI_Request *request, MPI_Status *status)
address
V2P size rank
L2P
Isend
IRecv
compute
compute
Wait
Wait
source
destination
Bull, Atos Technologies
BXI: offloading MPI communication in HW (2/2)
8
Isend
IRecv
compute
compute
Wait
Wait
Wait
Wait
without HW offload
with HW offload
compute
compute
Isend
IRecv
time
Bull, Atos Technologies
BXI Software compute stack
Shmem UPC Open MPI IP Lustre
Portals 4
PCIe / BXI
Services (NFS, ..)
User Applications ISV
Applications
CAF Other MPI
Kernel Driver
BXI NIC
PGAS
Portals 4 Kernel Portals 4 Userspace
Bull, Atos Technologies
Input Port
Output Port
adaptive
Routing Tables
BXI port
static
Output FIFOs
16 VCs
Input Port
Output Port
adaptive
Routing Tables
BXI port
static
Output FIFOs
16 VCs
Input Port
Output Port
adaptive
Routing Tables
BXI port
static
Output FIFOs
16 VCs
BXI: 48 port switch ASIC
10
Input Port
Output Port
adaptive
Routing Tables
BXI port
static
Output FIFOs
16 VCs
48 ports Adaptive Routing 1 routing table per port 16 VCs
Bull, Atos Technologies
Mngt node
BXI Fabric Management
BXI Divio
Switch
ARM µc
Ethernet Management
Network
Fans, Sensors, ...
“JTAG” GbE
Mngt Node
Fabric Management
Software
10GbE
Routing → Up to 64k nodes Supervision → Failures handling Topology → Cable checking Performance → Counters access
Bull, Atos Technologies
▶ 64k nodes
▶ Full routing tables computation takes >5minutes, but goal is <5s … routing algorithm cost is O(N^4)
▶ BXI QuickRepair computes local tables update in case of link failure and re-activation in < 1s
12
BXI Fabric Management QuickRepair
Bull, Atos Technologies
BXI PCI adapter card and 48p standalone switch
BXI NIC
BXI PCIe card
PCIe-gen3 16x
Optical cables (100Gb/s)
BXI port
Redundant Power Supplies
Redundant Fans 1U
Bull, Atos Technologies
4 blade group L2 switches
L1 switches
L2 switches
Cable backplane
L1-L2 connection
“Sequana” – Embedded interconnect
Em
bed
ded
in
tercon
nect
4 blade group
Fast Interconnect layout
octopus NIC-L1
connection
L2
L1 ---12---
---12---
24
Nodes
24 24
Nd Nd
L2
L1
24
Nodes Nd IO/svc
Bull, Atos Technologies
Sequana cells interconnection
15
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L2
L1
Nd Nd
L2
L1
Nd IO/Svc
L3 L3
L3
L3 switches ...
... 48x ...
L3
Distributed IO/Service nodes
Bull, Atos Technologies
Questions ?