Introduction Current Heterogeneous Systems • Clusters with accelerators • Accelerators are host-centric • No integrated network interconnect • Static assignment (1 CPU : N accelerators) • PCIe can become a bottleneck • Explicit programming required
Research Objective Divide Architecture in Two Parts • Cluster based on multi-core-chips
- Executes scalar code • Booster based on many-core technology
- Runs highly scalable code - EXTOLL: Switchless direct 3D torus [1]
=> Components-off-the-shelf philosophy
Goals of the Architecture • Accelerator directly connected to network • Static and dynamic workload assignments • Scale the number of accelerators and host
CPUs independently in an N to M ratio
Acc
Acc
Acc
Acc
Acc
Interconnection Network
CN
CN
CN
...
...
Acc
Acc
Acc
Acc
Acc
Acc
CN
CN
CN
CN
CN
CN
Interconnection Network... ...
(b) 64 Threads, 32 Threads/MIC.
Results II – Application-level Evaluation
• MPI version of the LAMMPS application • Equal Thread-to-MIC distribution • Communication time between MICs can be
improved by up to 32%
(a) 32 Threads, 16 Threads/MIC.
BN0 Accelerator
Boos
ter N
etw
ork
...
...
CNi
Accelerator 0
Accelerator 1
Accelerator N
...
System ViewUser View
NIC
BNN Accelerator
NIC
Clu
ster
Net
wor
k
CN1
CNM
BI0
BI1
BIK
...
Network-Attached Accelerators Communication Model • Simplified, transparent user view • Direct accelerator-to-accelerator communication • Dynamic workload distribution • Accelerators accessible from any node without
host interaction
Prototype Implementation • High-density booster node card (BNC) with two
Intel Xeon Phis (61 cores) & NICs
Accelerator Access • Distributed shared memory to communicate • MMIO range can be anywhere in the network • Loads and stores are encapsulated into network
transactions
Software Design • Transparent to implementation • No accelerator driver changes • Responsible for device & MSI
configuration and IRQ handling
Results I – MPI Performance
0
200
400
600
800
1000
1200
1400
1600
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Band
wid
th (M
B/se
c)
Message Size (Bytes)
mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
8KB 32KB 128KB 512KB 2MB
Late
ncy
(use
c)
Message Size (Bytes)
mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL
0
5
10
15
20
25
30
0B 2B 8B 32B 128B 512B 2KB
Late
ncy
(use
c)
Message Size (Bytes)
mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-remote mic0 EXTOLL
Network-Attached Accelerators: Host-independent Accelerators for Future HPC Systems
The research leading to these results has been conducted in the frame of the DEEP (Dynamically Exascale Entry Platform) project, which has received funding from the European Union's Seventh Framework Programme for research, technological development, and demonstration under grant agreement no 287530.
Sarah Neuwirth, Dirk Frey, and Ulrich Brüning Institute of Computer Engineering (ZITI) – University of Heidelberg, Germany; {sarah.neuwirth,dirk.frey,ulrich.bruening}@ziti.uni-heidelberg.de
BNC with Backplane
Proto-BI
References [1] H. Froening, M. Nuessle, C. Leber, and U. Bruening, On Achieving High Message Rates. In CCGrid ‘13 (pp. 498-505). [2] S. Neuwirth, D. Frey, M. Nuessle, and U. Bruening, Scalable Commu- nication Architecture for Network-Attached Accelerators. In HPCA ‘15 (pp. 627-638).
Fig. 1: Heterogeneous system.
Fig. 2: Booster-based architecture.
Fig. 5: Comparison of the software stacks.
Fig. 4: Memory mapping between BNs and BI.
Fig. 3: Communication model and 3D torus topology [2].
Fig. 6: Prototype system.
Fig. 8: Internode MIC-to-MIC Bandwidth and bi-bandwidth. (a) Bandwidth. (b) Bidirectional bandwidth.
Fig. 7: Latency MIC-to-MIC internode half round-trip. (a) Small messages. (b) Large messages.
Fig. 9: LAMMPS performance using a bead-spring polymer, Lennard-Jones, and copper metallic solid benchmark.
0200400600800
100012001400160018002000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MBi
-Ban
dwid
th (M
B/se
c)Message Size (Bytes)
mic0-mic1 Boostermic0-mic1 OFED/SCIFmic0-mic1 TCP/SCIFmic0-remote mic0 EXTOLL
More information available at www.deep-project.eu
Conclusion • Novel communication architecture • Scales the number of accelerators and CPUs
independently • Host-independent direct accelerator-to-acce-
lerator communication with very low latency • High-density implementation with promising
MPI performance
User Level Tools
Accelerator Driver
Kernel PCI API
Hardware
User Level ToolsAccelerator Driver
Hardware
Virtual PCI layer
Loadable Modules
Conventional System Booster Interface Node
Built-inModules
EXTOLL Driver
BN1
BN0 BI
EX
TOLL
MIC0
...
SM
FU
MICN
......
00000000h00000000h
00000000h
FFFFFFFFh
addressx
FFFFFFFFh
FFFFFFFFh
.. ... .
...
MIC1
Accelerator MMIO MIC1
Accelerator MMIO MIC0
addressx
...
...
......