Enabling Multi-FPGA Clusters as an OpenMP Acceleration Device
Ramon Nepomuceno Renan Sterle Guido Araujo
2
Motivation: Seismic Imaging
image taken from the internet:https://www.saltambiental.com.br/geofisica/apresentacao/
3
Motivation: Seismic Imaging
image taken from the internet:https://www.saltambiental.com.br/geofisica/apresentacao/
4
Motivation: Seismic Imaging
image taken from the internet:https://www.saltambiental.com.br/geofisica/apresentacao/
Introduction
Heterogeneous Systems
5
Introduction
Heterogeneous Systems
6
Introduction
Heterogeneous Systems
7
Introduction
Heterogeneous Systems
8
Introduction
Heterogeneous Systems
9
Introduction
• An FPGA is a powerful device that can speedup pipelined applications
10
Introduction
11
• But even a powerful FPGA cannot handle very large problems like Seismic Imaging
• Need Multi-FPGA architectures
Introduction
• Moreover....
• FPGA toolchains are hardware oriented and unfriendly to software designers
Introduction
• Moreover....
• FPGA toolchains are hardware oriented and unfriendly to software designers
• Need a simpler and transparent approach to embed FPGA accelerators into software applications
Proposal
How to program such systems?
• We propose using OpenMP task-based programming model for Multi-FPGA architectures. We are extending it to:
• Enable automatic bitstream/data offloading to FPGAs
• Transparently map data dependencies to IPs/boards interconnects
• Allow a seamless flow of data to/from FPGA.
14
Background
15
OpenMP Task Directive
16
#pragma omp parallel #pragma omp single #pragma omp task depend (out:A) A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp task depend(in:A)\\ depend(out: B[i])
B[i]=bar(a) } for(int i = 0; i < 4; i ++){ #pragma omp task depend(in:B[i/2])
fun(B[i/2]); }
OpenMP Task Directive
17
#pragma omp parallel #pragma omp single #pragma omp task depend (out:A) A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp task depend(in:A)\\ depend(out: B[i])
B[i]=bar(a); } for(int i = 0; i < 4; i ++){ #pragma omp task depend(in:B[i/2])
fun(B[i/2]); }
Create a pool of worker threads
OpenMP Task Directive
18
#pragma omp parallel #pragma omp single #pragma omp task depend (out:A) A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp task depend(in:A)\\ depend(out: B[i])
B[i]=bar(a); } for(int i = 0; i < 4; i ++){ #pragma omp task depend(in:B[i/2])
fun(B[i/2]); }
Select the control thread
#pragma omp parallel #pragma omp single #pragma omp task depend (out:A) A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp task depend(in:A)\\ depend(out: B[i])
B[i]=bar(a) } for(int i = 0; i < 4; i ++){ #pragma omp task depend(in:B[i/2])
fun(B[i/2]); }
OpenMP Task Directive
foo
19
#pragma omp parallel #pragma omp single #pragma omp task depend (out:A) A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp task depend(in:A)\\ depend(out: B[i])
B[i]=bar(a) } for(int i = 0; i < 4; i ++){ #pragma omp task depend(in:B[i/2])
fun(B[i/2]); }
OpenMP Task Directive
foo
bar bar
20
#pragma omp parallel #pragma omp single #pragma omp task depend (out:A) A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp task depend(in:A)\\ depend(out: B[i])
B[i]=bar(a) } for(int i = 0; i < 4; i ++){ #pragma omp task depend(in:B[i/2])
fun(B[i/2]); }
OpenMP Task Directive
foo
fun
bar
fun fun fun
bar
21
control thread
OpenMP Task Runtime
22
OpenMP Runtime
OpenMP Task Runtime
23
dependency graph
worker threads
ready queue
OpenMP Task Runtime
24
#pragma omp target device(FPGA) map(to: X) map(from: Y)
CPU FPGA
…
…
X
Y
Shared Memory
OpenMP Target Directive
25
#pragma omp target device(FPGA) map(to: X) map(from: Y)
CPU FPGA
…
…
X
Y
Shared Memory
1
1
OpenMP Target Directive
26
#pragma omp target device(FPGA) map(to: X) map(from: Y)
CPU FPGA
…
…
X
Y
Shared Memory
2
2
OpenMP Target Directive
27
#pragma omp target device(FPGA) map(to: X) map(from: Y)
CPU FPGA
…
…
X
Y
Shared Memory
3
3
OpenMP Target Directive
28
#pragma omp parallel #pragma omp single #pragma omp target device(GPU)\\ map(from: A) depend (out: A) nowait A = foo(); for(int i = 0; i < 2; i ++){ #pragma omp target device(GPU)\\ map(to: A) map(from: B[i]) \\ depend (in: A) depend (out: B[i]) nowait
B[i]=bar(a) } for(int i = 0; i < 4; i ++){ #pragma omp target device(GPU)\\ map(to: B[i/2]) \\ depend (in: B[i/2]) nowait
fun(B[i/2]); }
OpenMP Target Depend Directive
29
But LLVM/OpenMP cannot handle FPGAs
LLVM/OpenMP FPGA Limitations:
● Does not support for bitstream offloading;● Does not support for mapping data dependencies to
IP/boards interconnects; ● Does not allow a seamless flow of data to/from
multiple-FPGA.
30
The Software
31
LLVMgeneratedHost Code libomptarget.so
Data GPU Plugin
DSP Plugin
X86_64 Plugin
GPU
DSP
X86_64 core
Host Devices
GPU Code
DSP Code
X86_64 Code
VC709 Code
VC709 Plugin
VC709 Board
Our modules: LLVM/OpenMP VC709 Plugin
32
fat binary
The Hardware
Single Node
33
The FPGA
● Field Programmable Gate Array.○ Xilinx VC709 Board
■ 2x 4GB DDR3 SODIMM memories;■ 8-lane PCIe;■ 4x SFP;
34
The Board
35
4 x 10Gb/s optical links
VFI
FO
DM
A IP
PC
Ie IP
256-bit @250 MHz
256-bit @250 MHz
Configuration Registers
64-bit @156.25 MHz
NET
64-bit @156.25 MHz
PC
I Exp
ress
Inte
rface
sfp+
VC709-board
Virtex-7
NET
NET
NET
sfp+
sfp+
sfp+
2x DDR3
VC709 Target Reference Design
36
VFI
FO
PC
IE/D
MA
net
AX
IS-IC
IP.0
Configuration registers
netIP.1
IP.2
IP.3
AXI-L AXI-L
Virtex-7
VC709-board
sfp+
sfp+
net
net
sfp+
2x DDR3
sfp+
2x DDR3
Our modules
37
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
Single node: How it works
Create pool of threads
38
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
Single node: How it works
Select the control thread
39
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
Single node: How it works
create the tasks
40
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
Single node: How it works
+1 +2 +3 +4
41
Single node: How it works
+1 +2 +3 +4
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
42
Single node: How it works
+1 +2 +3 +4
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
43
Single node: How it works
+1 +2 +3 +4
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
44
Single node: How it works
+1 +2 +3 +4
PC
IE/D
MA
A
XIS-
IC
Virtex-7
VC709-board
+1+2+3+4
45
The Hardware
Multiple Nodes
46
Multiple Nodes
47
Multi-Nodes
48
• Need to forward dependency data across optical links○ Designed a MAC Frame Handler (MFH)○ Mount and unmount a MAC frame.
Adding MFH
VFI
FO
PC
IE/D
MA
net
AXI-S AXI-SAXI-S
AXI-L: AXI-LITEAXI-S: AXI-STREAM
AX
IS-IC
IP.0
Configuration registers
netIP.1
IP.2
IP.3
AXI-L AXI-L
AXI-L
Virtex-7
VC709-board
sfp+
sfp+
net
net
sfp+
2x DDR3
sfp+
2x DDR3
MFH
49
Multiple nodes: How it works
Virtex-7
VC709-board A
+4
+1
A-S
WT
configuration registers
MFH
Virtex-7
VC709-board B
+3
+2A-S
WT
configuration registers
MFH
NODE-A NODE-B
50
Multiple nodes: How it works
Virtex-7
VC709-board A
+4
+1
A-S
WT
configuration registers
MFH
Virtex-7
VC709-board B
+3
+2A-S
WT
configuration registers
MFH
NODE-A NODE-B
51
Multiple nodes: How it works
Virtex-7
VC709-board A
+4
+1
A-S
WT
configuration registers
MFH
Virtex-7
VC709-board B
+3
+2A-S
WT
configuration registers
MFH
NODE-A NODE-B
52
Multiple nodes: How it works
Virtex-7
VC709-board A
+4
+1
A-S
WT
configuration registers
MFH
Virtex-7
VC709-board B
+3
+2A-S
WT
configuration registers
MFH
NODE-A NODE-B
53
Multiple nodes: How it works
Virtex-7
VC709-board A
+4
+1
A-S
WT
configuration registers
MFH
Virtex-7
VC709-board B
+3
+2A-S
WT
configuration registers
MFH
NODE-A NODE-B
54
Ongoing Experiments
VC709-board A
laplace 3
laplace 0
A-S
WT
configuration registers
MFH
NODE-A
laplace 1
laplace 2
VC709-board A
laplace 3
laplace 0
A-S
WT
configuration registers
MFH
NODE-B
laplace 1
laplace 2
Related Works
56
Year Multi-FPGA Compiler ToolKit Target System
Leow 2006 no C-Breeze Celoxica RC100 Board-Spartan
Cabrera 2009 no Mercurium, GCC, SGI RASClib SGI RASC 2.2 Board-Virtex 4
Cilardo 2013 no Custom OpenMP, Xilinx EDK Xilinx EDK Boards
Choi 2013 no LLVM GCC Altera Board with Stratix IV
Filgueras 2014 no Mercurium, Nanos++ Xilinx Board with Zynq-700
Podobas 2014 no Custom C89 Compiler Altera ED5 Stratix V
Sommer 2017 no Clang, LLVM, libomptarget, TPC Xilinx VC709 Board
Ceissler(FCCM) 2018 no Clang, LLVM, libomptarget Intel HARP2 and Amazon AWS
Bosh 2018 no Mercurium, Nanos++ Zynq Ultrascale+
Knaust 2019 no Clang LLVM Intel Board Arria 10 GX
Our work 2021 yes Clang LLVM, libomptarget Cluster of Xilinx VC709 Boards
OpenMP for FPGAs
57
Year Multi-FPGA Compiler ToolKit Target System
Leow 2006 no C-Breeze Celoxica RC100 Board-Spartan
Cabrera 2009 no Mercurium, GCC, SGI RASClib SGI RASC 2.2 Board-Virtex 4
Cilardo 2013 no Custom OpenMP, Xilinx EDK Xilinx EDK Boards
Choi 2013 no LLVM GCC Altera Board with Stratix IV
Filgueras 2014 no Mercurium, Nanos++ Xilinx Board with Zynq-700
Podobas 2014 no Custom C89 Compiler Altera ED5 Stratix V
Sommer 2017 no Clang, LLVM, libomptarget, TPC Xilinx VC709 Board
Ceissler(FCCM) 2018 no Clang, LLVM, libomptarget Intel HARP2 and Amazon AWS
Bosh 2018 no Mercurium, Nanos++ Zynq Ultrascale+
Knaust 2019 no Clang LLVM Intel Board Arria 10 GX
Our work 2021 yes Clang LLVM, libomptarget Cluster of Xilinx VC709 Boards
OpenMP for FPGAs
58