Direct Universal Access:Making Data Center Resources Available to FPGA
Ran Shu1, Peng Cheng1, Guo Chen2, Zhiyuan Guo1, 3, Lei Qu1, Yongqiang Xiong1,
Derek Chiou4, Thomas Moscibroda4
Microsoft Research1, Hunan University2,Beihang University3, Microsoft Azure4
Microsoft Research Microsoft Azure
1
FPGA Deployment in Data Centers
• Wide deployment– Major cloud service providers
• Microsoft, Amazon, Facebook, Alibaba, Tencent, Baidu, IBM, etc.
• Accelerated applications– Computation
• Web search ranking• Deep neural networks• Big data analytics
– Networking• Network processing
– Database/Storage• SQL• Key-value store
Image from A. Caulfield et al., Micro 2016
Image from D. Firestone et al., NSDI 2018
2
Resource Access Requirements
• Heterogeneous resources
– CPU
– Memory
– Other FPGAs
– GPU
– SSD
Data center network fabric
CPUHost
DRAMSSDGPU
NIC
...
FPGA Onboard DRAM
FPGA board
PCIe fabric
Server
FPGA Onboard DRAM
FPGA board
CPUHost
DRAMSSDGPU
NIC
...
FPGA Onboard DRAM
FPGA board
PCIe fabric
Server
FPGA Onboard DRAM
FPGA board
...
EthernetEthernet
... ...
Image from A. Putnam et al., ISCA 2014 3
FPGA Board in Data Center
FPGA chip
QSFP
DDR 4
PCIe Gen 3
Image from https://www.cnet.com/news/microsoft-project-brainwave-speeds-ai-with-fpga-chips-on-azure-build-conference/ 4
Current FPGA Communication Architecture
Application Layer
Physical Layer
Application
Application
DDR 4 IP PCIe Gen 3 IP
DDR Stack LTL NVMeHostDMA
FPGA Chip
QSFP IP
DMA
PCIe Gen 3QSFP
DDR 4
FPGA Board
DCQCN(in LTL) Transport Layer
Data Link Layer
5
Current FPGA Communication Architecture
MAC layer
Physical layer
Application
Application
DDR 4 IP PCIe Gen 3 IP
DDR Stack LTL NVMeHostDMA
FPGA Chip
QSFP IP
DMA
PCIe Gen 3QSFP
DDR 4
FPGA Board
???
Image from L. Zhang et al., CCR 2014
DCQCN(in LTL)
6
Problem #1 – Programming Interface
Application 1 Application 2
DDR IP PCIe IPEthernet IP
DDR Stack LTL NVMe StackHostDMA
Lines of code to use each interfaceHost DRAM: 294
Host CPU program: 205
Onboard DRAM: 517
Remote FPGA: 1356
7
FPGA Chip
Problem #2 – Accessibility
Application Application
DDR IP PCIe IPEthernet IP
DDR Stack LTL NVMe StackHostDMA
Data center naming space
Server-area naming space
On-board naming space
8
FPGA Chip
Problem #3 – Multiplexing
Application 1 Application 2
DDR IP PCIe IPEthernet IP
Host DMAMux/Demux
FPGA Chip
DDR Stack LTL NVMe StackHostDMA
9
Problem #3 – Multiplexing
Application 1 Application 2
DDR IP PCIe IPEthernet IP
LTLMux/Demux
DDR Stack LTL NVMe StackHostDMA
10
FPGA Chip
Problem #3 – Multiplexing
Application 1 Application 2
PCIe Mux/Demux
DDR IP PCIe IPEthernet IP
DDR Stack LTL NVMe StackHostDMA
11
FPGA Chip
Problem #4 – Security
DDR Stack LTL NVMe StackHostDMA
DDR IP PCIe IPEthernet IP
MaliciousApplication
Access unauthorized
address
Host Memory
PCIe Fabric12
FPGA Chip
Problem #4 – Security
DDR Stack LTL NVMe StackHostDMA
DDR IP PCIe IPEthernet IP
Victim Application
Attack application
MaliciousCPU
Program
PCIe Fabric13
FPGA Chip
Existing Problems
• Complex programming interface
• Separate naming space
• No general multiplexing
• Security issue
14
Direct Universal Access
FPGA 1
Server
QSFPDDR
Server
LTLDDR
accessFPGA
ConnectHostDMA
PCIe Gen3DDR
App 2
LTLDDR
accessFPGA
ConnectHostDMA
Datacenter networking fabric
QSFP
FPGA 2
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
App 1
DUA
Intra-server networking fabric
③
②
① ④
FPGA 3
15
DUA Overview
FPGA
Server
QSFPDDR
Server
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
PCIe Gen3DDR
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric
③
②
① ④
DUA is an “IP layer”An abstract overlay network
Leverage all existing h/w stacksHierarchical addressing & routing
16
DUA Overview
FPGA
Server
QSFPDDR
Server
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
PCIe Gen3DDR
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric
DUA is an “IP layer”
③
②
① ④Efficient Routing
Direct resource access by FPGA, totally bypass CPU
17
DUA Overview
FPGA
Server
QSFPDDR
Server
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
PCIe Gen3DDR
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric
③
②
① ④
Compatible BSD-socket Interfacefor both applications and communication stacks
Efficient RoutingDUA is an “IP layer”
18
DUA Overview
FPGA
Server
QSFPDDR
Server
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
PCIe Gen3DDR
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric
③
②
① ④
Efficient RoutingCompatible BSD-socket Interface
General Multiplexingfor both applications and communication stacks
DUA is an “IP layer”
19
DUA Overview
FPGA
Server
QSFPDDR
Server
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
PCIe Gen3DDR
AppApp
LTLDDR
accessFPGA
ConnectHostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric
③
②
① ④
Efficient RoutingCompatible BSD-socket InterfaceCompatible BSD-socket interface
DUA is an “IP layer”
Unified Multiplexing
SecurityProtect against both inside and outside attacks
20
App
FPGA
App
App
NIC
FPGA
CPU
FPGA
NIC
CPU
Datacenter networking fabric
Intra-server networking fabric
QSFP
DUA Underlay
PCIe Gen3DDR
Server
Server
CPU Control Agent
DUAOverlay
FPGA Control Agent
FPGA Host Stack
NVMe Stack
LTL
FPGA Connect
Host DMA
DDR ControllerDUA
data plane
System Architecture
DUA Data Plane
21
App
FPGA
App
App
NIC
FPGA
CPU
FPGA
NIC
CPU
Datacenter networking fabric
Intra-server networking fabric
QSFP
DUA Underlay
PCIe Gen3DDR
Server
Server
CPU Control Agent
DUAOverlay
FPGA Control Agent
FPGA Host Stack
NVMe Stack
LTL
FPGA Connect
Host DMA
DDR Controller
System Architecture
DUA Data Plane DUAcontrol plane
22
DUA Control Plane
• Challenge: large-scale resource and routing info dissemination– Limited h/w resource
• DUA solution– Hierarchical addressing
– Hierarchical routing
– Leverage existing infrastructure
• Fully distributed and lightweight– Need no global synchronization
Resource table
Interconnection table
Src Resource (UID) Dst Resource (UID) / Stack
FPGA 2 (192.168.0.2:2) / FPGA Connect
Host DRAM (192.168.0.2:3) / DMA
Onboard DRAM (192.168.0.2:4) / DDR
FPGA 1 (192.168.0.2:1) / FPGA Connect
Host DRAM (192.168.0.2:3) / DMA
Resources on other servers (*:*) / LTL
FPGA 1 (192.168.0.2:1)
FPGA 2 (192.168.0.2:2)
UID
(serverID:deviceID)Address /port Resource description
192.168.0.2:1 0x00000001CFFFF000 1st block of host DRAM
192.168.0.2:1 0x000000019FFFF000 2nd block of host DRAM
192.168.0.2:2 0x80000000 1st block of FPGA onboard
192.168.0.2:3 8000 1st application on FPGA
192.168.0.2:3 8001 2nd application on FPGA
23
DUA Data Plane
• Overlay– Unified interface
– Routing
• Stacks– Leverage all the existing (or
adopt future) stacks
• Underlay– Efficient multiplexing
– Security
App
FPGA
App
App
DUA data plane
FPGA Host stack
NVMe Stack
LTL
FPGA Connect
Host DMA
DDR Access
QSFP
DUA underlay
PCIe Gen3DDR
DUAoverlay
24
DUA Data Plane – Overlay
• Efficient & extensible design– Switch fabric
• High capacity cross-bar switch
– Connector• All cached routing tables
– Translator• Protocol translation
• High performance data path– Line-rate
– Near zero-delay
DUAoverlay
LTL DDRFPGA
ConnectHost DMA
Connector Connector Connector Connector
App App
Switch Fabric
Connector Connector
LTL Translator
DDR Translator
Host DMA Translator
FPGA CA
App
Connector
25
Evaluation – efficiency
Extreme low latency (< 50 ns/fwd)
FPGA 1 FPGA 4FPGA 2 FPGA 3
APP APPDUA overlay DUA overlay
FPGA Connect
FPGA Connect
LTL LTLFPGA
ConnectFPGA
Connect
0.153 0.176 0.196
1.266 1.266 1.316
2.444 2.479 2.545
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
64 128 256
Late
ncy
(u
s)
Packet Size (B)
DUA LTL FPGA Connect
Round Trip Time through FPGA Connect and DUA for 4 times, LTL twice
26
Evaluation – Logic Overhead
DUA Overlay• 2 Ports: 4.24%• 4 Ports: 9.29%• 8 Ports: 19.86%
DUA Underlay
• 4 Stacks and 3 PHY Interfaces: 0.25%
27
Evaluation – Deep Crossing
45.28% Latency Reduction
SPMV
SPMV
SPMV
DMV
DMV DMV DMV
640 × 64
64 × 640
640 × 128
128 × 640
Single FPGA Board: Parall = 32, 2 FPGA Board: Parall = 64
28
Evaluation – Regex Matching
• Up to 105~107 higher than CPU, Up to 105 lower than CPU• Up to 3 times throughput and up to 55% latency reduction compared to
using CPU to move data between FPGAs
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Thro
ugh
pu
t (G
B/s
)
Input String Length (Byte)
through FPGA Connect
through CPU
Pure CPU
1.00E+00
4.00E+00
1.60E+01
6.40E+01
2.56E+02
1.02E+03
4.10E+03
1.64E+04
6.55E+04
2.62E+05
1.05E+06
4.19E+06
1.68E+07
64 128 256 512 1024 2048 4096 8192 16384
Late
ncy
(u
s)
Input String Length (Byte)
through FPGA Connect through CPU Pure CPU
29
Conclusion
• Current FPGA communication architecture– No universal access
• DUA: build the “IP” layer for FPGA in data center– Leverage existing data center network
– Efficient routing
– Compatible BSD socket interface
– Unified multiplexing
– Security
• Open source soon
30
Thank you!Questions?
31