Download - Direct Universal Access: Making Data Center Resources Available … · 2019-03-12 · Direct Universal Access: Making Data Center Resources Available to FPGA Ran Shu 1, Peng Cheng

Direct Universal Access:Making Data Center Resources Available to FPGA

Ran Shu1, Peng Cheng1, Guo Chen2, Zhiyuan Guo1, 3, Lei Qu1, Yongqiang Xiong1,

Derek Chiou4, Thomas Moscibroda4

Microsoft Research1, Hunan University2,Beihang University3, Microsoft Azure4

Microsoft Research Microsoft Azure

1

FPGA Deployment in Data Centers

• Wide deployment– Major cloud service providers

• Microsoft, Amazon, Facebook, Alibaba, Tencent, Baidu, IBM, etc.

• Accelerated applications– Computation

• Web search ranking• Deep neural networks• Big data analytics

– Networking• Network processing

– Database/Storage• SQL• Key-value store

Image from A. Caulfield et al., Micro 2016

Image from D. Firestone et al., NSDI 2018

2

Resource Access Requirements

• Heterogeneous resources

– CPU

– Memory

– Other FPGAs

– GPU

– SSD

Data center network fabric

CPUHost

DRAMSSDGPU

NIC

...

FPGA Onboard DRAM

FPGA board

PCIe fabric

Server

FPGA Onboard DRAM

FPGA board

CPUHost

DRAMSSDGPU

NIC

...

FPGA Onboard DRAM

FPGA board

PCIe fabric

Server

FPGA Onboard DRAM

FPGA board

...

EthernetEthernet

... ...

Image from A. Putnam et al., ISCA 2014 3

FPGA Board in Data Center

FPGA chip

QSFP

DDR 4

PCIe Gen 3

Image from https://www.cnet.com/news/microsoft-project-brainwave-speeds-ai-with-fpga-chips-on-azure-build-conference/ 4

Current FPGA Communication Architecture

Application Layer

Physical Layer

Application

Application

DDR 4 IP PCIe Gen 3 IP

DDR Stack LTL NVMeHostDMA

FPGA Chip

QSFP IP

DMA

PCIe Gen 3QSFP

DDR 4

FPGA Board

DCQCN(in LTL) Transport Layer

Data Link Layer

5

Current FPGA Communication Architecture

MAC layer

Physical layer

Application

Application

DDR 4 IP PCIe Gen 3 IP

DDR Stack LTL NVMeHostDMA

FPGA Chip

QSFP IP

DMA

PCIe Gen 3QSFP

DDR 4

FPGA Board

???

Image from L. Zhang et al., CCR 2014

DCQCN(in LTL)

6

Problem #1 – Programming Interface

Application 1 Application 2

DDR IP PCIe IPEthernet IP

DDR Stack LTL NVMe StackHostDMA

Lines of code to use each interfaceHost DRAM: 294

Host CPU program: 205

Onboard DRAM: 517

Remote FPGA: 1356

7

FPGA Chip

Problem #2 – Accessibility

Application Application



Data center naming space

Server-area naming space

On-board naming space

8

FPGA Chip

Problem #3 – Multiplexing



Host DMAMux/Demux

FPGA Chip


9




LTLMux/Demux


10

FPGA Chip



PCIe Mux/Demux



11

FPGA Chip

Problem #4 – Security



MaliciousApplication

Access unauthorized

address

Host Memory

PCIe Fabric12

FPGA Chip

Problem #4 – Security



Victim Application

Attack application

MaliciousCPU

Program

PCIe Fabric13

FPGA Chip

Existing Problems

• Complex programming interface

• Separate naming space

• No general multiplexing

• Security issue

14

Direct Universal Access

FPGA 1

Server

QSFPDDR

Server

LTLDDR

accessFPGA

ConnectHostDMA

PCIe Gen3DDR

App 2

LTLDDR

accessFPGA

ConnectHostDMA

Datacenter networking fabric

QSFP

FPGA 2

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

App 1

DUA

Intra-server networking fabric

③

②

① ④

FPGA 3

15

DUA Overview

FPGA

Server

QSFPDDR

Server

AppApp

LTLDDR

accessFPGA

ConnectHostDMA

PCIe Gen3DDR

AppApp

LTLDDR

accessFPGA

ConnectHostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


③

②

① ④

DUA is an “IP layer”An abstract overlay network

Leverage all existing h/w stacksHierarchical addressing & routing

16

DUA Overview

FPGA

Server

QSFPDDR

Server

AppApp

LTLDDR

accessFPGA

ConnectHostDMA

PCIe Gen3DDR

AppApp

LTLDDR

accessFPGA

ConnectHostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


DUA is an “IP layer”

③

②

① ④Efficient Routing

Direct resource access by FPGA, totally bypass CPU

17

DUA Overview

FPGA

Server

QSFPDDR

Server

AppApp

LTLDDR

accessFPGA

ConnectHostDMA

PCIe Gen3DDR

AppApp

LTLDDR

accessFPGA

ConnectHostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


③

②

① ④

Compatible BSD-socket Interfacefor both applications and communication stacks

Efficient RoutingDUA is an “IP layer”

18

DUA Overview

FPGA

Server

QSFPDDR

Server

AppApp

LTLDDR

accessFPGA

ConnectHostDMA

PCIe Gen3DDR

AppApp

LTLDDR

accessFPGA

ConnectHostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


③

②

① ④

Efficient RoutingCompatible BSD-socket Interface

General Multiplexingfor both applications and communication stacks


19

DUA Overview

FPGA

Server

QSFPDDR

Server

AppApp

LTLDDR

accessFPGA

ConnectHostDMA

PCIe Gen3DDR

AppApp

LTLDDR

accessFPGA

ConnectHostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


③

②

① ④

Efficient RoutingCompatible BSD-socket InterfaceCompatible BSD-socket interface


Unified Multiplexing

SecurityProtect against both inside and outside attacks

20

App

FPGA

App

App

NIC

FPGA

CPU

FPGA

NIC

CPU



QSFP

DUA Underlay

PCIe Gen3DDR

Server

Server

CPU Control Agent

DUAOverlay

FPGA Control Agent

FPGA Host Stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR ControllerDUA

data plane

System Architecture

DUA Data Plane

21

App

FPGA

App

App

NIC

FPGA

CPU

FPGA

NIC

CPU



QSFP

DUA Underlay

PCIe Gen3DDR

Server

Server

CPU Control Agent

DUAOverlay

FPGA Control Agent

FPGA Host Stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR Controller

System Architecture

DUA Data Plane DUAcontrol plane

22

DUA Control Plane

• Challenge: large-scale resource and routing info dissemination– Limited h/w resource

• DUA solution– Hierarchical addressing

– Hierarchical routing

– Leverage existing infrastructure

• Fully distributed and lightweight– Need no global synchronization

Resource table

Interconnection table

Src Resource (UID) Dst Resource (UID) / Stack

FPGA 2 (192.168.0.2:2) / FPGA Connect

Host DRAM (192.168.0.2:3) / DMA

Onboard DRAM (192.168.0.2:4) / DDR

FPGA 1 (192.168.0.2:1) / FPGA Connect

Host DRAM (192.168.0.2:3) / DMA

Resources on other servers (*:*) / LTL

FPGA 1 (192.168.0.2:1)

FPGA 2 (192.168.0.2:2)

UID

(serverID:deviceID)Address /port Resource description

192.168.0.2:1 0x00000001CFFFF000 1st block of host DRAM

192.168.0.2:1 0x000000019FFFF000 2nd block of host DRAM

192.168.0.2:2 0x80000000 1st block of FPGA onboard

192.168.0.2:3 8000 1st application on FPGA

192.168.0.2:3 8001 2nd application on FPGA

23

DUA Data Plane

• Overlay– Unified interface

– Routing

• Stacks– Leverage all the existing (or

adopt future) stacks

• Underlay– Efficient multiplexing

– Security

App

FPGA

App

App

DUA data plane

FPGA Host stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR Access

QSFP

DUA underlay

PCIe Gen3DDR

DUAoverlay

24

DUA Data Plane – Overlay

• Efficient & extensible design– Switch fabric

• High capacity cross-bar switch

– Connector• All cached routing tables

– Translator• Protocol translation

• High performance data path– Line-rate

– Near zero-delay

DUAoverlay

LTL DDRFPGA

ConnectHost DMA

Connector Connector Connector Connector

App App

Switch Fabric

Connector Connector

LTL Translator

DDR Translator

Host DMA Translator

FPGA CA

App

Connector

25

Evaluation – efficiency

Extreme low latency (< 50 ns/fwd)

FPGA 1 FPGA 4FPGA 2 FPGA 3

APP APPDUA overlay DUA overlay

FPGA Connect

FPGA Connect

LTL LTLFPGA

ConnectFPGA

Connect

0.153 0.176 0.196

1.266 1.266 1.316

2.444 2.479 2.545

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

64 128 256

Late

ncy

(u

s)

Packet Size (B)

DUA LTL FPGA Connect

Round Trip Time through FPGA Connect and DUA for 4 times, LTL twice

26

Evaluation – Logic Overhead

DUA Overlay• 2 Ports: 4.24%• 4 Ports: 9.29%• 8 Ports: 19.86%

DUA Underlay

• 4 Stacks and 3 PHY Interfaces: 0.25%

27

Evaluation – Deep Crossing

45.28% Latency Reduction

SPMV

SPMV

SPMV

DMV

DMV DMV DMV

640 × 64

64 × 640

640 × 128

128 × 640

Single FPGA Board: Parall = 32, 2 FPGA Board: Parall = 64

28

Evaluation – Regex Matching

• Up to 105~107 higher than CPU, Up to 105 lower than CPU• Up to 3 times throughput and up to 55% latency reduction compared to

using CPU to move data between FPGAs

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Thro

ugh

pu

t (G

B/s

)

Input String Length (Byte)

through FPGA Connect

through CPU

Pure CPU

1.00E+00

4.00E+00

1.60E+01

6.40E+01

2.56E+02

1.02E+03

4.10E+03

1.64E+04

6.55E+04

2.62E+05

1.05E+06

4.19E+06

1.68E+07

64 128 256 512 1024 2048 4096 8192 16384

Late

ncy

(u

s)

Input String Length (Byte)

through FPGA Connect through CPU Pure CPU

29

Conclusion

• Current FPGA communication architecture– No universal access

• DUA: build the “IP” layer for FPGA in data center– Leverage existing data center network

– Efficient routing

– Compatible BSD socket interface

– Unified multiplexing

– Security

• Open source soon

30

Thank you!Questions?

31

Download - Direct Universal Access: Making Data Center Resources Available … · 2019-03-12 · Direct Universal Access: Making Data Center Resources Available to FPGA Ran Shu 1, Peng Cheng

Top Related