hardware acceleration over nfv in china mobile

1

OPNFV Plugfest Jun 4-8 2018

Hardware Acceleration over NFV in China Mobile

Wang Xu, China Mobile

2

Problems we face today

• Why we need hardware acceleration in NFV

• Which VNFs need to be accelerated?

• Which Functions need to be accelerated?

• Which accelerator do we need? ASIC or FPGA or GPU

• Look-aside or In-line?

• 1 card with 1 function or n functions?

• Detailed Spec of accelerator

• Multiple PRs

3

ITU 5G Vision

1 million/ km2 1 ms

4

mMTC: massive Machine Type Communication

Medical instruments

1 million/ km2

Smart city

Home appliances

Smart phones

Sensor-based

Environment monitoring Intelligent agriculture

Forest-fire prevention

Traditional terminals

5

uRLLC : ultra-Reliable & Low Latency Communications

Current situation: 40-50ms business latency V2X: driverless vehicles <= 1ms, drive-assistance vehicles<=20ms. Intelligent parking, Intelligent motorcade, driverless vehicles 0.75GB/s Unmanned aerial vehicle NW <=2ms, VR Image rendering Industrial automation : Sensors-based NW <= 1ms

1 ms

6

eMBB: Enhanced Mobile Broadband

User experience rate : 1Gbps Peak rate 10Gbps 10-100 Tbps per km2 AR/VR, 3D/4K video, Telemedicine, Distance learning Large number of flows, large traffic, low latency<20ms Forwarding capability MBBFBB

Facing eMBB and URLLC requirements, Local DC and edge DC near to end users Handle large bandwidth and operations to save core DC resources

7

MEC Vision

Core Layer Access Layer

Aggregation Layer

Access DC County DC City DC Province Core DC

Nation Core DC

1~10km 5~50km 80~300km

Province NW

Nation NW

Distance

Latency (transfer)

<70us <290us 0.45ms~1.55ms

Hops ~4hops ~8hops ~10hops

Latency (title)

2.2ms 2.44ms 2.6~3.7ms

Bandwidth 50GE 100GE 200GE

sCPE CRAN-CU

MECP 5GUP

CDN SAE-GW

SD-WAN

5GUP

8

Why we need hardware acceleration in NFV

More VNFs will be located in edge DC (MEC). - Limited compute resources

- Limited power/cooling - Limited space Higher performance require by 5G - Lower latency/jitter - Predictable performance - More secure

- High throughput packet forwarding, real-time media, encrypt, complex computing, etc. - Auto NW optimization, failure analysis & prediction, deep-learning, big-data in future

9

Current situation

Low development of media VNFs - Lack of performance requirements - Low development of new business

HW accelerators lack of standardization - So many acceleration solutions(FPGA, SoC, NP, GPU…) - Accelerators depend on VNF, NFVI - VNFs use their own accelerator China Mobile needs of acceleration - CDN, SD-WAN, SAE-GW, 5G core UP in city-level DC - sCPE, MECP, 5G edge UP, CRAN-CU in county-level DC - Various accelerations DPDK, SR-IoV, GPU, SmartNIC, FPGA

10

Acceleration points VNF Func Scale Acceleration points

vEPC SAE-GW

1.2M bearers 1user=1.1~1.2bearer

•L2/L3/L4 forwarding: throughput 50Gbps •DPI：100% packets need DPI

Volte ims

SBC 1M users 35,000 concurrent sessions

•Media transcoding •AMR-WBAMR Call after SRVCC • AMR-WBG.711/G.729 VoLTE & Fixed-line IMS • AMR-WBEVS VoLTE terminal support EVS codec

•Ipsec: Only signaling plane •Forwarding

5G UPF 1M users •L2/L3/L4 forwarding: throughput 200Gbps •DPI: 100% packets need DPI •Ipsec: Only signaling plane

BRAS 512K •L2、L3、MPLS forwarding: Overall 2-way forward capacity >2000Gbps, Overall throughput >1000Gbps, Overall packet forward capacity >1500MPPS •Pppoe/ipoe tunnel: 2000session/up/s •HQoS：Support FQ,SQ,GP,PQ 4-level control •Support multicast •Traffic statistics

11

Acceleration targets

Functions Key words Describe

CheckSum TCP/UDP/SCTP Generate a Sum value to check data-integrity

Tunnel Offload Vlan VLAN tag offload VxLan VxLan tag offload NVGRE NVGRE tag offload

TCP TCP Segmentation TCP segmentation due to MTU RSC(ReceiveSideCoalescing) Receive side offload

Service Offload GTP GTP encap/decap offload PacketLable Filter Packet Filter offload FW Filter FW ACL offload NAT NAT NAT translation offload

12

Carrier Network

Central Office

vEPC vCPE vALG vBNG vSaeGW vMonitor

CO & Cloud

EPC Region A

OLT node

Embedded Radio Modem

MEC

CPE/IoT on Radio

Hotel

Optical Splitter

ONT Node

Community

eNodeB

ONT Node

eNodeB

DPU/MDU

Aggregation Access NID/CPE/MEC @ first mile

uCPE/SDWAN/MEC

Region B

Region C

OLT

Residence

Carrier Grade NW Aggregation

13

Accelerator types

• Smart NIC: Accelerator is a part of NIC, ChinaMobile is considering this one

• NW attached accelerator: Accelerators can be used by hosts in NW

• PCIe accelerator: Accelerators are linked by PCIe bus with CPU

14

Which accelerator

The growth of AI/high-traffic complex service forwarding/big data X86 performance growth is slow After decoupling, performance is even worse Heterogeneous acceleration is the way

CPU + FPGA: mainstream technical routes flexible high performance low power low price

FPGA GPU ASIC NP

Flexibility ☆☆☆ ☆☆ ☆ ☆

Performance ☆ ☆☆ ☆☆☆ ☆☆☆

Power ☆☆ ☆ ☆☆☆ ☆☆☆

Price ☆☆ ☆ ☆☆☆ ☆☆☆

Ecosphere ☆☆ ☆☆☆ ☆ ☆

15

Industry trends

INTEL 2020: 1/3 of Cloud DCFPGA servers 2017: A10 FPGA PCIe 2015: Buy Altera $16.7B Microsoft 2017: AI platform BrainWave based on S10 FPGA 2016: Nearly all new buy servers with FPGA 2014-15: Bing and Azure use FPGA 2012: Large scale pilot with FPGA(1,632 servers) 2011: Catapult plan engine on Baidu 201708: AI platform XPU based on KU115 FPGA 201706: FPGA service on Baidu Cloud 2016: SQL-ACC using FPGA in inside analysis

Intel A10 FPGA(GX660)

BrainWave Architecture

XPU Architecture

16

X86+FPGA

X86 CPU

L1 Cache

CPU

L2Cache

L1 Cache

CPU

L2Cache

L1 Cache

CPU

L2Cache

L1 Cache

CPU

L2Cache

L3Cache

Home Agent Memory Controller

Coherent Interface

PCIe

Message Queue

Data Buffer

DRAM DIMM QPI

QPI

PCIe Interface

HW acc driver

data dissemination

Message Control

Message Stream

Data Stream

17

Look-aside vs. In-line

PCIe*8/16

Standard NIC 快路径分发和

LB 协议

业务

DPI

加速网卡

Smart NIC

PCIe*8/16

40Ge 40Ge 100Ge 100Ge

Look-aside CPU handle data plane

In-line Offload data plane to NIC

CPU CPU

NIC offload main data path in In-Line mode Dual server can realize 100Gbps GPT offload

18

Acceleration Architecture

Static Region

PR1 PR2 PR3 PRn Dynamic load

FPGA

Server

KVM

Nova-Compute

Cyborg-agent

FPGA device driver

L2-agent

Hypervisor

Virtio-Frontend

VM1 (Trancoding)

Virtio-Backend

Virtio-Frontend

VM2 (DPI)

Openstack

Nova Neutron Cinder

Cyborg Keystone Ceilomete

r

Horizion

NFVO

VNFM

1, Use FPGA in-line

2, Cyborg to manage and orchestrate FPGA

3, Common API to implement mutiple FPGA

19

vEPC Acceleration

Link IP

FWD LB

Security

GTP Rule Lookup

DPI

QoS Charging

IP FWD

Link RAN PTN

L2 13.95%

L3 4.30%

L4 6.98%

L4 13.95%

L4 4.19%

L7 6.28%

L7 17.45%

L3 2.68%

L2 13.95%

L7 16.25%

CPU

?

Func OVS IPv4v6 MAC Tunnel MPLS ACL Sec multicast

Load balance

Packet monitor Ddos Anti-spoofing APN ACL

Flow match Flow oper

GTP –encap GTP-decap GTP-ctrl

proto match proto analysis Policy

Car Remark Shaping

CG AAA OCS

IPv4v6 MAC Tunnel MPLS ACL multicast

OVS

20

vEPC Acceleration

Xilinx VU9P LCs: 2,586 K CLB Flip-Flops: 2,364 Memory: 345.9 Mb I/O: 832

Business BG: GW-U/UPF Forward capability(FC) 300Gbps City-level DC 1.5M user access Average rate 200Kbps per user

DPI Forwarding L3~L7 policy

SPI Forwarding L3~L4 policy

Base Forwarding Only encap

FC/VM

VMs Power

FC/VM

VMs Power

FC/VM

VMs Power

X86 3.5Gbps

54 18.9kW

7Gbps

27 9.45kW

30Gbps

7 2.45kW

X86+FPGA

5Gbps

38 17.1kW

10Gbps

19 10.07kW

42.86Gbps

5 2.65kW Save at least 16 VMs

21

Telco need E2E Acc Solution not only HW

VNFs use Acc API

Provide Acc Image, handle

Acc req, Or&Ma Acc HW

VIM virtualizes Acc, and create pool

Common API

Acc Virtualize and Pool

Accelerators like FPGA

Manage Acc using Cyborg

22

Cyborg

• Queens 2018-2; Rocky 2018-9 • Queens Accelerators to mainstream • Optimize cyborg-db • Add report & trait, interact with Placement • Rocky • Extend Nova API to provide Acc • Add FPGA versions • Add Pythonclient-cyborg API • Extend Placement, describe Acc with Func

23

China Mobile testing on FPGA(Smart NIC)

11 Nov 2015 OPNFV Summit

Servers E5-2650 v4 256G Mem Spec&Number

Controller node Huawei 2288*1

Hypervisor with Smart NIC ZTE R5300*1 + Huawe 2288*1

Hypervisor with 40G SRIOV NIC ZTE R5300*1

Hypervisor with normal NIC ZTE R5300*1 + Huawei 2288*1

Tester: IXIA XM2 VIM: TECS v3.0 17.14(Openstack M)

24

Testing result

SR-IOV vSwitch Smart NIC

Functions Only L2 FWD DVR, SFC, Security Group, Mirror, SDN VTEP

DVR, SFC, Security Group, Mirror, SDN VTEP

Live Migration Not support Support Support

CPU usage None Very high Low

Decouplability Tightly coupled

Common VirtIO driver, decoupled

Common VirtIO driver, decoupled

Performance High Low High

25

Testing result - performance

480 Flows (Control plane VNFs)

NIC type Line rate(82B/256B/512B)% Latency(82B/256B/512B)us

SR-IOV 25.5/73.3/91.6 6.5/7.2/10.0

vSwitch(3 cores) 11.0/30.6/53.8 30.5/45.6/50.4

Smart NIC(3 cores) 18.8/50.5/90.0 43.1/45.8/53.1

480,000 Flows (Data plane VNFs)

NIC type Line rate(82B/256B/512B)% Latency(82B/256B/512B)us

SR-IOV 25.8/73.6/89 6.5/7.2/10

vSwitch(3 cores) 6.6/19.0/32.4 31.8/57.1/45.6

Smart NIC(3 cores) 18.9/50.7/90.1 43.7/46.7/53.6

26

Testing Conclusion

Smart NIC vs. SR-IOV

Line rate: 512B Smart NIC = SR-IOV; 256B & 84B Smart NIC =

SRIOV*70%

Latency: Smart NIC>SRIOV,(due to software processing), Virtio-BE

offload

Smart NIC vs. vSwitch

When Flows grow, Smart NIC=vSwitch*3

If simply calculate, vSwitch needs 9 cores to achieve same performance

of Smart NIC,

Concidering across NUMA, it needs over 10 physical cores, there will be

no enough CPU for VNFs.

27

Future research

1. FPGA? FPGA+NP?

2. GPU + FPGA resource pool architecture design

3. Acceleration point deep mining

4. Start vEPC acceleration testing

5. Design Accelerator common API & push to community

6. Design accelerators management spec

7. FPGA detailed spec that we need

8. GPU usage in MEC, AR/VR

9. OPNFV project research on 4G/5G FWD plane & Data plane,

including APIs

hardware acceleration over nfv in china mobile

Documents