hardware acceleration over nfv in china mobile
TRANSCRIPT
1
OPNFV Plugfest Jun 4-8 2018
Hardware Acceleration over NFV in China Mobile
Wang Xu, China Mobile
2
Problems we face today
• Why we need hardware acceleration in NFV
• Which VNFs need to be accelerated?
• Which Functions need to be accelerated?
• Which accelerator do we need? ASIC or FPGA or GPU
• Look-aside or In-line?
• 1 card with 1 function or n functions?
• Detailed Spec of accelerator
• Multiple PRs
3
ITU 5G Vision
1 million/ km2 1 ms
4
mMTC: massive Machine Type Communication
Medical instruments
1 million/ km2
Smart city
Home appliances
Smart phones
Sensor-based
Environment monitoring Intelligent agriculture
Forest-fire prevention
Traditional terminals
5
uRLLC : ultra-Reliable & Low Latency Communications
Current situation: 40-50ms business latency V2X: driverless vehicles <= 1ms, drive-assistance vehicles<=20ms. Intelligent parking, Intelligent motorcade, driverless vehicles 0.75GB/s Unmanned aerial vehicle NW <=2ms, VR Image rendering Industrial automation : Sensors-based NW <= 1ms
1 ms
6
eMBB: Enhanced Mobile Broadband
User experience rate : 1Gbps Peak rate 10Gbps 10-100 Tbps per km2 AR/VR, 3D/4K video, Telemedicine, Distance learning Large number of flows, large traffic, low latency<20ms Forwarding capability MBBFBB
Facing eMBB and URLLC requirements, Local DC and edge DC near to end users Handle large bandwidth and operations to save core DC resources
7
MEC Vision
Core Layer Access Layer
Aggregation Layer
Access DC County DC City DC Province Core DC
Nation Core DC
1~10km 5~50km 80~300km
Province NW
Nation NW
Distance
Latency (transfer)
<70us <290us 0.45ms~1.55ms
Hops ~4hops ~8hops ~10hops
Latency (title)
2.2ms 2.44ms 2.6~3.7ms
Bandwidth 50GE 100GE 200GE
sCPE CRAN-CU
MECP 5GUP
CDN SAE-GW
SD-WAN
5GUP
8
Why we need hardware acceleration in NFV
More VNFs will be located in edge DC (MEC). - Limited compute resources
- Limited power/cooling - Limited space Higher performance require by 5G - Lower latency/jitter - Predictable performance - More secure
- High throughput packet forwarding, real-time media, encrypt, complex computing, etc. - Auto NW optimization, failure analysis & prediction, deep-learning, big-data in future
9
Current situation
Low development of media VNFs - Lack of performance requirements - Low development of new business
HW accelerators lack of standardization - So many acceleration solutions(FPGA, SoC, NP, GPU…) - Accelerators depend on VNF, NFVI - VNFs use their own accelerator China Mobile needs of acceleration - CDN, SD-WAN, SAE-GW, 5G core UP in city-level DC - sCPE, MECP, 5G edge UP, CRAN-CU in county-level DC - Various accelerations DPDK, SR-IoV, GPU, SmartNIC, FPGA
10
Acceleration points VNF Func Scale Acceleration points
vEPC SAE-GW
1.2M bearers 1user=1.1~1.2bearer
•L2/L3/L4 forwarding: throughput 50Gbps •DPI:100% packets need DPI
Volte ims
SBC 1M users 35,000 concurrent sessions
•Media transcoding •AMR-WBAMR Call after SRVCC • AMR-WBG.711/G.729 VoLTE & Fixed-line IMS • AMR-WBEVS VoLTE terminal support EVS codec
•Ipsec: Only signaling plane •Forwarding
5G UPF 1M users •L2/L3/L4 forwarding: throughput 200Gbps •DPI: 100% packets need DPI •Ipsec: Only signaling plane
BRAS 512K •L2、L3、MPLS forwarding: Overall 2-way forward capacity >2000Gbps, Overall throughput >1000Gbps, Overall packet forward capacity >1500MPPS •Pppoe/ipoe tunnel: 2000session/up/s •HQoS:Support FQ,SQ,GP,PQ 4-level control •Support multicast •Traffic statistics
11
Acceleration targets
Functions Key words Describe
CheckSum TCP/UDP/SCTP Generate a Sum value to check data-integrity
Tunnel Offload Vlan VLAN tag offload VxLan VxLan tag offload NVGRE NVGRE tag offload
TCP TCP Segmentation TCP segmentation due to MTU RSC(ReceiveSideCoalescing) Receive side offload
Service Offload GTP GTP encap/decap offload PacketLable Filter Packet Filter offload FW Filter FW ACL offload NAT NAT NAT translation offload
12
Carrier Network
Central Office
vEPC vCPE vALG vBNG vSaeGW vMonitor
CO & Cloud
EPC Region A
OLT node
Embedded Radio Modem
MEC
CPE/IoT on Radio
Hotel
Optical Splitter
ONT Node
Community
eNodeB
ONT Node
eNodeB
DPU/MDU
Aggregation Access NID/CPE/MEC @ first mile
uCPE/SDWAN/MEC
Region B
Region C
OLT
Residence
Carrier Grade NW Aggregation
13
Accelerator types
• Smart NIC: Accelerator is a part of NIC, ChinaMobile is considering this one
• NW attached accelerator: Accelerators can be used by hosts in NW
• PCIe accelerator: Accelerators are linked by PCIe bus with CPU
14
Which accelerator
The growth of AI/high-traffic complex service forwarding/big data X86 performance growth is slow After decoupling, performance is even worse Heterogeneous acceleration is the way
CPU + FPGA: mainstream technical routes flexible high performance low power low price
FPGA GPU ASIC NP
Flexibility ☆☆☆ ☆☆ ☆ ☆
Performance ☆ ☆☆ ☆☆☆ ☆☆☆
Power ☆☆ ☆ ☆☆☆ ☆☆☆
Price ☆☆ ☆ ☆☆☆ ☆☆☆
Ecosphere ☆☆ ☆☆☆ ☆ ☆
15
Industry trends
INTEL 2020: 1/3 of Cloud DCFPGA servers 2017: A10 FPGA PCIe 2015: Buy Altera $16.7B Microsoft 2017: AI platform BrainWave based on S10 FPGA 2016: Nearly all new buy servers with FPGA 2014-15: Bing and Azure use FPGA 2012: Large scale pilot with FPGA(1,632 servers) 2011: Catapult plan engine on Baidu 201708: AI platform XPU based on KU115 FPGA 201706: FPGA service on Baidu Cloud 2016: SQL-ACC using FPGA in inside analysis
Intel A10 FPGA(GX660)
BrainWave Architecture
XPU Architecture
16
X86+FPGA
X86 CPU
L1 Cache
CPU
L2Cache
L1 Cache
CPU
L2Cache
L1 Cache
CPU
L2Cache
L1 Cache
CPU
L2Cache
L3Cache
Home Agent Memory Controller
Coherent Interface
PCIe
Message Queue
Data Buffer
DRAM DIMM QPI
QPI
PCIe Interface
HW acc driver
data dissemination
Message Control
Message Stream
Data Stream
17
Look-aside vs. In-line
PCIe*8/16
Standard NIC 快路径 分发和
LB 协议
业务
DPI
加速网卡
Smart NIC
PCIe*8/16
40Ge 40Ge 100Ge 100Ge
Look-aside CPU handle data plane
In-line Offload data plane to NIC
CPU CPU
NIC offload main data path in In-Line mode Dual server can realize 100Gbps GPT offload
18
Acceleration Architecture
Static Region
PR1 PR2 PR3 PRn Dynamic load
FPGA
Server
KVM
Nova-Compute
Cyborg-agent
FPGA device driver
L2-agent
Hypervisor
Virtio-Frontend
VM1 (Trancoding)
Virtio-Backend
Virtio-Frontend
VM2 (DPI)
Openstack
Nova Neutron Cinder
Cyborg Keystone Ceilomete
r
Horizion
NFVO
VNFM
1, Use FPGA in-line
2, Cyborg to manage and orchestrate FPGA
3, Common API to implement mutiple FPGA
19
vEPC Acceleration
Link IP
FWD LB
Security
GTP Rule Lookup
DPI
QoS Charging
IP FWD
Link RAN PTN
L2 13.95%
L3 4.30%
L4 6.98%
L4 13.95%
L4 4.19%
L7 6.28%
L7 17.45%
L3 2.68%
L2 13.95%
L7 16.25%
CPU
?
Func OVS IPv4v6 MAC Tunnel MPLS ACL Sec multicast
Load balance
Packet monitor Ddos Anti-spoofing APN ACL
Flow match Flow oper
GTP –encap GTP-decap GTP-ctrl
proto match proto analysis Policy
Car Remark Shaping
CG AAA OCS
IPv4v6 MAC Tunnel MPLS ACL multicast
OVS
20
vEPC Acceleration
Xilinx VU9P LCs: 2,586 K CLB Flip-Flops: 2,364 Memory: 345.9 Mb I/O: 832
Business BG: GW-U/UPF Forward capability(FC) 300Gbps City-level DC 1.5M user access Average rate 200Kbps per user
DPI Forwarding L3~L7 policy
SPI Forwarding L3~L4 policy
Base Forwarding Only encap
FC/VM
VMs Power
FC/VM
VMs Power
FC/VM
VMs Power
X86 3.5Gbps
54 18.9kW
7Gbps
27 9.45kW
30Gbps
7 2.45kW
X86+FPGA
5Gbps
38 17.1kW
10Gbps
19 10.07kW
42.86Gbps
5 2.65kW Save at least 16 VMs
21
Telco need E2E Acc Solution not only HW
VNFs use Acc API
Provide Acc Image, handle
Acc req, Or&Ma Acc HW
VIM virtualizes Acc, and create pool
Common API
Acc Virtualize and Pool
Accelerators like FPGA
Manage Acc using Cyborg
22
Cyborg
• Queens 2018-2; Rocky 2018-9 • Queens Accelerators to mainstream • Optimize cyborg-db • Add report & trait, interact with Placement • Rocky • Extend Nova API to provide Acc • Add FPGA versions • Add Pythonclient-cyborg API • Extend Placement, describe Acc with Func
23
China Mobile testing on FPGA(Smart NIC)
11 Nov 2015 OPNFV Summit
Servers E5-2650 v4 256G Mem Spec&Number
Controller node Huawei 2288*1
Hypervisor with Smart NIC ZTE R5300*1 + Huawe 2288*1
Hypervisor with 40G SRIOV NIC ZTE R5300*1
Hypervisor with normal NIC ZTE R5300*1 + Huawei 2288*1
Tester: IXIA XM2 VIM: TECS v3.0 17.14(Openstack M)
24
Testing result
SR-IOV vSwitch Smart NIC
Functions Only L2 FWD DVR, SFC, Security Group, Mirror, SDN VTEP
DVR, SFC, Security Group, Mirror, SDN VTEP
Live Migration Not support Support Support
CPU usage None Very high Low
Decouplability Tightly coupled
Common VirtIO driver, decoupled
Common VirtIO driver, decoupled
Performance High Low High
25
Testing result - performance
480 Flows (Control plane VNFs)
NIC type Line rate(82B/256B/512B)% Latency(82B/256B/512B)us
SR-IOV 25.5/73.3/91.6 6.5/7.2/10.0
vSwitch(3 cores) 11.0/30.6/53.8 30.5/45.6/50.4
Smart NIC(3 cores) 18.8/50.5/90.0 43.1/45.8/53.1
480,000 Flows (Data plane VNFs)
NIC type Line rate(82B/256B/512B)% Latency(82B/256B/512B)us
SR-IOV 25.8/73.6/89 6.5/7.2/10
vSwitch(3 cores) 6.6/19.0/32.4 31.8/57.1/45.6
Smart NIC(3 cores) 18.9/50.7/90.1 43.7/46.7/53.6
26
Testing Conclusion
Smart NIC vs. SR-IOV
Line rate: 512B Smart NIC = SR-IOV; 256B & 84B Smart NIC =
SRIOV*70%
Latency: Smart NIC>SRIOV,(due to software processing), Virtio-BE
offload
Smart NIC vs. vSwitch
When Flows grow, Smart NIC=vSwitch*3
If simply calculate, vSwitch needs 9 cores to achieve same performance
of Smart NIC,
Concidering across NUMA, it needs over 10 physical cores, there will be
no enough CPU for VNFs.
27
Future research
1. FPGA? FPGA+NP?
2. GPU + FPGA resource pool architecture design
3. Acceleration point deep mining
4. Start vEPC acceleration testing
5. Design Accelerator common API & push to community
6. Design accelerators management spec
7. FPGA detailed spec that we need
8. GPU usage in MEC, AR/VR
9. OPNFV project research on 4G/5G FWD plane & Data plane,
including APIs