vpp accelerated high performance & scalable l3dsr l4 load ... · *other names and brands may be...
TRANSCRIPT
VPP Accelerated High Performance & Scalable L3DSR L4 Load Balancer on Top Clos
Yusuke Tatsumi @ Yahoo Japan Corporation & Naoyuki Mori @ Intel
Acknowledgement:Shunya Kitada, Yuta Kinoshita @ Yahoo Japan Corporation
Hongjun Ni, Ray Kinsella @IntelPierre Pfister, Jerome Tollet @Cisco
*Other names and brands may be claimed as the property of others.
• ! ! , • FD.io VPP Overview• VPP LB data-plane• VPP LB control-plane• Summary• !
Agenda
*Other names and brands may be claimed as the property of others.
Yahoo! JAPAN introduction
*Other names and brands may be claimed as the property of others.
Daily Unique Browser
90+ MillionDaily Unique Browser(Only Smartphone)
60+ Million
Monthly Page Views
70+ BillionNumber of services
100+Many Load Balancers support our services!
https://s.yimg.jp/i/docs/ir/archives/jp/nenji/2017/jp2018integrated_report.pdf
Yahoo! JAPAN introduction
*Other names and brands may be claimed as the property of others.
Background and issues of our LB
2 3 9 1 0 2 3
0 B = BE
[1] h9ps://www.linux-kvm.org[2] https://kubernetes.io/
[1] [2]
[3] LBaaS: Load Balancer as a Service
#
:+ B = / /4 = B 4 BE B= E B B 4 = =B 1 -
[3]
*Other names and brands may be claimed as the property of others.
• G I H & ( GL CB GC / +• E GC B CHG
• B E EC E G EL ( C• G A E E C
• ) E H C E G CB B GC (
) -
) 2-
)- B 2-
Internet
• B B CHG ( GL CB GC ) C• )CAAC GL EI E -) 1 G C• L C E G CB GC ( G (+2
- ) - A --
Leaf/Spine IP Clos fabric
-- ) 2- -- ) 2- -- ) 2-CCC
((
Internet
Limitations of our systemand OSS possibilities
*Other names and brands may be claimed as the property of others.
-G I: ���
-G B L ���
(CBGEC D B:
Internet
Leaf/Spine IP Clos fabric
1 ��� 1 ��� 1 ���
(CBGEC D B:
MMM
( 2 GC .2I:EG F B
.2F I 2
Internet
(C CE G: G : FG B D:E CEA B : 1 G D B: ) C 22):I: CD CBGEC D B: ECA F E G F: CB CHE CD:E G CB : D:E :B :F
GE G: L GC E: E:: C CHE HEE:BG A G G CBF
Limitations of our systemand OSS possibilities
*Other names and brands may be claimed as the property of others.
Universal data plane Extensible Modular Design• Layer 2 – 4 Network Stack
• CP, TM, Overlays and more …
• Linux (and FreeBSD) support
• Kernel Interfaces (Netmap, FastTap)
• Container and Virtualization support
• Appliance, Infrastructure, VNF & CNF
Architecture• Pluggable, easy to understand & extend.
• Mature graph node architecture.
Plugins• Full control to reorganize the pipeline.
• Fast, plugins are equal citizens
Fast, Scalable and Deterministic Developer friendly• L2XC - 25+ Mpps per core
Deterministic• 0 packet drops, ~15µs latency
• Continuous & extensive latency testing.
Scalability• Linear scaling with core/thread count
• Supporting millions of concurrent L[2,3]tables entries.
• Runtime counters for everything.( throughput, ipc, errors etc )
• Full pipeline tracing facilities.
• Multi-language API bindings.
• VPP command line introspection.
Packet Processing: VPP
Management AgentNetconf/Yang REST BGP
ethernet-input
dpdk-inputaf-packet-input
vhost-user-input
mpls-inputlldp-input
...-no-checksum
ip4-input ip6-inputarp-inputcdp-input l2-input
ip4-lookup ip4-lookup-mulitcast
ip4-rewrite-transit
ip4-load-balance
ip4-midchain
mpls-policy-encap
interface-output
Graph N
odesSDN IntegrationFD.io* VPP Overview
*Other names and brands may be claimed as the property of others.
VPP Layeringfor modular design
Plugins
VNET
VLIB
VPP Infra VPP InfraLibrary of function primitives, for• memory management• memory operations• vectors• rings• hashing• timers
VLIBVPP application management • buffer, buffer management• graph node, node management• tracing, counters• threading• CLIand most importantly …• main()
VNET VPP networking source• Devices• Layer [2, 3, 4]• Session Management• Overlays• Traffic Management
Plugins• Plugins can be in-tree:
SNAT, Policy ACL,Flow Per Packet,ILA, IOAM,Load Balancer, SIXRD,VCGNSegment RouWng
• Separate fd.io project:NSH_SFC
*Other names and brands may be claimed as the property of others.
• 33 E E C A•
• 3 A A BG C• : 3: / 3/ 3 :
• A D BA I (/1 1 BA ( 3• -/: B 3 :
• )3 GE A 1 )3(1 K 4• -
• / A A 1• 3: /: 3 3:
�
�
�
�
�
D E D D D E D D D E D D(
A3 :
/ †
-
CCC
† - C A BAE E ABD D E BA
: 3
1
A
1
! 33 3
:
Leaf/Spine IP Clos fabric
[1] h:ps://www.nanog.org/mee?ngs/nanog51/presenta?ons/Monday/NANOG51.Talk45.nanog51-Schaumann.pdf
What’s available?& What’s missing?
Missing
Par?ally exist
Existing
*Other names and brands may be claimed as the property of others.
Category Feature Description First appeared
Framework Consistent Hash Brings great scalability and redundancy w/ LB nodes 16.09
Multi-Threading Performance scalability w/ CPU cores 16.09
Protocols GRE4/GRE6 Encap w/ GRE IPv4 and IPv6 16.09
L3DSR Layer 3 Direct Server Return 18.04
NAT4/NAT6 NAT (originally for kube-proxy) 18.07
Port # aware Care TCP/UDP port number for LBaaS integration 18.10
API Manipulation Set/Get VIP/member over API from controller 18.10
Statistics Collect more detailed statistics like # of connections On going
Added by this project
VPP LB plugin Enhancements
*Other names and brands may be claimed as the property of others.
Client
Load Balancer (VIP)
Server
• Src IP: Client• Dst IP: VIP• DSCP: unflag
• Src IP: Client• Dst IP: Server• DSCP: flagged
• Src IP: VIP• Dst IP: Client• DSCP: unflag
1
23
DSCP/VIP mapping in hostVIP DSCP192.168.1.2 (000 010)b
192.168.1.3 (000 011)b DSCP: Part of IP header
• No tunneling overhead with DSCP• DSR: Saving LB bandwidth (3. is generally x2-10 larger than 1.)
Why L3DSR (L3 Direct Server Return) ?
*Other names and brands may be claimed as the property of others.
• 33 E E C A• -
• 3 A A BG C•
• A D BA I (/1 1 BA ( 3• - / : 3
• )3 GE A 1 )3(1 K 4• - ::
• / A A 1• -
[1] https://www.nanog.org/meetings/nanog51/presentations/Monday/NANOG51.Talk45.nanog51-Schaumann.pdf
�
�
�
�
�
�
�
Filling missing parts,build up system!
D E D D D E D D D E D D(
3 -
-
†
- :
AAA
† - C A BAE E ABD D E BA
1
A
1
/
3
Leaf/Spine IP Clos fabricMissing
ParJally exist
ExisJng
Successfully Developed!
*Other names and brands may be claimed as the property of others.
Mpps
@ 9 B % 6 6 6 @ @6B ) 0 9 BA 0 A @C
Frame size
Frame size V.S. Mpps @VPP L3DSR by IXIA• 8 S B MI 2P 2P5 M I E• :66 N
• 6 2 M I 6 I LLI0) *) 3 N
• 4 GI (1• 52 2 M 0MC M 5 M I E
I M I )( -8 1 L !
( @ @6B
.
2 0 1
IL MI 1 M AI G C N
2 cores
1 core
4 cores
[1] hIps://www.ixiacom.com/
[1]
*Other names and brands may be claimed as the property of others.
VPP LB/node performance test@ 10Gbps
To VIP
To real servers
*Other names and brands may be claimed as the property of others.
1 Packet processing is decomposedinto a directed graph of nodes …
… packets move through graph nodes in vector …2
Microprocessor
… graph nodes are optimized to fit inside the instruction cache …
… packets are pre-fetched into the data cache.
Instruction Cache3
Data Cache4
3
4
Makes use of modern Intel® Xeon® Processor micro-architectures. Instruction cache & data cache always hot è Minimized memory latency and usage.
vhost-user-input
af-packet-input dpdk-input
ip4-lookup-mulitcast ip4-lookup*
ethernet-input
mpls-inputlldp-input
arp-inputcdp-input...-no-
checksum
ip6-inputl2-input ip4-input
ip4-load-balance
mpls-policy-encap
ip4-rewrite-transit
ip4-midchain
interface-output
* Each graph node implements a “micro-NF”, a “micro-NetworkFunction” processing packets.
Packet 0
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Packet 7
Packet 8
Packet 9
Packet 10
FD.io VPP – The “Magic” of VectorsCompute Optimized SW Network Platform
*Other names and brands may be claimed as the property of others.
• RSS (Receive Side Scaling) enables traffic associated with one connec>on to a given thread.• Well leveraging HW feature RSS and SW mul>-threading for true scalability
VPP Load Balancer with L3DSRWorker 0 to N
NIC
ip4 load balance
Ib4 l3dsr
DPDK input
ip4 input
ip4 rewrite
ip4 lookup
IF output
RSS
Queue 0
Queue 1
Queue N
Multi-threading for performance scalability
NIC
ip4 load balance
Ib4 l3dsr
DPDK input
ip4 input
ip4 rewrite
ip4 lookup
IF output
ip4 load balance
Ib4 l3dsr
DPDK input
ip4 input
ip4 rewrite
ip4 lookup
IF output
Queue 0
Queue 1
Queue N
HW SW HWETH IP
[packet][packet][packet]
Load Balancer
Worker 0
Worker N
*Other names and brands may be claimed as the property of others.
::1 1 C -- B: 4
• // 1 4 1 : / 14 1:1 3
• /31: B B 4 + 31 1 : D
• : C - B
• /B 3 1 3
*Other names and brands may be claimed as the property of others.
/
A
B A
-
D B
(N+1) nodes
C
(
-
/C
A
ACD
/
LB node
A B A AInternet
/
/ / B
A / B
DC AD
(A
: C: :) A FG
*Other names and brands may be claimed as the property of others.
Internal VPP LB node architectureand control-plane
*Other names and brands may be claimed as the property of others.
• Full-scratch written by Golang• 22k LoC with gRPC
• Components• LB agent and drivers
• VPP API agent• BGP agent• Health check agent
• LB management system • Controller & DB• User-API & Interface
-
/
/ / /
GoBGP
/
[1] [2]
[3]
*Other names and brands may be claimed as the property of others.
LB control-plane
[1] h>ps://bird.network.cz/ [2] h>p://www.haproxy.org/ [3] h>ps://osrg.github.io/gobgp/
*Other names and brands may be claimed as the property of others.
Leaf/Spine IP Clos fabric
����� ����� �����
-
Internet
xUpgrading…
xUpgrading…
xUpgrading…
• Features• Serving API to integrate with any system• Zero-downtime upgrading• Life cycle management
• HW EoL migration
Completed first node!Completed all node!
LB control-plane (Cont.)
*Other names and brands may be claimed as the property of others.
New Data Center
• Features• Serving API to integrate with any system• Zero-downtime upgrading• Life cycle management
• HW EoL migration
����� �����
Leaf/Spine IP Clos fabric
�����
-
IP Clos fabric
�����Internet
xx x
LB control-plane (Cont.)
*Other names and brands may be claimed as the property of others.
• Setting• Traffic generator: iperf 10Gbps * 8• 8 VPP-LB nodes: 10Gbps * 8• ECMP
• Spine• Leaf• VPP-LB
• 80Gbps to the 1 VIP on 8 VPP-LB
LB system performance@ 80Gbps
( (( (
* 8
(
80Gbps
( (( (VIP
( ) * 8
(
40Gbps/Spine
20Gbps/Leaf
10Gbps/member
10Gbps/VPPLB
*Other names and brands may be claimed as the property of others.
• Setting• Traffic generator: iperf 10Gbps * 8• 8 VPP-LB nodes: 10Gbps * 8• ECMP
• Spine• Leaf• VPP-LB
• 80Gbps to the 1 VIP on 8 VPP-LB
9 0. ( 2 0 2 11 )0. )
LB system performance@ 80Gbps
vpplb-node01VIP: 172.27.156.248
vpplb-node02VIP: 172.27.156.248
vpplb-node03VIP: 172.27.156.248
vpplb-node04VIP: 172.27.156.248
vpplb-node05VIP: 172.27.156.248
vpplb-node06VIP: 172.27.156.248
vpplb-node07VIP: 172.27.156.248
vpplb-node08VIP: 172.27.156.248
( (( (
* 8
(
80Gbps
( (( (VIP
( ) * 8
(
40Gbps/Spine
20Gbps/Leaf
10Gbps/member
10Gbps/VPPLB
*Other names and brands may be claimed as the property of others.
• Setting• Traffic generator: iperf 10Gbps * 8• 8 VPP-LB nodes: 10Gbps * 8• ECMP
• Spine• Leaf• VPP-LB
• 80Gbps to the 1 VIP on 8 VPP-LB
A B ( . A BA B ) 2 C 2 C BE 9
iperf –c VIP iperf –c VIP iperf –c VIP iperf –c VIP
iperf –c VIP iperf –c VIP iperf –c VIP iperf –c VIP
LB system performance@ 80Gbps
)) ()
=> 8.29 Gbps => 9.42 Gbps => 8.50 Gbps => 8.81 Gbps
=> 8.98 Gbps => 8.18 Gbps => 9.34 Gbps => 9.08 Gbps
- AC 8 E 11 / A C 01
)) )))) ))
* 8
() &)
80Gbps
)) )))) ))VIP
&) - * 8
() &)
40Gbps/Spine
20Gbps/Leaf
10Gbps/member
10Gbps/VPPLB
*Other names and brands may be claimed as the property of others.
LBaaS Summary::/ : : /
• D 4 4 34 C -• Serving API to integrate with any system• Zero-downtime upgrading• Life cycle management (EoL migration)
• // 1 43 4 4 / 3 4• / 34 + 1 D• C - 1 4• / 4 4 4 4 4 4 B4
--
Clos fabric
Internet
-
• ! : :: : /: : - A:A: - : A -
• / : : /:• A: :A /:
• : -.: - :• - : -• : - : : :A /• / : A /:
C A: :
- : / :
*Other names and brands may be claimed as the property of others.
Legal Disclaimers• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
• Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.
• Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
• No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
• Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
• All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
• Intel, the Intel logo and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
• *Other names and brands may be claimed as the property of others.
• © 2018 Intel Corporation.
27
Backup
*Other names and brands may be claimed as the property of others.
• P E 1MA 1M4 L B• 55 6 6G
• 58 1E C : E 5 II( )( 2 6
• 3 N * 0-• 41 1E C E 4 L B
EE A E :( 0 GI
. 0
[1] h9ps://www.ixiacom.com/
[1]
VPP LB/node performance test@ 10Gbps
To VIP
To real servers
VPPLB AddiFonal latency
:= all_latency – returning_by_switch_latency
( ( 12 )2 29 ( 2 9
*Other names and brands may be claimed as the property of others.
• ( CA D ( 1• - ::
• -3 3 B•
[1] https://www.nanog.org/meetings/nanog51/presentations/Monday/NANOG51.Talk45.nanog51-Schaumann.pdf
�
�
Missing
ParIally exist
ExisIng
3 A D 3 A D 3 A D
3
/ †
-
:
† A AB4 A B
3 B
)
/
Leaf/Spine IP Clos fabric
L3DSR LB configuraIon example:lb vip 192.168.1.1/32 protocol tcp port 80 encap l3dsr dscp 23 new_len 1024lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.1lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.2lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.3 * Added by this project
L3DSR LB configuration example