Download - Deeper dive in Docker Overlay Networks
Deeper Dive in Docker
Overlay Networks
Laurent Bernaille@lbernail
CTO D2SI
Agenda
Reminder on the Docker Overlay
VXLAN Control Plane options
Using BGP as a dynamic Control Plane
What can we do with this?
Reminder on the Docker overlay
The Docker Overlay networkdocker0:~$ docker network create --driver overlay --subnet 192.168.0.0/24 dockercon
d099dcc709daddbc0e143c24e7091bef6b13bdc3abb379473af4582bf1e112b1
docker1:~$ docker network ls
NETWORK ID NAME DRIVER SCOPE
d099dcc709da dockercon overlay global
docker0:~$ docker run -d --ip 192.168.0.100 --net dockercon --name C0 debian sleep infinity
docker1:~$ docker run -it --rm --net dockercon debian
root@950d67e96db7:/# ping 192.168.0.100
PING 192.168.0.100 (192.168.0.100): 56 data bytes
64 bytes from 192.168.0.100: seq=0 ttl=64 time=1.153 ms
Docker Overlay: Data plane
docker0
eth0
192.168.0.100
C0 Namespace
br0
vxla
nve
th
eth0
docker1
C1 Namespace
br0
vxla
nve
th
eth0PING
eth0
192.168.0.Y
10.0.0.10 10.0.0.11IPsrc: 10.0.0.11dst: 10.0.0.10
UDPsrc: Xdst: 4789
VXLANVNI
Original L2src: 192.168.0.Ydst: 192.168.0.100
What is VXLAN?• Tunneling technology over UDP (L2 in UDP)
• Developed for cloud SDN to create multi-tenancy
• Without the need for L2 connectivity
• Without the normal VLAN limit (4096 VLAN Ids)
• Easy to encrypt: IPsec
• Overhead: 50 bytes
• In Linux
• Started with Open vSwitch
• Native with Kernel >= 3.7 and >=3.16 for Namespace support
Outer IP packetUDPdst: 4789
VXLANHeader
Original L2
VXLAN: Virtual eXtensible LAN
VNI: VXLAN Network Identifier
VTEP: VXLAN Tunnel Endpoint
docker0 docker1
10.0.0.0/16
10.0.0.10 10.0.1.10
Let's build an overlay "manually"
Overlay namespaces
docker0
br42
vxla
n4
2
eth0
docker1
br42
eth010.0.0.10 10.0.1.10
vxla
n4
2
Creating the overlay namespaceip netns add overns
ip netns exec overns ip link add dev br42 type bridge
ip netns exec overns ip addr add dev br42 192.168.0.1/24
ip link add dev vxlan42 type vxlan id 42 proxy dstport 4789
ip link set vxlan1 netns overns
ip netns exec overns ip link set vxlan42 master br42
ip netns exec overns ip link set vxlan42 up
ip netns exec overns ip link set br42 up
create overlay NS
create bridge in NS
create VXLAN interface
move it to NS
add it to bridge
bring all interfaces up
setup_vxlan script
docker0
C0 Namespace
br42
veth
eth0
docker1
C1 Namespace
br42
veth
eth0
eth0
192.168.0.10
eth0
192.168.0.20
10.0.0.10 10.0.1.10
vxla
n4
2
vxla
n4
2
Attach containers
docker0
docker run -d --net=none --name=demo debian sleep infinity
ctn_ns_path=$(docker inspect --format="{{ .NetworkSettings.SandboxKey}}" demo)
ctn_ns=${ctn_ns_path##*/}
ip link add dev veth1 mtu 1450 type veth peer name veth2 mtu 1450
ip link set dev veth1 netns overns
ip netns exec overns ip link set veth1 master br42
ip netns exec overns ip link set veth1 up
ip link set dev veth2 netns $ctn_ns
ip netns exec $ctn_ns ip link set dev veth2 name eth0 address 02:42:c0:a8:00:10
ip netns exec $ctn_ns ip addr add dev eth0 192.168.0.10
ip netns exec $ctn_ns ip link set dev eth0 up
docker1
Same with 192.168.0.20 / 02:42:c0:a8:00:20
Create container without net
Create veth
Send veth1 to overlay NS
Attach it to overlay bridge
Send veth2 to container
Rename & Configure
Get NS for container
Create containers and attach them
plumb script
Does it ping?
docker0:~$ docker exec -it demo ping 192.168.0.20
PING 192.168.0.20 (192.168.0.20): 56 data bytes
92 bytes from 192.168.0.10: Destination Host Unreachable
docker0:~$ sudo ip netns exec overns ip neighbor show
docker0:~$ sudo ip netns exec overns ip neighbor add 192.168.0.20 lladdr 02:42:c0:a8:00:20 dev vxlan42
docker0:~$ sudo ip netns exec overns bridge fdb add 02:42:c0:a8:00:20 dev vxlan42 self dst 10.0.1.10 \
vni 42 port 4789
docker1: Same with 192.168.0.10, 02:42:c0:a8:00:10 and 10.0.0.10
docker0
C0 Namespace
br42
veth
eth0
docker1
C1 Namespace
br42
veth
eth0
eth0
192.168.0.20
eth0
192.168.0.20
10.0.0.10 10.0.1.10
vxla
n4
2
vxla
n4
2
PING
FDB
ARP
FDB
ARP
Result
VXLAN Control Plane options
vxlan vxlan
vxlan
Multicast239.x.x.x
ARP: Who has 192.168.0.2?
L2 discovery: where is 02:42:c0:a8:00:02 ?
Use a multicast group to send traffic for unknown L3/L2 addresses
PROS: simple and efficient
CONS: Multicast connectivity not always available (on public clouds for instance)
VXLAN Control Plane options - 1: Multicast
Configure a remote IP address where to send traffic for unknown addresses
PROS: simple, not need for multicast, very good for two hosts
CONS: difficult to manage with more than 2 hosts
VXLAN Control Plane options - 2: Point-to-point
vxlan vxlan
Remote IP: point-to-pointSend everything to remote IP
Do nothing, provide ARP / FDB information from outside
PROS: very flexible
CONS: requires a daemon and a centralized database of addresses
VXLAN Control Plane options - 3: User-Land
vxlan vxlan
daemon daemon
Manual (with a daemon modifying ARP/FDB)ARP: Do you know 192.168.0.2?L2: where is 02:42:c0:a8:00:02 ?
vxlan
daemon
consul/swarm
docker0
eth0
192.168.0.100
C0 Namespace
br0
vxla
n
veth
eth0
docker1
C1 Namespace
br0
vxla
n
veth
eth0
192.168.0.Y
eth0
NAT
PING
dockerd dockerd
10.0.0.10 10.0.1.10
ARP
FDB
ARP
FDB
IPsrc: 10.0.0.11dst: 10.0.0.10
UDPsrc: Xdst: 4789
VXLANVNI
Original L2src: 192.168.0.Ydst: 192.168.0.100
Serf / Gossip
Docker Overlay control plane (3: User-land)
"Deep Dive in Docker Overlay Networks", Dockercon Austin 2017
Slides
Video
Blog Posts
That was a lot of information
Using BGP as a dynamic
control plane
Rely on BGP eVPN address family to distribute L2 and L3 data
PROS: BGP is a standard to distribute addresses, supported by SDN vendors
CONS: limited Linux implementations, requires some BGP knowledge
VXLAN Control Plane- Option 4: BGP-EVPN
vxlan vxlan
bgpd bgpd
vxlan
bgpd
Endpoint data is distributed with BGP
BGP in one slide
● Routing Protocol between network entities ("Autonomous Systems", AS)
Google ASN: 15169 / Amazon ASN: 16509
(both actually have more than one)
● BGP is an EGP: Exterior Gateway Protocol
IGP: Interior Gateway Protocol (OSPF, EIGRP, IS-IS)
IGP: next hop is the IP of a router
BGP: next hop is an Autonomous System
● BGP is what makes Internet work
● BGP scales very well
500 000+ prefixes for a full Internet table
A quick BGP example
AS 1
AS 2
AS 3
AS 5AS 4
eBGP
iBGP
20.0.0.0/16
20.0.0.0/16: AS1
20.0.0.0/16: AS120.0.0.0/16: AS4-AS1
Shortest PATH?
20.0.0.0/16: AS5-AS4-AS1
20.0.0.0/16: AS2-AS1
AS: Autonomous System
eBGP: external (different AS)
iBGP: internal (same AS)
iBGP
iBGP requires to mesh between all peers
n peers => n * (n-1) / 2 connections
50 peers => 1225 (49 of each host)
Route-reflectors simulate the mesh
More scalable and simpler
Possible to have more than one RR
RR
Distribute BGP information within an Autonomous System
BGP EVPN
● Part of MP-BGP (multi-protocol BGP: not only IP prefixes)
● Announce VXLAN information instead of IP prefixes
L3: IP addresses of VXLAN endpoints (VTEP)
L2: Location of MAC addresses
● BUM (Broadcast, Unknown, Multicast) traffic unicasted to all VTEPs
● Get the scalability of BGP
10.0.0.0/16
docker0: 10.0.0.10
Environment
RR1 RR2
quagga-
rrquagga-
rr
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0:~$ docker run -t -d --privileged --name quagga -p 179:179 --hostname docker0 \
-v $(pwd)/quagga:/etc/quagga cumulusnetworks/quagga (modify routing/forwarding)
router bgp 65000
bgp router-id 10.0.0.10
no bgp default ipv4-unicast
neighbor reflectors peer-group
neighbor reflectors remote-as 65000
neighbor reflectors capability extended-nexthop
neighbor 10.0.0.5 peer-group reflectors
neighbor 10.0.1.5 peer-group reflectors
address-family evpn
neighbor reflectors activate
advertise-all-vni
BGP configuration on Docker0
router bgp 65000
bgp router-id 10.0.0.5
bgp cluster-id 111.111.111.111
no bgp default ipv4-unicast
neighbor docker peer-group
neighbor docker remote-as 65000
bgp listen range 10.0.0.0/16 peer-group docker
address-family evpn
neighbor docker activate
neighbor docker route-reflector-client
BGP configuration on Route Reflectors
Creating our BGP clients on Docker hosts
10.0.0.0/16
docker0: 10.0.0.10
What we have so far
RR1 RR2
quagga-
rrquagga-
rr
docker0
quaggaeth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
quaggaeth0
Let's look at the BGP data
docker0:~$ docker exec -it quagga vtysh
docker0# show run
docker0# show bgp neighbors
docker0# show bgp evpn summary
BGP router identifier 10.0.0.10, local AS number 65000 vrf-id 0
Peers 2, using 42 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
quagga0(10.0.0.5) 4 65000 42 43 0 0 0 00:02:01 0
quagga1(10.0.1.5) 4 65000 42 43 0 0 0 00:02:01 0
docker0# show bgp evpn route
No EVPN prefixes exist
Configuring VXLAN interfaces
sudo ./setup_vxlan 42 container:quagga dstport 4789 nolearning <= Only learn through EVPN
10.0.0.0/16
docker0: 10.0.0.10
RR1 RR2
quagga-
rrquagga-
rr
docker0
br42 vxlan42
quaggaeth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
br42vxlan42
quaggaeth0
Let's look at the BGP data
docker0:~$ docker exec -it quagga vtysh
docker0# show bgp evpn route
BGP table version is 0, local router ID is 10.0.0.10
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.0.0.10:1
*> [3]:[0]:[32]:[10.0.0.10]
10.0.0.10 32768 i
Route Distinguisher: 10.0.1.10:1
*>i[3]:[0]:[32]:[10.0.1.10]
10.0.1.10 0 100 0 i
docker0# show evpn mac vni all
Let's add containers and try pinging
10.0.0.0/16
docker0: 10.0.0.10
RR1 RR2
quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo: 192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
br42vxlan42
quagga
demo: 192.168.0.20
eth0
eth0
docker0:~$ sudo ./plumb br42@quagga demo 192.168.0.10/[email protected] 02:42:c0:a8:00:10
docker1:~$ sudo ./plumb br42@quagga demo 192.168.0.20/[email protected] 02:42:c0:a8:00:20
What about BGP?
docker0:~$ docker exec -it quagga vtysh
docker0# show bgp evpn route
BGP table version is 0, local router ID is 10.0.0.10
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Route Distinguisher: 10.0.1.10:1
*>i[2]:[0]:[0]:[48]:[02:42:c0:a8:00:20]
10.0.1.10 0 100 0 i
* i[3]:[0]:[32]:[10.0.1.10]
10.0.1.10 0 100 0 i
docker0# show evpn mac vni all
VNI 42 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:00:10 local veth0pldemo
02:42:c0:a8:00:20 remote 10.0.1.10
10.0.0.0/16
docker0: 10.0.0.10
Overview
RR1 RR2
quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo: 192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
br42vxlan42
quagga
demo: 192.168.0.20
eth0
eth0Control plane
Data plane
● Standard VXLAN address distribution (used on many routers)
● Full management of BUM traffic
ARP queries
Broadcasts (DHCP)
Multicast (Discovery, keepalived)
● BUM traffic is unicasted (not efficient)
Possible optimizations: ARP suppression (Cumulus Quagga)
What's interesting about this setup?
What can we do with this?
What if we want a second Overlay?
10.0.0.0/16
docker0: 10.0.0.10
RR1 RR2quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
br66 vxlan66
docker0
br42vxlan42
quagga
demo192.168.0.10
eth0
eth0
br66vxlan66
demo66192.168.66.10
eth0demo66
192.168.66.20
eth0
docker0:~$ sudo ./setup_vxlan 66 container:quagga dstport 4789 nolearning
docker0:~$ docker run -d --net=none --name=demo66 debian sleep infinity
docker0:~$ sudo ./plumb br66@quagga demo66 192.168.66.10/24 02:42:c0:a8:66:10
What about BGP?
docker0:~$ docker exec -it quagga vtysh
docker0# show evpn vni
Number of VNIs: 2
VNI VxLAN IF VTEP IP # MACs # ARPs # Remote VTEPs
42 vxlan42 0.0.0.0 2 0 1
66 vxlan66 0.0.0.0 2 0 1
docker0# show evpn mac vni all
VNI 42 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:00:10 local veth0pldemo
02:42:c0:a8:00:20 remote 10.0.1.10
VNI 66 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:66:10 local veth0pldemo66
02:42:c0:a8:66:20 remote 10.0.1.10
10.0.0.0/16
docker0: 10.0.0.10
RR1 RR2quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
br42vxlan42
quaggaeth0
Taking advantage of broadcast: DHCP
dhcp192.168.0.254
eth0demo
192.168.0.20
eth0demodhcp
192.168.0.10?
eth0
Configuring DHCP
docker0:~$ docker run -d --net=none --name dhcp -v "$(pwd)/dhcp":/data networkboot/dhcpd eth0
docker0:~$ sudo ./plumb br42@quagga dhcp 192.168.0.254/24
docker1:~$ docker run -d --net=none --name=demodhcp debian sleep infinity
docker1:~$ sudo ./plumb br42@quagga demodhcp dhcp
docker1:~$ docker exec -it demodhcp ping 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=1.566 ms
subnet 192.168.0.0 netmask 255.255.255.0 {
range 192.168.0.100 192.168.0.200;
option routers 192.168.0.1;
option domain-name-servers 8.8.8.8;
}
DHCP configuration
10.0.0.0/16
RR1 RR2quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker0
br42vxlan42
quaggaeth0
Getting out of our Docker environment
dhcp192.168.0.254
eth0demo
192.168.0.20
eth0client
192.168.0.100
eth0
quagga
br42
vxlan42
vethgw192.168.0.1
docker0: 10.0.0.10 docker1: 10.0.1.10gateway0: 10.0.0.20
Getting out of our Docker environment
gateway0:~$ ./setup_vxlan 42 host dstport 4789 nolearning
gateway0:~$ ip link add dev vethbr type veth peer name vethgw
gateway0:~$ ip link set vethbr master br42
gateway0:~$ ip addr add 192.168.0.1/24 dev vethgw
gateway0:~$ ping 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=0.866 ms
br42
vethgw192.168.0.1
vxlan42
vethbr
10.0.0.0/16
RR1 RR2quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker0
br42vxlan42
quaggaeth0
Getting out of VXLAN / Quagga
dhcp192.168.0.254
eth0demo
192.168.0.20
eth0client
192.168.0.100
eth0
quagga
br42
vxlan42
vethgw192.168.0.1
eth0
Non-VXLAN
host
10.0.0.30
route
10.0.0.0/16 192.168.0.0/24NAT
docker0: 10.0.0.10 docker1: 10.0.1.10gateway0: 10.0.0.20
Getting out of VXLAN / Quagga
gateway0:~$ echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
gateway0:~$ iptables -t nat -A POSTROUTING ! -d 10.0.0.0/16 -s 192.168.0.0/24 -o eth0 -j MASQUERADE
docker1:~$ docker exec -it demodhcp ping 192.168.0.1 <= Local (VXLAN)
docker1:~$ docker exec -it demodhcp ping 10.0.0.30 <= Routed
docker1:~$ docker exec -it demodhcp ping 8.8.8.8 <= NATed
simple1:~$ ping 192.168.0.1
simple1:~$ ping 192.168.0.10
eth0
routeNAT
10.0.0.0/16
docker0: 10.0.0.10
RR1 RR2quagga-
rrquagga-
rr
docker0
br42 vxlan42
quagga
demo192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
br42vxlan42
quaggaeth0
Another nice thing we can do
dhcp192.168.0.254
eth0demo
192.168.0.20
eth0demodhcp
192.168.0.100
eth0
gateway0: 10.0.0.20
quaggabr42
vxlan42
vethgw192.168.0.1
eth0
Non-VXLAN
host
10.0.0.30
routeNAT
QEMU, dhclient192.168.0.10x
tap0
What could a real-life setup look like?
RR2
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
BGP/EVPN
Router
Standard
host
Standard
host
Standard
host
Standard
host
VXLAN
Routing
Routes from non-VXLAN infraRoutes to VXLAN networks
RR1
How does it compare to other solutions?
Data plane Control Plane
Swarm Classic VXLAN External KV Store (Consul / Etcd)
SwarmKit VXLAN Swarmkit (Raft / Gossip implementation)
Flannel host-gw Routing Etcd / Kubernetes API
Flannel VXLAN VXLAN Etcd / Kubernetes API
Calico Routing / IPIP Etcd / BGP (IP prefixes)
Weave Classic Custom Custom
Weave Fast Datapath VXLAN Custom
Contiv VXLAN, Routing, L2 Etcd / BGP (IP and maybe eVPN)
Disclaimer: almost no experience with any (from documentation and discussions mostly)
Perspectives
● FFRouting
Quagga fork
Cumulus has switched to FFRouting and merged EVPN support
● Open vSwitch
Alternative to linux native bridge and VXLAN
(Possibly) better performances and more features
Not sure how Quagga/FFRouting would integrate with Open vSwitch
● Performances
Measure impact of VXLAN
Test VXLAN acceleration when available on NICs
● CNI plugin (to test on Kubernetes and mostly for learning purposes )
Thank you!
Questions?
https://github.com/lbernail/dockercon2017
@lbernail