implementing a layer 2 framework on linux network
DESCRIPTION
TRANSCRIPT
Takuya ASADA<[email protected]> @syuu1228
I was in embedded software company, worked on SMP support for router firmware
Ph. D. Student of Tokyo University of Technology, researching improvement network I/O architecture on modern x86 servers
Interested in: SMP, Network, Virtualization
GSoC ’11(FreeBSD) Multithread support for BPF
GSoC ’12(FreeBSD) BIOS support for BHyVe
Research assistant at IIJ research laboratory, implementing BCube for Linux
Today’s topic!
BCube is a new network architecture
Designed for shipping-container based modular data centers
Server-centric network structure ◦ Server act as
End hosts
Relay nodes for each other
The paper published in ACM SIGCOMM ’09 by Microsoft Research Asia
Each server has one connection to each layers
Switches never connect to other switches
Servers relay traffic for each other
switch
server
000 001
0,0
010 011
0,1
1,0 1,1
100 101
0,0
110 111
0,1
1,0 1,1
2,0 2,1 2,0 2,1
Bcube0
Bcube1
Bcube2
𝐵𝐶𝑢𝑏𝑒𝑘 has k + 1 layers
𝐵𝐶𝑢𝑏𝑒𝑥 contains n 𝐵𝐶𝑢𝑏𝑒𝑥−1
𝐵𝐶𝑢𝑏𝑒0 contains n servers
Total servers = 𝑛𝑘+1
000 001
0,0
010 011
0,1
1,0 1,1
100 101
0,0
110 111
0,1
1,0 1,1
2,0 2,1 2,0 2,1
Bcube0
Bcube1
Bcube2
switch
server
High network capacity for various traffic patterns ◦ one-to-one
◦ one-to-all
◦ one-to-several
◦ all-to-all
Performance degrades gracefully as servers/switches failure increases
Doesn’t need special hardware, only use commodity switch
Each server has unique BCube address
Each digit pointed port number of switch in the layer
000 001
0,0
010 011
0,1
1,0 1,1
100 101
0,0
110 111
0,1
1,0 1,1
2,0 2,1 2,0 2,1
Bcube0
Bcube1
Bcube2
switch
server
Default routing rule ◦ Top layer→Bottom layer
◦ Ex: Route from 000 to 111 000 →100 →110 →111
000 001
0,0
010 011
0,1
1,0 1,1
100 101
0,0
110 111
0,1
1,0 1,1
2,0 2,1 2,0 2,1
Bcube0
Bcube1
Bcube2
There are alternate routes between any nodes
Can bypass failure servers and switches
Also can use acceralate throughput to parallelize traffic
000 001
0,0
010 011
0,1
1,0 1,1
100 101
0,0
110 111
0,1
1,0 1,1
2,0 2,1 2,0 2,1
Bcube0
Bcube1
Bcube2
Source server decides the best path for a flow
Bypass failure paths
To propagate routing path, source server writes routing path information on packet header
Add BCube header between Ethernet header and IP header
Has src/dst address and also routing path information on “Next Hop Index Array”
IP Header
BCube Header
Ethernet HeaderBCube dest address
BCube source address
Protocol type
Next Hop Index Array
Evaluating various "Data Center Network" technologies, especially for container-moduler datacenter architecture. BCube is one of the candidate.
Try to use existing code as much as possible
Minimum implementation at first
BCube binds multiple interface, assigns a BCube address and an IP address
What is the most similar function which already existing on Linux? →Bridge! ◦ Forked bridge.ko and brctl command,
named bcube.ko and bcctl command
brctl addbr <bridge> brctl delbr <bridge> ↓ bcctl addbc <bcube> <bcaddr> <N> <K> bcctl delbc <bcube>
Modified addbr/delbr, add 3 args ◦ BCube address ◦ n and k parameter
Use MAC address format/size for BCube address
Use BCube address for HW address of BCube device ◦ It works like fake MAC address on Linux network stack
101 → 00:00:01:00:01
brctl addif <bridge> <device> brctl delif <bridge> <device>
↓ bcctl assignif <bcube> <layer> <device> bcctl unassignif <bcube> <layer> <device>
Modified assignif / unassignif command, add layer number on args
Need to reconsider address resolution
Normal Ethernet ◦ IP Address → MAC Address (ARP)
BCube network ◦ IP Address → BCube Address
→ ARP?
◦ (Neighbor) BCube address → MAC Address → Need additional neighbor discovery protocol
Once broadcast works on BCube implementation, ARP should work on it
But I haven’t implemented it yet, decided to configure manually by following command: arp –i bc0 –s 10.0.0.6 00:00:00:01:00:10
Need an ARP like protocol
Decided to configure manually too, implemented following command: bcctl addneighbour <bcube> <layer> <bcaddr> <macaddr> bcctl delneighbour <bcube> <layer> <bcaddr>
bcube.ko maintenance neighbor table, use it in packet transmitting/forwarding
In bridge.ko, it maintenance FDB(forwarding database) to lookup destination MAC address→output port using hash table
Deleted FDB, implemented function to decide next hop BCube address, output port, and MAC address of next hop
Haven’t implemented source routing – just default routing for now
Top layer→Bottom layer
Ex: Route from 000 to 111 000 →100 →110 →111
000 001
0,0
010 011
0,1
1,0 1,1
100 101
0,0
110 111
0,1
1,0 1,1
2,0 2,1 2,0 2,1
Bcube0
Bcube1
Bcube2
To add BCube Header between Ethernet Header and IP header, I forked net/ethernet/eth.c
ETH_HLEN (14byte) → BCUBE_HLEN (24byte)
struct ethhdr (MAC header) → struct bcubehdr (MAC & BCube header)
eth_header_ops → bc_header_ops To handle Bcube Header
Unfortunately GRO accesses ethernet header directly, and it works before BCube handles a packet – need to disable it
Found a way to implement new L2 framework using existing bridge implementation ◦ Lot more easy than implement it from scrach
Development Status ◦ Implemented basic features, debugging now ◦ Will consider to add more features
broadcast / multicast Intermediate node/switch failure detection, change the
routing source routing address resolution protocol
Planing more detail evaluation in our data center testbed
Any comments and suggestions are welcome
This work was done as part of research assistance work at IIJ research laboratory.