devconf 2014 kernel networking walkthrough
DESCRIPTION
This presentation features a walk through the Linux kernel networking stack covering the essentials and recent developments a developer needs to know. Our starting point is the network card driver as it feeds a packet into the stack. We will follow the packet as it traverses through various subsystems such as packet filtering, routing, protocol stacks, and the socket layer. We will pause here and there to look into concepts such as segmentation offloading, TCP small queues, and low latency polling. We will cover APIs exposed by the kernel that go beyond use of write()/read() on sockets and will look into how they are implemented on the kernel side.TRANSCRIPT
Kernel Networking Walkthrough1
Kernel Networking Walkthrough
Thomas Graf – Principal Software EngineerNetworking ServicesRed HatFeb 7, 2014
Kernel Networking Walkthrough2
Agenda
● How does a packet get in and out of the net stack?● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO
● How does a packet get through the net stack?● RX Handler, IP Processing, TCP Processing, TCP
Fast Open● How to account for memory and do flow control?
● Socket Buffers, Flow Control, TCP Small Queues
● Q&A
Kernel Networking Walkthrough3
Touring the Network Stack
Expectation Reality
Kernel Networking Walkthrough4
How does a packet get in and out of the Network Stack?
Kernel Networking Walkthrough5
Receive & Transmit Process
Ring Buffer
DMA
ParseIP
ParseTCP/UDP
Socket Buffer
Task
read()
Ring Buffer
ConstructIP
ConstructTCP/UDP
Local?
Socket Buffer
Forward
Device?write()
NIC Network Stack(Kernel Space)
Process(User Space)
Kernel Networking Walkthrough6
The 3 ways into the Network Stack
Ring Buffer
NetworkStack
NAPI based Polling poll()
Ring Buffer
NetworkStack
Interrupt Driven
Ring Buffer NetworkStack
Busy Polling busy_poll()
Task
Kernel Networking Walkthrough7
RSS – Receive Side Scaling
● NIC distributes packets across multiple RX queues allowing for parallel processing.
● Separate IRQ per RX queue, thus selects CPU to run hardware interrupt handler on.
RX-queue-1
RX-queue-2
RX-queue-3
RX-queue-4
CPU 1
CPU 3
CPU 1
CPU 5
filter
Kernel Networking Walkthrough8
RPS – Receive Packet Steering
● Software filter to select CPU # for processing
● Use it to ...
RX-queue-1
RX-queue-2
RX-queue-3
RX-queue-4
CPU 1
CPU 2
CPU 3
CPU 1
CPU 2
CPU 3
... redo queue - CPU mapping ... distribute single queue to multiple CPUs
Kernel Networking Walkthrough9
Hardware Offload
● RX/TX Checksumming● Perform CPU intensive
checksumming in hardware.
● Virtual LAN filtering and tag stripping● Strip 802.1Q header and store VLAN
ID in network packet meta data.● Filter out unsubscribed VLANs.
Kernel Networking Walkthrough10
Generic Receive Offload
Ring BufferNetwork
Stack
poll()
NAPI based GRO
MTU
GRO
Up to 64K
Kernel Networking Walkthrough11
Segmentation Offload
Ring Buffer
NetworkStack
Generic Segmentation Offload (GSO)
MTU
TCP Segmentation Offload (TSO)
MTU
Up to 64K
Kernel Networking Walkthrough12
How does a packet get through the Network Stack?
(c) Karen Sagovac
Kernel Networking Walkthrough13
Packet Processing
Link Layer
Ingress QoS
Proto Handler
IPv4
IPv6
ARP
IPX
...Drop
Packet SocketETH_P_ALL
tcpdump
Feast of the hungry chicks
RX Handler
Open vSwitch
Team
Bonding
Bridge
macvlan
macvtap
Kernel Networking Walkthrough14
IP Processing
IPHandler Route Lookup
PREROUTING
IPv4Construction
Route Lookup
Local Output
OUTPUT
POSTROUTINGLink Layer
FORWARD
Forwarding L4(TCP, ...)
Local Delivery
INPUT
UserSpace
Kernel Networking Walkthrough15
TCP Processing
IP
Socket Filter
Receive TCP
Parse TCPLookup Socket
Backlogsocket locked
Receive Socket Buffer
Prequeuetask exists
process context ← softirq
Task
poll()read()
Kernel Networking Walkthrough16
TCP Fast Open(net.ipv4.tcp_fastopen)
2nd Req SYN
SYN+ACK
ACK+HTTP GET
Data
2x RTT
SYN+Cookie+HTTP GET
SYN+ACK+Data
2nd Req
1x RTT
Client Server
SYN
SYN+ACK
ACK+HTTP GET
1st Req
Data
2x RTT2x RTT
Regular
Client Server
SYN
SYN+ACK+Cookie
ACK+HTTP GET
1st Req
Data
2x RTT
Fast Open
Kernel Networking Walkthrough17
Memory Accounting & Flow Control
Kernel Networking Walkthrough18
Socket Buffers & Flow Control(net.ipv4.tcp_{r|w}mem)
ssh
TX Ring Buffer
TCP/IP
Socket Buffer
wmemoverlimit?
Block or EWOULDBLOCK
wmem += packet-size
ssh
RX Ring Buffer
TCP/IP
Socket Buffer
rmem -= packet-size
rmemoverlimit?
Reduce TCP Window
rmem += packet-size
wmem -= packet-size
write()
Kernel Networking Walkthrough19
TCP Small Queues(net.ipv4.tcp_limit_output_bytes)
ssh
TX Ring Buffer
Driver
TCP/IP
Socket Buffer
write()
Queuing Discipline
torrent
Socket Buffer
write()
TSQ: max 128Kb in flight per socket
Kernel Networking Walkthrough20
Q&A
Feedback Page● http://devconf.cz/f/1
Coming Up Next:
NetworkManager for Enterprise
Dan Williams