eslea closing conference, edinburgh, march 2007, r. hughes-jones manchester 1 protocols working with...
Post on 19-Dec-2015
216 views
TRANSCRIPT
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester1
Protocols
Working with 10 Gigabit Ethernet
Richard Hughes-Jones The University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks”
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester2
Introduction 10 GigE on SuperMicro X7DBE 10 GigE on SuperMicro X5DPE-G2 10 GigE and TCP– Monitor with web100 disk writes 10 GigE and Constant Bit Rate transfers UDP + memory access GÉANT 4 Gigabit tests
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester3
UDP/IP packets sent between back-to-back systems Similar processing to TCP/IP but no flow control & congestion avoidance algorithms
Latency Round trip times using Request-Response UDP frames Latency as a function of frame size
Slope s given by:
Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates processing times + HW latencies
Histograms of ‘singleton’ measurements UDP Throughput
Send a controlled stream of UDP frames spaced at regular intervals Vary the frame size and the frame transmit spacing & measure:
The time of first and last frames receivedThe number packets received, lost, & out of orderHistogram inter-packet spacing received packetsPacket loss pattern1-way delayCPU loadNumber of interrupts
Udpmon: Latency & Throughput Measurements
Tells us about: Behavior of the IP stack The way the HW operates Interrupt coalescence
Tells us about: Behavior of the IP stack The way the HW operates Capacity & Available throughput of
the LAN / MAN / WAN
1
s
paths data dt
db
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester4
Throughput Measurements
UDP Throughput with udpmon Send a controlled stream of UDP frames spaced at regular intervals
n bytes
Number of packets
Wait timetime
Zero stats OK done
●●●
Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay
Send data frames at regular intervals
●●●
Time to send Time to receive
Inter-packet time(Histogram)
Signal end of testOK done
Time
Sender Receiver
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester5
High-end Server PCs for 10 Gigabit
Boston/Supermicro X7DBE Two Dual Core Intel Xeon Woodcrest 5130
2 GHz Independent 1.33GHz FSBuses
530 MHz FD Memory (serial) Parallel access to 4 banks
Chipsets: Intel 5000P MCH – PCIe & MemoryESB2 – PCI-X GE etc.
PCI 3 8 lane PCIe buses 3* 133 MHz PCI-X
2 Gigabit Ethernet SATA
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester6
10 GigE Back2Back: UDP Latency Motherboard: Supermicro X7DBE Chipset: Intel 5000P MCH CPU: 2 Dual Intel Xeon 5130
2 GHz with 4096k L2 cache Mem bus: 2 independent 1.33 GHz PCI-e 8 lane Linux Kernel 2.6.20-web100_pktd-plus Myricom NIC 10G-PCIE-8A-R Fibre myri10ge v1.2.0 + firmware v1.4.10
rx-usecs=0 Coalescence OFF MSI=1 Checksums ON tx_boundary=4096
MTU 9000 bytes
Latency 22 µs & very well behaved Latency Slope 0.0028 µs/byte B2B Expect: 0.00268 µs/byte
Mem 0.0004 PCI-e 0.00054 10GigE 0.0008 PCI-e 0.00054 Mem 0.0004
gig6-5_Myri10GE_rxcoal=0
y = 0.0028x + 21.937
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Message length bytes
La
ten
cy
us
64 bytes gig6-5
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80
Latency us
N(t
)
8900 bytes gig6-5
0
1000
2000
3000
4000
5000
6000
0 20 40 60 80Latency us
N(t
)
3000 bytes gig6-5
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80Latency us
N(t
)
Histogram FWHM ~1-2 us
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester7
10 GigE Back2Back: UDP Throughput Kernel 2.6.20-web100_pktd-plus Myricom 10G-PCIE-8A-R Fibre
rx-usecs=25 Coalescence ON
MTU 9000 bytes Max throughput 9.4 Gbit/s
Notice rate for 8972 byte packet
~0.002% packet loss in 10M packetsin receiving host
Sending host, 3 CPUs idle For <8 µs packets,
1 CPU is >90% in kernel modeinc ~10% soft int
Receiving host 3 CPUs idle For <8 µs packets,
1 CPU is 70-80% in kernel modeinc ~15% soft int
gig6-5_myri10GE
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 10 20 30 40Spacing between frames us
Re
cv W
ire r
ate
Mb
it/s 1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
gig6-5_myri10GE
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
%c
pu
1 k
ern
el
sn
d
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
C
gig6-5_myri10GE
0
20
40
60
80
100
0 10 20 30 40Spacing between frames us
% c
pu
1
ke
rne
l re
c
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester8
10 GigE UDP Throughput vs packet size Motherboard: Supermicro X7DBE Linux Kernel 2.6.20-web100_
pktd-plus Myricom NIC 10G-PCIE-8A-R Fibre myri10ge v1.2.0 + firmware v1.4.10
rx-usecs=0 Coalescence ON MSI=1 Checksums ON tx_boundary=4096
Steps at 4060 and 8160 byteswithin 36 bytes of 2n boundaries
Model data transfer time as t= C + m*Bytes C includes the time to set up transfers Fit reasonable C= 1.67 µs m= 5.4 e4 µs/byte Steps consistent with C increasing by 0.6 µs
The Myricom drive segments the transfers, limiting the DMA to 4096 bytes – PCI-e chipset dependent!
gig6-5_myri_udpscan
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 2000 4000 6000 8000 10000Size of user data in packet bytes
Rec
v W
ire
rate
Mbi
t/s
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester9
10 GigE via Cisco 7600: UDP Latency Motherboard: Supermicro X7DBE PCI-e 8 lane Linux Kernel 2.6.20 SMP Myricom NIC 10G-PCIE-8A-R Fibre
myri10ge v1.2.0 + firmware v1.4.10 Rx-usecs=0 Coalescence OFF MSI=1 Checksums ON
MTU 9000 bytes
Latency 36.6 µs & very well behaved Switch Latency 14.66 µs Switch internal: 0.0011 µs/byte
PCI-e 0.00054 10GigE 0.0008
gig6-Cisco-5_Myri_rxcoal0
y = 0.0046x + 36.6
0
10
20
30
40
50
60
70
80
90
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Message length bytes
La
ten
cy
us
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester10
The “SC05” Server PCs
Boston/Supermicro X7DBE Two Intel Xeon Nocona
3.2 GHz Cache 2048k Shared 800 MHz FSBus
DDR2-400 Memory
Chipsets: Intel 7520 Lindenhurst
PCI 2 8 lane PCIe buses 1 4 lane PCIe buse 3* 133 MHz PCI-X
2 Gigabit Ethernet
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester11
10 GigE X7DBEX6DHE: UDP Throughput Kernel 2.6.20-web100_pktd-plus Myricom 10G-PCIE-8A-R Fibre
myri10ge v1.2.0 + firmware v1.4.10 rx-usecs=25
Coalescence ON MTU 9000 bytes Max throughput 6.3 Gbit/s
Packet loss ~ 40-60 % in receiving host
Sending host, 3 CPUs idle 1 CPU is >90% in kernel mode
Receiving host 3 CPUs idle For <8 µs packets,
1 CPU is 70-80% in kernel modeinc ~15% soft int
gig6-X6DHE_MSI_myri
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 10 20 30 40Spacing between frames us
Re
cv W
ire r
ate
Mb
it/s 1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
gig6-X6DHE_MSI_myri
0
20
40
60
80
100
0 10 20 30 40Spacing between frames us
% c
pu
1
ke
rne
l re
c
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
gig6-X6DHE_MSI_myri
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
ac
ke
t lo
ss
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8972 bytes
8000 bytes
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester12
So now we can run at 9.4 Gbit/s
Can we do any work ?
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester13
10 GigE X7DBEX7DBE: TCP iperf No packet loss MTU 9000 TCP buffer 256k BDP=~330k Cwnd
SlowStart then slow growth Limited by sender !
Duplicate ACKs One event of 3 DupACKs
Packets Re-Transmitted
Throughput Mbit/s Iperf throughput 7.77 Gbit/s
Web100 plots of TCP parameters
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester14
10 GigE X7DBEX7DBE: TCP iperf Packet loss 1: 50,000 -recv-kernel patch MTU 9000 TCP buffer 256k BDP=~330k Cwnd
SlowStart then slow growth Limited by sender !
Duplicate ACKs ~10 DupACKs every lost packet
Packets Re-Transmitted One per lost packet
Throughput Mbit/s Iperf throughput 7.84 Gbit/s
Web100 plots of TCP parameters
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester15
10 GigE X7DBEX7DBE: CBR/TCP Packet loss 1: 50,000 -recv-kernel patch tcpdelay message 8120bytes Wait 7 µs RTT 36 µs TCP buffer 256k BDP=~330k Cwnd
Dips as expected
Duplicate ACKs ~15 DupACKs every lost packet
Packets Re-Transmitted One per lost packet
Throughput Mbit/s tcpdelay throughput 7.33 Gbit/s
Web100 plots of TCP parameters
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester16
Cpu0 : 6.0% us, 74.7% sy, 0.0% ni, 0.3% id, 0.0% wa, 1.3% hi, 17.7% si, 0.0% stCpu1 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si, 0.0% stCpu2 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si, 0.0% stCpu3 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si, 0.0% st
B2B UDP with memory access Send UDP traffic B2B with 10GE On receiver run independent
memory write task L2 Cache 4096 k Byte 8000k Byte blocks 100% user mode
Achievable UDP Throughput mean 9.39 Gb/s sigma 106 mean 9.21 Gb/s sigma 37 mean 9.2 sigma 30
Packet loss mean 0.04% mean 1.4 % mean 1.8 %
CPU load:
gig6-5_udpmon_membw
9000
9100
9200
9300
9400
9500
9600
0 10 20 30 40 50 60 70Trial number
Rec
v W
ire r
ate
Mbi
t/s
UDPUDP+cpu1UDP+cpu3
gig6-5_udpmon_membw
0
0.5
1
1.5
2
2.5
3
3.5
0 10 20 30 40 50 60 70Trial number
% P
acke
t lo
ss
UDPUDP+cpu1UDP+cpu3
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester17
ESLEA-FABRIC:4 Gbit flows over GÉANT Set up 4 Gigabit Lightpath Between GÉANT PoPs
Collaboration with Dante GÉANT Development Network London – London or London – Amsterdam
and GÉANT Lightpath service CERN – Poznan PCs in their PoPs with 10 Gigabit NICs
VLBI Tests: UDP Performance
Throughput, jitter, packet loss, 1-way delay, stability Continuous (days) Data Flows – VLBI_UDP and multi-Gigabit TCP performance with current kernels Experience for FPGA Ethernet packet systems
Dante Interests: multi-Gigabit TCP performance The effect of (Alcatel) buffer size on bursty TCP using BW limited
Lightpaths
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester18
Options Using the GÉANT Development Network
10 Gigabit SDH backbone Alkatel 1678 MCC Node location:
London Amsterdam Paris Prague Frankfurt
Can do traffic routingso make long rtt paths
Available Now 07 Less Pressure for
long term tests
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester19
Options Using the GÉANT LightPaths Set up 4 Gigabit Lightpath Between GÉANT PoPs
Collaboration with Dante PCs in Dante PoPs
10 Gigabit SDH backbone Alkatel 1678 MCC Node location:
Budapest Geneva Frankfurt Milan Paris Poznan Prague Vienna
Can do traffic routingso make long rtt paths
Ideal: London Copenhagen
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester20
Any Questions?
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester21
Backup Slides
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester22
10 Gigabit Ethernet: UDP Throughput
1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s
CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s
SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mb
its/
s
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester23
10 Gigabit Ethernet: Tuning PCI-X
16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times Times based on PCI-X times from
the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes5.7Gbit/s
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR UpdateKernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
DataTAG Xeon 2.2 GHz
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester24
10 Gigabit Ethernet: TCP Data transfer on PCI-X
Sun V20z 1.8GHz to2.6 GHz Dual Opterons
Connect via 6509 XFrame II NIC PCI-X mmrbc 4096 bytes
66 MHz
Two 9000 byte packets b2b Ave Rate 2.87 Gbit/s
Burst of packets length646.8 us
Gap between bursts 343 us 2 Interrupts / burst
CSR Access
Data Transfer
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester25
10 Gigabit Ethernet: UDP Data transfer on PCI-X Sun V20z 1.8GHz to
2.6 GHz Dual Opterons Connect via 6509 XFrame II NIC PCI-X mmrbc 2048 bytes
66 MHz One 8000 byte packets
2.8us for CSRs 24.2 us data transfer
effective rate 2.6 Gbit/s
2000 byte packet wait 0us ~200ms pauses
8000 byte packet wait 0us ~15ms between data blocks
CSR Access 2.8us
Data Transfer
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester26
10 Gigabit Ethernet: Neterion NIC Results X5DPE-G2 Supermicro PCs B2B Dual 2.2 GHz Xeon CPU FSB 533 MHz XFrame II NIC PCI-X mmrbc 4096 bytes
Low UDP rates ~2.5Gbit/s Large packet loss
TCP One iperf TCP data stream
4 Gbit/s Two bi-directional iperf TCP
data streams 3.8 & 2.2 Gbit/s
s2io 9k 3d Feb 06
0
500
1000
1500
2000
2500
3000
3500
4000
0 5 10 15 20 25 30 35 40
Spacing between frames us
Re
cv
Wir
e r
ate
Mb
it/s
1472 bytes 2000 bytes 3000 bytes 4000 bytes 5000 bytes 6000 bytes 7000 bytes 8000 bytes 8972 bytes
s2io 9k 3d Feb 06
0
10
20
30
40
5060
70
80
90
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss
1472 bytes 2000 bytes 3000 bytes 4000 bytes 5000 bytes 6000 bytes 7000 bytes 8000 bytes 8972 bytes
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester27
SC|05 Seattle-SLAC 10 Gigabit Ethernet 2 Lightpaths:
Routed over ESnet Layer 2 over Ultra Science Net
6 Sun V20Z systems per λ
dcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s
Used Neteion NICs & Chelsio TOE Data also sent to StorCloud
using fibre channel links
Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9 Gbit on Trunk