slide: 1 richard hughes-jones mini-symposium on optical data networking, august 2005, r....
TRANSCRIPT
Slide: 1Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1
Using TCP/IP on High Bandwidth Long Distance
Optical Networks Real Applications on Real Networks
Richard Hughes-Jones University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks” then look for “Rank”
Slide: 2Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 2
SCINet
Bandwidth Challenge at SC2004 Setting up the BW Bunker
The BW Challenge at the SLAC Booth
Working with S2io, Sun, Chelsio
Slide: 3Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 3
The Bandwidth Challenge – SC2004 The peak aggregate bandwidth from the booths was 101.13Gbits/s That is 3 full length DVDs per second ! 4 Times greater that SC2003 ! (with its 4.4 Gbit transatlantic flows) Saturated TEN 10Gigabit Ethernet waves SLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago
to Pittsburgh (with UKLight).
Slide: 4Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 4
TCP has been around for ages
and it just works fine
So
What’s the Problem?
The users complain about the Network!
Slide: 5Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 5
TCP – provides reliability Positive acknowledgement (ACK) of each received segment
Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit
Segment n
ACK of Segment nRTT
Time
Sender Receiver
Sequence 1024Length 1024
Ack 2048
Segment n+1
ACK of Segment n +1RTT
Sequence 2048Length 1024
Ack 3072
Inefficient – sender has to wait
Slide: 6Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 6
Flow Control: Sender – Congestion Window Uses Congestion window, cwnd, a sliding window to control the data flow
Byte count giving highest byte that can be sent with out without an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND
The available space in the receive buffer Timer kept for each packet
Unsent Datamay be transmitted immediately
Sent Databuffered waiting ACK
TCP Cwnd slides Data to be sent,waiting for windowto open.Application writes here
Data sent and ACKed
Sending hostadvances markeras data transmitted
Received ACKadvances trailing edge
Receiver’s advertisedwindow advances leading edge
Slide: 7Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 7
How it works: TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput
Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost
cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTT*log2 (W)
Rate doubles each RTT
CWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
Slide: 8Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 8
additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 /MTU for each ACK – linear increase in ratecwnd -> cwnd + a / cwnd - Additive Increase, a=1
TCP takes packet loss as indication of congestion ! multiplicative decrease: cut the congestion window size aggressively if
a packet is lost Standard TCP reduces cwnd by 0.5cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ Slow start to Congestion avoidance transition determined by ssthresh
Packet loss is a killer
CWNDslow start:
exponential increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
How it works: TCP AIMD Congestion Avoidance
Slide: 9Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 9
TCP (Reno) – Details of problem The time for TCP to recover its throughput from 1 lost 1500 byte packet is given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e t
o r
eco
ver
sec
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 25 ms USA 150 ms1.6 s 26 s 28min
Slide: 10Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 10
TCP: Simple Tuning - Filling the Pipe Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on:
Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth
Round Trip Time (RTT)
The number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by
orders of magnitude
Windows also used for flow controlRTT
Time
Sender Receiver
ACK
Segment time on wire = bits in segment/BW
Slide: 11Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 11
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is
that there is not such a decrease in throughput. Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, Hamilton-TCP, BiC-TCP
Slide: 12Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 12
Lets Check out this
theory about new TCP stacks
Does it matter ?
Does it work?
Slide: 13Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 13
Packet Loss with new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
Slide: 14Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 14
High Throughput Demonstration
Manchester (Geneva)
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSRCisco GSRCisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
Slide: 15Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 15
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
Slide: 16Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 16
Throughput for real users
Transfers in the UK for BaBar using
MB-NG and SuperJANET4
Slide: 17Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 17
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Slide: 18Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 18
Topology of the Production Network
KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS
man01
RAL Domain
Manchester Domain
ral01
HW RAID
HW RAID routers switches
3 routers2 switches
Slide: 19Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 19
iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits
Slide: 20Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 20
Applications: Throughput Mbit/sHighSpeed TCP2 GByte file RAID5SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0(not disk limited)
Slide: 21Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 21
bbftp: What else is going on?Scalable TCP
SuperMicro + SuperJANET Instantaneous 220 - 625 Mbit/s
Congestion window – duplicate ACK Throughput variation not TCP related?
Disk speed / bus transfer
Application
BaBar + SuperJANET Instantaneous 200 – 600 Mbit/s
Disk-mem~ 590 Mbit/s
Slide: 22Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 22
Average Transfer Rates Mbit/s
App TCP Stack SuperMicro on MB-NG
SuperMicro on
SuperJANET4
BaBar on
SuperJANET4
SC2004 on UKLight
Iperf Standard 940 350-370 425 940
HighSpeed 940 510 570 940
Scalable 940 580-650 605 940
bbcp Standard 434 290-310 290
HighSpeed 435 385 360
Scalable 432 400-430 380
bbftp Standard 400-410 325 320 825
HighSpeed 370-390 380
Scalable 430 345-532 380 875
apache Standard 425 260 300-360
HighSpeed 430 370 315
Scalable 428 400 317
Gridftp Standard 405 240
HighSpeed 320
Scalable 335
New stacksgive more
throughput
Rate decreases
Slide: 23Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 23
Transatlantic Disk to Disk Transfers
With UKLight
SuperComputing 2004
Slide: 24Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 24
SC2004 UKLIGHT Overview
MB-NG 7600 OSRManchester
ULCC UKLight
UCL HEP
UCL network
K2
Ci
Chicago Starlight
Amsterdam
SC2004
Caltech BoothUltraLight IP
SLAC Booth
Cisco 6509
UKLight 10GFour 1GE channels
UKLight 10G
Surfnet/ EuroLink 10GTwo 1GE channels
NLR LambdaNLR-PITT-STAR-10GE-16
K2
K2 Ci
Caltech 7600
Slide: 25Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 25
Transatlantic Ethernet: TCP Throughput Tests
Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP
Wire rate throughput of 940 Mbit/s
First 10 sec
Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing
0
500
1000
1500
2000
0 20000 40000 60000 80000 100000 120000 140000
time ms
TCPA
chiv
e M
bit/s
0
200000000
400000000
600000000
800000000
1000000000
1200000000
1400000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
0
500
1000
1500
2000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time ms
TCPA
chiv
e M
bit/s
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
Slide: 26Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 26
SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/s
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time msT
CP
Ach
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
ch
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Slide: 27Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 27
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions (work in progress) Hosts:
Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP traffic
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop of 30%
Disk write +9000 MTU UDP
1400 Mbit/sDrop of 19%
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4 8k
64k
y=178-1.05x
% CPU kernel mode
Slide: 28Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 28
Remote Computing Farms
in the ATLAS TDAQ Experiment
Slide: 29Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 29
Remote Computing Concepts
ROBROBROBROB
L2PUL2PUL2PUL2PU
SFISFI SFI
PFLocal Event Processing Farms
ATLAS Detectors – Level 1 Trigger
SFOs
Mass storageExperimental Area
CERN B513
CopenhagenEdmontonKrakowManchester
PF
Remote Event Processing Farms
PF
PF PF
ligh
tpat
hs
PF
Data Collection Network
Back End Network
GÉANT
Switch
Level 2 Trigger
Event Builders
~PByte/sec
320 MByte/sec
Slide: 30Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 30
ATLAS Application Protocol
Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes
Processing of event Return of computation
EF asks SFO for buffer space SFO sends OK EF transfers results of the computation
tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.
Send OK
Send event data
Request event
●●●
Request Buffer
Send processed event
Process event
Time
Request-Response time (Histogram)
Event Filter Daemon EFD SFI and SFO
Slide: 31Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 31
tcpmon: TCP Activity Manc-CERN Req-Resp
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta Web100 Instruments the TCP stack
Round trip time 20 ms 64 byte Request green
1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Data
Byte
s O
ut
0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
TCP Congestion windowgets re-set on each Request
TCP stack implementation detail to reduce Cwnd after inactivity
Even after 10s, each response takes 13 rtt or ~260 ms
020406080
100120140160180
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
TC
PA
ch
ive M
bit
/s
0
50000
100000
150000
200000
250000
Cw
nd
Transfer achievable throughput120 Mbit/s
Slide: 32Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 32
tcpmon: TCP Activity Manc-cern Req-RespTCP stack tuned
Round trip time 20 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms
0
200000
400000
600000
800000
1000000
1200000
0 500 1000 1500 2000 2500 3000time
Da
ta B
yte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0100200300400
500600700800900
0 1000 2000 3000 4000 5000 6000 7000 8000time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000time ms
nu
m P
ackets
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value TCP Congestion window
grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)
Transfer achievable throughputgrows to 800 Mbit/s
Slide: 33Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 33
Round trip time 150 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s
tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned
TCP Congestion windowin slow start to ~1.8s then congestion avoidance
Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)
Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s
0100000200000300000400000500000600000700000800000900000
1000000
0 1000 2000 3000 4000 5000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0100
200300
400500
600700
800
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
Cw
nd
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time msn
um
Packets
0
200000
400000
600000
800000
1000000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
Slide: 34Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 34
Time Series of Request-Response Latency
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
1600.00
1800.00
2000.00
0 50 100 150 200 250 300
Request Time s
Ro
un
d T
rip
La
ten
cy
ms
1000000
Alberta – CERN Round trip time 150 ms 1 Mbyte of data returned Stable for ~150s at 300ms Falls to 160ms with ~80 μs variation
160.30
160.35
160.40
160.45
160.50
160.55
160.60
200 205 210 215 220 225 230 235 240 245 250
Request Time s
Ro
un
d T
rip
La
ten
cy
ms
25.00
30.00
35.00
40.00
45.00
50.00
55.00
60.00
65.00
70.00
75.00
0 10 20 30 40 50 60 70 80 90 100
Request Time s
Ro
un
d T
rip
La
tne
cy
ms Manchester – CERN
Round trip time 20 ms 1 Mbyte of data returned Stable for ~18s at ~42.5ms Then alternate points 29 & 42.5 ms
Slide: 35Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 35
Radio Astronomy
e-VLBI
Slide: 36Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 36
Jodrell BankUK
DwingelooDWDM link
MedicinaItaly Torun
Poland
e-VLBI at the GÉANT2 Launch Jun 2005
Slide: 37Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 37
e-VLBI UDP Data Streams
Slide: 38Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 38
UDP Performance: 3 Flows on GÉANT
Throughput: 5 Hour run Jodrell: JIVE
2.0 GHz dual Xeon – 2.4 GHz dual Xeon670-840 Mbit/s
Medicina (Bologna): JIVE 800 MHz PIII – mark623 1.2 GHz PIII 330 Mbit/s limited by sending PC
Torun: JIVE 2.4 GHz dual Xeon – mark575 1.2 GHz PIII 245-325 Mbit/s limited by security policing (>400Mbit/s 20 Mbit/s) ?
Throughput: 50 min period Period is ~17 min
BW 14Jun05
0
200
400
600
800
1000
0 500 1000 1500 2000Time 10s steps
Rec
v w
ire r
ate
Mbi
t/s
JodrellMedicinaTorun
BW 14Jun05
0
200
400
600
800
1000
200 250 300 350 400 450 500Time 10s steps
Rec
v w
ire r
ate
Mbi
t/s
JodrellMedicinaTorun
Slide: 39Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 39
UDP Performance: 3 Flows on GÉANT
Packet Loss & Re-ordering Jodrell: 2.0 GHz Xeon
Loss 0 – 12% Reordering significant
Medicina: 800 MHz PIII Loss ~6% Reordering in-significant
Torun: 2.4 GHz Xeon Loss 6 - 12% Reordering in-significant
Torun 14Jun04
0
1
2
3
4
5
0 500 1000 1500 2000Time 10s
num
re-
orde
red
020000
400006000080000100000
120000140000
num
lost
re-ordered
num_lost
jbgig1-jivegig1_14Jun05
0
500
1000
1500
2000
0 500 1000 1500 2000Time 10s
num
re-
orde
red
0
50000
100000
150000
num
lost
re-ordered
num_lost
Medicina 14Jun05
0
1
2
3
4
5
0 500 1000 1500 2000Time 10s
num
re-
orde
red
0
10000
20000
30000
40000
50000
60000
70000
num
lost
re-ordered
num_lost
Slide: 40Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 40
18 Hour Flows on UKLightJodrell – JIVE, 26 June 2005
Throughput: Jodrell: JIVE
2.4 GHz dual Xeon – 2.4 GHz dual Xeon
960-980 Mbit/s
Traffic through SURFnet
Packet Loss Only 3 groups with 10-150 lost packets
each No packets lost the rest of the time
Packet re-ordering None
man03-jivegig1_26Jun05
0
200
400
600
800
1000
0 1000 2000 3000 4000 5000 6000 7000
Time 10s steps
Recv w
ire r
ate
Mbit/s
w10
man03-jivegig1_26Jun05
900910920930940950
960970980990
1000
5000 5050 5100 5150 5200
Time 10sR
ecv w
ire r
ate
Mbit/s
w10
man03-jivegig1_26Jun05
1
10
100
1000
0 1000 2000 3000 4000 5000 6000 7000
Time 10s steps
Packet
Loss
w10
Slide: 41Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 41
The End Hosts themselves The performance of Motherboards, NICs, RAID controllers and Disks matter Plenty of CPU power is required to sustain Gigabit transfers for the TCP/IP stack as well and
the application Packets can be lost in the IP stack due to lack of processing power
New TCP stacks are stable give better response & performance Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale Take care on difference between the Protocol and The Implementation
Packet loss is a killer Check on campus links & equipment, and access links to backbones
Applications architecture & implementation is also important The work is applicable to other areas including:
Remote iSCSIRemote database accessesReal-time Grid Computing – eg Real-Time Interactive Medical Image processing
Interaction between HW, protocol processing, and disk sub-system complex
Summary & Conclusions
MB - NG
Slide: 42Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 42
More Information Some URLs Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications)
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications)
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
Slide: 43Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 43
Any Questions?
Slide: 44Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 44
Backup Slides
Slide: 45Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 45
Multi-Gigabit flows at SC2003 BW Challenge Three Server systems with 10 Gigabit Ethernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:
Pal Alto PAIX rtt 17 ms , window 30 MB Shared with Caltech booth 4.37 Gbit HighSpeed TCP I=5% Then 2.87 Gbit I=16% Fall when 10 Gbit on link
3.3Gbit Scalable TCP I=8% Tested 2 flows sum 1.9Gbit I=39%
Chicago Starlight rtt 65 ms , window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit HighSpeed TCP I=1.6%
Amsterdam SARA rtt 175 ms , window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit HighSpeed TCP I=6.9%
Very Stable Both used Abilene to Chicago
10 Gbits/s throughput from SC2003 to PAIX
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
Slide: 46Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 46
UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program
Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size
Slope is given by:
Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:
Behavior of the IP stack The way the HW operates Interrupt coalescence
pathsdata dt
db1 s
Latency Measurements
Slide: 47Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 47
Throughput Measurements
UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals
n bytes
Number of packets
Wait timetime
Zero stats OK done
●●●
Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay
Send data frames at regular intervals
●●●
Time to send Time to receive
Inter-packet time(Histogram)
Signal end of testOK done
Time
Sender Receiver
Slide: 48Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 48
PCI Bus & Gigabit Ethernet Activity
PCI Activity Logic Analyzer with
PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC
GigabitEthernetProbe
CPU
mem
chipset
NIC
CPU
mem
NIC
chipset
Logic AnalyserDisplay
PCI bus PCI bus
Possible Bottlenecks
Slide: 49Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 49
SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus
6 PCI PCI-X slots 4 independent PCI buses
64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X
Dual Gigabit Ethernet Adaptec AIC-7899W
dual channel SCSI UDMA/100 bus master/EIDE channels
data transfer rates of 100 MB/sec burst
“Server Quality” Motherboards
Slide: 50Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 50
“Server Quality” Motherboards
Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory
Theory BW: 6.4Gbit
HyperTransport
2 independent PCI buses 133 MHz PCI-X
2 Gigabit Ethernet SATA
( PCI-e )
Slide: 51Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 51
End Hosts & NICs CERN-nat-Manc.
Request-Response Latency
Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network
SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus
pcatb121-nat-gig6_13Aug04
0100200300400500600700800900
1000
0 10 20 30 40
Spacing between frames us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
pcatb121-nat-gig6_13Aug04
0
20
40
60
80
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss 50 bytes
100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
pcatb121-nat-gig6_13Aug04
0
5
10
15
0 5 10 15 20 25 30 35 40Spacing between frames us
num
re-
orde
red
50 bytes
100 bytes 200 bytes
400 bytes 600 bytes
800 bytes 1000 bytes
1200 bytes 1400 bytes
1472 bytes
256 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
6000
20900 21100 21300 21500Latency us
N(t
)
512 bytes pcatb121-nat-gig6
0
2000
4000
6000
8000
10000
20900 21100 21300 21500Latency us
N(t
)
1400 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
20900 21100 21300 21500Latency us
N(t
)
The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving
the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS
have no re-ordering
Slide: 52Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 52
TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e t
o r
eco
ver
sec
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 20 ms USA 150 ms
Slide: 53Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 53
Network & Disk Interactions Disk Write
mem-disk: 1735 Mbit/s Tends to be in 1 die
Disk Write + UDP 1500 mem-disk : 1218 Mbit/s Both dies at ~80%
Disk Write + CPU mem mem-disk : 1341 Mbit/s 1 CPU at ~60% other 20% Large user mode usage Below Cut = hi BW Hi BW = die1 used
Disk Write + CPUload mem-disk : 1334 Mbit/s 1 CPU at ~60% other 20% All CPUs saturated in
user mode
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
RAID0 6disks 1 Gbyte Write 8k 3w8506-8 26 Dec04 16384
y = -1.0215x + 215.63
y = -1.0529x + 206.46
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k total CPU
64k total CPU
R0 6d 1 Gbyte udp Write 8k 3w8506-8 26 Dec04 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k totalCPU
64k totalCPU
y=178-1.05x
R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05xcut equn
R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k totalCPU64k totalCPUy=178-1.05xcut equn 2
R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05x
R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k total CPU
64k total CPU
y=178-1.05x
Total CPU load
Kernel CPU load
R0 6d 1 Gbyte membw write 64k 3w8506-8 04Jan05 16384
0
500
1000
1500
2000
2500
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
Trial number
Th
rou
gh
pu
t M
bit
/s
Series1
L3+L4<cut
Slide: 54Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 54
TCP Fast Retransmit & Recovery Duplicate ACKs are due to lost segments or segments out of order. Fast Retransmit: If the receiver transmits 3 duplicate ACKs
(i.e. it received 3 additional segments without getting the one expected) Transmitting host sends the missing segment
Set ssthresh to 0.5*cwnd – so enter congestion avoidance phaseSet cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK
no need to go into “slow start” again
At steady state, CWND oscillates around the optimal window size With a retransmission timeout, slow start is triggered again
CWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeoutCWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
Slide: 55Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 55
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel
Agreement withtheory good
Some new stacksgood at high loss rates
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas