gridpp meeting edinburgh 4-5 feb 04 r. hughes-jones manchester 1 high performance networking for all...
TRANSCRIPT
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
1
High Performance Networking for ALL
Members of GridPP are in many Network collaborations including:
MB - NG
Close links with:
SLACUKERNA, SURFNET and other NRNsDanteInternet2Starlight, NetherlightGGFRipeIndustry…
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
2
Network Monitoring [1]
Architecture
DataGrid WP7 code extended by Gareth Manc
Technology transfer to UK e-Science
Developed by Mark Lees DL
Fed back into DataGrid by Gareth
Links to: GGF NM-WG, Dante, Internet2 Characteristics, Schema & web services
Success
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
3
Network Monitoring [2]
24 Jan to 4 Feb 04
TCP iperf RAL to HEP
Only 2 sites >80 Mbit/s
24 Jan to 4 Feb 04
TCP iperf DL to HEP
HELP!
RIPE-47, Amsterdam, 29 January 2004
High bandwidth, Long distance….
Where is my throughput?
Robin TaskerCCLRC, Daresbury Laboratory, UK
DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
5
Throughput… What’s the problem?
One Terabyte of data transferred in less than an hour
On February 27-28 2003, the transatlantic DataTAG network wasextended, i.e. CERN - Chicago - Sunnyvale (>10000 km).
For the first time, a terabyte of data was transferred across the Atlantic in less than one hour using a single TCP (Reno) stream.
The transfer was accomplished from Sunnyvale to Geneva ata rate of 2.38 Gbits/s
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
6
Internet2 Land Speed Record
On October 1 2003, DataTAG set a new Internet2 Land Speed Record by transferring 1.1 Terabytes of data in less than 30 minutes from Geneva to Chicago across the DataTAG provision, corresponding to an average rate of 5.44 Gbits/s using a single TCP (Reno) stream
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
7
So how did we do that?Management of the End-to-End Connection
Memory-to-Memory transfer; no disk system involved
Processor speed and system bus characteristics
TCP Configuration – window size and frame size (MTU)
Network Interface Card and associated driver and their configuration
End-to-End “no loss” environment from CERN to Sunnyvale!
At least a 2.5 Gbits/s capacity pipe on the end-to-end path
A single TCP connection on the end-to-end path
No real user application
That’s to say - not the usual User experience!
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
8
Realistically – what’s the problem & why do network research?
End System IssuesNetwork Interface Card and Driver and their configuration TCP and its configuration Operating System and its configuration Disk SystemProcessor speedBus speed and capability
Network Infrastructure IssuesObsolete network equipmentConfigured bandwidth restrictionsTopologySecurity restrictions (e.g., firewalls)Sub-optimal routingTransport Protocols
Network Capacity and the influence of Others!Many, many TCP connectionsMice and Elephants on the pathCongestion
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
9
End Hosts: Buses, NICs and Drivers
Latency
Throughput
Bus Activity
Use UDP packets to characterise Intel PRO/10GbE Server Adaptor
SuperMicro P4DP8-G2 motherboard
Dual Xenon 2.2GHz CPU
400 MHz System bus
133 MHz PCI-X bus
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
10
End Hosts: Understanding NIC DriversLinux driver basics – TX
Application system call
Encapsulation in UDP/TCP and IP headers
Enqueue on device send queue
Driver places information in DMA descriptor ring
NIC reads data from main memory
via DMA and sends on wire
NIC signals to processor that TX
descriptor sent
Linux driver basics – RX NIC places data in main memory via
DMA to a free RX descriptor
NIC signals RX descriptor has data
Driver passes frame to IP layer and
cleans RX descriptor
IP layer passes data to application
Linux NAPI driver model On receiving a packet, NIC raises interrupt
Driver switches off RX interrupts and schedules RX DMA ring poll
Frames are pulled off DMA ring and is processed up to application
When all frames are processed RX interrupts are re-enabled
Dramatic reduction in RX interrupts under load
Improving the performance of a Gigabit Ethernet driver under Linux
http://datatag.web.cern.ch/datatag/papers/drafts/linux_kernel_map/
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
11
Protocols: TCP (Reno) – Performance
AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due
in part to the TCP congestion control algorithm
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1
For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
12
Protocols: HighSpeed TCP & Scalable TCP
Adjusting the AIMD Algorithm – TCP Reno For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1
For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
High Speed TCPa and b vary depending on current cwnd where
a increases more rapidly with larger cwnd and as a consequence returns to the ‘optimal’ cwnd size sooner for the network path; and
b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput.
Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd
such that the increase is greater than TCP Reno, and the decrease on
loss is less than TCP Reno
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
13
Protocols: HighSpeed TCP & Scalable TCP
HighSpeed TCP Scalable TCP
HighSpeed TCP implemented by Gareth Manc
Scalable TCP implemented by Tom Kelly Camb
Integration of stacks into DataTAG Kernel Yee UCL + Gareth
Success
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
14
Some Measurements of Throughput CERN -SARAStandard TCP txlen 100 25 Jan03
0
100
200
300
400
500
1043509370 1043509470 1043509570 1043509670 1043509770
Time
I/f
Rat
e M
bits
/s
00.20.40.60.811.21.41.61.82
Re
cv. R
ate
Mb
its/s
Out Mbit/s In Mbit/s
Hispeed TCP txlen 2000 26 Jan03
0
100
200
300
400
500
1043577520 1043577620 1043577720 1043577820 1043577920Time
I/f
Rat
e M
bits
/s
00.20.40.60.811.21.41.61.82
Recv. R
ate
Mbits
/s
Out Mbit/s
In Mbit/s
Using the GÉANT Backup Link 1 GByte file transfers
Blue Data
Red TCP ACKs
Standard TCPAverage Throughput 167 Mbit/s
Users see 5 - 50 Mbit/s!
High-Speed TCPAverage Throughput 345 Mbit/s
Scalable TCPAverage Throughput 340
Mbit/s
Scalable TCP txlen 2000 27 Jan03
0
100
200
300
400
500
1043678800 1043678900 1043679000 1043679100 1043679200Time
II/f
Rat
e M
bits
/s
00.20.40.60.811.21.41.61.82
Re
cv. R
ate
Mb
its/s
Out Mbit/s
In Mbit/s
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
15
Users, The Campus & the MAN [1]
NNW – to – SJ4 Access 2.5 Gbit PoS Hits 1 Gbit 50 %
Man – NNW Access 2 * 1 Gbit Ethernet
Pete White
Pat Meyrs
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
16
0
50
100
150
200
250
24/01/200400:00
25/01/200400:00
26/01/200400:00
27/01/200400:00
28/01/200400:00
29/01/200400:00
30/01/200400:00
31/01/200400:00
Tra
ffic
Mbi
t/s
In
out
Users, The Campus & the MAN [2]
LMN to site 1 Access 1 Gbit Ethernet LMN to site 2 Access 1 Gbit Ethernet
0
100
200
300
400
500
600
700
800
900
24/01/200400:00
25/01/200400:00
26/01/200400:00
27/01/200400:00
28/01/200400:00
29/01/200400:00
30/01/200400:00
31/01/200400:00
Tra
ffic
Mb
it/s
In from SJ4
Out to SJ4ULCC-JANET traffic 30/1/2004
0
100
200
300
400
500
600
700
800
00:00 02:24 04:48 07:12 09:36 12:00 14:24 16:48 19:12 21:36 00:00
Tra
ffic
Mbi
t/s
in
out
0
50
100
150
200
250
300
350
24/01/200400:00
25/01/200400:00
26/01/200400:00
27/01/200400:00
28/01/200400:00
29/01/200400:00
30/01/200400:00
31/01/200400:00
Tra
ffic
Mbi
t/s
In site1
Out site1
Message:
Continue to work with your network group
Understand the traffic levels
Understand the Network Topology
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
17
10 GigEthernet: Tuning PCI-X
mmrbc 1024 bytes
mmrbc 2048 bytes
mmrbc 4096 bytes
Transfer of 16114 bytes
mmrbc 512 bytes
PCI-X segment 2048 bytes
CSR Access
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
18
10 GigEthernet at SC2003 BW Challenge (Phoenix)
Three Server systems with 10 GigEthernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Streams From SLAC/FNAL booth in Phoenix to:
Pal Alto PAIX 17 ms rtt
Chicago Starlight 65 ms rtt
Amsterdam SARA 175 ms rtt
10 Gbits/s throughput from SC2003 to PAIX
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
19
Helping Real Users [1]Radio Astronomy VLBI
PoC with NRNs & GEANT 1024 Mbit/s 24 on 7 NOW
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
20
1472 byte Packets man -> JIVE FWHM 22 µs (B2B 3 µs )
VLBI Project: Throughput Jitter & 1-way Delay
1472 bytes w=50 jitter Gnt5-DwMk5 28Oct03
0
2000
4000
6000
8000
10000
0 20 40 60 80 100 120 140
Jitter us
N(t
)
1472 bytes w12 Gnt5-DwMk5 21Oct03
0
2000
4000
6000
8000
10000
12000
2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000Packet No.
1-w
ay
de
lay
us
1472 bytes w12 Gnt5-DwMk5 21Oct03
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000Packet No.
1-w
ay d
elay
us
1-way Delay – note the packet loss (points with zero 1 –way delay)
Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mbits/s
Gnt5-DwMk5
DwMk5-Gnt5
1472 byte Packets Manchester -> Dwingeloo JIVE
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
21
Measure the time between lost packets in the time series of packets sent.
Lost 1410 in 0.6s Is it a Poisson process? Assume Poisson is
stationary λ(t) = λ
Use Prob. Density Function:
P(t) = λ e-λt
Mean λ = 2360 / s[426 µs]
Plot log: slope -0.0028expect -0.0024
Could be additional process involved
VLBI Project: Packet Loss Distributionpacket loss distribution 12b bin=12us
0
10
20
30
40
50
60
70
80
12 72 132
192
252
312
372
432
492
552
612
672
732
792
852
912
972
Time between lost frames (us)
Num
ber
in B
in
Measured
Poisson
packet loss distribution 12b
y = 41.832e-0.0028x
y = 39.762e-0.0024x
1
10
100
0 500 1000 1500 2000
Time between frames (us)
Num
ber
in B
in
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
22
VLBI Traffic Flows – Only testing! Manchester – NetNorthWest - SuperJANET Access links
Two 1 Gbit/s
Access links:SJ4 to GÉANT GÉANT to SurfNet
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
23
Throughput & PCI transactions on the Mark5 PC:
Read /Write
n bytes
Wait timetime
Mark5 uses Supermicro P3TDLE
1.2 GHz PIII
Mem bus 133/100 MHz
2 *64bit 66 MHz PCI
4 32bit 33 MHz PCI
SuperStor
NIC
Input Card
IDEDiscPack
Ethernet
Logic AnalyserDisplay
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
24
PCI Activity: Read Multiple data blocks 0 wait Read 999424 bytes Each Data block:
Setup CSRs
Data movement
Update CSRs
For 0 wait between reads:
Data blocks ~600µs longtake ~6 ms
Then 744µs gap PCI transfer rate 1188Mbit/s
(148.5 Mbytes/s) Read_sstor rate 778 Mbit/s
(97 Mbyte/s) PCI bus occupancy: 68.44% Concern about Ethernet Traffic 64
bit 33 MHz PCI needs ~ 82% for 930 Mbit/s Expect ~360 Mbit/s
Data transfer
CSR AccessPCI Burst 4096 bytes
Data Block131,072 bytes
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
25
PCI Activity: Read Throughput
Flat then 1/t dependance ~ 860 Mbit/s for Read blocks >=
262144 bytes
CPU load ~20% Concern about CPU load needed
to drive Gigabit link
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
26
Helping Real Users [2]HEP
BaBar & CMS Application Throughput
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
27
BaBar Case Study: Disk Performace
BaBar Disk Server Tyan Tiger S2466N
motherboard 1 64bit 66 MHz PCI bus Athlon MP2000+ CPU AMD-760 MPX chipset 3Ware 7500-8 RAID5 8 * 200Gb Maxtor IDE
7200rpm disks Note the VM parameter
readahead max
Disk to memory (read)Max throughput 1.2 Gbit/s 150 MBytes/s)
Memory to disk (write)Max throughput 400 Mbit/s 50 MBytes/s)[not as fast as Raid0]
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
28
BaBar: Serial ATA Raid Controllers 3Ware 66 MHz PCI
Read Throughput raid5 4 3Ware 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
ICP 66 MHz PCI
Write Throughput raid5 4 3Ware 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
1800
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 516readahead max 1200
Read Throughput raid5 4 ICP 66MHz SATA disk
0
100
200
300
400
500
600
700
800
900
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
Write Throughput raid5 4 ICP 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
29
BaBar Case Study: RAID Throughput & PCI Activity 3Ware 7500-8 RAID5 parallel EIDE 3Ware forces PCI bus to 33 MHz BaBar Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s
Disk – disk throughput bbcp 40-45 Mbytes/s (320 – 360 Mbit/s)
PCI bus effectively full!
Read from RAID5 Disks Write to RAID5 Disks
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
30PC PC
MB – NG SuperJANET4 Development Network BaBar Case Study
RAL
OSM-1OC48-POS-SS
MCC
OSM-1OC48-POS-SS
MAN
Gigabit Ethernet2.5 Gbit POS Access2.5 Gbit POS coreMPLS Admin. Domains
SJ4 Dev
SJ4 DevSJ4
Dev
SJ4 Dev
PC PCPC
3ware RAID5
BarBar PC
3ware RAID5
MB - NG
Status / Tests: Manc host has DataTAG TCP stack RAL Host now available BaBar-BaBar mem-mem BaBar-BaBar real data MB-NG BaBar-BaBar real data SJ4 Mbng-mbng real data MB-NG Mbng-mbng real data SJ4 Different TCP stacks already installed
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
31PC PC
Study of Applications MB – NG SuperJANET4 Development Network
UCL
OSM-1OC48-POS-SS
MCC
OSM-1OC48-POS-SS
MAN
Gigabit Ethernet2.5 Gbit POS Access2.5 Gbit POS coreMPLS Admin. Domains
SJ4 Dev
SJ4 DevSJ4
Dev
SJ4 Dev
PC PCPC
3ware RAID0
PC
3ware RAID0
MB - NG
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
32
24 Hours HighSpeed TCP mem-mem TCP mem-mem lon2-man1 Tx 64 Tx-abs 64 Rx 64 Rx-abs 128 941.5 Mbit/s +- 0.5 Mbit/s
MB - NG
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
33
Gridftp Throughput HighSpeedTCP Int Coal 64 128 Txqueuelen 2000 TCP buffer 1 M byte
(rtt*BW = 750kbytes)
Interface throughput
Acks received
Data moved 520 Mbit/s
Same for B2B tests
So its not that simple!
MB - NG
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
34
Gridftp Throughput + Web100
Throughput Mbit/s:
See alternate 600/800 Mbitand zero
Cwnd smooth No dup Ack / send stall /
timeouts
MB - NG
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
35
http data transfers HighSpeed TCP
Apachie web server out of the box!
prototype client - curl http library 1Mbyte TCP buffers 2Gbyte file Throughput 72 MBytes/s Cwnd - some variation No dup Ack / send stall /
timeouts
MB - NG
GridPP Meeting Edinburgh 4-5 Feb 04R. Hughes-Jones Manchester
36
More Information Some URLs
MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/netMotherboard and NIC Tests:
www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html