networkshop march 2005 richard hughes-jones manchester bandwidth challenge, land speed record,...

42
Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Upload: bennett-cannon

Post on 29-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Bandwidth Challenge, Land Speed Record,TCP/IP and You

Page 2: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

The SC Network

Working with S2io, Cisco & folks

At the SLAC BoothRunning theBW Challenge

Bandwidth Lust at SC2003

Page 3: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

The Bandwidth Challenge at SC2003 The peak aggregate bandwidth from the 3 booths was 23.21Gbits/s 1-way link utilisations of >90% 6.6 TBytes in 48 minutes

Page 4: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Multi-Gigabit flows at SC2003 BW Challenge Three Server systems with 10 Gigabit Ethernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:

Pal Alto PAIX rtt 17 ms , window 30 MB Shared with Caltech booth 4.37 Gbit HighSpeed TCP I=5% Then 2.87 Gbit I=16% Fall when 10 Gbit on link

3.3Gbit Scalable TCP I=8% Tested 2 flows sum 1.9Gbit I=39%

Chicago Starlight rtt 65 ms , window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit HighSpeed TCP I=1.6%

Amsterdam SARA rtt 175 ms , window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit HighSpeed TCP I=6.9%

Very Stable Both used Abilene to Chicago

10 Gbits/s throughput from SC2003 to PAIX

0

1

2

3

4

5

6

7

8

9

10

11/19/0315:59

11/19/0316:13

11/19/0316:27

11/19/0316:42

11/19/0316:56

11/19/0317:11

11/19/0317:25 Date & Time

Throughput

Gbits/s

Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2

10 Gbits/s throughput from SC2003 to Chicago & Amsterdam

0

1

2

3

4

5

6

7

8

9

10

11/19/0315:59

11/19/0316:13

11/19/0316:27

11/19/0316:42

11/19/0316:56

11/19/0317:11

11/19/0317:25 Date & Time

Throughput

Gbits/s

Router traffic to Abilele

Phoenix-Chicago

Phoenix-Amsterdam

Page 5: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

SCINet

Collaboration at SC2004 Setting up the BW Bunker

The BW Challenge at the SLAC Booth

Working with S2io, Sun, Chelsio

Page 6: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

UKLight & ESLEA at SC2004 UK e-Science Researchers from Manchester, UCL & ULCC involved in

the Bandwidth Challenge Collaborated with Scientists & Engineers from Caltech, CERN, FERMI, SLAC,

Starlight, UKERNA & U. of Florida Networks used by the SLAC/UK team:

10 Gbit Ethernet link from SC2004 to ESnet/QWest PoP in Sunnyvale 10 Gbit Ethernet link from SC2004 and the CENIC/NLR/Level(3) PoP in

Sunnyvale  10 Gbit Ethernet link from SC2004 to Chicago and on to UKLight

UKLight focused on Gigabit disk-to-disk transfers between UK sites and Pittsburgh

UK had generous support from Boston Ltd who loaned the servers

The BWC Collaboration had support from: S2io NICs Chelsio TOE Sun who loaned servers

Essential support from Boston, Sun & Cisco

Page 7: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

The Bandwidth Challenge – SC2004 The peak aggregate bandwidth from the booths was 101.13Gbits/s That is 3 full length DVDs per second ! 4 Times greater that SC2003 ! Saturated TEN 10Gigabit Ethernet waves SLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago

to Pittsburgh (with UKLight).

Page 8: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Land Speed Record – SC2004 Pittsburgh-Tokyo-CERN Single stream TCP

LSR = Distance * Speed Single Stream, Multiple Stream, IPv4 and IPv6 Standard TCP Current single stream IPv4 University of Tokyo, Fujitsu & WIDE 9 Nov 05 20,645 km connection SC2004 booth - CERN via Tokyo Latency 433 ms RTT 10 Gbit Chelsio TOE Card 7.21 Gbps (TCP payload), 1500 B mtu taking about 10 min 148,850 Tetabit meter / second (Internet2 LSR approved record) Full DVD in 5 s

Page 9: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Just a Well Engineered End-to-End Connection

End-to-End “no loss” environment

NO contention, NO sharing on the end-to-end path

Processor speed and system bus characteristics

TCP Configuration – window size and frame size (MTU)

Tuned PCI-X bus

Tuned Network Interface Card driver

A single TCP connection on the end-to-end path

Memory-to-Memory transfer

no disk system involved

No real user application (but did file transfers!!)

Not a typical User or Campus situation BUT …

So what’s the matter with TCP – Did we cheat?

InternetInternet

Regional

Regional

Regional

Regional

Campus

Campus

Campus

Campus

Client

Server

Campus

Campus

Campus

Campus

Client

Server

UK LightUK Light

From Robin Tasker

Page 10: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

TCP (Reno) – What’s the problem?

TCP has 2 phases: Slowstart

Probe the network to estimate the Available BWExponential growth

Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”

AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due

in part to the TCP congestion control algorithm. For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½

Packet loss is a killer !!

Page 11: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:

for rtt of ~200 ms:

MSS

RTTC

*2

* 2

2 min

0.00010.0010.010.1

110

1001000

10000100000

0 50 100 150 200rtt ms

Tim

e t

o r

eco

ver

sec

10Mbit100Mbit1Gbit2.5Gbit10Gbit

UK 6 ms Europe 20 ms USA 150 ms

Page 12: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)

For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP

a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner

for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is

that there is not such a decrease in throughput. Scalable TCP

a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.

Fast TCP

Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.

HSTCP-LP, H-TCP, BiC-TCP

Page 13: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Packet Loss with new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms

Page 14: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Packet Loss and new TCP Stacks TCP Response Function

UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel

Agreement withtheory good

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

Page 15: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

High Throughput Demonstrations

Manchester (Geneva)

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSRCisco GSRCisco7609

Cisco7609

London (Chicago)

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100

Page 16: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins

Page 17: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Is TCP fair?

TCP Flows – Sharing the Bandwidth

Page 18: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)

Used iperf/TCP and UDT/UDP to generate traffic

Each run was 16 minutes, in 7 regions

Test of TCP Sharing: Methodology (1Gbit/s)

Ping 1/s

Iperf or UDT

ICMP/ping traffic

TCP/UDPbottleneck

iperf

SLACCaltech/UFL/CERN

2 mins 4 mins

Les Cottrell PFLDnet 2005

Page 19: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor

throughput Unequal sharing

TCP Reno single stream

Congestion has a dramatic effect

Recovery is slow

Increase recovery rate

SLAC to CERN

RTT increases when achieves best throughput

Les Cottrell PFLDnet 2005

Remaining flows do not take up slack when flow removed

Page 20: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

UK Transfers MB-NG and SuperJANET4

Throughput for real users

Page 21: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)

BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits

Page 22: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET

bbcp

bbftp

Apachie

Gridftp

Previous work used RAID0(not disk limited)

Page 23: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

bbftp: What else is going on? Scalable TCP

BaBar + SuperJANET

SuperMicro + SuperJANET

Congestion window – duplicate ACK Variation not TCP related?

Disk speed / bus transfer Application

Page 24: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

SC2004 & Transfers with UKLight

A Taster for Lambda & Packet Switched Hybrid Networks

Page 25: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Transatlantic Ethernet: TCP Throughput Tests

Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP

Wire rate throughput of 940 Mbit/s

First 10 sec

Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing

0

500

1000

1500

2000

0 20000 40000 60000 80000 100000 120000 140000

time ms

TCPA

chiv

e M

bit/s

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

Cwnd

InstaneousBWAveBWCurCwnd (Value)

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time ms

TCPA

chiv

e M

bit/s

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

Cwnd

InstaneousBWAveBWCurCwnd (Value)

Page 26: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

SC2004 Disk-Disk bbftp (work in progress)

bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)

Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s

~4.5s of overhead)

Disk-TCP-Disk at 1Gbit/sis here!

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time msT

CP

Ach

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

ch

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)

Page 27: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Super Computing Bandwidth Challenge gives opportunity to make world-wide High performance tests.

Land Speed Record shows what can be achieved with state of the art kit Standard TCP not optimum for high throughput long distance links Packet loss is a killer for TCP

Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert

New stacks are stable give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”

Host is critical think Server quality not Supermarket PC Motherboards NICs, RAID controllers and Disks matter

NIC should use 64 bit 133 MHz PCI-X 66 MHz PCI can be OK but 32 bit 33 MHz is too slow for Gigabit rates

Worry about the CPU-Memory bandwidth as well as the PCI bandwidth Data crosses the memory bus at least 3 times

Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses Choose a modern high throughput RAID controller

Consider SW RAID0 of RAID5 HW controllers Users are now able to perform sustained 1 Gbit/s transfers

Summary, Conclusions & Thanks

MB - NG

Page 28: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

More Information Some URLs UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:

http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002

Page 29: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Any Questions?

Page 30: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Backup Slides

Page 31: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Topology of the MB – NG Network

KeyGigabit Ethernet2.5 Gbit POS Access

MPLS Admin. Domains

UCL Domain

Edge Router Cisco 7609

man01

man03

Boundary Router Cisco 7609

Boundary Router Cisco 7609

RAL Domain

Manchester Domain

lon02

man02

ral01

UKERNADevelopment

Network

Boundary Router Cisco 7609

ral02

ral02

lon03

lon01

HW RAID

HW RAID

Page 32: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Topology of the Production Network

KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS

man01

RAL Domain

Manchester Domain

ral01

HW RAID

HW RAID routers switches

3 routers2 switches

Page 33: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

SC2004 UKLIGHT Overview

MB-NG 7600 OSRManchester

ULCC UKLight

UCL HEP

UCL network

K2

Ci

Chicago Starlight

Amsterdam

SC2004

Caltech BoothUltraLight IP

SLAC Booth

Cisco 6509

UKLight 10GFour 1GE channels

UKLight 10G

Surfnet/ EuroLink 10GTwo 1GE channels

NLR LambdaNLR-PITT-STAR-10GE-16

K2

K2 Ci

Caltech 7600

Page 34: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s

High Performance TCP – MB-NG

Standard HighSpeed Scalable

Page 35: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks:

1200 Mbit/s read 600 Mbit/s write

Scalable TCP

BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s

SuperMicro + SuperJANET Instantaneous

400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s

SuperMicro + MB-NG Instantaneous

880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s

Page 36: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Average Transfer Rates Mbit/s

App TCP Stack SuperMicro on MB-NG

SuperMicro on

SuperJANET4

BaBar on

SuperJANET4

SC2004 on UKLight

Iperf Standard 940 350-370 425 940

HighSpeed 940 510 570 940

Scalable 940 580-650 605 940

bbcp Standard 434 290-310 290

HighSpeed 435 385 360

Scalable 432 400-430 380

bbftp Standard 400-410 325 320 825

HighSpeed 370-390 380

Scalable 430 345-532 380 875

apache Standard 425 260 300-360

HighSpeed 430 370 315

Scalable 428 400 317

Gridftp Standard 405 240

HighSpeed 320

Scalable 335

New stacksgive more

throughput

Rate decreases

Page 37: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

UKLight and ESLEA Collaboration forming for SC2005

Caltech, CERN, FERMI, SLAC, Starlight, UKLight, …

Current Proposals include: Bandwidth Challenge with even faster disk-to-disk transfers between UK

sites and SC2005 Radio Astronomy demo at 512 Mbit user data or 1 Gbit user data

Japan, Haystack(MIT), Jodrell Bank, JIVE High Bandwidth linkup between UK and US HPC systems 10Gig NLR wave to Seattle

Set up a 10 Gigabit Ethernet Test Bench Experiments (CALICE) need to investigate >25 Gbit to the processor

ESLEA/UKlight need resources to study: New protocols and congestion / sharing The interaction between protcol processing, applications and storage Monitoring L1/L2 behaviour in hybrid networks

Page 38: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

10 Gigabit Ethernet: UDP Throughput Tests 1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s

CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate of 5.7 Gbit/s

SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s

an-al 10GE Xsum 512kbuf MTU16114 27Oct03

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25 30 35 40Spacing between frames us

Rec

v W

ire

rate

Mb

its/

s

16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes

Page 39: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

10 Gigabit Ethernet: Tuning PCI-X 16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc

Measured times Times based on PCI-X times from

the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s

0

5

10

15

20

25

30

35

40

45

50

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e u

s

0

1

2

3

4

5

6

7

8

9

PC

I-X

Tra

nsfe

r ra

te G

bit/s

Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X

Kernel 2.6.1#17 HP Itanium Intel10GE Feb04

0

2

4

6

8

10

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e

us

measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X

mmrbc1024 bytes

mmrbc2048 bytes

mmrbc4096 bytes5.7Gbit/s

mmrbc512 bytes

CSR Access

PCI-X Sequence

Data Transfer

Interrupt & CSR Update

Page 40: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

10 Gigabit Ethernet: SC2004 TCP Tests Sun AMD opteron compute servers v20z Chelsio TOE Tests between Linux 2.6.6. hosts

10 Gbit ethernet link from SC2004 to CENIC/NLR/Level(3) PoP in Sunnyvale  Two 2.4GHz AMD 64 bit Opteron processors with 4GB of RAM at SC2004 1500B MTU, all Linux 2.6.6 in one direction 9.43G i.e. 9.07G goodput and the reverse direction 5.65G i.e. 5.44G goodput Total of 15+G on wire.

10 Gbit ethernet link from SC2004 to ESnet/QWest PoP in Sunnyvale One 2.4GHz AMD 64 bit Opteron each end 2MByte window, 16 streams, 1500B MTU, all Linux 2.6.6 in one direction 7.72Gbit/s i.e. 7.42 Gbit/s goodput 120mins (6.6Tbits shipped)

S2io NICs with Solaris 10 in 4*2.2GHz Opteron cpu v40z to one or more S2io or Chelsio NICs with Linux 2.6.5 or 2.6.6 in 2*2.4GHz V20Zs LAN 1 S2io NIC back to back: 7.46 Gbit/s LAN 2 S2io in V40z to 2 V20z : each NIC ~6 Gbit/s total 12.08 Gbit/s

Page 41: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

Transatlantic Ethernet: disk-to-disk Tests Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel RAID0 (6 SATA disks) Bbftp (disk-disk) throughput Standard TCP

Throughput of 436 Mbit/s

First 10 sec

Work in progress to study: Throughput limitations Help real users

0

500

1000

1500

2000

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

TCP

Ach

ive

Mbi

t/s

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time ms

TCP

Ach

ive

Mbi

t/s

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)

sculcc1-chi-2

0

100

200

300

400

500

600

700

800

900

1000

0 10 20 30TCP buffer size Mbytes

TC

P A

chie

vabl

e th

roug

hput

M

bit/

s

iperf Sender Mbit/sbbftp Mbit/s

Page 42: Networkshop March 2005 Richard Hughes-Jones Manchester Bandwidth Challenge, Land Speed Record, TCP/IP and You

Networkshop March 2005Richard Hughes-Jones Manchester

SC2004 Disk-Disk bbftp (work in progress)

UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

HS TCP

Don’t believe this is a protocol problem !

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

time ms

TC

PA

ch

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

CurCwnd (Value)

0

200

400

600

800

1000

1200

0 5000 10000 15000 20000 25000 30000 35000 40000 45000time ms

Nu

m. D

up

A

CK

s

0

5000000

1000000015000000

20000000

25000000

3000000035000000

40000000

45000000

Cw

nd

DupAcksIn (Delta)CurCwnd (Value)

0

0.2

0.4

0.6

0.8

1

1.2

0 5000 10000 15000 20000 25000 30000 35000 40000 45000time ms

Nu

m. T

im

eo

uts

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

Cw

nd

Timeouts (Delta)CurCwnd (Value)

1

10

100

1000

0 5000 10000 15000 20000 25000 30000 35000 40000 45000time ms

nu

m O

th

er R

ed

uctio

ns

0

5000000

1000000015000000

20000000

25000000

3000000035000000

40000000

45000000

Cw

nd

OtherReductions (Delta)

CurCwnd (Value)