slide: 1 richard hughes-jones mini-symposium on optical data networking, august 2005, r....

Slide: 1Richard Hughes-JonesMini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1

Using TCP/IP on High Bandwidth Long Distance

Optical Networks Real Applications on Real Networks

Richard Hughes-Jones University of Manchester

www.hep.man.ac.uk/~rich/ then “Talks” then look for “Rank”


SCINet

Bandwidth Challenge at SC2004 Setting up the BW Bunker

The BW Challenge at the SLAC Booth

Working with S2io, Sun, Chelsio


The Bandwidth Challenge – SC2004 The peak aggregate bandwidth from the booths was 101.13Gbits/s That is 3 full length DVDs per second ! 4 Times greater that SC2003 ! (with its 4.4 Gbit transatlantic flows) Saturated TEN 10Gigabit Ethernet waves SLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago

to Pittsburgh (with UKLight).


TCP has been around for ages

and it just works fine

So

What’s the Problem?

The users complain about the Network!


TCP – provides reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient – sender has to wait


Flow Control: Sender – Congestion Window Uses Congestion window, cwnd, a sliding window to control the data flow

Byte count giving highest byte that can be sent with out without an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sent,waiting for windowto open.Application writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiver’s advertisedwindow advances leading edge


How it works: TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput

Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTT*log2 (W)

Rate doubles each RTT

CWND

slow start: exponential

increase

congestion avoidance: linear increase

packet loss

time

retransmit: slow start

again

timeout


additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 /MTU for each ACK – linear increase in ratecwnd -> cwnd + a / cwnd - Additive Increase, a=1

TCP takes packet loss as indication of congestion ! multiplicative decrease: cut the congestion window size aggressively if

a packet is lost Standard TCP reduces cwnd by 0.5cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ Slow start to Congestion avoidance transition determined by ssthresh

Packet loss is a killer

CWNDslow start:

exponential increase


packet loss

time


again

timeout

How it works: TCP AIMD Congestion Avoidance


TCP (Reno) – Details of problem The time for TCP to recover its throughput from 1 lost 1500 byte packet is given by:

for rtt of ~200 ms:

MSS

RTTC

*2

* 2

2 min

0.00010.0010.010.1

110

1001000

10000100000

0 50 100 150 200rtt ms

Tim

e t

o r

eco

ver

sec

10Mbit100Mbit1Gbit2.5Gbit10Gbit

UK 6 ms Europe 25 ms USA 150 ms1.6 s 26 s 28min


TCP: Simple Tuning - Filling the Pipe Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on:

Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segment/BW


Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)

For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP

a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner

for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is

that there is not such a decrease in throughput. Scalable TCP

a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.

Fast TCP

Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.

HSTCP-LP, Hamilton-TCP, BiC-TCP


Lets Check out this

theory about new TCP stacks

Does it matter ?

Does it work?


Packet Loss with new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms


High Throughput Demonstration

Manchester (Geneva)

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSRCisco GSRCisco7609

Cisco7609

London (Chicago)

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100


High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins


Throughput for real users

Transfers in the UK for BaBar using

MB-NG and SuperJANET4


Topology of the MB – NG Network

KeyGigabit Ethernet2.5 Gbit POS Access

MPLS Admin. Domains

UCL Domain

Edge Router Cisco 7609

man01

man03

Boundary Router Cisco 7609


RAL Domain

Manchester Domain

lon02

man02

ral01

UKERNADevelopment

Network


ral02

ral02

lon03

lon01

HW RAID

HW RAID


Topology of the Production Network

KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS

man01

RAL Domain

Manchester Domain

ral01

HW RAID

HW RAID routers switches

3 routers2 switches


iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)

BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits


Applications: Throughput Mbit/sHighSpeed TCP2 GByte file RAID5SuperMicro + SuperJANET

bbcp

bbftp

Apachie

Gridftp

Previous work used RAID0(not disk limited)


bbftp: What else is going on?Scalable TCP

SuperMicro + SuperJANET Instantaneous 220 - 625 Mbit/s

Congestion window – duplicate ACK Throughput variation not TCP related?

Disk speed / bus transfer

Application

BaBar + SuperJANET Instantaneous 200 – 600 Mbit/s

Disk-mem~ 590 Mbit/s


Average Transfer Rates Mbit/s

App TCP Stack SuperMicro on MB-NG

SuperMicro on

SuperJANET4

BaBar on

SuperJANET4

SC2004 on UKLight

Iperf Standard 940 350-370 425 940

HighSpeed 940 510 570 940

Scalable 940 580-650 605 940

bbcp Standard 434 290-310 290

HighSpeed 435 385 360

Scalable 432 400-430 380

bbftp Standard 400-410 325 320 825

HighSpeed 370-390 380

Scalable 430 345-532 380 875

apache Standard 425 260 300-360

HighSpeed 430 370 315

Scalable 428 400 317

Gridftp Standard 405 240

HighSpeed 320

Scalable 335

New stacksgive more

throughput

Rate decreases


Transatlantic Disk to Disk Transfers

With UKLight

SuperComputing 2004


SC2004 UKLIGHT Overview

MB-NG 7600 OSRManchester

ULCC UKLight

UCL HEP

UCL network

K2

Ci

Chicago Starlight

Amsterdam

SC2004

Caltech BoothUltraLight IP

SLAC Booth

Cisco 6509

UKLight 10GFour 1GE channels

UKLight 10G

Surfnet/ EuroLink 10GTwo 1GE channels

NLR LambdaNLR-PITT-STAR-10GE-16

K2

K2 Ci

Caltech 7600


Transatlantic Ethernet: TCP Throughput Tests

Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP

Wire rate throughput of 940 Mbit/s

First 10 sec

Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing

0

500

1000

1500

2000

0 20000 40000 60000 80000 100000 120000 140000

time ms

TCPA

chiv

e M

bit/s

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

Cwnd

InstaneousBWAveBWCurCwnd (Value)

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time ms

TCPA

chiv

e M

bit/s

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

Cwnd



SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)

Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s

~4.5s of overhead)

Disk-TCP-Disk at 1Gbit/s

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time msT

CP

Ach

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

ch

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd



RAID0 6disks 1 Gbyte Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

Network & Disk Interactions (work in progress) Hosts:

Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size

Measure memory to RAID0 transfer rates with & without UDP traffic

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp9000 write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

Disk write1735 Mbit/s

Disk write +1500 MTU UDP

1218 Mbit/sDrop of 30%

Disk write +9000 MTU UDP

1400 Mbit/sDrop of 19%


y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k


0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4 8k

64k

y=178-1.05x

% CPU kernel mode


Remote Computing Farms

in the ATLAS TDAQ Experiment


Remote Computing Concepts

ROBROBROBROB

L2PUL2PUL2PUL2PU

SFISFI SFI

PFLocal Event Processing Farms

ATLAS Detectors – Level 1 Trigger

SFOs

Mass storageExperimental Area

CERN B513

CopenhagenEdmontonKrakowManchester

PF

Remote Event Processing Farms

PF

PF PF

ligh

tpat

hs

PF

Data Collection Network

Back End Network

GÉANT

Switch

Level 2 Trigger

Event Builders

~PByte/sec

320 MByte/sec


ATLAS Application Protocol

Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes

Processing of event Return of computation

EF asks SFO for buffer space SFO sends OK EF transfers results of the computation

tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.

Send OK

Send event data

Request event

●●●

Request Buffer

Send processed event

Process event

Time

Request-Response time (Histogram)

Event Filter Daemon EFD SFI and SFO


tcpmon: TCP Activity Manc-CERN Req-Resp

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta Web100 Instruments the TCP stack

Round trip time 20 ms 64 byte Request green

1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

Data

Byte

s O

ut

0

50000

100000

150000

200000

250000

Cu

rCw

nd

DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value

TCP Congestion windowgets re-set on each Request

TCP stack implementation detail to reduce Cwnd after inactivity

Even after 10s, each response takes 13 rtt or ~260 ms

020406080

100120140160180

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

TC

PA

ch

ive M

bit

/s

0

50000

100000

150000

200000

250000

Cw

nd

Transfer achievable throughput120 Mbit/s


tcpmon: TCP Activity Manc-cern Req-RespTCP stack tuned


1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms

0

200000

400000

600000

800000

1000000

1200000

0 500 1000 1500 2000 2500 3000time

Da

ta B

yte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0100200300400

500600700800900

0 1000 2000 3000 4000 5000 6000 7000 8000time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000time ms

nu

m P

ackets

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value TCP Congestion window

grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)

Transfer achievable throughputgrows to 800 Mbit/s



1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s

tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned

TCP Congestion windowin slow start to ~1.8s then congestion avoidance

Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)

Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s

0100000200000300000400000500000600000700000800000900000

1000000

0 1000 2000 3000 4000 5000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0100

200300

400500

600700

800

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

Cw

nd

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time msn

um

Packets

0

200000

400000

600000

800000

1000000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value


Time Series of Request-Response Latency

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

2000.00

0 50 100 150 200 250 300

Request Time s

Ro

un

d T

rip

La

ten

cy

ms

1000000

Alberta – CERN Round trip time 150 ms 1 Mbyte of data returned Stable for ~150s at 300ms Falls to 160ms with ~80 μs variation

160.30

160.35

160.40

160.45

160.50

160.55

160.60

200 205 210 215 220 225 230 235 240 245 250

Request Time s

Ro

un

d T

rip

La

ten

cy

ms

25.00

30.00

35.00

40.00

45.00

50.00

55.00

60.00

65.00

70.00

75.00

0 10 20 30 40 50 60 70 80 90 100

Request Time s

Ro

un

d T

rip

La

tne

cy

ms Manchester – CERN

Round trip time 20 ms 1 Mbyte of data returned Stable for ~18s at ~42.5ms Then alternate points 29 & 42.5 ms


Radio Astronomy

e-VLBI


Jodrell BankUK

DwingelooDWDM link

MedicinaItaly Torun

Poland

e-VLBI at the GÉANT2 Launch Jun 2005


e-VLBI UDP Data Streams


UDP Performance: 3 Flows on GÉANT

Throughput: 5 Hour run Jodrell: JIVE

2.0 GHz dual Xeon – 2.4 GHz dual Xeon670-840 Mbit/s

Medicina (Bologna): JIVE 800 MHz PIII – mark623 1.2 GHz PIII 330 Mbit/s limited by sending PC

Torun: JIVE 2.4 GHz dual Xeon – mark575 1.2 GHz PIII 245-325 Mbit/s limited by security policing (>400Mbit/s 20 Mbit/s) ?

Throughput: 50 min period Period is ~17 min

BW 14Jun05

0

200

400

600

800

1000

0 500 1000 1500 2000Time 10s steps

Rec

v w

ire r

ate

Mbi

t/s

JodrellMedicinaTorun

BW 14Jun05

0

200

400

600

800

1000

200 250 300 350 400 450 500Time 10s steps

Rec

v w

ire r

ate

Mbi

t/s

JodrellMedicinaTorun


UDP Performance: 3 Flows on GÉANT

Packet Loss & Re-ordering Jodrell: 2.0 GHz Xeon

Loss 0 – 12% Reordering significant

Medicina: 800 MHz PIII Loss ~6% Reordering in-significant

Torun: 2.4 GHz Xeon Loss 6 - 12% Reordering in-significant

Torun 14Jun04

0

1

2

3

4

5

0 500 1000 1500 2000Time 10s

num

re-

orde

red

020000

400006000080000100000

120000140000

num

lost

re-ordered

num_lost

jbgig1-jivegig1_14Jun05

0

500

1000

1500

2000

0 500 1000 1500 2000Time 10s

num

re-

orde

red

0

50000

100000

150000

num

lost

re-ordered

num_lost

Medicina 14Jun05

0

1

2

3

4

5

0 500 1000 1500 2000Time 10s

num

re-

orde

red

0

10000

20000

30000

40000

50000

60000

70000

num

lost

re-ordered

num_lost


18 Hour Flows on UKLightJodrell – JIVE, 26 June 2005

Throughput: Jodrell: JIVE

2.4 GHz dual Xeon – 2.4 GHz dual Xeon

960-980 Mbit/s

Traffic through SURFnet

Packet Loss Only 3 groups with 10-150 lost packets

each No packets lost the rest of the time

Packet re-ordering None

man03-jivegig1_26Jun05

0

200

400

600

800

1000

0 1000 2000 3000 4000 5000 6000 7000

Time 10s steps

Recv w

ire r

ate

Mbit/s

w10


900910920930940950

960970980990

1000

5000 5050 5100 5150 5200

Time 10sR

ecv w

ire r

ate

Mbit/s

w10


1

10

100

1000

0 1000 2000 3000 4000 5000 6000 7000

Time 10s steps

Packet

Loss

w10


The End Hosts themselves The performance of Motherboards, NICs, RAID controllers and Disks matter Plenty of CPU power is required to sustain Gigabit transfers for the TCP/IP stack as well and

the application Packets can be lost in the IP stack due to lack of processing power

New TCP stacks are stable give better response & performance Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale Take care on difference between the Protocol and The Implementation

Packet loss is a killer Check on campus links & equipment, and access links to backbones

Applications architecture & implementation is also important The work is applicable to other areas including:

Remote iSCSIRemote database accessesReal-time Grid Computing – eg Real-Time Interactive Medical Image processing

Interaction between HW, protocol processing, and disk sub-system complex

Summary & Conclusions

MB - NG


More Information Some URLs Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests:

http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications)

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications)

PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002


Any Questions?


Backup Slides


Multi-Gigabit flows at SC2003 BW Challenge Three Server systems with 10 Gigabit Ethernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:

Pal Alto PAIX rtt 17 ms , window 30 MB Shared with Caltech booth 4.37 Gbit HighSpeed TCP I=5% Then 2.87 Gbit I=16% Fall when 10 Gbit on link

3.3Gbit Scalable TCP I=8% Tested 2 flows sum 1.9Gbit I=39%

Chicago Starlight rtt 65 ms , window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit HighSpeed TCP I=1.6%

Amsterdam SARA rtt 175 ms , window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit HighSpeed TCP I=6.9%

Very Stable Both used Abilene to Chicago

10 Gbits/s throughput from SC2003 to PAIX

0

1

2

3

4

5

6

7

8

9

10

11/19/0315:59

11/19/0316:13

11/19/0316:27

11/19/0316:42

11/19/0316:56

11/19/0317:11

11/19/0317:25 Date & Time

Throughput

Gbits/s

Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2

10 Gbits/s throughput from SC2003 to Chicago & Amsterdam

0

1

2

3

4

5

6

7

8

9

10

11/19/0315:59

11/19/0316:13

11/19/0316:27

11/19/0316:42

11/19/0316:56

11/19/0317:11

11/19/0317:25 Date & Time

Throughput

Gbits/s

Router traffic to Abilele

Phoenix-Chicago

Phoenix-Amsterdam


UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program

Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size

Slope is given by:

Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)

Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:

Behavior of the IP stack The way the HW operates Interrupt coalescence

pathsdata dt

db1 s

Latency Measurements


Throughput Measurements

UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals

n bytes

Number of packets

Wait timetime

Zero stats OK done

●●●

Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay

Send data frames at regular intervals

●●●

Time to send Time to receive

Inter-packet time(Histogram)

Signal end of testOK done

Time

Sender Receiver


PCI Bus & Gigabit Ethernet Activity

PCI Activity Logic Analyzer with

PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC

GigabitEthernetProbe

CPU

mem

chipset

NIC

CPU

mem

NIC

chipset

Logic AnalyserDisplay

PCI bus PCI bus

Possible Bottlenecks


SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus

6 PCI PCI-X slots 4 independent PCI buses

64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X

Dual Gigabit Ethernet Adaptec AIC-7899W

dual channel SCSI UDMA/100 bus master/EIDE channels

data transfer rates of 100 MB/sec burst

“Server Quality” Motherboards


“Server Quality” Motherboards

Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory

Theory BW: 6.4Gbit

HyperTransport

2 independent PCI buses 133 MHz PCI-X

2 Gigabit Ethernet SATA

( PCI-e )


End Hosts & NICs CERN-nat-Manc.

Request-Response Latency

Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network

SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus

pcatb121-nat-gig6_13Aug04

0100200300400500600700800900

1000

0 10 20 30 40

Spacing between frames us

Rec

v W

ire r

ate

Mbi

ts/s

50 bytes

100 bytes

200 bytes

400 bytes

600 bytes

800 bytes

1000 bytes

1200 bytes

1400 bytes

1472 bytes


0

20

40

60

80

0 5 10 15 20 25 30 35 40Spacing between frames us

% P

acke

t lo

ss 50 bytes

100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes


0

5

10

15

0 5 10 15 20 25 30 35 40Spacing between frames us

num

re-

orde

red

50 bytes

100 bytes 200 bytes

400 bytes 600 bytes

800 bytes 1000 bytes

1200 bytes 1400 bytes

1472 bytes

256 bytes pcatb121-nat-gig6

0

1000

2000

3000

4000

5000

6000

20900 21100 21300 21500Latency us

N(t

)


0

2000

4000

6000

8000

10000

20900 21100 21300 21500Latency us

N(t

)


0

1000

2000

3000

4000

5000

20900 21100 21300 21500Latency us

N(t

)

The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving

the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS

have no re-ordering


TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:

for rtt of ~200 ms:

MSS

RTTC

*2

* 2

2 min

0.00010.0010.010.1

110

1001000

10000100000

0 50 100 150 200rtt ms

Tim

e t

o r

eco

ver

sec

10Mbit100Mbit1Gbit2.5Gbit10Gbit

UK 6 ms Europe 20 ms USA 150 ms


Network & Disk Interactions Disk Write

mem-disk: 1735 Mbit/s Tends to be in 1 die

Disk Write + UDP 1500 mem-disk : 1218 Mbit/s Both dies at ~80%

Disk Write + CPU mem mem-disk : 1341 Mbit/s 1 CPU at ~60% other 20% Large user mode usage Below Cut = hi BW Hi BW = die1 used

Disk Write + CPUload mem-disk : 1334 Mbit/s 1 CPU at ~60% other 20% All CPUs saturated in

user mode


y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k


0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

RAID0 6disks 1 Gbyte Write 8k 3w8506-8 26 Dec04 16384

y = -1.0215x + 215.63

y = -1.0529x + 206.46

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k total CPU

64k total CPU

R0 6d 1 Gbyte udp Write 8k 3w8506-8 26 Dec04 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k totalCPU

64k totalCPU

y=178-1.05x

R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k

64k

y=178-1.05xcut equn


0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k totalCPU64k totalCPUy=178-1.05xcut equn 2

R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k

64k

y=178-1.05x

R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k total CPU

64k total CPU

y=178-1.05x

Total CPU load

Kernel CPU load


0

500

1000

1500

2000

2500

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

Trial number

Th

rou

gh

pu

t M

bit

/s

Series1

L3+L4<cut


TCP Fast Retransmit & Recovery Duplicate ACKs are due to lost segments or segments out of order. Fast Retransmit: If the receiver transmits 3 duplicate ACKs

(i.e. it received 3 additional segments without getting the one expected) Transmitting host sends the missing segment

Set ssthresh to 0.5*cwnd – so enter congestion avoidance phaseSet cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into “slow start” again

At steady state, CWND oscillates around the optimal window size With a retransmission timeout, slow start is triggered again

CWND


increase


packet loss

time


again

timeoutCWND


increase


packet loss

time


again

timeout


Packet Loss and new TCP Stacks TCP Response Function

UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel

Agreement withtheory good

Some new stacksgood at high loss rates

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

slide: 1 richard hughes-jones mini-symposium on optical data networking, august 2005, r....

Documents