internet data transfer record between cern and california sylvain ravot (caltech) paolo moroni...

27
Internet data transfer Internet data transfer record between CERN and record between CERN and California California Sylvain Ravot (Caltech) Sylvain Ravot (Caltech) Paolo Moroni (CERN) Paolo Moroni (CERN)

Upload: anthony-pitts

Post on 17-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Internet data transfer Internet data transfer record between CERN and record between CERN and

CaliforniaCalifornia

Sylvain Ravot (Caltech)Sylvain Ravot (Caltech)

Paolo Moroni (CERN)Paolo Moroni (CERN)

Page 2: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 2)EOF RIPE-45 meeting - Barcelona

SummarySummary

Internet2 Land Speed Record Internet2 Land Speed Record ContestContest

New LSRNew LSR DataTAG project and network DataTAG project and network

configurationconfiguration Establishing an LSR: hardware and Establishing an LSR: hardware and

tuningtuning ConclusionsConclusions AcknowledgementsAcknowledgements

Page 3: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 3)EOF RIPE-45 meeting - Barcelona

Internet2 LSR Contest Internet2 LSR Contest (I)(I)

From http://lsr.internet2.edu:From http://lsr.internet2.edu:

““A minimum of 100 megabytes must be A minimum of 100 megabytes must be transferred a minimum terrestrial distance transferred a minimum terrestrial distance of 100 kilometers with a minimum of two of 100 kilometers with a minimum of two router hops in each direction between the router hops in each direction between the source node and the destination node source node and the destination node across one or more operational and across one or more operational and production-oriented high-performance production-oriented high-performance research and education networks.” research and education networks.”

“The contest unit of measurement is […] “The contest unit of measurement is […] bit-meters/second.”bit-meters/second.”

Page 4: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 4)EOF RIPE-45 meeting - Barcelona

Internet2 LSR Contest Internet2 LSR Contest (II)(II)

““Instances of all hardware units and software Instances of all hardware units and software modules used to transfer contest data on the modules used to transfer contest data on the source node, the destination node, the links, source node, the destination node, the links, and the routers must be offered for and the routers must be offered for commercial sale or as open source software to commercial sale or as open source software to all U.S. members of the Internet community by all U.S. members of the Internet community by their respective vendors or developers prior to their respective vendors or developers prior to or immediately after winning the contest.”or immediately after winning the contest.”

Award classes: single or multiple TCP streams, Award classes: single or multiple TCP streams, on top of IPv4 or IPv6on top of IPv4 or IPv6

Generally available networks and equipment, Generally available networks and equipment, vs. lab prototypesvs. lab prototypes

Page 5: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 5)EOF RIPE-45 meeting - Barcelona

Former LSRFormer LSR

TCP/IPv4 single streamTCP/IPv4 single stream By NIKHEF, Caltech and SLACBy NIKHEF, Caltech and SLAC Established on November 19Established on November 19thth 2002 2002 10978 Km of network: Geneva-10978 Km of network: Geneva-

Amsterdam-Chicago-SunnyvaleAmsterdam-Chicago-Sunnyvale 0.93 Gb/sec0.93 Gb/sec 10136.15 terabit-meters/second10136.15 terabit-meters/second

Page 6: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 6)EOF RIPE-45 meeting - Barcelona

Current LSRCurrent LSR

TCP/IPv4 single streamTCP/IPv4 single stream

By Caltech, CERN, LANL and SLAC, within the By Caltech, CERN, LANL and SLAC, within the DataTAGDataTAG project framework project framework

Established on February 27Established on February 27thth-28-28thth 2003 by 2003 by Sylvain Ravot (Caltech) using IPERFSylvain Ravot (Caltech) using IPERF

10037 Km of network: Geneva-Chicago-10037 Km of network: Geneva-Chicago-Sunnyvale Sunnyvale (shorter distance than the former (shorter distance than the former LSR)LSR)

2.38 Gb/sec (2.38 Gb/sec (sustainedsustained: a Terabyte of data : a Terabyte of data moved in an hour)moved in an hour)

23888 Terabit-meters/second23888 Terabit-meters/second

Page 7: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 7)EOF RIPE-45 meeting - Barcelona

DataTAG projectDataTAG project Full project title: Full project title: ““Research and technological development Research and technological development

for a transatlantic GRIDfor a transatlantic GRID” ”

IST project (EU funded), supported by the NSF and the IST project (EU funded), supported by the NSF and the DoE (Caltech): DoE (Caltech): http://www.datatag.orghttp://www.datatag.org

Partners: PPARC (UK), INRIA (FR), University of Partners: PPARC (UK), INRIA (FR), University of Amsterdam (NL), INFN (IT) and CERN (CH)Amsterdam (NL), INFN (IT) and CERN (CH)

Researchers also from Caltech, Los Alamos, SLAC and Researchers also from Caltech, Los Alamos, SLAC and CanadaCanada

Test-bed kernel: transatlantic STM-16 (T-systems) Test-bed kernel: transatlantic STM-16 (T-systems) between Geneva (CERN) and Chicago (StarLight), with between Geneva (CERN) and Chicago (StarLight), with interconnected workstations at each side.interconnected workstations at each side.

Test-bed extensions provided by GEANT, SURFnet, VTHD Test-bed extensions provided by GEANT, SURFnet, VTHD and other partners, in Europe and North Americaand other partners, in Europe and North America

Page 8: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 8)EOF RIPE-45 meeting - Barcelona

DataTAG as test-bed DataTAG as test-bed for LSRfor LSR

Research on TCP as part of the DataTAG Research on TCP as part of the DataTAG programmeprogramme

The DataTAG test-bed was the main The DataTAG test-bed was the main environment for the LSRenvironment for the LSR

Network extension kindly made available by Network extension kindly made available by Level3 Communications, Inc.: Chicago-Level3 Communications, Inc.: Chicago-Sunnyvale STM-64Sunnyvale STM-64

Router at Sunnyvale kindly lent by CiscoRouter at Sunnyvale kindly lent by Cisco

PC 10 GbE interfaces kindly made available PC 10 GbE interfaces kindly made available as pre-release product by Intelas pre-release product by Intel

Page 9: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 9)EOF RIPE-45 meeting - Barcelona

StarLightChicago (Illinois – USA)

CERNGeneva (Switzerland)

Level3 PoPSUNNYVALE (California – USA)

LSR network configuration STM-64 STM-64 (Level3 loan)(Level3 loan)

STM-16 STM-16 (T-(T-systems)systems) 10 GbE10 GbE

Cisco 7606Cisco 7606

Cisco 7609Cisco 7609

Juniper T640Juniper T640(TeraGrid)(TeraGrid)

Cisco 12406Cisco 12406(Cisco loan)(Cisco loan) DataTAGDataTAG networknetwork

PC (10GbE)PC (10GbE)(Intel loan)(Intel loan)

PC (10GbE)PC (10GbE)(Intel loan)(Intel loan)

Page 10: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 10)EOF RIPE-45 meeting - Barcelona

Establishing an LSR: Establishing an LSR: hardware (I)hardware (I)

No LSR without good hardwareNo LSR without good hardware

A lot of bandwidth: minimum 2.5 Gb/sec on A lot of bandwidth: minimum 2.5 Gb/sec on the whole path (thanks to Level3 for the the whole path (thanks to Level3 for the STM-64 on loan between Chicago and STM-64 on loan between Chicago and Sunnyvale)Sunnyvale)

Powerful routers (Cisco 7600 and GSR, Powerful routers (Cisco 7600 and GSR, Juniper T640)Juniper T640)

Powerful Linux PCs on both sidesPowerful Linux PCs on both sides

Intel 10 GbE interfacesIntel 10 GbE interfaces

Page 11: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 11)EOF RIPE-45 meeting - Barcelona

Establishing an LSR: Establishing an LSR: hardware (II)hardware (II)

Linux PC at CERN:Linux PC at CERN: Dual Intel® Xeon™ processors, 2.40GHz with 512K L2 cache Dual Intel® Xeon™ processors, 2.40GHz with 512K L2 cache SuperMicro P4DP8-G2 Motherboard SuperMicro P4DP8-G2 Motherboard Intel E700 chipset Intel E700 chipset 2 GB RAM,PC2100 ECC Reg. DDR 2 GB RAM,PC2100 ECC Reg. DDR Hard drive: 1 x 140 GB - Maxtor ATA-133 Hard drive: 1 x 140 GB - Maxtor ATA-133 On board Intel 82546EB dual port Gigabit Ethernet controller On board Intel 82546EB dual port Gigabit Ethernet controller 4U Rack-mounted server4U Rack-mounted server

Linux PC at Sunnyvale:Linux PC at Sunnyvale: Dual Intel® Xeon™ processors , 2.40GHz with 512K L2 cache Dual Intel® Xeon™ processors , 2.40GHz with 512K L2 cache SuperMicro P4DPE-G2 Motherboard SuperMicro P4DPE-G2 Motherboard 2 GB RAM, PC2100 ECC Reg. DDR 2 GB RAM, PC2100 ECC Reg. DDR 2* 3ware 7500-8 RAID controllers 2* 3ware 7500-8 RAID controllers 16 Western Digital IDE disk drives for RAID and 1 for system 16 Western Digital IDE disk drives for RAID and 1 for system 2 Intel 82550 fast Ethernet 2 Intel 82550 fast Ethernet 2*SysKonnect Gigabit Ethernet card SK-9843 SK-NET GE SX 2*SysKonnect Gigabit Ethernet card SK-9843 SK-NET GE SX 4U Rack-mounted server 4U Rack-mounted server 480W to run 600W to spin up480W to run 600W to spin up

Page 12: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 12)EOF RIPE-45 meeting - Barcelona

Establishing an LSR: Establishing an LSR: hardware (III)hardware (III)

Intel 10 GbE interfaces: Intel Pro/10 Intel 10 GbE interfaces: Intel Pro/10 GbE-LRGbE-LR

Not yet commercially available when Not yet commercially available when the LSR was set, but announced as the LSR was set, but announced as commercially available shortly commercially available shortly afterwardsafterwards

Page 13: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 13)EOF RIPE-45 meeting - Barcelona

Establishing an LSR: Establishing an LSR: standard tuningstandard tuning

MTU set to 9000 bytesMTU set to 9000 bytes

TCP window size increased from the TCP window size increased from the Linux default of 64K: essential over Linux default of 64K: essential over long distancelong distance

But standard Linux kernel (2.4.20)But standard Linux kernel (2.4.20)

Standard tuning is not enough for Standard tuning is not enough for LSR: why?LSR: why?

Page 14: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 14)EOF RIPE-45 meeting - Barcelona

TCP WAN problemsTCP WAN problems

RResponsiveness to packet losses is esponsiveness to packet losses is proportional to the square of the RTT: proportional to the square of the RTT: R=C*(RTT**2)/2*MSS (where C is the link R=C*(RTT**2)/2*MSS (where C is the link capacity and MSS is the max segment size). capacity and MSS is the max segment size). This makes it very difficult to take advantage This makes it very difficult to take advantage of full capacity over long-distance WAN: not a of full capacity over long-distance WAN: not a real problem for standard traffic on a shared real problem for standard traffic on a shared link, but a serious penalty for LSRlink, but a serious penalty for LSR

Slow start mode is “too” slow using default Slow start mode is “too” slow using default parameters: they are good for standard parameters: they are good for standard traffic, but not for LSRtraffic, but not for LSR

Page 15: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 15)EOF RIPE-45 meeting - Barcelona

Example: recovering Example: recovering from a packet lossfrom a packet loss

TCP Throughput CERN-Chicago over the 622 Mbit/s link

0

50

100

150

200

0 200 400 600 800 1000 1200 1400 1600

Time (s)

Th

rou

gh

pu

t (M

bit

/s)

TCP reactivity TCP reactivity Time to increase the throughput by 120 Mbit/s

is larger than 6 min for a connection between Chicago and CERN.

Packet losses is a disaster for the Packet losses is a disaster for the overall throughputoverall throughput

6 min6 min

Page 16: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 16)EOF RIPE-45 meeting - Barcelona

Cwnd average of the last 10 samples.

Cwnd average over the life of the connection to that point

Slow start Congestion Avoidance

SSTHRESH

Example: slow startExample: slow start vs. congestion avoidance vs. congestion avoidance

Page 17: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 17)EOF RIPE-45 meeting - Barcelona

Establishing an LSR: Establishing an LSR: non-standard tuningnon-standard tuning

The TCP stack was designed a long time ago and The TCP stack was designed a long time ago and for much slower networks: looking at the for much slower networks: looking at the previous page’s formula, if C is very small, it previous page’s formula, if C is very small, it keeps the responsiveness low enough for any keeps the responsiveness low enough for any terrestrial RTT: modern, fast WAN links are terrestrial RTT: modern, fast WAN links are “bad” for TCP performance“bad” for TCP performance

TCP tries to increase its window size until TCP tries to increase its window size until something breaks (packet loss, congestion, …); something breaks (packet loss, congestion, …); then restarts from a half of the previous value then restarts from a half of the previous value until it breaks again. This gradual approximation until it breaks again. This gradual approximation process takes very long over long distance and process takes very long over long distance and degrades performancedegrades performance

Page 18: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 18)EOF RIPE-45 meeting - Barcelona

Establishing an LSR: Establishing an LSR: further tuningfurther tuning

Knowing a priori the available bandwidthKnowing a priori the available bandwidth, , prevent TCP from trying larger windows by prevent TCP from trying larger windows by restricting the amount of buffers it may use: restricting the amount of buffers it may use: without buffers, it won’t try to use larger without buffers, it won’t try to use larger windows and packet losses can be avoidedwindows and packet losses can be avoided

The product C*RTT yields the optimal TCP The product C*RTT yields the optimal TCP window size for a link of capacity Cwindow size for a link of capacity C

So, allocate just enough buffers to let TCP So, allocate just enough buffers to let TCP squeeze the maximum performance from squeeze the maximum performance from the existing bandwidth the existing bandwidth and nothing elseand nothing else

Page 19: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 19)EOF RIPE-45 meeting - Barcelona

Further tuning: Linux Further tuning: Linux implementationimplementation

Tuning TCP buffers (numbers for STM-16):Tuning TCP buffers (numbers for STM-16): echo “4096 87380 128388607” > echo “4096 87380 128388607” >

/proc/sys/net/ipv4/tcp_rmem /proc/sys/net/ipv4/tcp_rmem

echo “4096 65530 128388607” > echo “4096 65530 128388607” > /proc/sys/net/ipv4/tcp_wmem /proc/sys/net/ipv4/tcp_wmem

echo 128388607 > /proc/sys/net/core/wmem_max echo 128388607 > /proc/sys/net/core/wmem_max

echo 128388607 > /proc/sys/net/core/rmem_maxecho 128388607 > /proc/sys/net/core/rmem_max

Tuning the network device buffers:Tuning the network device buffers: /sbin/ifconfig eth1 txqueuelen 10000 /sbin/ifconfig eth1 txqueuelen 10000

/sbin/ifconfig eth1 mtu 9000/sbin/ifconfig eth1 mtu 9000

Both on sender and receiverBoth on sender and receiver

Page 20: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 20)EOF RIPE-45 meeting - Barcelona

Even further tuning: Even further tuning: Linux implementationLinux implementation

TCP slow start mode vs. congestion avoidance TCP slow start mode vs. congestion avoidance mode is another performance penalty in the mode is another performance penalty in the sender for the LSRsender for the LSR

On Linux (sender side only):On Linux (sender side only): sysctl -w net.ipv4.route.flush=1sysctl -w net.ipv4.route.flush=1

This prevents TCP from using any previously This prevents TCP from using any previously cached window value, i.e. speeds up slow start cached window value, i.e. speeds up slow start mode and gets to congestion avoidance mode mode and gets to congestion avoidance mode at exponential speed (instead of stopping the at exponential speed (instead of stopping the exponential growth of the congestion window exponential growth of the congestion window at half of some previously cached value)at half of some previously cached value)

Page 21: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 21)EOF RIPE-45 meeting - Barcelona

IPERF parametersIPERF parameters

On the sender:On the sender: iperf -c 192.91.239.213 -i 5 -P 3 -w 40M -iperf -c 192.91.239.213 -i 5 -P 3 -w 40M -

t 180t 180

On the receiver:On the receiver: iperf-1.6.5 -s -w 128Miperf-1.6.5 -s -w 128M

IPERF is available at IPERF is available at http://dast.nlanr.nethttp://dast.nlanr.net

Page 22: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 22)EOF RIPE-45 meeting - Barcelona

Conclusions: how useful in Conclusions: how useful in practice?practice?

The LSR result cannot be immediately The LSR result cannot be immediately translated into practical general-purpose translated into practical general-purpose recommendations: it relies onrecommendations: it relies on

some some a prioria priori knowledge (the physical link speed) knowledge (the physical link speed) dedicated bandwidthdedicated bandwidth ad hocad hoc TCP tuning: good for LSR, not for general- TCP tuning: good for LSR, not for general-

purpose trafficpurpose traffic

Nevertheless, work is ongoing for a more Nevertheless, work is ongoing for a more modern TCP stack: the new LSR modern TCP stack: the new LSR demonstrates that fast WAN TCP is possible demonstrates that fast WAN TCP is possible in practice, by tweaking TCP a bitin practice, by tweaking TCP a bit

Page 23: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 23)EOF RIPE-45 meeting - Barcelona

Conclusions: sustained Conclusions: sustained throughputthroughput

As stated by Internet2, the LSR definition has As stated by Internet2, the LSR definition has only very limited provisioning for requiring only very limited provisioning for requiring sustained throughput (100 Megabytes are not sustained throughput (100 Megabytes are not much for today’s network bandwidth)much for today’s network bandwidth)

However, apart from the achieved throughput, However, apart from the achieved throughput, this LSR shows that high this LSR shows that high sustainedsustained throughput is throughput is in principle possible with TCP over long distancein principle possible with TCP over long distance

This is new with respect to former results, where This is new with respect to former results, where the throughput could be sustained only for 40-60 the throughput could be sustained only for 40-60 seconds, before some TCP feedback mechanism seconds, before some TCP feedback mechanism kicked in and ruined the performancekicked in and ruined the performance

Page 24: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 24)EOF RIPE-45 meeting - Barcelona

Other remarksOther remarks

A big difference with respect to the past is that A big difference with respect to the past is that the bottleneck for things like the LSR is now in the bottleneck for things like the LSR is now in the end hosts: no non-trivial tuning was needed the end hosts: no non-trivial tuning was needed on the network where the LSR was establishedon the network where the LSR was established

Incidentally, although single-stream, the new Incidentally, although single-stream, the new LSR was also good enough to establish the new LSR was also good enough to establish the new LSR for multiple IPv4 streamsLSR for multiple IPv4 streams

No TCP packet was lost during the LSR trial No TCP packet was lost during the LSR trial window window

Details of the new record are not published on Details of the new record are not published on http://lsr.internet2.edu yethttp://lsr.internet2.edu yet

Page 25: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 25)EOF RIPE-45 meeting - Barcelona

Acknowledgements: Acknowledgements: peoplepeople

Sylvain Ravot (working for Caltech Sylvain Ravot (working for Caltech at CERN)at CERN)

Wu-Chun Feng (LANL)Wu-Chun Feng (LANL)

Les Cottrell (SLAC)Les Cottrell (SLAC)

Page 26: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

12 May 2003Ravot - Moroni

(Slide 26)EOF RIPE-45 meeting - Barcelona

Acknowledgements: Acknowledgements: industrial partnersindustrial partners

Page 27: Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Acknowledgements: Acknowledgements: organisationsorganisations

Thank youThank you