masaki hirabaru [email protected] nict

29
Masaki Hirabaru [email protected] NICT The 3rd International HEP DataGrid Work shop August 26, 2004 Kyungpook National Univ., Daegu, Korea High Performance Data Tra nsfer over TransPAC

Upload: dakota

Post on 12-Jan-2016

67 views

Category:

Documents


0 download

DESCRIPTION

High Performance Data Transfer over TransPAC. The 3rd International HEP DataGrid Workshop August 26, 2004 Kyungpook National Univ., Daegu, Korea. Masaki Hirabaru [email protected] NICT. Acknowledgements. NICT Kashima Space Research Center Yasuhiro Koyama, Tetsuro Kondo - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Masaki Hirabaru masaki@nict.go.jp NICT

Masaki [email protected]

NICT

The 3rd International HEP DataGrid WorkshopAugust 26, 2004

Kyungpook National Univ., Daegu, Korea

High Performance Data Transfer over TransPAC

Page 2: Masaki Hirabaru masaki@nict.go.jp NICT

Acknowledgements

•NICT Kashima Space Research CenterYasuhiro Koyama, Tetsuro Kondo

•MIT Haystack ObservatoryDavid Lapsley, Alan Whitney

•APAN Tokyo NOC

•JGN II NOC

•NICT R&D Management Department

•Indiana U. Global NOC

Page 3: Masaki Hirabaru masaki@nict.go.jp NICT

Contents

• e-VLBI

• Performance Measurement

• TCP test over TransPAC

• TCP test in the Laboratory

Page 4: Masaki Hirabaru masaki@nict.go.jp NICT

Motivations• MIT Haystack – NICT Kashima e-VLBI Experiment

on August 27, 2003 to measure UT1-UTC in 24 hours– 41.54 GB CRL => MIT 107 Mbps (~50 mins)

41.54 GB MIT => CRL 44.6 Mbps (~120 mins)

– RTT ~220 ms, UDP throughput 300-400 MbpsHowever TCP ~6-8 Mbps (per session, tuned)

– BBFTP with 5 x 10 TCP sessions to gain performance

• HUT – NICT Kashima Gigabit VLBI Experiment

- RTT ~325 ms, UDP throughput ~70 MbpsHowever TCP ~2 Mbps (as is), ~10 Mbps (tuned)

- Netants (5 TCP sessions with ftp stream restart extension)

They need high-speed / real-time / reliable / long-haul high-performance, huge data transfer.

Page 5: Masaki Hirabaru masaki@nict.go.jp NICT

VLBI (Very Long Baseline Interferometry)de

lay

radio signal from a star

correlator

A/D clockA/D

 Internet

clock

•e-VLBI geographically distributed observation, interconnecting radio antennas over the world

•Gigabit / real-time VLBI multi-gigabit rate sampling

High Bandwidth – Delay Product Network issue

(NICT Kashima Radio Astronomy Applications Group)Data rate 512Mbps ~

Page 6: Masaki Hirabaru masaki@nict.go.jp NICT

Recent Experiment of UT1-UTC Estimationbetween NICT Kashima and MIT Haystack (via Washington DC)

•July 30, 2004 4am-6am JSTKashima was upgraded to 1G through JGN II 10G link.All processing done in ~4.5 hours (last time ~21 hours)Average ~30 Mbps transfer by bbftp (under investigation)

test

experiment

Page 7: Masaki Hirabaru masaki@nict.go.jp NICT

 

KwangjuBusan

2.5G

Fukuoka

Korea

                                       

2.5G SONET

KORENTaegu

Daejon

10G

1G (10G)1G

1G

Seoul XP

Genkai XP

Kitakyushu

Kashima

1G (10G)

Fukuoka Japan

250km

1,000km2.5G

TransPAC / JGN II

9,000km

4,000km

Los Angeles

Chicago

Washington DC

MIT Haystack

10G

2.4G (x2)

APII/JGNII

Abilene

Koganei

1G(10G)

Indianapolis

100kmbwctl server

Network Diagram for e-VLBI and test servers

10G

Tokyo XP

*Info and key exchange page needed like:http://e2epi.internet2.edu/pipes/ami/bwctl/

perf server

e-vlbi server

– Done 1 Gbps upgrade at Kashima

– On-going 2.5 Gbps upgrade at Haystack– Experiments using 1 Gigabit bps or more– Using real-time correlation

JGNII

e-VLBI:

Page 8: Masaki Hirabaru masaki@nict.go.jp NICT

APAN JP Maps written in perl and fig2div

Page 9: Masaki Hirabaru masaki@nict.go.jp NICT

Purposes• Measure, analyze and improve end-to-end perfor

mance in high bandwidth-delay product networks– to support for networked science applications– to help operations in finding a bottleneck– to evaluate advanced transport protocols

(e.g. Tsunami, SABUL, HSTCP, FAST, XCP, [ours])

• Improve TCP under easier conditions– with a signle TCP stream– memory to memory– bottleneck but no cross traffic

Consume all the available bandwidth

Page 10: Masaki Hirabaru masaki@nict.go.jp NICT

Path

ReceiverSender

Backbone

B1 <= B2 & B1 <= B3

Access Access

B1B2

B3

a) w/o bottleneckqueue

ReceiverSender

Backbone

B1 > B2 || B1 > B3

Access Access

B1 B2 B3

b) w/ bottleneckqueue

bottleneck

Page 11: Masaki Hirabaru masaki@nict.go.jp NICT

TCP on a path with bottleneck

bottleneck

overflowqueue

The sender may generate burst traffic.The sender recognizes the overflow after the delay < RTT.The bottleneck may change over time.

loss

Page 12: Masaki Hirabaru masaki@nict.go.jp NICT

Limiting the Sending Rate

ReceiverSender

1Gbps

a)

b)

congestion20Mbpsthroughput

ReceiverSender

100Mbps

congestion90Mbpsthroughput

better!

Page 13: Masaki Hirabaru masaki@nict.go.jp NICT

Web100 (http://www.web100.org)• A kernel patch for monitoring/modifying TCP

metrics in Linux kernel• We need to know TCP behavior to identify a

problem.

• Iperf (http://dast.nlanr.net/Projects/Iperf/)– TCP/UDP bandwidth measurement

• bwctl (http://e2epi.internet2.edu/bwctl/)– Wrapper for iperf with authentication and scheduling

Page 14: Masaki Hirabaru masaki@nict.go.jp NICT

1st Step: Tuning a Host with UDP

• Remove any bottlenecks on a host– CPU, Memory, Bus, OS (driver), …

• Dell PowerEdge 1650 (*not enough power)– Intel Xeon 1.4GHz x1(2), Memory 1GB– Intel Pro/1000 XT onboard PCI-X (133Mhz)

• Dell PowerEdge 2650– Intel Xeon 2.8GHz x1(2), Memory 1GB– Intel Pro/1000 XT PCI-X (133Mhz)

• Iperf UDP throughput 957 Mbps – GbE wire rate: headers: UDP(20B)+IP(20B)+EthernetII(38B)– Linux 2.4.26 (RedHat 9) with web100– PE1650: TxIntDelay=0

Page 15: Masaki Hirabaru masaki@nict.go.jp NICT

2nd Step: Tuning a Host with TCP• Maximum socket buffer size (TCP window size)

– net.core.wmem_max net.core.rmem_max (64MB)– net.ipv4.tcp_wmem net.tcp4.tcp_rmem (64MB)

• Driver descriptor length– e1000: TxDescriptors=1024 RxDescriptors=256 (default)

• Interface queue length– txqueuelen=100 (default)– net.core.netdev_max_backlog=300 (default)

• Interface queue descriptor– fifo (default)

• MTU– mtu=1500 (IP MTU)

• Iperf TCP throughput 941 Mbps– GbE wire rate: headers: TCP(32B)+IP(20B)+EthernetII(38B)– Linux 2.4.26 (RedHat 9) with web100

• Web100 (incl. High Speed TCP)– net.ipv4.web100_no_metric_save=1 (do not store TCP metrics in the route cache)– net.ipv4.WAD_IFQ=1 (do not send a congestion signal on buffer full)– net.ipv4.web100_rbufmode=0 net.ipv4.web100_sbufmode=0 (disable auto tuning)– Net.ipv4.WAD_FloydAIMD=1 (HighSpeed TCP)– net.ipv4.web100_default_wscale=7 (default)

Page 16: Masaki Hirabaru masaki@nict.go.jp NICT

 

                                       

Tokyo XP

Kashima

0.1G

2.5G

TransPAC

9,000km

4,000km

Los AngelesWashington DC

MIT Haystack

10G

1GAbilene

Koganei

1G

Indianapolis

I2 Venue1G

10G

100km

server (general)

server (e-VLBI)

Network Diagram for TransPAC/I2 Measurement (Oct. 2003)

1G x2

sender

receiver

Mark5Linux 2.4.7 (RH 7.1)P3 1.3GHzMemory 256MBGbE SK-9843

PE1650Linux 2.4.22 (RH 9)Xeon 1.4GHzMemory 1GBGbE Intel Pro/1000 XT

Iperf UDP ~900Mbps (no loss)

Page 17: Masaki Hirabaru masaki@nict.go.jp NICT

TransPAC/I2 #1: High Speed (60 mins)

Page 18: Masaki Hirabaru masaki@nict.go.jp NICT

TransPAC/I2 #2: Reno (10 mins)

Page 19: Masaki Hirabaru masaki@nict.go.jp NICT

TransPAC/I2 #3: High Speed (Win 12MB)

Page 20: Masaki Hirabaru masaki@nict.go.jp NICT

Test in a laboratory – with bottleneck

PacketSphere

ReceiverSender

L2SW(FES12GCF)

Bandwidth 800Mbps Buffer 256KBDelay 88 msLoss 0

GbE/SX

GbE/TGbE/T

PE 2650 PE 1650

• #1: Reno => Reno

• #2: High Speed TCP => Reno

2*BDP = 16MB

Page 21: Masaki Hirabaru masaki@nict.go.jp NICT

Laboratory #1,#2: 800M bottleneck

Reno

HighSpeed

Page 22: Masaki Hirabaru masaki@nict.go.jp NICT

Laboratory #3,#4,#5: High Speed (Limiting)

Window Size(16MB)

Rate Control

Cwnd Clamp

270 us every 10 packetsWith limited slow-start (1000)

(95%)With limited slow-start (100)

With limited slow-start (1000)

Page 23: Masaki Hirabaru masaki@nict.go.jp NICT

How to know when bottleneck changed

• End host probes periodically (e.g. packet train)• Router notifies to the end host (e.g. XCP)

Page 24: Masaki Hirabaru masaki@nict.go.jp NICT

Another approach: enough buffer on router

• At least 2xBDP (bandwidth delay product)e.g. 1G bps x 200ms x 2 = 500Mb ~ 50MB

• Replace Fast SRAM with DRAMin order to reduce space and cost

Page 25: Masaki Hirabaru masaki@nict.go.jp NICT

Test in a laboratory – with bottleneck (2)

NetworkEmulator

ReceiverSender

L2SW(FES12GCF)

Bandwidth 800Mbps Buffer 64MBDelay 88 msLoss 0

GbE/SX

GbE/TGbE/T

PE 2650 PE 1650

• #6: High Speed TCP => Reno

2*BDP = 16MB

Page 26: Masaki Hirabaru masaki@nict.go.jp NICT

Laboratory #6: 800M bottleneck

HighSpeed

Page 27: Masaki Hirabaru masaki@nict.go.jp NICT

Report on MTU• Increasing MTU (packet size) results in better p

erformance. Standard MTU is 1500B. MTU 9KB is available throughout Abilene, TransPAC, APII backbones.

• On Aug 25, 2004, a remaining link with 1500B was upgraded to 9KB in Tokyo XP. MTU 9KB is available from Busan to Los Angeles.

Page 28: Masaki Hirabaru masaki@nict.go.jp NICT

Current and Future Plans of e-VLBI

• KOA (Korean Observatory of Astronomy) has one existing radio telescope but in a different band from ours. They are building another three radio telescopes.

• Using a dedicated light path from Europe to Asia through US is being considered.

• e-VLBI Demonstration in SuperComputing2004 (November) is being planned, interconnecting radio telescopes from Europe, US, and Japan.

• Gigabit A/D converter ready and now implementing 10G.

• Our peformance measurement infrastructure will be merged into a framework of Global (Network) Observatory maintained by NOC people. (Internet2 piPEs, APAN CMM, and e-VLBI)

Page 29: Masaki Hirabaru masaki@nict.go.jp NICT

Questions?• See http://www2.nict.go.jp/ka/radioastro/index.ht

mlfor VLBI