on the joint use of new tcp proposals and ip-qos on high bandwidth-rtt product paths andrea di...
TRANSCRIPT
On the joint use of new TCP On the joint use of new TCP proposals and IP-QOS on proposals and IP-QOS on
High bandwidth-RTT product High bandwidth-RTT product pathspaths
Andrea Di Donato
Centre of Excellence on Networked SystemsUniversity College London, United Kingdom
TERENA 2004 - Rhodes
Pillars….Pillars….
Case studyCase study few users require a reliable bulk transcontinental data
transfer at multi-Gb/s speed
Why TCP?Why TCP? Provides reliable data transfer of large data sets Allows the usage of existing transport programs, eg
FTP, GridFTP etc..
TCP’s issues for TCP’s issues for Transcontinental Gbps transfersTranscontinental Gbps transfers
TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD) performs poorlypoorly over paths of High bandwidth-Round_Trip_Timebandwidth-Round_Trip_Time product (BW*RTT)
the TCP aggregate’s stability is even more precarious if such aggregate consists of Only few flows
This is exactly the case for the scientific community where……..– Few users– Reliable bulk data transfer– Multi Gb/s speed.
ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(αcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2]
[Low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao Wang, John C. Doyle, May 6, 2003
= const.
So…..let’s try different TCPsSo…..let’s try different TCPs
Standard, HS-TCP and Scalable Standard, HS-TCP and Scalable TCP’s AIMDsTCP’s AIMDs
The StandardStandard TCP AIMD algorithm is as follows:
– on the receipt of an acknowledgement,• W’ = w + 1/w ; w -> congestion window
– CWND increases by one at the most every RTTCWND increases by one at the most every RTT
– and in response to a congestion event,• W’ = 0.5 * w
The modified HighSpeedTCP HighSpeedTCP AIMD algorithm is as follows:
– on the receipt of an acknowledgement,• W’ = w + a(w)/w; higher ‘w’ gives higher a(w)
– CWND increases more than what it does for Standard TCPCWND increases more than what it does for Standard TCP
– and in response to a congestion event,• W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w)
The modified ScalableTCPScalableTCP AIMD algorithm is as follows
– For each acknowledgement received in a round trip time,• Scalable TCP cwnd’ = cwnd + 0.01
– CWND increases by 1 every 100 Acks receivedCWND increases by 1 every 100 Acks received
– and on the first detection of congestion in a given round trip time • Scalable TCP cwnd’ = cwnd – 0.125 * cwnd
Are the new TCPs more adequateAre the new TCPs more adequatefor a trans-continental data for a trans-continental data
transfer transfer ??
Test-bed layoutTest-bed layout
Path Bottleneck = 1Gbps Min RTT = 118 ms
PCs:Supermicro 6022P-6 Dual Intel® Xeon Linux Kernel: 2.4.20NIC: Sysconnect 98xxNIC driver : sk98lin 6.19
Juniper M10
AF
BE
Standard TCPStandard TCP
HS-TCPHS-TCP
Scalable TCPScalable TCP
Chicago
TCP applications
Competing background traffic
Geneva UDP CBRUDP CBR
Standard Vs New TCP Stacks: the Standard Vs New TCP Stacks: the
inadequacyinadequacy of standard TCP in fat of standard TCP in fat long long pipespipes
ThroughputThroughput Coefficient of Variance Coefficient of Variance = St.Dev.Th/MeanTh.= St.Dev.Th/MeanTh.
• The purpose is to show that a small number of standard-TCP flows are not adequate to take advantage of the very large spare capacity when operating in a high RTT environment
• The figures shows how poor the Standard (Vanilla) TCP throughput is even when competing with very small CBR background traffic.
• For Standard (Vanilla) TCP– the average throughput shows little dependence on the available bandwidthlittle dependence on the available bandwidth unlike the other
stacks.– This rather passive behaviourpassive behaviour reveals that Standard TCP is weak in a non-protected path of these
ranges of bandwidth-delay products
CBR LoadCBR Load CBR LoadCBR Load
Are the new TCP stacks Are the new TCP stacks invasive ?invasive ?
Can These new stacks impact traditional traffic ?: User Impact
Factors investigation METRICMETRIC
N StandardTCP flows
N flows:N-1 StandardTCP flows + 1 New TCP stack flow (red)
Avg. Throughput of 1 Standard TCP in case 1 (green)User Impact Factor = UIFUIF= ---------------------------------------------------------------------
Avg. Throughput of 1 standard TCP in case 2 (green)
case1
case2
New stacks: User Impact Factors investigation(1)
Avg. Throughput of 11 Standard TCP when with N Standard TCPN Standard TCP flows
UIFUIF = ----------------------------------------------------------------------------------------------------------Avg. Throughput of 11 standard TCP when with N-1 Standard TCPN-1 Standard TCP + 1 New TCP1 New TCP
1
UIFUIF
Number of Number of background background Vanilla TCP Vanilla TCP flow flow
UIFUIF is greater than one for a certain range of values……
The results obtained so far suggest we need a mechanism for segregating segregating traffic sharing the same packet-switched path. This is in order to
1.1. Protect traditional BE trafficProtect traditional BE traffic from the possible aggression of the new TCP stacks
2. Guarantee a level of end-to-end service end-to-end service predictabilitypredictability for TCP flows which is sufficient to enforce a network resources reservation system through the use of some sort of middleware architecture (GRID)
The need for QoS….The need for QoS….
Test-bed layoutTest-bed layout
Path Bottleneck = 1Gbps Min RTT = 118 ms
PCs:Supermicro 6022P-6 Dual Intel® Xeon Linux Kernel: 2.4.20NIC: Sysconnect 98xxNIC driver : sk98lin 6.19
Juniper M10
AF
AF
BE
BE
Standard TCPStandard TCP
HS-TCPHS-TCP
Scalable TCPScalable TCP
Geneva
AF
BE
PriorityAnd XMbps
1000-X Mbps
AF: class for TCP applications
BE: class for the legacy web traffic
Chicago
QoSQoS
1 Gb/s UDP CBR1 Gb/s UDP CBR
QoS testsQoS tests metricmetric used
QoS efficiency factor (QEF)QoS efficiency factor (QEF) measures how efficiently traffic utilises a minimum bandwidth guaranteed reserved. – AF-QEFAF-QEF = AF_Throughput / BE_BW-reserved– BE-QEFBE-QEF = BE_Throughput / BE_BW-reserved
TCP Convergence speed estimationTCP Convergence speed estimation– (AF-QEF)s – (AF-QEF)(AF-QEF)s – (AF-QEF)
• ss = steady state phase of the test
BE/AF QoS Efficiency Factors (1)• txqueuelen and backlog = 50000• TCP with an offset of 30 sec.
AFs-QEF in steady state BEs QEF in steady state
• 100MB socket buffersize• Congestion moderation on
AF-QEFAF-QEF
AF-allocation(%)AF-allocation(%)
BE-QEFBE-QEF
AF-allocation(%)AF-allocation(%)
BE/AF QoS Efficiency Factors (2)
AFs-QEF – AFtsQEFAFs-QEF – AFtsQEF
This is where HS and Standard TCP differ
AF-allocationAF-allocation
Let’s see some TIPICAL Let’s see some TIPICAL dynamics in time when AF is dynamics in time when AF is allocated 900Mbps and BE allocated 900Mbps and BE
100Mbps100Mbps
Standard TCP with 900Mbps with 900Mbps guaranteedguaranteed
Cwnd Vs Time AF (green) and BE (red) Rx. Throughput
CWND cwnd
time
throughput
Time
Hs-TCP with 900Mbps guaranteedwith 900Mbps guaranteed
Cwnd Vs Time AF (green) and BE (red) Rx. Throughput
cwnd
time
throughput
time
Scalable TCP with 900Mbps with 900Mbps guaranteedguaranteed
Cwnd Vs Time AF (green) and BE (red) Rx. Throughput
cwnd
Time
Throughput
Time
ClampingClamping (50MBytes Socket Buffer (50MBytes Socket Buffer Size)Size)
Cwnd Vs Time AF (green) and BE (red) Rx. Throughput
Same performance regardless of the TCP stack used since it is “NOT” TCP.
cwnd
time
Throughput
time
Results analysis..Results analysis..
HS-TCP and Scalable TCP experienced HS-TCP and Scalable TCP experienced (with different magnitudes) (with different magnitudes) “unusual”“unusual”
slow start periodsslow start periods.
““unusual”unusual” as there were not any as there were not any timeouts timeouts experiencedexperienced.
Zoom on the “Zoom on the “unusual unusual ” slow start period” slow start periodexperienced from all the TCPs senders but with experienced from all the TCPs senders but with different magnitude: we found silences before…different magnitude: we found silences before…
cwnd
time
time
cwnd
time
QoS Results InterpretationQoS Results Interpretation (1):(1): Background - what are SACKS ?Background - what are SACKS ?
SSelective AckAcknowledgementss (SACKsSACKs) are control packets……
Sent from the TCP receiver to the sender Informing what data segments the receiver has Allow the sender to infer and re-transmit lost data Required to prevent TCP going into slow start after a
multiple drop – MANDATORYMANDATORY for high ranges of the CWND !!
We know the current SACK processing (*) is not optimal We know the current SACK processing (*) is not optimal yet.yet.
(*) Although the Tom Kelly SACKs patch improvement is used.
BUT…BUT…
QoS results interpretation (2):QoS results interpretation (2): HypothesisHypothesis
Here we formulate a segment-levelsegment-level perspective hypothesis :
The burst of SACKsSACKs coming back might cause the sender host to:
Hang (periods of silence) when AF is allocated high fraction of the BW, this causing the host to reset everything and, as a consequence, to make TCP enter slow start after each period of silence.
Vanilla Vanilla wouldn’twouldn’t be affected by this as the congestion avoidance CWND gradient doesn’t increase with higher CWND values.
HS-TCP and ScalableHS-TCP and Scalable wouldwould be affected by this as their congestion avoidance CWND gradient increases with higher CWND values.
ScalableScalable is the most affected since it has the steepest gradient for the congestion avoidance phase and this provokes a longer SACK SACK Burst Burst if compared to that HS-TCP presents
Let’s try and validate such a Let’s try and validate such a segment-level perspective segment-level perspective
hypothesis through the use of a hypothesis through the use of a network solution: network solution: WREDWRED
This is inspired by the fact that This is inspired by the fact that WREDWRED is already a solution for is already a solution for TCPTCP stabilitystability but from a flow- but from a flow-
level perspective…remember???level perspective…remember???
ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(ααcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2]
= const.
What is WRED ?What is WRED ?
WRED (Weighted Random Early Detection) is a router AQM (Active Queue Management) solution where….
A probability of dropping per router queue occupancy is assigned to a class queue
WRED profiles
0
20
40
60
80
100
120
0 20 40 60 80 100 120
queue occupancy (%)
wred1
wred2
wred3
Queue occupancyQueue occupancy
Prob. of droppingProb. of dropping
WREDWRED: test rationale: test rationale
• Dual approach rationale – the same tool (WRED) but from different perspective: Flow-level perspectiveFlow-level perspective::
The oscillations are the effect of protocol instability. This being the case, a gentle (small ρ) Weighted Random Early Detection (WRED) drop profile should be able to facilitate the validation of the inequality expression governing the stability condition for the TCP flow [low]
Segment-level perspective: our rationale……Segment-level perspective: our rationale……A smoother distribution of the loss pattern all along the AF
scheduler buffer would help in reducing the LENGTH of the bursts of SACKSbursts of SACKS coming back and therefore would help in not stalling the sender.
[low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao Wang, John C. Doyle, May 6, 2003
This, if successful, would validate all the This, if successful, would validate all the HypothesisHypothesis made before!!!! made before!!!!
This is what control theory says at This is what control theory says at flow levelflow level
WRED “test-bed”WRED “test-bed”
TRANSATLANTICTRANSATLANTIC
Juniper M10AF
AF
BE
BE
Standard TCPStandard TCP
HS-TCPHS-TCP
Scalable TCPScalable TCP
Chicago
AF
BE
WRED WRED +Priority
And 900Mbps
100Mbps
AF: class for TCP applications
1Gbps BE UDP CBR Background traffic
Geneva
1Gb/s UDP CBR1Gb/s UDP CBR
WRED profiles
0
20
40
60
80
100
120
0 20 40 60 80 100 120
queue occupancy (%)
wred1
wred2
wred3
Results with WRED…
As expected, only the “gentle” WRED works. Other profiles are found to be too As expected, only the “gentle” WRED works. Other profiles are found to be too aggressive, this way inducing too much loss.aggressive, this way inducing too much loss.
time
Throughput
time
cwnd
Results with WRED: summary
gentle WRED
cwnd
cwnd
throughput
throughput
Conclusions(1/5): Summary/ResultsConclusions(1/5): Summary/ResultsWith pretty empty fat-longfat-long pipes, TCP as it is now would invariably failfail
in terms of high performance throughput.
New TCP stacksNew TCP stacks are therefore felt as necessary (Scalable, Hs-TCP,….)
……… ………BUTBUT
New TCP stacks cancan affect the performance of traditional BE traffic: IP-QOSIP-QOS is therefore tried out as a solution for
– Legacy (BE) traffic protection – service predictability
With…… 1Gbps IP-QOS-enabled static fat-long path the current state of the TCP Linux kernel/CPU speed,
Standard TCP performs surprisingly better than HS-TCP and Scalable with HS-TCP much closer to Standard rather than to Scalable TCP.
Conclusions(2/5): results-Conclusions(2/5): results-provenproven theorytheory
The SACK burst processing is notis not an issue for Vanilla TCP since its congestion avoidance slope gradient Doesn’t increase with increasing CWND
The SACK burst processing isis an issue for HS-TCP and Scalable TCP as their congestion avoidance slope gradient
increases with increasing CWND with Scalable steeper than the HS-TCP gradient.
For highhigh values of the BW allocated to HS and Scalable TCP, the steeper the CWND congestion avoidance gradient is (Scalable), the higher/longer is the SACK rate/burst,rate/burst, this making the sender host to
1.1. StallStall2. Experience a period of silenceperiod of silence that occurs every time CWND “hits”
the QOS pipe allocated3. reset all its variables and thus also the TCP connection to slow startslow start
Conclusions(3/5):Conclusions(3/5):problem generalisation and problem generalisation and
remediesremedies For highhigh BW allocations there is an inherent higher risk,
for very aggressive transport protocol like Scalable TCP Scalable TCP of stalling the sender with the effect of reducing drastically the stability and thus the throughput performance. A proven remedy is that of switching off code whose bad
performance is worse than not having it in place at all as for with Congestion ModerationCongestion Moderation.
For even highereven higher BW allocations, then, the network itself needs to smooth out the loss pattern, this way reducing the sacks rate/burstsacks rate/burst. the use of a gentle WREDgentle WRED has been proven of extreme
efficacy.
Conclusions(4/5)Conclusions(4/5)
In a QoS-enabled high bandwidth delay product path, Standard TCP performs betterStandard TCP performs better than the other protocols because it is protected (protected (large buffers) – juniper… CWND grows so slowly (and UNDOs + CongestionModeration
also helps greatly) that it never hits the CWND limit; therefore real drops are preventeddrops are prevented
HS-TCP under QOS shows more and less the same performances Standard shows BUT, in case of random drop (even if in a protected environment there is always a BER>0BER>0…), which would be lethal to a high performance bulk data transfer over Standard TCP… HS-TCP is preferredHS-TCP is preferred as it would react notably better by design.
Conclusions (5/5)Conclusions (5/5) AllAll the possible Solutions to the SACK processing issue
Software: Optimising TCP code Hardware: Need CPU speed NetworkNetwork: Active Queue Management (Active Queue Management (WREDWRED))
The network/AQM solution found can relaxrelax the requirement of any further optimization of the SACK code.
However, we believe that the main limitations in using aggressive TCP protocols lies in not having enough CPU speedenough CPU speed.
In an environment where the available bandwidth is greater than 1 Gbit/s the problems are definitely further compounded for definitely further compounded for ScalableScalable and, above a certain CWND value, also for HS-TCPalso for HS-TCP. This is due to the larger range of thelarger range of the CWNDCWND values needed in order to utilize
those higher available bandwidths. The bigger values for CWND mean longer burstslonger bursts of packets dropped
which induces longer SACK bursts to be processed at the sender.
OUR “third way”OUR “third way”SOLUTION SOLUTION !!
publicationspublications “Using QoS for High Throughput TCP Transport Over Fat
Long Pipes” Andrea Di Donato, Yee-Ting Li, Frank Saka and Peter Clarke In
proceedings of Second International Workshop on Protocols for Protocols for Fast Long-Distance NetworksFast Long-Distance Networks February 16-17, 2004. Argonne National Laboratory, Argonne, Illinois USA
“On the joint use of new TCP proposals and IP-QoS on high Bandwidth-RTT product paths” Andrea Di Donato, Frank Saka, Javier Orellana and Peter Clarke In
proceedings of the Trans-European-Research and Education Networking Association (TERENATERENA) networking conference 2004. 7-10 June 2004. Rhodes, Greece.
THANK YOU……….THANK YOU……….
APPENDIX
CWND and UNDOs definitionsCWND and UNDOs definitions
• Congestion moderation is:– triggered by a dubious event which is typically the
acknowledge of more than 3 packets. The action induced is that the CWND is set to the # of packets that are thought to be in flight.
• UNDOs – ss-thresh and cwnd reduce as loss is thought to have
happened but the sudden reception of THE ack makes both cwnd and ss-tresh to reset to the previous value
Four things to keep in mind when looking at the AF-QEF results:
1. Regardless of the TCP implementation, a large bandwidth allocation means higher CWND values. This means a higher probability of dropping more packets in the same CWND.
• This, along with the smaller value of the AF BW delay products as well as with the CWND threshold explains why all the protocols tested perform the same under small BW allocations
2. For the new TCP stacks tested, the bigger the allocated bandwidth, the higher the loss rate induced by the scheduler during congestion since the loss recovery (congestion avoidance gradient) for bigger CWND is more aggressive.
3. SelectiveACKnowledgements (SACKs) are feedback from the receiver to tell the sender what packets the receiver has. This is to allow the sender to deduce what packets to resend.
4. Too many SACKs (as a result of a major multiple drops in a CWND) to be processed at once (burst of sacks) is recognised to be an issue for the current kernels and CPUs in use.
• This, along with point 2 and 1, leads to say that for protocols with a steep congestion avoidance gradient like Scalable and for high bandwidth allocated, the SACK processing becomes an issue. This explains why Scalable performance for higher bandwidth allocations was so poor in our tests.
• Thus, the difference in performance is due to an implementation issue whose magnitude is, though, function of the dynamic of the TCP protocol in use and not due to the protocol design per se..
Vanilla TCP (Again)• The standard AIMD algorithm is as follows:
– on the receipt of an acknowledgement,• W’ = w + 1/w ; w -> congestion window
– and in response to a congestion event,• W’ = 0.5 * w
• The increase and decrease parameters of the AIMD algorithm are fixed at 1 and 0.5
• With a 1500-byte packets and a 100 ms RTT, achieving a steady state throughput of 10 Gbps would require a packet drop rate at most once every 1 2/3 hours
• TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD) performs poorly over paths of high bandwidth-Round-Trip-Time product and the aggregate stability is even more precarious if such aggregate consists of only few flows as it is the case for the scientific community where few users require a reliable bulk transcontinental data transfer at Gb/s speed
HS-TCP • If congestion window <= threshold, use Standard
AIMD algorithm else use HighSpeed AIMD algorithm
• The modified HighSpeed AIMD algorithm is as follows:
– on the receipt of an acknowledgement,• W’ = w + a(w)/w; higher ‘w’ gives higher a(w)
– and in response to a congestion event,• W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w)
• The increase and decrease parameters vary based on the current value of the congestion window.
• For the same packet size and RTT, a steady state throughput of 10 Gbps can be achieved with a packet drop rate at most once every 12 seconds
Scalable• Slow start phase of the original TCP algorithm is
unmodified. The congestion avoidance phase is modified as follows:
– For each acknowledgement received in a round trip time,• Scalable TCP cwnd’ = cwnd + 0.01
– and on the first detection of congestion in a given round trip time
• Scalable TCP cwnd’ = cwnd – 0.125 * cwnd
• Like HighSpeed TCP this has a threshold window size and the modified algorithm is used only when the size of the congestion window is above the threshold window size. Though values of 0.01 and 0.125 are suggested for the increase and decrease parameters, they (as well as threshold window size) can be configured using the proc file system (by a superuser).
New stacks: User Impact Factors investigation(2)
The same as for the previous plot but for 10 New TCP stacks:The UIPs are even higher than 1 with respect to the 1 new TCP flow scenario and this corroborates the results even more as this is a much more realistic scenario.
1
UIF (10 New TCP flows)UIF (10 New TCP flows)
Number of Number of background background Vanilla TCP Vanilla TCP flow flow
IP-QOS ( a la Diff. Serv.)• Packets are dscp-marked at the sender host• Two classes are defined:
– BE– AF
• Packet are dscp-classified at the router input ports.
• Each class is put in a physically different IP-level router output queue. Each of them is assigned a minimum BW guaranteed (WRR).– Low priority for BE– High priority for AF
Juniper M10 BW scheduler “priority”
• The queue weight ensures the queue is provided a given minimum amount of bandwidth which is proportional to the weight. As long as this minimum has not been served, the queue is said to have a “positive credit”. Once this minimum amount is reached, the queue has a “negative credit”.
• The credits are decreased immediately when a packet is sent. They are increased frequently (by the weight assigned).
• For each packet, the WRR algorithm strictly follows this queue service order:
1. High priority, positive credit queues 2. Low priority, positive credit queues 3. High priority, negative credit queues 4. Low priority, negative credit queues
Router BW scheduler benchmarking
JUNIPER M10
Cards used: 1x G/E, 1000 BASE-SX REV 01 • IOS Version used: “Junos 5.3R2.4”
BE
AF