on the joint use of new tcp proposals and ip-qos on high bandwidth-rtt product paths andrea di...

On the joint use of new TCP On the joint use of new TCP proposals and IP-QOS on proposals and IP-QOS on

High bandwidth-RTT product High bandwidth-RTT product pathspaths

Andrea Di Donato

Centre of Excellence on Networked SystemsUniversity College London, United Kingdom

TERENA 2004 - Rhodes

Pillars….Pillars….

Case studyCase study few users require a reliable bulk transcontinental data

transfer at multi-Gb/s speed

Why TCP?Why TCP? Provides reliable data transfer of large data sets Allows the usage of existing transport programs, eg

FTP, GridFTP etc..

TCP’s issues for TCP’s issues for Transcontinental Gbps transfersTranscontinental Gbps transfers

TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD) performs poorlypoorly over paths of High bandwidth-Round_Trip_Timebandwidth-Round_Trip_Time product (BW*RTT)

the TCP aggregate’s stability is even more precarious if such aggregate consists of Only few flows

This is exactly the case for the scientific community where……..– Few users– Reliable bulk data transfer– Multi Gb/s speed.

ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(αcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2]

[Low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao Wang, John C. Doyle, May 6, 2003

= const.

So…..let’s try different TCPsSo…..let’s try different TCPs

Standard, HS-TCP and Scalable Standard, HS-TCP and Scalable TCP’s AIMDsTCP’s AIMDs

The StandardStandard TCP AIMD algorithm is as follows:

– on the receipt of an acknowledgement,• W’ = w + 1/w ; w -> congestion window

– CWND increases by one at the most every RTTCWND increases by one at the most every RTT

– and in response to a congestion event,• W’ = 0.5 * w

The modified HighSpeedTCP HighSpeedTCP AIMD algorithm is as follows:

– on the receipt of an acknowledgement,• W’ = w + a(w)/w; higher ‘w’ gives higher a(w)

– CWND increases more than what it does for Standard TCPCWND increases more than what it does for Standard TCP

– and in response to a congestion event,• W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w)

The modified ScalableTCPScalableTCP AIMD algorithm is as follows

– For each acknowledgement received in a round trip time,• Scalable TCP cwnd’ = cwnd + 0.01

– CWND increases by 1 every 100 Acks receivedCWND increases by 1 every 100 Acks received

– and on the first detection of congestion in a given round trip time • Scalable TCP cwnd’ = cwnd – 0.125 * cwnd

Are the new TCPs more adequateAre the new TCPs more adequatefor a trans-continental data for a trans-continental data

transfer transfer ??

Test-bed layoutTest-bed layout

Path Bottleneck = 1Gbps Min RTT = 118 ms

PCs:Supermicro 6022P-6 Dual Intel® Xeon Linux Kernel: 2.4.20NIC: Sysconnect 98xxNIC driver : sk98lin 6.19

Juniper M10

AF

BE

Standard TCPStandard TCP

HS-TCPHS-TCP

Scalable TCPScalable TCP

Chicago

TCP applications

Competing background traffic

Geneva UDP CBRUDP CBR

Standard Vs New TCP Stacks: the Standard Vs New TCP Stacks: the

inadequacyinadequacy of standard TCP in fat of standard TCP in fat long long pipespipes

ThroughputThroughput Coefficient of Variance Coefficient of Variance = St.Dev.Th/MeanTh.= St.Dev.Th/MeanTh.

• The purpose is to show that a small number of standard-TCP flows are not adequate to take advantage of the very large spare capacity when operating in a high RTT environment

• The figures shows how poor the Standard (Vanilla) TCP throughput is even when competing with very small CBR background traffic.

• For Standard (Vanilla) TCP– the average throughput shows little dependence on the available bandwidthlittle dependence on the available bandwidth unlike the other

stacks.– This rather passive behaviourpassive behaviour reveals that Standard TCP is weak in a non-protected path of these

ranges of bandwidth-delay products

CBR LoadCBR Load CBR LoadCBR Load

Are the new TCP stacks Are the new TCP stacks invasive ?invasive ?

Can These new stacks impact traditional traffic ?: User Impact

Factors investigation METRICMETRIC

N StandardTCP flows

N flows:N-1 StandardTCP flows + 1 New TCP stack flow (red)

Avg. Throughput of 1 Standard TCP in case 1 (green)User Impact Factor = UIFUIF= ---------------------------------------------------------------------

Avg. Throughput of 1 standard TCP in case 2 (green)

case1

case2

New stacks: User Impact Factors investigation(1)

Avg. Throughput of 11 Standard TCP when with N Standard TCPN Standard TCP flows

UIFUIF = ----------------------------------------------------------------------------------------------------------Avg. Throughput of 11 standard TCP when with N-1 Standard TCPN-1 Standard TCP + 1 New TCP1 New TCP

1

UIFUIF

Number of Number of background background Vanilla TCP Vanilla TCP flow flow

UIFUIF is greater than one for a certain range of values……

The results obtained so far suggest we need a mechanism for segregating segregating traffic sharing the same packet-switched path. This is in order to

1.1. Protect traditional BE trafficProtect traditional BE traffic from the possible aggression of the new TCP stacks

2. Guarantee a level of end-to-end service end-to-end service predictabilitypredictability for TCP flows which is sufficient to enforce a network resources reservation system through the use of some sort of middleware architecture (GRID)

The need for QoS….The need for QoS….

Test-bed layoutTest-bed layout

Path Bottleneck = 1Gbps Min RTT = 118 ms

PCs:Supermicro 6022P-6 Dual Intel® Xeon Linux Kernel: 2.4.20NIC: Sysconnect 98xxNIC driver : sk98lin 6.19

Juniper M10

AF

AF

BE

BE


HS-TCPHS-TCP


Geneva

AF

BE

PriorityAnd XMbps

1000-X Mbps

AF: class for TCP applications

BE: class for the legacy web traffic

Chicago

QoSQoS

1 Gb/s UDP CBR1 Gb/s UDP CBR

QoS testsQoS tests metricmetric used

QoS efficiency factor (QEF)QoS efficiency factor (QEF) measures how efficiently traffic utilises a minimum bandwidth guaranteed reserved. – AF-QEFAF-QEF = AF_Throughput / BE_BW-reserved– BE-QEFBE-QEF = BE_Throughput / BE_BW-reserved

TCP Convergence speed estimationTCP Convergence speed estimation– (AF-QEF)s – (AF-QEF)(AF-QEF)s – (AF-QEF)

• ss = steady state phase of the test

BE/AF QoS Efficiency Factors (1)• txqueuelen and backlog = 50000• TCP with an offset of 30 sec.

AFs-QEF in steady state BEs QEF in steady state

• 100MB socket buffersize• Congestion moderation on

AF-QEFAF-QEF

AF-allocation(%)AF-allocation(%)

BE-QEFBE-QEF

AF-allocation(%)AF-allocation(%)

BE/AF QoS Efficiency Factors (2)

AFs-QEF – AFtsQEFAFs-QEF – AFtsQEF

This is where HS and Standard TCP differ

AF-allocationAF-allocation

Let’s see some TIPICAL Let’s see some TIPICAL dynamics in time when AF is dynamics in time when AF is allocated 900Mbps and BE allocated 900Mbps and BE

100Mbps100Mbps

Standard TCP with 900Mbps with 900Mbps guaranteedguaranteed

Cwnd Vs Time AF (green) and BE (red) Rx. Throughput

CWND cwnd

time

throughput

Time

Hs-TCP with 900Mbps guaranteedwith 900Mbps guaranteed


cwnd

time

throughput

time

Scalable TCP with 900Mbps with 900Mbps guaranteedguaranteed


cwnd

Time

Throughput

Time

ClampingClamping (50MBytes Socket Buffer (50MBytes Socket Buffer Size)Size)


Same performance regardless of the TCP stack used since it is “NOT” TCP.

cwnd

time

Throughput

time

Results analysis..Results analysis..

HS-TCP and Scalable TCP experienced HS-TCP and Scalable TCP experienced (with different magnitudes) (with different magnitudes) “unusual”“unusual”

slow start periodsslow start periods.

““unusual”unusual” as there were not any as there were not any timeouts timeouts experiencedexperienced.

Zoom on the “Zoom on the “unusual unusual ” slow start period” slow start periodexperienced from all the TCPs senders but with experienced from all the TCPs senders but with different magnitude: we found silences before…different magnitude: we found silences before…

cwnd

time

time

cwnd

time

QoS Results InterpretationQoS Results Interpretation (1):(1): Background - what are SACKS ?Background - what are SACKS ?

SSelective AckAcknowledgementss (SACKsSACKs) are control packets……

Sent from the TCP receiver to the sender Informing what data segments the receiver has Allow the sender to infer and re-transmit lost data Required to prevent TCP going into slow start after a

multiple drop – MANDATORYMANDATORY for high ranges of the CWND !!

We know the current SACK processing (*) is not optimal We know the current SACK processing (*) is not optimal yet.yet.

(*) Although the Tom Kelly SACKs patch improvement is used.

BUT…BUT…

QoS results interpretation (2):QoS results interpretation (2): HypothesisHypothesis

Here we formulate a segment-levelsegment-level perspective hypothesis :

The burst of SACKsSACKs coming back might cause the sender host to:

Hang (periods of silence) when AF is allocated high fraction of the BW, this causing the host to reset everything and, as a consequence, to make TCP enter slow start after each period of silence.

Vanilla Vanilla wouldn’twouldn’t be affected by this as the congestion avoidance CWND gradient doesn’t increase with higher CWND values.

HS-TCP and ScalableHS-TCP and Scalable wouldwould be affected by this as their congestion avoidance CWND gradient increases with higher CWND values.

ScalableScalable is the most affected since it has the steepest gradient for the congestion avoidance phase and this provokes a longer SACK SACK Burst Burst if compared to that HS-TCP presents

Let’s try and validate such a Let’s try and validate such a segment-level perspective segment-level perspective

hypothesis through the use of a hypothesis through the use of a network solution: network solution: WREDWRED

This is inspired by the fact that This is inspired by the fact that WREDWRED is already a solution for is already a solution for TCPTCP stabilitystability but from a flow- but from a flow-

level perspective…remember???level perspective…remember???

ρ[(c^3 τ^3)/(4N^4)] * [(cτ/2N)+1+1/(ααcτ)] < [ π(1-β)^2 ] / √[4β^2 + π^2(1- β)^2]

= const.

What is WRED ?What is WRED ?

WRED (Weighted Random Early Detection) is a router AQM (Active Queue Management) solution where….

A probability of dropping per router queue occupancy is assigned to a class queue

WRED profiles

0

20

40

60

80

100

120

0 20 40 60 80 100 120

queue occupancy (%)

wred1

wred2

wred3

Queue occupancyQueue occupancy

Prob. of droppingProb. of dropping

WREDWRED: test rationale: test rationale

• Dual approach rationale – the same tool (WRED) but from different perspective: Flow-level perspectiveFlow-level perspective::

The oscillations are the effect of protocol instability. This being the case, a gentle (small ρ) Weighted Random Early Detection (WRED) drop profile should be able to facilitate the validation of the inequality expression governing the stability condition for the TCP flow [low]

Segment-level perspective: our rationale……Segment-level perspective: our rationale……A smoother distribution of the loss pattern all along the AF

scheduler buffer would help in reducing the LENGTH of the bursts of SACKSbursts of SACKS coming back and therefore would help in not stalling the sender.

[low] linear stability of TCP/RED and a scalable control. Steven H. Low, Fernando Paganini, Jiantao Wang, John C. Doyle, May 6, 2003

This, if successful, would validate all the This, if successful, would validate all the HypothesisHypothesis made before!!!! made before!!!!

This is what control theory says at This is what control theory says at flow levelflow level

WRED “test-bed”WRED “test-bed”

TRANSATLANTICTRANSATLANTIC

Juniper M10AF

AF

BE

BE


HS-TCPHS-TCP


Chicago

AF

BE

WRED WRED +Priority

And 900Mbps

100Mbps

AF: class for TCP applications

1Gbps BE UDP CBR Background traffic

Geneva

1Gb/s UDP CBR1Gb/s UDP CBR

WRED profiles

0

20

40

60

80

100

120

0 20 40 60 80 100 120

queue occupancy (%)

wred1

wred2

wred3

Results with WRED…

As expected, only the “gentle” WRED works. Other profiles are found to be too As expected, only the “gentle” WRED works. Other profiles are found to be too aggressive, this way inducing too much loss.aggressive, this way inducing too much loss.

time

Throughput

time

cwnd

Results with WRED: summary

gentle WRED

cwnd

cwnd

throughput

throughput

Conclusions(1/5): Summary/ResultsConclusions(1/5): Summary/ResultsWith pretty empty fat-longfat-long pipes, TCP as it is now would invariably failfail

in terms of high performance throughput.

New TCP stacksNew TCP stacks are therefore felt as necessary (Scalable, Hs-TCP,….)

……… ………BUTBUT

New TCP stacks cancan affect the performance of traditional BE traffic: IP-QOSIP-QOS is therefore tried out as a solution for

– Legacy (BE) traffic protection – service predictability

With…… 1Gbps IP-QOS-enabled static fat-long path the current state of the TCP Linux kernel/CPU speed,

Standard TCP performs surprisingly better than HS-TCP and Scalable with HS-TCP much closer to Standard rather than to Scalable TCP.

Conclusions(2/5): results-Conclusions(2/5): results-provenproven theorytheory

The SACK burst processing is notis not an issue for Vanilla TCP since its congestion avoidance slope gradient Doesn’t increase with increasing CWND

The SACK burst processing isis an issue for HS-TCP and Scalable TCP as their congestion avoidance slope gradient

increases with increasing CWND with Scalable steeper than the HS-TCP gradient.

For highhigh values of the BW allocated to HS and Scalable TCP, the steeper the CWND congestion avoidance gradient is (Scalable), the higher/longer is the SACK rate/burst,rate/burst, this making the sender host to

1.1. StallStall2. Experience a period of silenceperiod of silence that occurs every time CWND “hits”

the QOS pipe allocated3. reset all its variables and thus also the TCP connection to slow startslow start

Conclusions(3/5):Conclusions(3/5):problem generalisation and problem generalisation and

remediesremedies For highhigh BW allocations there is an inherent higher risk,

for very aggressive transport protocol like Scalable TCP Scalable TCP of stalling the sender with the effect of reducing drastically the stability and thus the throughput performance. A proven remedy is that of switching off code whose bad

performance is worse than not having it in place at all as for with Congestion ModerationCongestion Moderation.

For even highereven higher BW allocations, then, the network itself needs to smooth out the loss pattern, this way reducing the sacks rate/burstsacks rate/burst. the use of a gentle WREDgentle WRED has been proven of extreme

efficacy.

Conclusions(4/5)Conclusions(4/5)

In a QoS-enabled high bandwidth delay product path, Standard TCP performs betterStandard TCP performs better than the other protocols because it is protected (protected (large buffers) – juniper… CWND grows so slowly (and UNDOs + CongestionModeration

also helps greatly) that it never hits the CWND limit; therefore real drops are preventeddrops are prevented

HS-TCP under QOS shows more and less the same performances Standard shows BUT, in case of random drop (even if in a protected environment there is always a BER>0BER>0…), which would be lethal to a high performance bulk data transfer over Standard TCP… HS-TCP is preferredHS-TCP is preferred as it would react notably better by design.

Conclusions (5/5)Conclusions (5/5) AllAll the possible Solutions to the SACK processing issue

Software: Optimising TCP code Hardware: Need CPU speed NetworkNetwork: Active Queue Management (Active Queue Management (WREDWRED))

The network/AQM solution found can relaxrelax the requirement of any further optimization of the SACK code.

However, we believe that the main limitations in using aggressive TCP protocols lies in not having enough CPU speedenough CPU speed.

In an environment where the available bandwidth is greater than 1 Gbit/s the problems are definitely further compounded for definitely further compounded for ScalableScalable and, above a certain CWND value, also for HS-TCPalso for HS-TCP. This is due to the larger range of thelarger range of the CWNDCWND values needed in order to utilize

those higher available bandwidths. The bigger values for CWND mean longer burstslonger bursts of packets dropped

which induces longer SACK bursts to be processed at the sender.

OUR “third way”OUR “third way”SOLUTION SOLUTION !!

publicationspublications “Using QoS for High Throughput TCP Transport Over Fat

Long Pipes” Andrea Di Donato, Yee-Ting Li, Frank Saka and Peter Clarke In

proceedings of Second International Workshop on Protocols for Protocols for Fast Long-Distance NetworksFast Long-Distance Networks February 16-17, 2004. Argonne National Laboratory, Argonne, Illinois USA

“On the joint use of new TCP proposals and IP-QoS on high Bandwidth-RTT product paths” Andrea Di Donato, Frank Saka, Javier Orellana and Peter Clarke In

proceedings of the Trans-European-Research and Education Networking Association (TERENATERENA) networking conference 2004. 7-10 June 2004. Rhodes, Greece.

THANK YOU……….THANK YOU……….

APPENDIX

CWND and UNDOs definitionsCWND and UNDOs definitions

• Congestion moderation is:– triggered by a dubious event which is typically the

acknowledge of more than 3 packets. The action induced is that the CWND is set to the # of packets that are thought to be in flight.

• UNDOs – ss-thresh and cwnd reduce as loss is thought to have

happened but the sudden reception of THE ack makes both cwnd and ss-tresh to reset to the previous value

Four things to keep in mind when looking at the AF-QEF results:

1. Regardless of the TCP implementation, a large bandwidth allocation means higher CWND values. This means a higher probability of dropping more packets in the same CWND.

• This, along with the smaller value of the AF BW delay products as well as with the CWND threshold explains why all the protocols tested perform the same under small BW allocations

2. For the new TCP stacks tested, the bigger the allocated bandwidth, the higher the loss rate induced by the scheduler during congestion since the loss recovery (congestion avoidance gradient) for bigger CWND is more aggressive.

3. SelectiveACKnowledgements (SACKs) are feedback from the receiver to tell the sender what packets the receiver has. This is to allow the sender to deduce what packets to resend.

4. Too many SACKs (as a result of a major multiple drops in a CWND) to be processed at once (burst of sacks) is recognised to be an issue for the current kernels and CPUs in use.

• This, along with point 2 and 1, leads to say that for protocols with a steep congestion avoidance gradient like Scalable and for high bandwidth allocated, the SACK processing becomes an issue. This explains why Scalable performance for higher bandwidth allocations was so poor in our tests.

• Thus, the difference in performance is due to an implementation issue whose magnitude is, though, function of the dynamic of the TCP protocol in use and not due to the protocol design per se..

Vanilla TCP (Again)• The standard AIMD algorithm is as follows:

– on the receipt of an acknowledgement,• W’ = w + 1/w ; w -> congestion window

– and in response to a congestion event,• W’ = 0.5 * w

• The increase and decrease parameters of the AIMD algorithm are fixed at 1 and 0.5

• With a 1500-byte packets and a 100 ms RTT, achieving a steady state throughput of 10 Gbps would require a packet drop rate at most once every 1 2/3 hours

• TCP’s congestion control algorithm additive increase multiplicative decrease (AIMD) performs poorly over paths of high bandwidth-Round-Trip-Time product and the aggregate stability is even more precarious if such aggregate consists of only few flows as it is the case for the scientific community where few users require a reliable bulk transcontinental data transfer at Gb/s speed

HS-TCP • If congestion window <= threshold, use Standard

AIMD algorithm else use HighSpeed AIMD algorithm

• The modified HighSpeed AIMD algorithm is as follows:

– on the receipt of an acknowledgement,• W’ = w + a(w)/w; higher ‘w’ gives higher a(w)

– and in response to a congestion event,• W’ = (1 -b(w))*w; higher ‘w’ gives lower b(w)

• The increase and decrease parameters vary based on the current value of the congestion window.

• For the same packet size and RTT, a steady state throughput of 10 Gbps can be achieved with a packet drop rate at most once every 12 seconds

Scalable• Slow start phase of the original TCP algorithm is

unmodified. The congestion avoidance phase is modified as follows:

– For each acknowledgement received in a round trip time,• Scalable TCP cwnd’ = cwnd + 0.01

– and on the first detection of congestion in a given round trip time

• Scalable TCP cwnd’ = cwnd – 0.125 * cwnd

• Like HighSpeed TCP this has a threshold window size and the modified algorithm is used only when the size of the congestion window is above the threshold window size. Though values of 0.01 and 0.125 are suggested for the increase and decrease parameters, they (as well as threshold window size) can be configured using the proc file system (by a superuser).

New stacks: User Impact Factors investigation(2)

The same as for the previous plot but for 10 New TCP stacks:The UIPs are even higher than 1 with respect to the 1 new TCP flow scenario and this corroborates the results even more as this is a much more realistic scenario.

1

UIF (10 New TCP flows)UIF (10 New TCP flows)

Number of Number of background background Vanilla TCP Vanilla TCP flow flow

IP-QOS ( a la Diff. Serv.)• Packets are dscp-marked at the sender host• Two classes are defined:

– BE– AF

• Packet are dscp-classified at the router input ports.

• Each class is put in a physically different IP-level router output queue. Each of them is assigned a minimum BW guaranteed (WRR).– Low priority for BE– High priority for AF

Juniper M10 BW scheduler “priority”

• The queue weight ensures the queue is provided a given minimum amount of bandwidth which is proportional to the weight. As long as this minimum has not been served, the queue is said to have a “positive credit”. Once this minimum amount is reached, the queue has a “negative credit”.

• The credits are decreased immediately when a packet is sent. They are increased frequently (by the weight assigned).

• For each packet, the WRR algorithm strictly follows this queue service order:

1. High priority, positive credit queues 2. Low priority, positive credit queues 3. High priority, negative credit queues 4. Low priority, negative credit queues

Router BW scheduler benchmarking

JUNIPER M10

Cards used: 1x G/E, 1000 BASE-SX REV 01 • IOS Version used: “Junos 5.3R2.4”

BE

AF

on the joint use of new tcp proposals and ip-qos on high bandwidth-rtt product paths andrea di...

Documents