compsci514: computer networks lecture 14 ... · lecture 14 datacentertransport protocols ii...

41
CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

Upload: others

Post on 15-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

CompSci 514: Computer Networks

Lecture 14 Datacenter Transport protocols II

Xiaowei Yang

Page 2: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Roadmap

• Clos topology

• Datacenter TCP

• Re-architecting datacenter networks and stacks for low latency and high performance– Best Paper award, SIGCOMM’17

Page 3: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Motivation for Clos topology

• Clos topology aims to achieve the performance of a cross-bar switch

• When the number of ports n is large, it is hard to build such a nxn switch

Page 4: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Clos topology

• A multi-stage switching network• A path from any input port to any output port• Each switch has a small number of ports

Page 5: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Roadmap

• Clos topology

• Datacenter TCP

• Re-architecting datacenter networks and stacks for low latency and high performance– Best Paper award, SIGCOMM’17

Page 6: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Datacenter Impairments

• Incast

• Queue Buildup

• Buffer Pressure

6

Page 7: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Queue Buildup

7

Sender 1

Sender 2

Receiver

• Big flows buildup queues. Ø Increased latency for short flows.

•Measurements in Bing clusterØ For 90% packets: RTT < 1msØ For 10% packets: 1ms < RTT < 15ms

Page 8: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Data Center Transport Requirements

8

1. High Burst Tolerance– Incast due to Partition/Aggregate is common.

2. Low Latency– Short flows, queries

3. High Throughput – Continuous data updates, large file transfers

The challenge is to achieve these three together.

Page 9: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Tension Between Requirements

9

High Burst ToleranceHigh Throughput

Low Latency

DCTCP

Deep Buffers:Ø Queuing Delays

Increase Latency

Shallow Buffers:Ø Bad for Bursts &

Throughput

Reduced RTOmin(SIGCOMM ‘09)Ø Doesn’t Help Latency

AQM – RED:Ø Avg Queue Not Fast

Enough for Incast

Objective:Low Queue Occupancy & High Throughput

Page 10: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

The DCTCP Algorithm

10

Page 11: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Review: The TCP/ECN Control Loop

11

Sender 1

Sender 2

ReceiverECN Mark (1 bit)

ECN = Explicit Congestion Notification

Page 12: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Small Queues & TCP Throughput:The Buffer Sizing Story

17

• Bandwidth-delay product rule of thumb:– A single flow needs buffers for 100% Throughput.

B

Cwnd

Buffer Size

Throughput100%

Page 13: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Small Queues & TCP Throughput:The Buffer Sizing Story

17

• Bandwidth-delay product rule of thumb:– A single flow needs buffers for 100% Throughput.

• Appenzeller rule of thumb (SIGCOMM ‘04):– Large # of flows: is enough.

B

Cwnd

Buffer Size

Throughput100%

Page 14: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Small Queues & TCP Throughput:The Buffer Sizing Story

17

• Bandwidth-delay product rule of thumb:– A single flow needs buffers for 100% Throughput.

• Appenzeller rule of thumb (SIGCOMM ‘04):– Large # of flows: is enough.

• Can’t rely on stat-mux benefit in the DC.– Measurements show typically 1-2 big flows at each server, at most 4.

Page 15: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Small Queues & TCP Throughput:The Buffer Sizing Story

17

• Bandwidth-delay product rule of thumb:– A single flow needs buffers for 100% Throughput.

• Appenzeller rule of thumb (SIGCOMM ‘04):– Large # of flows: is enough.

• Can’t rely on stat-mux benefit in the DC.– Measurements show typically 1-2 big flows at each server, at most 4.

B Real Rule of Thumb:Low Variance in Sending Rate → Small Buffers Suffice

Page 16: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Two Key Ideas1. React in proportion to the extent of congestion, not its presence.

ü Reduces variance in sending rates, lowering queuing requirements.

2. Mark based on instantaneous queue length.ü Fast feedback to better deal with bursts.

18

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

Page 17: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Data Center TCP AlgorithmSwitch side:– Mark packets when Queue Length > K.

19

Sender side:– Maintain running average of fraction of packets marked (α).

In each RTT:

Ø Adaptive window decreases:– Note: decrease factor between 1 and 2.

B KMark Don’t Mark

The picture can't be displayed.

Page 18: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Working with delayed acks

Figure 9: CDF of RTT to the aggregator. 10% of responsessee an unacceptable queuing delay of 1 to 14ms caused by longflows sharing the queue.

even be many synchronized short flows. Since the latency is caused

by queueing, the only solution is to reduce the size of the queues.

2.3.4 Buffer pressure

Given the mix of long and short flows in our data center, it is verycommon for short flows on one port to be impacted by activity onany of the many other ports, as depicted in Figure 6(c). Indeed, theloss rate of short flows in this traffic pattern depends on the numberof long flows traversing other ports. The explanation is that activityon the different ports is coupled by the shared memory pool.

The long, greedy TCP flows build up queues on their interfaces.Since buffer space is a shared resource, the queue build up reducesthe amount of buffer space available to absorb bursts of traffic fromPartition/Aggregate traffic. We term this impairment buffer pres-

sure. The result is packet loss and timeouts, as in incast, but withoutrequiring synchronized flows.

3. THE DCTCP ALGORITHMThe design of DCTCP is motivated by the performance impair-

ments described in § 2.3. The goal of DCTCP is to achieve highburst tolerance, low latency, and high throughput, with commod-ity shallow buffered switches. To this end, DCTCP is designed tooperate with small queue occupancies, without loss of throughput.

DCTCP achieves these goals primarily by reacting to congestionin proportion to the extent of congestion. DCTCP uses a simplemarking scheme at switches that sets the Congestion Experienced(CE) codepoint of packets as soon as the buffer occupancy exceedsa fixed small threshold. The DCTCP source reacts by reducing thewindow by a factor that depends on the fraction of marked packets:the larger the fraction, the bigger the decrease factor.

It is important to note that the key contribution here is not thecontrol law itself. It is the act of deriving multi-bit feedback from

the information present in the single-bit sequence of marks. Othercontrol laws that act upon this information can be derived as well.Since DCTCP requires the network to provide only single-bit feed-back, we are able to re-use much of the ECN machinery that isalready available in modern TCP stacks and switches.

The idea of reacting in proportion to the extent of congestionis also used by delay-based congestion control algorithms [5, 31].Indeed, one can view path delay information as implicit multi-bitfeedback. However, at very high data rates and with low-latencynetwork fabrics, sensing the queue buildup in shallow-buffered switchescan be extremely noisy. For example, a 10 packet backlog consti-tutes 120µs of queuing delay at 1Gbps, and only 12µs at 10Gbps.The accurate measurement of such small increases in queueing de-lay is a daunting task for today’s servers.

The need for reacting in proportion to the extent of congestionis especially acute in the absence of large-scale statistical multi-plexing. Standard TCP cuts its window size by a factor of 2 whenit receives ECN notification. In effect, TCP reacts to presence of

!"#$#% !"#$#&

!"#$%&%'()%*+,%

"-",.%/ 0123"45%

6748%9(:;<

!"#$%&%'()%*+,%

"-",.%/ 0123"45%

6748%9(:;&

!"#$%7//"$714"%

'()%6748%9(:;<

!"#$%7//"$714"%

'()%6748%9(:;&

Figure 10: Two state ACK generation state machine.

congestion, not to its extent2. Dropping the window in half causes

a large mismatch between the input rate to the link and the availablecapacity. In the high speed data center environment where only asmall number of flows share the buffer (§ 2.2), this leads to bufferunderflows and loss of throughput.

3.1 AlgorithmThe DCTCP algorithm has three main components:

(1) Simple Marking at the Switch: DCTCP employs a very sim-ple active queue management scheme. There is only a single pa-rameter, the marking threshold, K. An arriving packet is markedwith the CE codepoint if the queue occupancy is greater than K

upon it’s arrival. Otherwise, it is not marked. This scheme en-sures that sources are quickly notified of the queue overshoot. TheRED marking scheme implemented by most modern switches canbe re-purposed for DCTCP. We simply need to set both the low andhigh thresholds to K, and mark based on instantaneous, instead ofaverage queue length.(2) ECN-Echo at the Receiver: The only difference between aDCTCP receiver and a TCP receiver is the way information in theCE codepoints is conveyed back to the sender. RFC 3168 statesthat a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWRflag) that the congestion notification has been received. A DCTCPreceiver, however, tries to accurately convey the exact sequence ofmarked packets back to the sender. The simplest way to do this isto ACK every packet, setting the ECN-Echo flag if and only if thepacket has a marked CE codepoint.

However, using Delayed ACKs is important for a variety of rea-sons, including reducing the load on the data sender. To use de-layed ACKs (one cumulative ACK for every m consecutively re-ceived packets 3), the DCTCP receiver uses the trivial two statestate-machine shown in Figure 10 to determine whether to set ECN-Echo bit. The states correspond to whether the last received packetwas marked with the CE codepoint or not. Since the sender knowshow many packets each ACK covers, it can exactly reconstruct theruns of marks seen by the receiver.(3) Controller at the Sender: The sender maintains an estimate ofthe fraction of packets that are marked, called ↵, which is updatedonce for every window of data (roughly one RTT) as follows:

↵ (1� g)⇥ ↵ + g ⇥ F, (1)

where F is the fraction of packets that were marked in the last win-dow of data, and 0 < g < 1 is the weight given to new samplesagainst the past in the estimation of ↵. Given that the sender re-ceives marks for every packet when the queue length is higher thanK and does not receive any marks when the queue length is belowK, Equation (1) implies that ↵ estimates the probability that thequeue size is greater than K. Essentially, ↵ close to 0 indicateslow, and ↵ close to 1 indicates high levels of congestion.

2Other variants which use a variety of fixed factors and/or otherfixed reactions have the same issue.3Typically, one ACK every 2 packets.

Page 19: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

DCTCP in Action

20

Setup: Win 7, Broadcom 1Gbps SwitchScenario: 2 long-lived flows, K = 30KB

(Kby

tes)

Page 20: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Why it Works1.High Burst Tolerance

ü Large buffer headroom → bursts fit.ü Aggressive marking → sources react before packets are

dropped.

2. Low Latencyü Small buffer occupancies → low queuing delay.

3. High Throughput ü ECN averaging → smooth rate adjustments, low

variance.21

Page 21: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Analysis• How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters?

22

Ø Need to quantify queue size oscillations (Stability).

Time

(W*+1)(1-α/2)

W*

Window Size

W*+1

Page 22: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Packets sent in this RTT are marked.

Analysis• How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters?

22

Ø Need to quantify queue size oscillations (Stability).

Time

(W*+1)(1-α/2)

W*

Window Size

W*+1

Page 23: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Analysis

• Q(t) = NW(t) − C × RTT

• The key observation is that with synchronized senders, the queue size exceeds the marking threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their window sizes accordingly.

• S(W1,W2)=(W22 −W1

2)/2

• Critical window size when ECN marking occurs: W∗=(C×RTT+K)/N

Page 24: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

• α = S(W∗,W∗ + 1)/S((W∗ + 1)(1 − α/2),W∗ + 1)

• α2(1 − α/4) = (2W∗ + 1)/(W∗ + 1)2 ≈ 2/W∗

• α ≈ sqrt(2/W∗)

• Single flow oscillation– D = (W∗ +1)−(W∗ +1)(1−α/2)

Prior work [26, 20] on congestion control in the small bufferregime has observed that at high line rates, queue size fluctuationsbecome so fast that you cannot control the queue size, only its dis-

tribution. The physical significance of ↵ is aligned with this obser-vation: it represents a single point of the queue size distribution atthe bottleneck link.

The only difference between a DCTCP sender and a TCP senderis in how each reacts to receiving an ACK with the ECN-Echo flagset. Other features of TCP such as slow start, additive increasein congestion avoidance, or recovery from packet lost are left un-changed. While TCP always cuts it’s window size by a factor of 2in response4 to a marked ACK, DCTCP uses ↵:

cwnd cwnd⇥ (1� ↵/2). (2)

Thus, when ↵ is near 0 (low congestion), the window is onlyslightly reduced. In other words, DCTCP senders start gently re-ducing their window as soon as the queue exceeds K. This ishow DCTCP maintains low queue length, while still ensuring highthroughput. When congestion is high (↵ = 1), DCTCP cuts win-dow in half, just like TCP.

3.2 BenefitsDCTCP alleviates the three impairments discussed in § 2.3 (shown

in Figure 6) as follows.Queue buildup: DCTCP senders start reacting as soon as queuelength on an interface exceeds K. This reduces queueing delays oncongested switch ports, which minimizes the impact of long flowson the completion time of small flows. Also, more buffer spaceis available as headroom to absorb transient micro-bursts, greatlymitigating costly packet losses that can lead to timeouts.Buffer pressure: DCTCP also solves the buffer pressure problembecause a congested port’s queue length does not grow exceedinglylarge. Therefore, in shared memory switches, a few congested portswill not exhaust the buffer resources harming flows passing throughother ports.Incast: The incast scenario, where a large number of synchronizedsmall flows hit the same queue, is the most difficult to handle. If thenumber of small flows is so high that even 1 packet from each flowis sufficient to overwhelm the buffer on a synchronized burst, thenthere isn’t much DCTCP—or any congestion control scheme thatdoes not attempt to schedule traffic—can do to avoid packet drops.

However, in practice, each flow has several packets to transmit,and their windows build up over multiple RTTs. It is often bursts insubsequent RTTs that lead to drops. Because DCTCP starts mark-ing early (and aggressively – based on instantaneous queue length),DCTCP sources receive enough marks during the first one or twoRTTs to tame the size of follow up bursts. This prevents bufferoverflows and resulting timeouts.

3.3 AnalysisWe now analyze the steady state behavior of the DCTCP control

loop in a simplified setting. We consider N infinitely long-livedflows with identical round-trip times RTT , sharing a single bottle-neck link of capacity C. We further assume that the N flows aresynchronized; i.e., their “sawtooth” window dynamics are in-phase.Of course, this assumption is only realistic when N is small. How-ever, this is the case we care about most in data centers (§ 2.2).

Because the N window sizes are synchronized, they follow iden-tical sawtooths, and the queue size at time t is given by

Q(t) = NW (t)� C ⇥RTT, (3)

4Both TCP and DCTCP cut their window size at most once perwindow of data [27].

(W*+1)(1‐α/2)

W*

WindowSize

Time

K

QueueSize

Time

W*+1

Packetssentin

thisperiod(1RTT)

aremarked.

D

TC

Qmax

A

TC

Figure 11: Window size of a single DCTCP sender, and thequeue size process.

where W (t) is the window size of a single source [4]. Thereforethe queue size process is also a sawtooth. We are interested in com-puting the following quantities which completely specify the saw-tooth (see Figure 11): the maximum queue size (Qmax), the ampli-tude of queue oscillations (A), and the period of oscillations (TC ).The most important of these is the amplitude of oscillations, whichquantifies how well DCTCP is able to maintain steady queues, dueto its gentle proportionate reaction to congestion indications.

We proceed to computing these quantities. The key observationis that with synchronized senders, the queue size exceeds the mark-ing threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their win-dow sizes accordingly. Therefore, we can compute the fraction ofmarked packets, ↵, by simply dividing the number of packets sentduring the last RTT of the period by the total number of packetssent during a full period of the sawtooth, TC .

Let’s consider one of the senders. Let S(W1, W2) denote thenumber of packets sent by the sender, while its window size in-creases from W1 to W2 > W1. Since this takes W2 �W1 round-trip times, during which the average window size is (W1 +W2)/2,

S(W1, W2) = (W 22 �W

21 )/2. (4)

Let W⇤ = (C⇥RTT +K)/N . This is the critical window size

at which the queue size reaches K, and the switch starts markingpackets with the CE codepoint. During the RTT it takes for thesender to react to these marks, its window size increases by onemore packet, reaching W

⇤ + 1. Hence,

↵ = S(W ⇤, W

⇤ + 1)/S((W ⇤ + 1)(1� ↵/2), W ⇤ + 1). (5)

Plugging (4) into (5) and rearranging, we get:

↵2(1� ↵/4) = (2W

⇤ + 1)/(W ⇤ + 1)2 ⇡ 2/W⇤, (6)

where the approximation in (6) is valid when W⇤

>> 1. Equa-tion (6) can be used to compute ↵ as a function of the networkparameters C, RTT , N and K. Assuming ↵ is small, this canbe simplified as ↵ ⇡

p2/W ⇤. We can now compute A and TC

in Figure 11 as follows. Note that the amplitude of oscillation inwindow size of a single flow, D, (see Figure 11) is given by:

D = (W ⇤ + 1)� (W ⇤ + 1)(1� ↵/2). (7)

Since there are N flows in total,

A = ND = N(W ⇤ + 1)↵/2 ⇡ N

2

p2W ⇤

=12

p2N(C ⇥RTT + K), (8)

TC = D =12

p2(C ⇥RTT + K)/N (in RTTs). (9)

Finally, using (3), we have:

Qmax = N(W ⇤ + 1)� C ⇥RTT = K + N. (10)

Page 25: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Analysis• How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters?

22

Ø Need to quantify queue size oscillations (Stability).

85% Less Buffer than TCP

!"## !"##$ !"##% !"##& !"##' !"#&(

$(

%(

&(

'(

!((

)*+,-./0

12,2,-3,4567-.89:;,6/0

-

-

<=!$-/*+2>96*?4

)7,?@,6*:9>-A49>B/*/

N=2

!"## !"##$ !"##% !"##& !"##' !"#&(

$(

%(

&(

'(

!((

)*+,-./0

12,2,-3,4567-.89:;,6/0

-

-

<=!$-/*+2>96*?4

)7,?@,6*:9>-A49>B/*/

!"## !"##$ !"##% !"##& !"##' !"#&(

$(

%(

&(

'(

!((

)*+,-./0

12,2,-3,4567-.89:;,6/0

-

-

<=!$-/*+2>96*?4

)7,?@,6*:9>-A49>B/*/

N=10 N=40

Figure 12: Comparison between the queue size process pre-dicted by the analysis with NS-2 simulations. The DCTCP pa-rameters are set to K = 40 packets, and g = 1/16.

We have evaluated the accuracy of the above results using NS-2simulations in a variety of scenarios. Figure 12 shows the resultsfor N = 2, 10, and 40 long-lived DCTCP flows sharing a 10Gbpsbottleneck, with a 100µs round-trip time. As seen, the analysisis indeed a fairly accurate prediction of the actual dynamics, es-pecially when N is small (less than 10). For large N , as in theN = 40 case, de-synchronization of the flows leads to smallerqueue variations than predicted by the analysis.

Equation (8) reveals an important property of DCTCP; when N

is small, the amplitude of queue size oscillations with DCTCP isO(p

C ⇥RTT ), and is therefore much smaller than the O(C ⇥RTT ) oscillations of TCP. This allows for a very small markingthreshold K, without loss of throughput in the low statistical mul-tiplexing regime seen in data centers. In fact, as we verify in thenext section, even with the worst case assumption of synchronizedflows used in this analysis, DCTCP can begin marking packets at(1/7)th of the bandwidth-delay product without losing throughput.

3.4 Guidelines for choosing parametersIn this section, C is in packets/second, RTT is in seconds, and

K is in packets.Marking Threshold. The minimum value of the queue occupancysawtooth is given by:

Qmin = Qmax �A (11)

= K + N � 12

p2N(C ⇥RTT + K). (12)

To find a lower bound on K, we minimize (12) over N , andchoose K so that this minimum is larger than zero, i.e. the queuedoes not underflow. This results in:

K > (C ⇥RTT )/7. (13)

Estimation Gain. The estimation gain g must be chosen smallenough to ensure the exponential moving average (1) “spans” atleast one congestion event. Since a congestion event occurs everyTC round-trip times, we choose g such that:

(1� g)TC > 1/2. (14)

Plugging in (9) with the worst case value N = 1, results in thefollowing criterion:

g < 1.386/

p2(C ⇥RTT + K). (15)

3.5 DiscussionAQM is not enough: Before designing DCTCP, we evaluated Ac-tive Queue Management (AQM) schemes like RED 5 and PI [17]that do not modify TCP’s congestion control mechanism. We foundthey do not work well when there is low statistical multiplexing and5We always use RED with ECN: i.e. random early marking, notrandom early drop. We call it RED simply to follow convention.

traffic is bursty—both of which are true in the data center. Essen-tially, because of TCP’s conservative reaction to congestion indica-tions, any AQM scheme operating with TCP in a data-center-likeenvironment requires making a tradeoff between throughput anddelay [9]: either accept large queue occupancies (and hence delay),or accept loss of throughput.

We will examine performance of RED (with ECN) in some detailin § 4, since our testbed switches are RED/ECN-capable. We haveevaluated other AQM schemes such as PI extensively using NS-2. See [3] for detailed results. Our simulation results show thatwith few flows (< 5), PI suffers from queue underflows and a lossof utilization, while with many flows (20), queue oscillations getworse, which can hurt the latency of time-critical flows.Convergence and Synchronization: In both analysis and experi-mentation, we have found that DCTCP achieves both high through-put and low delay, all in an environment with low statistical multi-plexing. In achieving this, DCTCP trades off convergence time; thetime required for a new flow to grab its share of the bandwidth froman existing flow with a large window size. This is expected since aDCTCP source must make incremental adjustments to its windowsize based on the accumulated multi-bit feedback in ↵. The sametradeoff is also made by a number of TCP variants [22, 23].

We posit that this is not a major concern in data centers. First,data center round-trip times are only a few 100µsec, 2 orders ofmagnitudes less than RTTs in the Internet. Since convergence timefor a window based protocol like DCTCP is proportional to theRTT, the actual differences in time cause by DCTCP’s slower con-vergence compared to TCP are not substantial. Simulations showthat the convergence times for DCTCP is on the order of 20-30ms

at 1Gbps, and 80-150ms at 10Gbps, a factor of 2-3 more thanTCP 6. Second, in a data center dominated by microbursts, whichby definition are too small to converge, and big flows, which cantolerate a small convergence delay over their long lifetimes, con-vergence time is the right metric to yield.

Another concern with DCTCP is that the “on-off” style mark-ing can cause synchronization between flows. However, DCTCP’sreaction to congestion is not severe, so it is less critical to avoidsynchronization [10].Practical considerations: While the recommendations of § 3.4work well in simulations, some care is required before applyingthese recommendations in real networks. The analysis of the pre-vious section is for idealized DCTCP sources, and does not cap-ture any of the burstiness inherent to actual implementations ofwindow-based congestion control protocols in the network stack.For example, we found that at 10G line rates, hosts tend to sendbursts of as many as 30-40 packets, whenever the window permit-ted them to do so. While a myriad of system details (quirks inTCP stack implementations, MTU settings, and network adapterconfigurations) can cause burstiness, optimizations such as Large

Send Offload (LSO), and interrupt moderation increase burstinessnoticeably 7. One must make allowances for such bursts when se-lecting the value of K. For example, while based on (13), a mark-ing threshold as low as 20 packets can be used for 10Gbps, wefound that a more conservative marking threshold larger than 60packets is required to avoid loss of throughput. This excess is inline with the burst sizes of 30-40 packets observed at 10Gbps.

Based on our experience with the intrinsic burstiness seen at1Gbps and 10Gbps, and the total amount of available buffering inour switches, we use the marking thresholds of K = 20 packets for

6For RTTs ranging from 100µs to 300µs.7Of course, such implementation issues are not specific to DCTCPand affect any protocol implemented in the stack.

Minimizing Qmin

Page 26: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Evaluation• Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments

– 90 server testbed– Broadcom Triumph 48 1G ports – 4MB shared memory– Cisco Cat4948 48 1G ports – 16MB shared memory– Broadcom Scorpion 24 10G ports – 4MB shared memory

• Numerous micro-benchmarks– Throughput and Queue Length– Multi-hop– Queue Buildup– Buffer Pressure

• Cluster traffic benchmark23

– Fairness and Convergence– Incast– Static vs Dynamic Buffer Mgmt

Page 27: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Cluster Traffic Benchmark

• Emulate traffic within 1 Rack of Bing cluster– 45 1G servers, 10G server for external traffic

• Generate query, and background traffic– Flow sizes and arrival times follow distributions seen in

Bing

• Metric:– Flow completion time for queries and background flows.

24

We use RTOmin = 10ms for both TCP & DCTCP.

Page 28: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Baseline

25

Background Flows Query Flows

Page 29: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Baseline

25

Background Flows Query Flows

ü Low latency for short flows.

Page 30: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Baseline

25

Background Flows Query Flows

ü Low latency for short flows.ü High throughput for long flows.

Page 31: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Baseline

25

Background Flows Query Flows

ü Low latency for short flows.ü High throughput for long flows.ü High burst tolerance for query flows.

Page 32: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Scaled Background & Query10x Background, 10x Query

26QueryShort messages

Page 33: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Conclusions

• DCTCP satisfies all our requirements for Data Center packet transport.ü Handles bursts wellü Keeps queuing delays lowü Achieves high throughput

• Features:ü Very simple change to TCP and a single switch parameter.ü Based on mechanisms already available in Silicon.

27

Page 34: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Comments

• Real world data• A novel idea• Comprehensive evaluation

• Didn’t compare with the scheme of eliminating RTOmin and microsecond RTT measurement

• Deadline-based scheduling research

Page 35: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Discussion

• How does DCTCP differ from TCP?

• Will DCTCP work well on the Internet? Why?

• Is there a tradeoff between generality and performance?

Page 36: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Re-architecting datacenter networks and stacks for low latency and high

performance

Mark Handley, Costin Raiciu, AlexandruAgache, Andrei Voinescu, Andrew W.

Moore, Gianni Antichi, and Marcin Wójcik

Page 37: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Motivation

• Low latency• High throughput

Page 38: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Design assumptions

• Clos Topology

• Designer can change end system protocol stacks as well as switches

Page 39: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

• https://www.youtube.com/watch?v=OI3mh1Vx8xI

Page 40: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Discussion

• Will NDP work well on the Internet? Why?

• Is there a tradeoff between generality and performance?

• Will it work well on non-clos topology?

Page 41: CompSci514: Computer Networks Lecture 14 ... · Lecture 14 DatacenterTransport protocols II XiaoweiYang. Roadmap •Clos topology •Datacenter TCP •Re-architecting datacenter networks

Summary

• How to overcome the transport challenges in DC networks

• DCTCP– Use the fraction of CE marked packets to estimate

congestion– Smoothing sending rates

• NDP– Start, spray, trim