how a large atm mtu causes deadlocks in tcp data transfers€¦ · network mtu (maximum...

11-EI-YACMTRANSAC’T1ONS ON NETWORKING. kol. 3. N(), 4, .AII(;US”F 1995 .U)Y

How a Large ATM MTU CausesDeadlocks in TCP Data Transfers

Kjerstl Moldeklev and Per Gunningberg. Member, /EEk”

Absfract— The implementation of protocols, such as TCP/IP,and their integration into the operating system environment iscrucial for protocol performance. Putting TCP on high-speednetworks, e.g., ATM, with large maximum transmission unitscauses the TCP maximum segment size to he relatively large.What Nagle’s algorithm considers a “small” segment is no longersmall, which affects the TCP performance. We report on TCP/IPthroughput and RPC response time performance measurementsfor various sizes of send and receive socket buffer-s, using theSparc10 architecture machines Axil 31 1/5. 1 running SunOS 4.1.3connected to a FORE System’s ATM network. For some commoncombinations of socket buffer sizes we observe a dramatic perfor-mance degradation to less than I ye of expected throughput andto one order-of-magnitude longer response time than expected.The performance degradation is caused by a deadlock situationin the TCP connection which is resolved by the 200 ms spacedtimer generated TCP delayed acknowledgment. We explain thecauses of the deadlock situations, and discuss means to avoid orprevent them.

1, INTK(MXICTK)N

THE TCP/IP prwtocols tire often the first protocol suiteto be put on high-speed ne(work~ such :IS ATM (asyn-

chronous transfer mode). This is in spite t}l the fact that TCP/IPoriginally was not designed to match the characteristics ofhigh-performance networks, Several extensions of the TCPprotocol [ I ]. [2]. 13I have been suggested to mtike it performbetter over these networks and for connections with a highbandwidth-delay product [4]. New applications demandinghigh bit-rate, such as multimedia conference systems, and thecharacteristics of high-speed networks. have triggered muchresearch on new transport protocols [5].

It is well known that tbe implementation of protocols, andtheir integration into the operating system environment maydominate all improvements of protocol mechanisms. Imple-mentation optimization for TCP/IP on low-sped networksmay be totally wrong for high-performance networks, In thispaper we show that some implementation optirnizations forthe “ethernet era” in Fact degrade TCP/lP performance onhigh-speed rwtworks with large data units,

Manuscript rccei~ cd octt~bcr 7. 1994: ,lpprc,~cd b} tEEE/ACMTR,WS,%CTIONSm Nt:IVWRKIY(, EdiIur J. Smith.

K. Md&kld\ IS wiih Norwcglwr Tclecom IUxwrch, Tdcnor AS, Norwuy(c-m~il: kjers\i. nlc)ldeklcv @>tl.tclcn{~r.rl(]).

P. Gunningberg is with ~hc Dqmrtmerr[ of (’ompu[cr Scicncc. UppwlaI_lnivcrsity and Swedish Ins[ilutc of Compu!cr Scicncc. Swcdcrr (c-mfiil:per.gunl~ingberg @dtw\. uu.sc).

IEEE Log Number W 12641.

The ATM nctwor-k transfers $mall data units, called cells,

which are 53 bytes long with a 48 byte payload. For computer

communication the recommcncied end-to-end ATM wrwiceis through an ATM adaptation layer (AAL). either AAL3/4

or AAL5. The adaptation layer aggregates cells into muchlarger data uni~s which are more efficiently handled by higherlayers and better match the application data units. The TCP/lPprotocols use the size of these AAL data units—the ATM

network MTU (maximum transmission unit)—to compute themaximum segment size (MSS) 16], [7]. In our measurements

the AAL 5 payload is 9188 bytes, The normal TCP/lP header is

40 byte~ which means that the MSS used by TCP is 9148 bytes.We ha\ e measured the throughput performance of TCP over

AAL5 ATM between two Axil 3 I 1/5. I Spare 10 architecturemachines running SunOS 4.1.3, for various sizes of sendand receive socket bulfer~. Dependent on TCP window size,normally wc measured sustained throughput in the order of

tens of Mb/s. but for some combinations of i(wket buffer

sizes we observed a dramtitic drop to as low as O.16 Mb/s.

This is less than I% of the normal throughput. It occurred

for comm(m combinations of socket buffer ~izes. such as asend socket buffer of 16 kbytes and a receive socket buffer of

32 kbytcs. The dramatic drop in performance is caused by adeadlock \i(uation in the TCP connection which i~ broken bythe 200 rns timer generdtcd TCP delayed acknowledgment.It causes TCP to behave as a stop-and-wait protocol withone or two data segments sent every 200 ms. The deadlock

occurs when the amount of data sent is not enough to trigger

a TCP window update packet at the receiver, and at the same

time there i~ not enough space in the send buffer to create asegment of size MSS bytes. Nagle’s algorithm prohibits thesending of non-MSS segments if there are unacknowledgedbytes. Since TCP piggy bticks acknowledgments onto window

updates, the connection is deadlocked until the receiver sendsa timer gcncratcd :icknowlcdgrnent.

The factors which in combination force the TCP connection

into “throughput detidlock$’- are: I ) a large maximum segmentsize, 2) tisymmctry of the socket buffer sizes, 3) use of Nagle’s

algorithm. 4) the delayed acknowledgment strategy. 5) thesequence of actions on acknotvledgment reception, at last 6)

the socket Itiyer optimization for efficient memory manage-ment. The “RPC deadlocks” arc caused by a combination ofRPC message iizc. socket layer memory management, TCP

window size in relation tt~ RPC message size. TCP delayed

acknowledgment strategy, Nagle’s algorithm, as well as therelative processing capacity of the RPC client and server.

1063-6692/95$04,00” C I995 1[:[,1:

410

The deadlock problem also exists for small MSS’s andlow-speed connections, but it is not so likely. Actually, itwill not happen for socket send buffers which are largerthan three MSS segments. Furthermore, for small MSS’s thediscrepancy between the normal and degraded throughput isnot that big, which makes it difficult to detect or uninter-

esting to investigate into. We know of one paper reportingon performance degradation due to the interaction betweenthe delayed acknowledgment strategy and Nagle’s algorithm,namely Crowcroft et al. [8]. They report on a boundary effectof remote procedure call (RPC) response time over TCP andethernet. Our RPC response time measurements relate to thispaper, which we feel fails to describe the correct reason forthe observed glitches in the RPC performance. A deadlocksituation within the transmission of the RPC request and/orresponse message can cause an increase in response time ofup to 400 ms.

The most straightforward way to prevent many of thedeadlock situations is to switch off Nagle’s algorithm or togenerate an acknowledgment for each segment. As will beshown, for measurements with no deadlock situations there ishardly any performance penalty having it switched off. For thethroughput deadlocks, a straightforward avoidance solution isto ensure that the send socket buffer is equal or greater thanthe receive socket buffer or greater than three MSS’s. Theseand other alternatives which require small changes in the TCPimplementation will be discussed.

This paper explains the causes of the deadlocks and dis-cusses some means for solving the underlying problems.We distinguish between causes that are due to TCP and itsimplementation requirement specifications and to BSD Unixas well as to specific SunOS optimization. We expect thatsome of the deadlock situations to appear for other platforms.Indeed, many other researchers have reported to us similarproblems on SPARC/Solaris 2.3, SGVIRIX and IBM RS6000/AIX. We have not been able verify and analyze thecauses to their problems due to our unavailability of platformsand source code. Thus we believe that this should be of interestto a broader audience than just the SunOS users.

The rest of this paper is outlined as follows. Section 11summarizes the BSD Unix socket layer, the TCP protocol,and operating system and implementation issues of importanceto understand the protocol behavior. Section HI goes into adetailed description of the cause of throughput degradation,and Section IV discusses possible solutions to the throughputdeadlocks. Section V presents the RPC deadlock problem andsection six contains conclusions. Readers with experience onTCP implementations in BSD Unix environments may wantto skip the next section.

H. TCP AND BSD-BASED ENVIRONMENTS

In most Unix systems, and many other systems, the transportand lower layer protocols are implemented as part of thekernel. There are several reasons for doing this, see forexample, [9], [10]. The user data to be transmitted is located inuser space, On a write system call, user data in the write call iscopied from application address space to kernel address space

IEEWACM TRANSACTIONS ON NETWORKING. VOL. 3, NO. 4. AUGUST 1995

so that TCP and other protocols can do the further processingof the data. Similarly, on reception the requested amount ofuser data is copied from kernel address space to applicationspace.

A. Network Memory Management in the

BSD Unix Socket Luyer

In BSD Unix based systems, as for instance SunOS 4.1.3,the socket layer acts as the interface between the applicationin user space and the protocols within the kernel. Othersystems have similar layers and buffer management systems.The socket layer offers an application programming interface,i.e., system calls such as write and read, and kernel buffersfor application data. The application process is put to “sleep”

if the socket layer is unable to copy all the application datain the write system call into the buffer. For further progress ithas to wait until buffer space is released.

A system provided identifier is used in the system callsin order to identify different connections with applicationprocesses. Associated with this identifier are two socket databuffers, one for data to the kernel (write) and one for data to theapplication (read). Each socket buffer consists of an orderedchain of mbufs [12]. An mbuf is a data structure used by allkernel protocols in SunOS 4.x and by many other BSD Unixsystems as well. Data to and from the application is copiedto and from these mbuf chains. Associated with the socketidentifier is a variable which holds the number of used mbufsin each direction and a variable for the current number of bytesin these chains of mbufs. The user can set a maximum allowednumber of bytes in these chains by using the SO_RCVBUFand SO_SNDBUF socket options.

There are two types of mbufs, the “small” mbufs whichhold 112 bytes of data and “cluster” mbufs which can take 1

kbyte of data [12]. Whenever possible, the system tries to usecluster mbufs when copying data from user address space tothe socket mbuf chain. Use of cluster mbufs means that thesystem can avoid copy operations within the kernel by usinga pointer and a reference count instead.

For SunOS 4. 1.x this is roughly as illustrated in Fig. 1. Notethat the TCP protocol output routine is called either after alldata in the write system call has been copied or after 4096bytes of data have been copied, whichever occurs first. Thecopy of 4096 bytes is SunOS specific and it may differ toother BSD Unix systems. The reason for copying 4096 bytesinto the socket send buffer is to exploit parallelism betweenthe kernel protocol execution and the network interface packettransmission. However, this is only true for networks with amaximum transmission unit (MTU) smaller than 4096 bytes.

B. TCP Acknowledgment Strategy and Flow Control

TCP is a bidirectional protocol. It establishes a connectionand each peer informs the other about its current window size.The window size refers to the number of bytes rather than thenumber of packets. The sender sends segments which couldbe as small as a single byte and as large as the maximum1P datagram size, 65535 bytes, less the TCP/IP header [11],[ 12]. TCP sets an MSS per connection. It is normally set to

MOI.DEKLEV ,\ND(i(NNINGHERG: HOW A LARGF. ATM MTLT CAUSES DEADLOCKS 411

write(usr_data). .—~ cope. = u

usr_da[a>0

J ‘7Yeavaila~le returnsystemcallspaceIn sendbuffer>= 1024?TF

/\usr_data> 1024 bytes~

ao.ml./\ “s~’’ytes’clustermbuf usr_data>= 512 bytes? applicationCOPY 1024 bytes

Y:$,zil?;$ 7 Y

SLEEP

l---_-l\T F COPY COPY

usr_data usr_data[

I/ \

the network MTU minus the size of the TCP/lP header 16],

[7], The MTL! is 1500 bytes for ethernet and 9188 bytes for

our F(3RE ATM SBA-200° network adapter with driver version~,~,~, The n{lrmal TCp/IP header is 40 bytes which means that

the MSS for the measurements in this paper is 9148 bytes,TCP’S end-to-end flow control is through a sliding win-

dow mcchaniim where the receiver announces its free bufferspace m the transmitter. Therefore, when data is copied to

the application the receiver checks if a window update and

acknowledgment packet should be returned. According to [7]

TCP should implement delayed acknowledgments, and the

delay must t-w less than ().5 s. Furthermore, in a stream

of’ MSS’S there should be an acknowledgment for at leastevery cstber segment. According to [ 13 I acknowledgments aresuggested to be dcltiyed if the TCP PUSH bit is not set. Theacknowledgments are delayed until they can be piggybackedtmto either J data segment or a window update packet [ 13].To limit the interval of delay, an explicit acknowledgmentis suggested to be generated every 200 ms to 300 ms. ThePUSH bit is set if the data in a segment empties the send

buffer [2]. [7]. On the receiving side, however, BSD Unix

deri~ed implementations ignore the PUSH bit because theynever delay the delivery of received data to the application [2],

and no acknowledgment is returned when a received segmenthas the PUSH bit set.

The algorithm for sending a window update and acknowl-

edgment as implemented in SunOS is roughly depicted inFig. 2, A separate window update with a piggybacked ac-knowledgment is sent if the window can slide more than eithera) 35’% of the receive buffer size or b) two MSS’S of the size.In addition. 200 ms spaced timer generated acknowledgments

are [transmitted. The time for transmitting such an acknowl-edgment is independent of the time of connection set-up orthe last time of (not necessarily Final) segment reception overthii connection.

window-slide>= 35% socketreceive buffer sizeTrue False

/\SEND window-shde>= 2*MSS &window update

TF

/\

SEND KNmwindnw update

Fig 2, TCP window update ulgorithrn.

TCP OUtptSto

iNagle’~ = outstandingunacknowledgedbytes?

~~~esent bytes>. MSS?

J <e

‘ ‘i7’y’:;--NYy’~SEND return~ MSS bytes 1/2 “max peer window? #unsentbytes< window?

1..!!Elbyte’”=Mss J1 J\return SEND SEND

#unsentbytes window byte$

Fig. 3. TCP wgmentdticm md NiIglc”\ dg(wi[hm.

Both the allocated send and receive buffers limit the amountof data that can bc outstanding between two communicatingTCP peers. The available space (maximum-current amount ofdata bytes ) of the rcccive buffer is used to set the announcedTCP window size to ensure that the sender will not send more

data than can be received. In BSD Unix systems the sendsocket buffer is used as tbe repository for TCP segments in

case of retransmission. Since data bytes remain in the sendsocket buffer until they are acknowledged. the available spacefor copying in new data into the socket \end buffer is furtherlimited. This repository approach may be different in othersystems.

C. Nugle ’s Algorithm

Nagle’s algorithm [ 14] was introduced as a solution tothe “small-packet problem.’” It was observed that TCP sentmany small segments which resulted in unnecessary headerand processing overhead. Nagle’s algorithm inhibits sendingTCP segments which are smaller than the assigned MSS ifany previously transmitted data on the connection remainsunacknowledged, Fig. 3 illustrates the TCP segmentation ofthe application byte stream [ I], [7], [ 12]. Nagle’s algorithm canbe switched off by the TCP_NODELAY unix socket option.Switching off Nagle’s algorithm is necessary for applicationswhich send small amount of data with no replies, such as astream of mouse events, which have no data in the reverse di-rection. Still, Nagle’s algorithm is recommended on both telnetand ftp connections [ 14], and the default TCP configurationuses Nagle’s algorithm [7].

412 IEEEIACM TRANSACTIONS ON NETWORKING, VOL. 3. NO. 4, AUGUST ]995

According to Fig. 3, a “small” segment is less than MSS

bytes. A reflection is that in high-speed networks with largeMTU’S, Nagle’s “small” segments are not actually small anymore. For our ATM network a “small” segment is less than9148 bytes.

D. Invocation of TCP Routines in BSD Unix

On transmit, the TCP output routine is initiated by the writesystem call. In BSD Unix, a process calling the kernel is neverpre-empted by another process while executing a system call[ 12]. The kernel call must explicitly give up the processor by asleep call or run to completion of the system call. System callsappear synchronousf-v to the application, i.e., the applicationprocess is blocked until the system call returns. The time untilthe return from the system call includes any sleep calls whilein the kernel. This means that a write system call with largeuser data sizes is blocked until all data are processed by boththe socket and the protocol layers. As can be seen in Fig. 1,

the process does a sleep when there is not enough space in thesocket buffer for all the application data.

System call execution might however be interrupted bythe bottom half of the kernel, by hardware interrupts. Theyoccur usynchronouslv and unrelated to the current system callprocessing.

The receiving application does a read system call whichis blocked until there is something to read from the socketlayer. On a frame arrival the network interface will generatea hardware interrupt. The hardware interrupt routine runs thedevice driver, copies data to mbufs, and thereafter initiatesa software interrupt. This software interrupt handling routinecalls the higher-layer protocol e.g., 1P which calls the TCPinput routine. After TCP has processed data and put it into thereceive socket buffer a wake-up call is executed which putsthe application process back on the ready queue.

E. Action Sequence on Acknowledgment Reception

An incoming segment with window update and/or acknowl-edgment information may trigger new data segments to be sentin the other direction. This depends on the current number ofbytes in the send buffer, Nagle’s algorithm and the size of theannounced window as previously described.

The initiation of segment transfer(s) on acknowledgmentreception is one of the points which causes deadlocks and itis implementation dependent. One might expect that the sendalgorithm should strive to form segments as large as possiblein order to reduce overhead. The obvious thing to do wouldbe to copy as much data as possible into the buffer beforeTCP output is called, especially since an acknowledgment willrelease buffer space. This is not the case in BSD Unix. On the

III. OBSERVED TCP THROUGHPUT DEADLOCKS

All throughput performance measurements are run betweentwo Axil 31115.1 ( 135.5 MIPS) Sparc10 architecture ma-chines using the FORE ASX- 100 ATM switch and FOREATM SBA-200/l 75 network adapter cards ( 140 Mb/s physicaltransmission) with the version 2.2.6 device driver. The ATMnetwork interface MTU is 9188 bytes, making TCP computeits MSS as 9148 bytes.

We used the ttcp program to measure the TCP memory-to-memory throughput. The ttcp program was modified to

set the size of the send and receive socket buffers dependent onthe values of its command-line arguments. The user data sizeof the write/read system call was 8192 bytes. Each reportedthroughput measure is an average of 25 runs, each transferring16 Mbytes of data between the two machines. The standarddeviation of the majority of the measured points is less than1YO of the average.

In order to analyze the flow of segments on the ATM

connections we used probes within the network drivers tolog all packets on a connection. The probes parse the TCP/IPpackets and register them as events in a log table during adata transfer. Included in each event is a time stamp, an eventcode and a length field. The time stamp is generated usingthe SunOS un iqt i me ( ) kernel function which accesses theinternal microsecond hardware clock. The length field is usedto log the announced window size, the TCP packet length, andsequence numbers. The contents of the log table are printedoff-line using the kvm library functions available in SunOS4. 1.x. The probes were not active during the performance

measurements themselves.Table I presents the throughput for different sizes of the

socket send and receive buffers.’ In the following we willdescribe throughput degradations depicted as grey shadedentries in Table I. S is the size of the send socket buffer.R is the size of the receive socket buffer. As can be seenfrom the table, dramatic drops in throughput occur when thereceive buffer space is larger than the sender space. The slow-start [15] behavior is not an issue in these TCP performancemeasurements, since both the sender and receiver reside onthe same 1P subnetwork. Thus, there is no slow-start behaviorin the following traces. Nevertheless, we have run the samemeasurements and included the slow-start behavior. Deadlocksoccur during the slow-start, and when the send congestionwindow is up to its maximum size, the behavior is as without

the slow-start.The degradations in Table I are due to either the inherent

delayed acknowledgment strategy itself, a combination ofthe delayed acknowledgment strategy and use of Nagle’salgorithm, or the sender-side silly-window avoidance rule.

contrary, the action is to first transmit available bytes and then A. Cla$sijcation of the ThroughputAnomalies

to ask for a refill of the buffer. The argument for doing it inthis order is that the application process must be woken up, In this section we will classify the shaded areas in Table L

put on the ready queue and eventually executed in order to The degradation caused by the silly-window avoidance effect

copy more data. This could cause an unacceptable delay. As a will not be discussed. The sender-side silly-window syndrome

consequence TCP may send small segments. As will be shown1In SunOS 4. 1.x the maximum allowed socket size is 52428 bytes, In

later, this order has a more devastating consequence when our measurements we have increased this to the TCP maximum window sizeNagle’s algorithm decides not to send non-MSS segments. 216 – 1 = 65 .335.

hloI. OEKI.t:V ANII G(’NNINGBERG HOW ,4 LARGE ATM MT( C.A(ISES DEAD1.(K’KS

SI)(KI I tl\tlI. R SIzI\ \\[) \ USI.K D,\l,\ SI/I. (Jt 8192 byws

s R 4k tlk 16k m 32k 40k 48k 52k 56k 64k

4k 23 23

flk 26

24k 26 34 40 40 40 40 0.8 2 4 3

32k 24 32 35 51 47 47 47 47 47 47

40k 26 35 36 55 58 58 58 57 58 56

48k 26 36 37 56 58 62 63 63 63 64

52k 26 35 37 54 58 62 63 64 64 64

56k 26 35 37 56 61 63 64 64 64 64

64k 26 36 35 54 61 62 62 63 63 64

t Sender-side silly-window syndromeavoidance

Predictable from

❑Combination of

the acknowledgment timer acknowledgmentstrategy and Nagle’s algorithm

❑

avoidance

Combination ofsocket copy ruleand Nagle’s algorithm

[ 12I nutv occur when the send socket buffer sizeis lmore than MSS bytes larger than the receive socket buffer

size, We will focus cm the other classes,

For the grey entries in Table 1 tbe throughput drop is

caused by the same phenomenon, the sender cannot transmitenough data to trigger a window update at the receiver whicha acknowledgment can piggyback on. Wc have a deadlocksituation where the sender refrains from sending more data.and the rccciver refrains from returning an acknowledgment.This deadlock can only be resolved by the cyclic timer-generdted acknowledgment. Thereafter. the sender starts tosend again. but after u while the connection will be back

in the same deadlocked situation. Hence. the connection hasa stop-and-wait behavior for which data is prompted by the2(K) ms cyclically generated acknowledgment. The deadlockthroughput is decided by the cycle f“or timer generated ac-knowledgments and the amount of data transmitted untilthe next deadlock. (The difference in deadlock throughputfor the gray ttible entries are only due to the size andnumber of the segments transmitted in-between connection

deadlocks.)Thecornbinationsof send, ,S. and receive, Ji’, socket buffer

sizes which cause deadlock situations arc marked with twoshades of grey in Fig. 4. For combinations in the darkergray area. the connection goes directly into deadlock. For[he lighter gmy area it may take some time before deadlockoccurs. It takes longer for the entries ,S = 24 k R = 48k to 64 k, Duc 10 the rneasuri!lg method we therefore get a

40?6/0,35=11702

1u~-L-?-.Nam

64kbytes

I 2h4ss 64 khvtes

\

S=R

—

—,---

<= 0.35R

~35R +MSS

~ig. ~. Anmrld,ms w,ch~[ hu(ler ii/c c{,mbinatmns for hISS = 9 I.$X bYIe,

higher average throughput for them. When the connection get~into the deadlock situation. the throughput will be 0.66 Mb/<(7236 + 914S = 16 kbytes transmitted every 200 ins).

The throughput deadlocks can be partitioned into three

classes:

● deadlocks predictable from the window update rulei:● deadlocks caused by the socket layer data copying rules.

and Nagle’s algorithm;● deadlocks caused by the timer generated acknowledgment

and Nagle”s algorithm.

B. Deadlocks Prtdi(table jrmn [he Wind(m Upd~Ite R1{l[J.s

In this class. in the dark gray area in Fig, 4, the send

socket buffer size is less than 357c of the receive socket buffersize, and also less than twice the MSS of the ATM network.Knowing the window update tilgorithm and acknowledgmentstrategy of TCP, these result~ are predictable and are imple-mentation independent, Even if the whole send sockel bufferis sent. it is not enough to trigger a window update onto whichto piggyback an acknowledgment.

The following inequalities hold between the send socket

buffer and tbe receive socket buffer marked with the darkgrey area in Fig. 4:

(,9 < ().351/) A (,s < 2\lss). (1)

For example, consider the entry S = 8 k R = 24 k inTable 1. which yields f).16 Mb/s, The socket layer copies 4kbytes into the socket send buffer before it calls TCP. Atthe receiving side, these 4 kbytcs are less than 357( of 24k and less than 2MSS. Therefore. no window update will bereturned. The sender can, and will copy another 4 kbytes intothe send socket buffer, but due to Nagle’s algorithm tbe senderrefrains from sending the~e 4 kbytes until an acknowledgmenthas been received. There is now no more space for copyingin additional data. The connection is deadlocked and thereturned acknowledgment will be timer generated. When thisacknowledgment is received by the sender, the sender firsttransmits the remaining bytes in the send socket buffer beforeit starts copying more data from the application to the socket

414 IEEHACM TRANSACTIONS ON NETWORKING, VOL. 3, NO. 4, AUGUST 1995

buffer. Thus, the connection is stop-and-wait with 4 kbytessent every 200 ms which gives a throughput of 0.16 Mb/s.Actually, the connection will deadlock independent of Nagle’salgorithm. Even if the whole 8 kbyte send socket buffer is sent,it will still be less than 35% of the receive buffer.

Now consider the entries S = 16 k R = 48 k to 64 k whichyield 0.49 Mb/s. Here 3140 +9148 bytes are immediately sentas two segments when the timer generated acknowledgmentreleases buffer space. Otherwise the behavior is as described

above.

C. Deadlocks Caused by the Data CopyingRules and Nagle ’s Algorithm

In this class, the light gray area in Fig. 4, S is largerthan 0.35R or 2MSS. A deadlock situation is therefore notpredictable according to the TCP window update strategy. Inany case, the TCP connection sooner or later phases into a

behavior which relies on timer generated acknowledgments toresolve deadlocks. Deadlocks in this class are caused by BSDUnix implementation decisions and SunOS optimization.

The area in Fig. 4 is bounded by the dark gray area and thefollowing inequalities between the send socket buffer, S, andthe receive socket buffer. R:

(S< 0.35R + MSS) A (S < 3MSS). (2)

For small S and R there are some boundary effects, whichwill be discussed later. The upper limit 3MSS is caused bythe implementation decision to first send available bytes inthe socket send buffer and thereafter copy from user space.If it were done the other way round, the upper limit wouldinstead be 2MSS.

For example, consider the entry S = 8 k, R = 16 k whichyields 0.16 Mb/s. S is big enough (507. of R) to trigger the35% window update rule. This is what happens: The first writesystem call of 8 kbytes results in only 4 kbytes being copiedinto the send socket buffer before TCP is called, see Fig. 1.Fig. 5(a) illustrates this by showing for the S = 8 k, R =16 k entry, the sender side data segment transmission andacknowledgment reception, that is, the number of outstandingunacknowledged bytes. TCP transmits a segment with 4096bytes, and the last 4096 bytes of the write call are copied intothe send socket buffer. At this point in time there are 4 kunacknowledged bytes, and 4 k new unsent bytes in the sendbuffer. Due to Nagle’s algorithm these new 4 kbytes cannot betransmitted since they are less than MSS. At the receiver, thewindow can slide only 257. (4 ldl 6 k), so there is no windowupdate to piggyback an acknowledgment onto. At this stage,the connection is deadlocked and TCP acts as a stop-and-waitprotocol with a window of 4096 bytes, and acknowledgmentsgenerated every 200 ms. The achieved throughput is 20 kbytelsor 0.16 Mb/s. After a small initial phase the entry S = 16 k

R = 40 k also goes directly into deadlock and transmits 12kbytes in-between deadlocks.

Common to these two examples is that the connectionimmediately gets into deadlock. A deadlock situation occurswhen b,,nt bytes are sent which are not enough to advancethe window but are enough to block the sender to create a new

16384 l“’’I’’’’ 1’”’’ 1’””.

12288 -

8192 -0.35 of R=16k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4096

00 100 200 300 400 500Time in msec

(a)

R

(b)

Fig, 5, (a) Tracesof h..,, t, S = 8 k R = 16k. (b) Snapshot of the numberof data bytes in send and receive socket.

MSS segment. Connection deadlock happens when:

(bsent < 2MSS) A (b,e,,t < 0.35R)

A (( Sbyte – bsent ) < MSS) (3)

where S@~e is the number of data bytes in the socket sendbuffer. The inequality is illustrated in Fig. 5(b).

The sender may send several segments in-between dead-locks, thus the throughput may vary. The absolute upperthroughput limit after deadlock is though (0.35R + MSS)/200ms.

D. Deadlocks Caused by the Timer GeneratedAcknowledgment and Nagle ’s Algorithm

For entries like S = 16 k, R = 24 k/32 k it seems that thereshould be no problem because after 4 kbytes are transmitted,a full MSS of 9148 bytes can be constructed, which shouldprompt a window update. But assume now instead that atsome point in time there will be, say, an 8 kbyte segmentsent. This 8 kbyte segment is not big enough to advance thewindow and the available space in the send socket buffer

is not big enough to create a full MSS. But how couldthere be an 8 k segment sent? In short, the timer generatedacknowledgment may arrive just after that 8 k has been copiedinto an empty send buffer. When this situation is reached, 8 kis sent every 200 ms. Observe that this is caused by the factthat the implementation on reception of an acknowledgmentfirst sends what is available in the send buffer and thereaftercopies more bytes into the buffer. This order is BSD Unixdependent but may very well exist in other implementationsas well. The Appendix gives a detailed presentation of the0.35 Mb/s throughput result with a 16 k send buffer and a 32k receive buffer.

MOLDEKLEV 4NDG(INNINGHE. RG tiOW A I. AR(i F. ATkl \!T( CA L’SI+ IXADL(XKS

24576

20480

16384E 112288

8192

4096

00 100 200 300 4W 500 rjfm 700

Time in mscc

(;1)

s

Fig. 6(a) depicts how a $ = 16 k, If = 32 k connectiongets into dcadlockatler about 350ms. Assumcthe sender hastransmitted /),,, ,,/ bytes. A timer generated acknowledgmentwhich acknowledges 1~,,,~ of these lJ., ,{t bytes such thalb,., Tit – h,,, ~. satisties (3), that is

((h,,,,, - h(r,.k.) < ().:J:)R) ~ ((h.,,,, - I),,(A) < 2NISS)

A ((,%,,tr - (~~..(llt- 1~(1<~))

< N!ss). (4)

Fig. 6(b) presents a snapshot of the socket send buffer in thissituation. Due to the segmenl flow on the TCP connection.the probability of a timer generated acknowledgment to ac-tually acknowledge b,,, ~ bytes is very high: The connectiondeadlocks within 600” ms. See Appendix A tor details.

With the SunOS 4. 1.x socket layer optimization of callingthe protocol for at least every 4 kbytes results in the firstsegment being a maximum -1 kbytes, independent of the user

data in the write call as long as this is larger than 4 kbytes.This will get the connection directly into deadlock if R islarger than l(K)(i/(1.:M = 117’02 bytes. This gives a verticalline at 4096/().35 = I I 702 bytes in Fig. 4. It’ (R < 11702).the deadlock depends on the user dtita size. A user data sizesmaller than 4 k will move this boundary to the left.

Another boundary effect is as follows. When 1~.,,,,,bytes aresent, there is potential for another SI,,,~, — h+,,,~ bytes to be

transmitted. Depending on b., ,,t and Sh!,t,. this segment my

be less than MSS, and with Nagle’s algori(hm in use, (seeFig. 3) such a small segment is sent only ifi

((S/),,/, – k ,,, ) 2 R/’2). (5)

Inequality (5) is due to TCP sending a segment if the size ofthe segment is at least half [he maximum advertised receivewindow. This strategy ii used to cope with an initial problem

Receive socket buffer size R 64 kb~[es

\

S=R

; =-2MSS; = 3MSS

S = 0.35RS = 0.35R + MSS

1’Ig. 7. l-.!hmrct um)m:IlmI. WICAC!huffcr \i/c c[m]birr~t i{m~. MSS = 1460

of [he sender avoidance of the silly-window syndrome whencommunicating with hosts with tiny buffers. e.g., 512 bytes[ 12]. Thus. if $/,,,,, – Ij,t ,,t ii larger than half the maximum

advertizecl window [/ for R less than 2MSS a small segmentis transmitted independent of Nagle’s algorithm, In Fig. 4, (5)

is drawn with /)., ,,~ equal I() 4 kbytes.

F-. Tllrmgllpu[ Dcdl{)ck Are[l Depem{.r m MSS

The deadlocks abo~e also occur on other networks. Thesmaller/larger the network MTU, the smaller/larger the haT-ardous send and rcceivc socket size combination area. The

larger the deadlock area is relative to the total send and rcccivcbuffer/window space, the higher the probability of being within

the deadlock area.The hazardous socket size combinations for ethernet (with-

out tbe boundary effects) are depicted in Fig. 7. Duc 10 thesmaller MTU ( 1500 bytes) the number of combinations toavoid is much ~maller, find the deadlock area is much smallerrelative to the total window and buffer size space. Increasingthe tottil window ~ize by using the TCP window wale option

[4] would for’ an unchanged value of the MSS. reduce thevulnerable area. [f the MSS i~ increased. the deadlock areawould increase wwnxiirrgly.

G. fl<fwtin<q IIIP Tllr{>[~,yllput ~e~{(llmk.~

This section discusses how to avoid or prevent the dead-lock situations. Within existing implementations. the obviousavoidance solution is to keep away from the dangerous S andII’ combinations. i.e.. to ensure that .S z :lllSS or .S z If.Another straightforward prevention solution is to turn offNagle’s algorithm. A third alternative is to let TCP explicitlyacknowledge each and every incoming segment. We have done

measurements with Nagle”s algorithm turned off and with anacknowledgment for each segment. These measurements arepresented in Table 1[. The throughput result on the upper rowof a Table II entry is with Naglc’s algorithm turned on forcomparison the middle row has results from measuremcnt~with Nagle”i alg(withm turned off and [bc lower row’ has resultsfrom when thcm is an acknowledgment for each $egment.Other possible changes within an implementation are discussedin I 16]. including changing INagle”s algorithm. the size of the

416

TABLE II

TCP Mb/s THROUGHPUTOVER ATM, (a) WITH NAGLE’ S ALGORITHM, (b)WITHOUT NAGLE’ S ALGORITHM. (c) ACKNOWLEDGMENT ON EACH SEGMENT

sR 4k

234tt 22

21

268k 26

24

2616k 25

24

2624k 26

24

2332k 25

24

2640k 25

24

2648k 25

24

2652k 26

24

26S6k 26

23

2664k 26

23

8k I 16k I 24k I 32km232221 23 23 23

32 m3130 27 28 28

mHi-H--im35 50 47 49

32 35 51 4736 36 55 5736 46 52 59

36 36 55 5835 44 56 58

%R-E-i%

M35 52 55 5836 55 52 60

35 37 54 5835 45 56 5737 54 52 60

35 37 56 6135 54 55 5737 55 52 60

36 35 53 6035 54 55 5836153152160

44M 4ttk 52k %k 64k

*

28 28 28 28

&

39 39 39 39 39

40(M6243a51 45 46 46 4648 48 49 48 49

47 47 47 47 4758 54 55 55 5460 60 60 60 60

58 58 57 58 5858 59 59 59 5960 60 61 61 61

62 63 63 63 6459 59 59 59 5960 61 61 61 61

62 63 64 64 6458 58 58 59 5960 61 60 60 61

63 64 64 64 6458 59 59 58 5860 60 60 60 60

62 62 63 63 6458 59 58 59 5860 61 61 60 60

El(a) Nagle’s algorithm on

I% (b) Nagle’s algorithm offcc (c) Acknowledgment on

each segment

❑Predictable fromthe acknowledgmentstrategy

ig

Combination of , Combination ofsocket copy mle ❑ timer acknowledgmentand Nagle’s algorithm and Nagle’s algorithm

MSS, or to remove the 4 k limit on the number of bytes copiedto the socket buffer before TCP is called. We believe that thesechanges are less attractive.

Switching off Nagle’s algorithm, i.e., setting theTCP_NODELAY option, removes the lightly gray shadedlow throughput entries without a significant performancepenalty for other socket size combinations. As expected, thethroughput in the darkly shaded area is still doomed to below. In some of them there is an increase since now the wholesend buffer size is sent in-between deadlocks instead of only4 kbytes. For small buffers, ( < 40 k) turning off Nagle’salgorithm has no negative effect on performance, while largerbuffers result in a marginal throughput degradation due tomore small segments.

Letting TCP explicitly acknowledge every incoming seg-ment is a kernel compile option in SunOS 4. 1.x. From Table11it is evident that all deadlocks disappear when the connectionis independent of delayed acknowledgments. For buffer sizesgreater than 40 kbytes the increased number of segments and

IEEWACM TRANSACTSONS ON NETWORKING, VOL. 3, NO. 4, AUGUST 1995

acknowledgments has a slightly negative effect on throughput.For smaller buffer sizes, there is instead a performance gain.

If the TCP PUSH bit is set on the receive side in orderto return an acknowledgment as suggested in [ 13], it wouldonly prevent the predictable deadlocks which transmit onlyone segment between the timer generated acknowledgments.

IV. OBSERVED TCP RESPONSE TIME DEADLOCKS

When we measured TCP response time we observed anotherclass of deadlock situations in the otherwise safe areas, S 2R and S z 3MSS. TCP was used as a reliable transportservice for request and response messages of remote procedurecalls (RPC’ s). For some message and buffer sizes the TCPRPC relied on the timer generated acknowledgment whichprolonged the transmission time of a message with up to 200

ms.In an RPC, the client writes a user message, a request, to the

socket layer and thereafter blocks for a server response. Theclient message will be sent as one or several segments to theserver. The TCP at the server side will collect these segmentsand deliver them as a request to the server user process.The system will not, and cannot start the server process untilthe whole request has arrived. When the request arrives, theserver action starts and a response message is eventually sentback to the client. A deadlock situation occurs when the TCP

client refrains from sending the last segment of a request dueto Nagle’s algorithm and when there is no window updateto piggyback an acknowledgment onto. There could be noresponse message to piggyback onto either since the serverprocess has not yet started the processing because the requestis not complete.

The increase in the message transmission time depends onwhen the TCP timer generated acknowledgment is returned.Moreover, the RPC response time may increase up to 400 msif both the request and the response messages are caught indeadlock situations.

As with the “unpredictable” throughput deadlocks, turningoff Nagle’s algorithm removes the dependence on a timergenerated acknowledgment. The request and response mes-sages may then be transmitted as several non-MSS segments,regardless of the number of outstanding unacknowledgedbytes.

A. RPC Deadlocks on the ATM Ne~ork

The RPC response time measurements were done by aclient-server program that sets up a TCP connection, executesa specified number of RPC’s back-to-back and measures thetime for each of them. The command-line arguments of theclient-server program include parameters for the size of thesocket buffers, the size of the application send and receivebuffers, the size of the request and response message, and thenumber of RPC’s to be executed.

In the following figures we present RPC response timemeasurements as a function of the user request message size.In all measurements, the server returns a response message of10 bytes which will be transmitted as one segment and will notsuffer a deadlock. Each of the measured points is the average

MOLDEKLEV .\Sl) G(NNINGB1..RG HOW A LARGE ATM MTU CAUSES DEADLOCKS

TABLE 111TRACE(M AN RPC SK; ME:NT FLOW WITH EQLIAi. CAIM(TIY SF.IWF.RASD CI.IIN

Write data, Socket From Bytes in Number PACKETS on Bytes in From Read data,read data eoPY TCP to socket of unacked the wire socket TCP to write data

before process send bytes reeeive pr~. inTCP call in sleep buffer buffer sleep +TO

1 wnte(4852) 4096 4096 1460 >> DATAIW >> 1460 wkup read( 1460)

2 sleep 4096 2920 >> DATAla >> 1460 wkup read( 1460)3 4096 4096 >> DATA,~76 >> 1176 wkup read( 1176)4 wkup 2636 2636 << ACK1m << 05 wkup 1176 1176 << ACKIW << 06 756 1932 1176 07 wkup 756 0 << ACK1176 << 0 TO8 756 756 >> DATA75~ >> 756 wkup read(756)9 read(l O) wkup o 0 << DATA,0 << write( IO)

AcK75~

417

200 I ‘T-=I=-”l problem. For messages above 4096 one or several 4096 byte1 I I k. .,1 segments are sent with a window update for each of them.I

I

-=w!!’i;x%xw:-=- 32768 byte window ~

IIIIIIII1

The last segment may be less than 35% of 4096, but then the

corresponding acknowledgment will be piggybacked onto theresponse message.

With a 16 kbyte window, request messages between 4097

\ and 12 287 bytes ti~ce a deadlock situation. Since the socketlayer copies at most 4096 bytes before TCP is called, allmessages will have an initial segment of 4096 bytes. The next

o~ 1 ! 4 segment length depends on the request message size. If the

o 4096 8192 12288 16384 message is less than 12 288 bytes, the TCP client refrains

Reauest message size [bvtesl from sending the next segment due to Nagle’s algorithm. At. . .Fig. X. TCP t)f m ATM response [ime~ Im 4, 16. wld 32 kbye socket buffers.

time of 100 cm.sect4tive RPC calls. When deadlocks occurwithin consecutive calls. the jirst time-out will be between O

and 200 ms and the remaining time-outs will be close to 200ms since a new request is immediately sent after the response

arrives. Therefore. when a deadlock situation is repeated foreach RPC. the measured m’erage response time will be closeto 200 ms,

Why do we rely on a timer generated acknowledgment to

transmit the whole request message? Fig. 8 presents request-response time measurements over TCP and ATM for differentrequest message sizes and for 4, 16 and 32 kbyte symmetricsocket buffer sizes. Referring to the throughput deadlock situ-

ations, there should be no deadlock situations with symmetricbuffer sizes. With a 4 kbyte window there are indeed noRPC deadlocks. But on the other hand, both 16 and 32 kbytewindows experience deadlocks for a continuous range of user

request message sizes. When Nagle’s algorithm is turned off,there are no deadlocks.

A 4 kbyte symmetric buffer size TCP connection willbehave as a stop-and-wait protocol. For user request mes-sages less than 35$% of 4096 bytes which is 1433 bytes, thewhole message is delivered at once to the server and theacknowledgment is piggybacked onto the response message.For messages larger than 1433 bytes and up to 4096 awindow update is generated, so for these messages there is no

the same time tbe 4096 byte segment is not acknowledgedsince the window can slide only 4/16, i.e., 25’ZC. Thus, atimer generated acknowledgment is required before the secondsegment is transmitted. When the message size is 12288 bytes,the size of the second segment is 8192 bytes. The reasonNagle’s algorithm does not apply here is that the segmentis exactly half the maximum receiver announced window size(see Fig. 3),

With a 32 kbyte window, the situation is the same as for

the 16 kbyte window, except from the fact that 8192 bytesnow are less than half the maximum announced window size.Thus, a segment is not transmitted until it constitutes MSSbytes. Thereby, deadlocks will occur for messages between4097 and 4096 + 9147 = 13243 bytes.

This type of deadlock also occurs on other networks, suchas ethernet. Fig. 9 shows TCP response time measurementsover ethernet as a function of the request message size ona connection with 4 kbyte socket buffers. From the figureit is evident that ethernet has a similar behavior. Table III

shows a segment transmission trace of a 4852 byte request-response message over an ethernet leading to a deadlock. Thedeadlock occurs at line 6, where 756 bytes are hold backin the send buffer because the segment DAT.4 1176 was notenough to genera[e a window update. When Nagle’s algorithmis turned off, the 756 byte segment on line 6 will instead besent immediately and the whole RPC message can be deliveredto the server which in turn generates a response messageDATAI(I with a piggybacked acknowledgment.

418 IEEEJACM TRA?JSACTIONS ON NEllvORKING.VOL. 3. NO. 4, AUGUST 1995

TABLE IVTRACE OF RPC SEGMENT FLOW WITH SERVER CAPACITY LESS THAN CLIENT CAPACITY

i?.-

Fig. 9,

Write data,read data

t

1 write(4852)234567 read(l O)

-t

Socket Fromcopy TCP ta

before pm

TCP call in sleq

4096sleep

Wkupwkup

756wkup

Bytes insocket

sendbuffer

4096409640962636

07560

Numberof unacked

bytes

1460292040962636

07560

200 1 r , ~ , ,— Nagle’s on

~~Nagle’s off

I 50

100

50

Io- + “. t 1 ,

0 4096 8192

Request message size [bytes]

TCP over ethernet respmse times for 4 kbyte buffers.

Amazingly, the probability for RPC deadlocks stronglydepends on the processing capacity and the load of the serverrelative to the client. The paradox is that the better [he server

machine can match its processing of incoming packets to their

arrival rate, the more likely there will be a deadlock situation,i.e., the worse the average response time!

B. Server and Client Processing Capacity

The average response time measurement results vary sig-nificantly dependent on the server processing capacity relativeto the client capacity. To illustrate this we present RPCresponse time measurements over ethernet between machineswith different processing capacity (a 28.5 MIPS SunIPX and a12.5 MIPS SunSLC). For these measurements 4 kbyte bufferswere used. The results are presented in Fig. 10(a)–(c). Onehundred requests were sent for each request message size.

With an equal capacity server and client, as in Fig. 10(a),there is one expected plateau after 4096 bytes. The responsetime ripples around 8 and 12 kbytes, show that only a few ofthe 100 requests got into the deadloek situation. For a fasterserver, as in Fig. 10(b), the deadloek plateau is repeated at4096k + n bytes, where 4096k is an integer multiple of thewindow size, and n = 1,2, . .. . 1459.

When the server has lower capacity than the client, as inFig. 1O(C), there is no plateau, instead there are some ripples

around 4096 bytes. Ripples occur when only a few requests arecaught in deadlocks. For example, in one measurement with

PACKETS on Bytes inthe wire socket

receivebtier

>> DATAl@>> DATAIW>> DATA1176

<< ACKIW<< AcK2fj36>> DA’G%7561>>>>>><<<c>>

146014601176

00

756<< DATAIO <<

ACK756

TFrom Read data,TCP to write data

proe in

sleep+ TO

wkup read( 1460)

wkup read(2636)

wkup I read(756)write( I O)

the request message size of 4852 bytes, there were only 30messages out of 1000 RPC’s that were caught in the deadlocksituation. Table IV illustrates what happens with this 4852byte request, The acknowledgment on line 5 acknowledgestwo segments while a faster server (compare with Table III)would be able to generate a separate acknowledgment for eachof them. The effect of this acknowledgment is that the sendbuffer will be cleared. Hence, Nagle’s algorithm will allowthe next small segment to be sent directly. Thus, a relativefast server may be able to process and deliver each segmentindividually while a slow server may have to back-up segmentsand deliver, as well as acknowledge, several of them at thesame time,

We have also run throughput measurements with a receiver

capacity less than the sender capacity and found no differenceswith the results presented in previous sections. The occur-rence of a throughput deadlock is independent of the relativeprocessing capacity since it is caused by the socket layermemory management or a timer generated acknowledgment.

This is independent of the end-system processing capacity.For the RPC deadlock to occur, there is a dependence on theserver coping with several back-to-back segments. This clearlydepends on the server processing capacity.

C. Relation to Crowcroft et al. Work

Crowcroft et al. [8] report on the response time plateau

effect as a “boundary effect” and they conclude it is dueto a mismatched interface between the socket and protocolcommunication layers. They touch upon the buffering strategyin the socket layer between the user and the protocol, the de-layed acknowledgment strategy of TCP, and the use of Nagle’salgorithm. We ran a similar ethernet experiment as in [8]. Itsends a 4852 byte request message 1000 times. The messageis sent over a TCP connection with a 4096 byte window anda 4096 byte send and receive buffer. Crowcroft et al. reportedthat they had turned off Nagle’s algorithm but observed littleimprovement. In their traces with Nagle’s algorithm turned off,we can not see any timer generated acknowledgments as theyclaim there to be. Hence, we question their observation thatturning off Nagle’s algorithm showed little improvement onthe RPC measurements. Their experiment seems to foeus moreon the number and size of segments and lesson the dependence

MOLDF:KL1;V ANl>(; (’XNINGftERG II(NV A LARGE ATM MT(I CAUSES DEADLOCKS 41Y

~ 200 ~ I I— Naglt?’s on

~ 150 - —- Nagle’s off -

.: 100 -

u$ 50 -a 1

2

d-0 “~’” ‘ 1

4096 8192 12288 16384Request message size [bytes]

(a)

v 200%

- -

z 150 -- Nag e’s on

- - Nag e’s off

.: *w “

um~ 500

21

00—.

4096 8192 12288 16384Request message size [bytes]

(b)

-o 4096 8192 12288 16384Request message size [bytes]

(c)

Fig. lo. RPC response time depemten[ on reldtive client-server processingcdp~city. (t) Equal cdptwity sener and client. (b) Server capacity higher thanclient capacit). (c) Server capaci[y less than client capacity.

on the timer generated acknowledgment. Their solution is toset the socket low-watermark to at least MSS bytes. This willindeed affect the number and size of small segments to betransmitted on the connection. but it will not remove the RPCdeadlock. Therefore, matching the user buffer to the socketbuffer would not solve the problem.

V. CONCLUSIONS

The deadlock situations in the TCP data transfer phaseare primarily due to an interaction between the TCP delayedacknowledgment scheme. Nagle’s algorithm and a large MSS.A TCP connection is caught in a deadlock situation whenthe sender refrains from sending a non-MSS segment andthe receiver refrains from returning a window update with

a piggybacked acknowledgment. The deadlock is resolved bya timer generated acknowledgment. The window update rules

and Nagie’s definition of a “small” segment are important.In high-speed networks with large maximum transmissionunits such as ATM AAL5, the unit which Nagle’s algorithmconsiders smali is no longer small, which makes the deadlocksituations more likely to occur,

The throughput deadlocks depend heavily on the send andreceive socket buffer sizes. For a send buffer size, .5’,less than35% of the receiver buffer size I? and less than 2MSS the

connection will always get into a deadlock. The throughput

deadlocks in the area ((,$ < ().351/ + JISS) A (,$ < 3LfSS))

are caused by the way the implementation is structured, such

as the BSD Unix sequence of actions on acknowledgmentreception and socket layer optimization for ctlicient memorymanagement.

The RPC deadlmks also depend on the size of the requestand response messages relative to the socket buffer sizes aswell as the relative processing capacity of the RPC client

and server. Turning off Nagle’s algorithm prevents the RPC

deadlocks. Still. the transmission time for a large request or

response message may be prolonged duc to a throughputdeadlock.

We have presented ways of avoiding and preventing dead-locks within TCP implementations:

●

9

.

Avoiding the throughput deadlock area cun be done bycareful settings of the size of the socket buffers. A sendsocket buffer equal or larger than three MSS’~ or equal or

larger than the receive socket buffer avoids all deadlocks.

By setting the send socket buffer to the size of at least

three MSS’S. the sender can thereby avoid deadlocksindependent of knowledge of the socket buffer size at thereceiver. The sender could also possibly adjust its sendbuffer size, based on the window updates received fromthe receiver, by a careful setting of the buffer sizes suchthat S’ greater or equal to 1?.The sender can actively pre~’ent unpredictable throughput

deadlocks and all RPC deadlocks by turning off Nagle’s

algorithm, using the TCP_NODELAY option. This allows

the transmission of small segments even if there areoutstanding unacknowledged bytes. Thereby, the receivermay get the amount of bytes required to slide the window

and return a window update (or the response messagein the RPC case) onto which an acknowledgment ispiggybacked.The receiver can prevent all deadlocks by not relying

on the delayed acknowledgment strategy and instead ex-plicitly acknowledge every incoming segment. In SunOS

4. 1.x there is a kernel compile option which forces all

TCP connections to acknowledge every segment. Forhosts connected to heterogeneous networks this may cre-ate a potential performance disadvantage, as the delayedacknowledgment strategy was introduced to increase theperformance [ I3].

All the above solutions are within the current implementationrequirements [7] and operate independent of the TCP imple-

mentation of the peer. An alternative solution, but not withinrequirements, is to modify Nagle’s algorithm to allow smallsegments [ 16].

For the next generation of TCP we believe that additional

per-connection options should be offered in order to adjust toapplication and network requirements. One such option couldbe the setting of the acknowledgment strategy per connection.Another is the setting of the MSS which today only depends

420

i23

45

6

7

8

9

0

II

2

3141567

8

9

!0

!1

!2

!3!4!5:6

;7

8

9II!1,213A

15

I8

9

0

,1

2

3

4

54

,7

5

,9

0

II

g

IEEEYACM TRANSACTIONS ON NETWORKING, VOL. 3. NO. 4, AUGUST 1995

TABLE VATM AND 16 k SEND AND 32 k RECEIVE SOCKET BUFFER, USER DATA SIZE OF 8 kbytes

RANSMIT1192bytesnewh

-wite(8k)

write(8k)

W’n[8(Sk)

wite(flk)

wrife(8k)

wnte(8k)

wit@3k)

wkfef8k)

writc(8k)

ccket~PY!forcCPcall4096409640%4096

Skq

40%

40%

4036

sleep

4C46

40964096

sleep

4096

40%&sleep

4C96sleep

40964096

slap

40964096

skcp

4096

4096

‘mm‘CP to‘m mkm

wkup

wkup

wkup

wkup

wkup

Wklm

wkup

wkllp

wkup

Mes in

end)Uffer

40968192

12288

3140

31407236

11332

15428

154283140

31407236

1I 3321542S154283140

31403140

7236

11332

13428

1228816384

1638472367236

I I 332134281542881928192

12288

[638416384

81928192

12288163841638481928192

1228816384

819281928192

0

{ureterfnecked,ytes

409641W64096

13244

[32440

31403140

31401220812288

0

31403[40

31401228812288

0

31403[40

3140

3!40

12288122889148

91489148

07236723672367236

0

8192

8192

8192

81920

81920192

81928192

08192

81928192

08192

8192n

PACKBIK4IJntiwim

>> DAT~

>> DATAgi4g

<< ACKIW

>> DATA31m

>> DATA914a

<< ACK,2ZW

>> DATA,,a

>> DAT+,4B

<< ACK,=

>> DATA,W

ACK3,@

>> DAT.@la

<< (ACX31~

<< ACK9,48>> DATA=

- A=$2M>> DATAB,n

<< ACK8,n>> DAT~,a

<< AC&,w>> DATA8L92

<< ACK8,nDATAg,m

ACK8,~

Bytesins.xketma ivebuffer

>> 40964096

4091i>> 13244

5052<< 0

>> 31400

0>> 9148

956<< 0>> 3140

0

0>> 9148

9s6<< 0

>> 31400

<< 0

0

>> 91489148

0

0

0<< 0

9> 72360

0

0

<< 0>> 8192

0

0,< 0

>> 81920

00

c< o>> 8!92

o

0

c< o

4W?6

o

0

wkup

wkup

wk”p

wkup

To

Wkup

TD

.*

Towkup

3-owkup

Towkup

‘mwkup

TO

RECEIVEnwimum 8[92byw m eachsystemcall

rcnd(8192)mad(.5052)deq

mad(3 I ‘lo)skep

rczw8192)re.ad(956)

read(3I 40)

read(8192))

reed(956)

mad(3MO)skep

rc@(8 192)rra3(956kleep

read(7236)deq

m

m8d(8192)sleep

H2(8192Mep

md(8 192)decp

on the underlying network MTU and the MSS option exchange The sender is a loop of write ( 8 k ) calls, while thewith the peer [7]. Assuming that the window size cannot bechanged, for instance due to some network traffic contract, theuser could set the connection MSS to assure that the windowsize is at least three MSS’s. Such a setting of the MSS couldalso be used as part of adapting to a network traffic parametercontract. In addition, different application types, e.g., terminalemulation, file transfers, and RPC’S, could take advantage ofthis setting.

APPENDIX

THROUGHPUT wrtw A 16 k SEND AND

A 32 k RECEIVE SOCKET BUFFER

In the following we will describe in detail how a throughputdeadlock arises with a socket send buffer of 16 kbytes and asocket receive buffer of 32 kbytes. The diagram in Table V isused to illustrate TCP internal actions and state and the packetflow on the network.

receiver is a loop of read ( 8k ) calls. The data segments arerepresented with DATA<Y where X is the number of user bytes

in the packets, and acknowledgments with ACKl, where Y is

the number of bytes the packets acknowledge.

The data transfer phase starts with a write ( 8k ) . The

socket layer copies 4 kbytes into the socket buffer before it

calls the TCP protocol through tcp–output ( ) . A packetwith 4 kbyte of user data is transmitted. After the return from

the tcp_out put ( ) routine, the socket layer continues by

copying more bytes from the current and next writ e ( 8k )call until the socket buffer is full. At this stage, the call

produces a segment of length 9148 (MSS) bytes. 13244 bytes

have now been transmitted, but are not acknowledged. The

last 3140 bytes in the send socket buffer are not transmitted

since they are less than MSS.After the segments have arrived at the receiver, the appli-

cation reads them in two chunks. After the second read ( ),

MOLDEKI.F.V \ND Ci(lNNINGBf-.R(i HOW A l.,\ R(;l. /lTM MTI’ CAIISES DtiAOLOCKS 42 I

the window can slide 13 244/32 768 > 35%. and a windowupdate with an acknowledgment is returned. The acknowl-edgment releases 13 244 bytes in the send socket buffer andacknowledges all outstanding bytes. When the window updateis received, TC’P first sends the remaining 3140 data bytes in

the send socket buffer before the application is scheduled torun (line 7). Thereafter. another segment of MSS bytes will be

transmitted. Due to the generated segment size pattcm on thesender side, an acknowledgment will he rctumcd immediately

after having received a 9148 byte segment, since the windowwill slide more than ?iS$i.

The sequence of sending a 3140 byte and a 9148 bytesegment Jolloucd by an acknowledgment is repeated until thesender reccivcs a timer generated acknowlcdgrnent, TO, Such

tin acknowledgment is generated on Iinc 21, and it w-rives at

the sender on line 25 und acknowledges less than MSS bytes.The timer gcncratcci acknowlcdgmen[ on Iinc 2 I releases

only 3140 bytes in the socket send buffer when it is received online 25. Thus. in the sochct send buffm-. 9148 bytes still remainunacknowledged, The 3140 bytes ready w be transmitted online 25 are not scnl due [[) Ntigie’s algorithm. The receptionof the ackn(wledgmcn[ wakes up the application. Another 4kbytes can bc copied into the send buffer which now’ contains16 kbytes. There arc [ 1[; ;IN1 – 91 1S) = 7236 bytes left

in the socket send buffer retidy to be transmitted. Due to

N@e’s algorithm, no wgrnent is transmitted because thereare outstanding unacknt}wledgcd byws.

On line 28 J new zOO nls spaced timer generated ac-

knowledgment is returned. It acknowledges 9143 bytes andat this time the sender does not have any outstanding un-acknowledged bytes. Therefore, the remaining 7236 bytes inthe send socket buffer are transmitted in one segment. At thereceiver this segment (line 29) does not trigger a window

update. At the sender side. there are now only 7236 bytesin the send socket buffer. Hence, there is space for another(16 384 – 72;Ki) = !)1 18 bytes, that is exactly MSS bytes.The acknowledgment on line 28 wakes up the application and4 kbytes are copied twice into the send socket buffer. A newsegment is not transmitted, since its length is less than MSS.The write call on line 32 immediately does a sleep becausethere is not enough free space in the socket buffer to copy an

additional 1024 bytes. This is the case even if there is spacefor an additional 956 bytes which would create a full MSS.see Fig. 1. At this point in time, there are 15 428 bytes in thesocket buffer of which 7236 have been transmitted.

The next action is the timer generated acknowledgment online 33. [t acknowledges 7236 bytes. On reception of thisacknowledgment at the sender, 8192 bytes are transmitted.Again, no progress is made since the sender can not send anMSS segment. The receiver does not return an acknowledg-ment as the window can slide only 25%, We have reachedthe point at which 8192 bytes will be sent every 200 ms.This behavior is repeated (Iincs 38 through 52) and gives a

throughput of about 40 kbytels, or ().32 MbIs.Fig, I I presents traces of the data segment transmission

and acknowledgment rcccption on a S = 16 k R = 32k connection: (a) presents the segment transfer until theconnection deadlocks, (b) is a magnified vertical slice of the

Time in msec(1)

Time in msec

32768

24576

16384

D

---- 0.35R. . . . . . . . . . . . . . . . . . . . . . .

8192

~30 340 350Time in msec

(cl

W I I. Trace ofdaw segments and :wknowledgmen[s, S = 16 k, 1? = 32 h.

initial segment flow of (a), and (c) presents the timer generated

acknowledgment. Taking a closer look on Fig. I I(b) it is e\’i-dent that the connection will always deadlock within 600 ms.The sender transmits the 3140 byte segments approximate] y6 ms apart. About 1.7 ms after tbe transmission of the 3140byte segment, the 9148 byte segment is transmitted. Withineach 200 ms cycle, the time interval the receiver has only3140 unacknowledged bytes is shifted by approximately 2 msrelative to the time of the timer generated acknowledgment.That is, within the first 3–5 timer generated acknowledgments

the connection is deadlocked.

REFERENCES

J. Pmtel, “Trwtsmisslon control prnmcol,793. Sept. 198 I.W. R. Stevens, TCP Ilhi.frrafed: volumeMA: Addison-Wesley, 1994.

promcol specification.”’ RFC

/: The F’roro(o/$. Reading

D. Comer, Infernenw;rkingwifh TCPIIP Vb/ume /, Prirrcip/e~, Pr(mIc,I/\.and,4 rchitecture. Englewood Cliff\, NJ: Prentice-Hall. 1991. hrd edV. Jacobson, B. Braden, and D Bormar. ‘“TCP extens]om for hlgh-performance,” RFC 1323, May 1992.C. Partridge, Gi,gabif NeWorkirrg. Reading, MA Addison-Wedey.1993.J. Po’wel, “The TCP maximum segment size and related topics.”’ RfT879, Nov. 1983.B, Bmderr. Ed., “’Requirements for Internet host+ communlcatitm la>ers,” RFC I I22. Oc[. 1989.J. Crowcroft er d., “1s Ioserine harmful’V IEEE Nenvork. \ 01. 6, PP

2(>2-$. Jan. 1992. - ‘D. D. Clark, “Modularity and efticvency in protocol ]mplemcnt:i[i{m.”’RFC 817. July 1982.

422

[lo]

[11]

[12]

[13]

[14]

[15]

116]

J. C. Mogul, R. F. Rashid, and M. J. Accetta, “The packet filter: Anefficient mechanism for user-level network code,” in F’roc. ACM SO.SP,1987, Pp. 39–5 1,J. Postel, “lntemet protocol, protocol specification,” RFC 791, Sept.1981.S, J. Leffler et al,, 4.3 BSD Uni.r Operaring SYsrevrr. Reading, MA:Addison-Wesley, 1989,D. D. Clark, “Window and acknowledgment strategy in TCP,” RFC813, July 1988.J. Nagle, “Congestion control in TCP/IP intemetworks,” RFC 896, Jan.1984.V. Jacobson, “Congestion avoidance and control,” in Proc. ACM SIG-COMM’88, 1988, pp. 314329.K. Moldeklev and P, Gunningberg, “Deadlock situations in TCP overATM,” to appear in Proc. lFIP IV Workshop on Protocols for High SpeedNemwrks, Vancouver, B. C., Canada, Aug. IO-t 2, 1994,

aK.jersti Moldeklev received the B. SC. and the M. SC.degrees in computer engineering from NorwegianInstitute of Technology, Trondheim, Norwayin 1988. Her Master thesis was written at theUniversity of Karlsruhe, Germany in 1987-1988.In 1989 she received a M. SC. in computer sciencefrom Stanford University.

Since 1989 she has been working for NorwegianTelecom Research (NTR). In 1992 (financed byNTR ) she entered the Ph.D. program at NorwegianInstitute of Technology. From 1993 to 1994 she

IEEWACM TRANSACTIONS ON NETWORKING, VOL. 3, NO. 4, AUGUST 1995

Per Gunningberg (M’83) received the M. SC. de-gree from University of California, Los Angeles in1981, and the Ph.D. degree from Uppsala Univer-sity, Sweden in 1983, both in computer science.

Since 1995 he has been an Associate Professorat Uppsala University and a part-time Researcher atthe Swedish Institute of Computer Science, SICS,He joined SICS research staff in 1985, and prior toSICS he spent a year and half as a Vk.iting AssistantProfessor at the University of California, Los Ange-les. His interests include protcwol implementations,

real time systems, distributed operating systems, and dependable computing.Dr. Gunningberg was co-chairman of the 3rd IFIP workshop on Protcwols

for High-Speed Networks in 1992.

spent a total of 4 months as a Visiting Researcher at Swedish Institute ofComputer Science, Sweden. Her research interests focus on high-performancecommunication of end systems.

how a large atm mtu causes deadlocks in tcp data transfers€¦ · network mtu (maximum...

Documents