end-to-end internet packet dynamicssylvia/cs268/... · end-to-end internet packet dynamics vern...

End-to-End Internet Packet DynamicsVern Paxson

Network Research GroupLawrence Berkeley National LaboratoryUniversity of California, Berkeley

[email protected]

LBNL-40488

June 23, 1997

AbstractWe discuss findings from a large-scale study of Internet packet dy-namics conducted by tracing 20,000 TCP bulk transfers between35 Internet sites. Because we traced each 100 Kbyte transfer atboth the sender and the receiver, the measurements allow us to dis-tinguish between the end-to-end behaviors due to the different di-rections of the Internet paths, which often exhibit asymmetries. Wecharacterize the prevalence of unusual network events such as out-of-order delivery and packet corruption; discuss a robust receiver-based algorithm for estimating “bottleneck bandwidth” that ad-dresses deficiencies discovered in techniques based on “packetpair”; investigate patterns of packet loss, finding that loss eventsare not well-modeled as independent and, furthermore, that the dis-tribution of the duration of loss events exhibits infinite variance; andanalyze variations in packet transit delays as indicators of conges-tion periods, finding that congestion periods also span a wide rangeof time scales.

1 IntroductionAs the Internet grows larger, measuring and characterizingits dynamics grows harder. Part of the problem is howquickly the network changes. Another part, though, is itsincreasing heterogeneity. It is more and more difficult tomeasure a plausibly representative cross-section of its behav-ior. The few studies to date of end-to-end packet dynamicshave all been confined to measuring a handful of Internet

This paper appears in the Proceedings of SIGCOMM '97. The workwas supported by the Director, Office of Energy Research, Scientific Com-puting Staff, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098.c 1997 Association for Computing Machinery. Permission to make digitalor hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and the fullcitation on the first page. Copyrights for components of this work ownedby others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.

paths, because of the great logistical difficulties presentedby larger-scale measurement [Mo92, Bo93, CPB93, Mu94].Consequently, it is hard to gauge how representative theirfindings are for today's Internet. Recently, we devised a mea-surement framework in which a number of sites run specialmeasurement daemons (“NPDs”) to facilitate measurement.With this framework, the number of Internet paths availablefor measurement grows as for sites, yielding an attrac-tive scaling. We previously used the frameworkwithsites to study end-to-end routing dynamics of about 1,000 In-ternet paths [Pa96].In this studywe report on a large-scale experiment to study

end-to-end Internet packet dynamics. Our analysis is basedon measurements of TCP bulk transfers conducted between35 NPD sites ( 2). Using TCP—rather than fixed-rate UDPor ICMP “echo” packets as done in [Bo93, CPB93, Mu94]—reaps significant benefits. First, TCP traffic is “real world,”since TCP is widely used in today's Internet. Consequently,any network path properties we can derive from measure-ments of a TCP transfer can potentially be directly appliedto tuning TCP performance. Second, TCP packet streams al-low fine-scale probing without unduly loading the network,since TCP adapts its transmission rate to current congestionlevels.Using TCP, however, also incurs two serious analysis

headaches. First, we need to distinguish between the ap-parently intertwined effects of the transport protocol and thenetwork. To do so, we developed tcpanaly, a program thatunderstands the specifics of the different TCP implementa-tions in our study and thus can separate TCP behavior fromnetwork behavior [Pa97a]. tcpanaly also forms the basisfor the analysis in this paper: after removing TCP effects, itthen computes a wide range of statistics concerning networkdynamics.Second, TCP packets are sent over a wide range of time

scales, frommilliseconds to many seconds between consecu-

This paper is necessarily terse due to space limitations. A longer ver-sion is available [Pa97b].

tive packets. Such irregular spacing greatly complicates cor-relational and frequency-domain analysis, because a streamof TCP packets does not give us a traditional time series ofconstant-rate observations to work with. Consequently, inthis paper we do not attempt these sorts of analyses, thoughwe hope to pursue them in future work. See also [Mu94]for previous work in applying frequency-domain analysis toInternet paths.In 3 we characterize unusual network behavior: out-

of-order delivery, replication, and packet corruption. Thenin 4 we discuss a robust algorithm for estimating the“bottleneck” bandwidth that limits a connection's maximumrate. This estimation is crucial for subsequent analysis be-cause knowing the bottleneck rate lets us determine whenthe closely-spaced TCP data packets used for our networkprobes are correlated with each other. (We note that thestream of ack packets returned by the TCP data receiver ingeneral is not correlated, due to the small size and largerspacing of the acks.) Once we can determine which probeswere correlated and which not, we then can turn to analysisof end-to-end Internet packet loss ( 5) and delay ( 6). In 7we briefly summarize our findings, a number of which chal-lenge commonly-held assumptions about network behavior.

2 The MeasurementsWe gathered our measurements using the “NPD” measure-ment framework we developed and discussed in [Pa96].35 sites participated in two experimental runs. The sites in-clude educational institutes, research labs, network serviceproviders, and commercial companies, in 9 countries. Weconducted the first run, , during Dec. 1994, and the sec-ond, , during Nov–Dec. 1995. Thus, differences between

and give an indication how Internet packet dynamicschanged during the course of 1995. Throughout this paper,when discussing such differences, we always limit discus-sion to the 21 sites that participated in both and .Each measurement was made by instructing daemons run-

ning at two of the sites to send or receive a 100 KbyteTCP bulk transfer, and to trace the results using tcpdump[JLM89]. Measurements occurred at Poisson intervals,which, in principle, results in unbiased measurement, evenif the sampling rate varies [Pa96]. In , the mean per-sitesampling interval was about 2 hours, with each site randomlypaired with another. Sites typically participated in about200 measurements, and we gathered a total of 2,800 pairsof traces. In , we sampled pairs of sites in a series ofgroupedmeasurements, varying the sampling rate frommin-utes to days, with most rates on the order of 4–30 minutes.These groups then give us observations of the path betweenthe site pair over a wide range of time scales. Sites typi-cally participated in about 1,200 measurements, for a totalof 18,000 trace pairs. In addition to the different samplingrates, the other difference between and is that inwe used Unix socket options to assure that the sending and

receiving TCPs had big “windows,” to prevent window limi-tations from throttling the transfer's throughput.We limited measurements to a total of 10 minutes. This

limit leads to under-representation of those times duringwhich network conditions were poor enough to make it dif-ficult to complete a 100 Kbyte transfer in that much time.Thus, our measurements are biased towards more favorablenetwork conditions. In [Pa97b] we show that the bias is neg-ligible for North American sites, but noticeable for Europeansites.

3 Network PathologiesWe begin with an analysis of network behavior we mightconsider “pathological,” meaning unusual or unexpected:out-of-order delivery, packet replication, and packet corrup-tion. It is important to recognize pathological behaviors sosubsequent analysis of packet loss and delay is not skewedby their presence. For example, it is very difficult to performany sort of sound queueing delay analysis in the presenceof out-of-order delivery, since the latter indicates that a first-in-first-out (FIFO) queueing model of the network does notapply.

3.1 Out-of-order deliveryEven though Internet routers employ FIFO queueing, anytime a route changes, if the new route offers a lower delaythan the old one, then reordering can occur [Mo92]. Sincewe recorded packets at both ends of each TCP connection,we can detect network reordering, as follows. First, we re-move from our analysis any trace pairs suffering packet filtererrors [Pa97a]. Then, for each arriving packet , we checkwhether it was sent after the last non-reordered packet. If so,then it becomes the new such packet. Otherwise, we countits arrival as an instance of a network reordering. So, for ex-ample, if a flight of ten packets all arrive in the order sentexcept the last one arrives before all of the others, we con-sider this to reflect 9 reordered packets rather than 1. Usingthis definition emphasizes “late” arrivals rather than “prema-ture” arrivals. It turns out that counting late arrivals givessomewhat higher ( %) numbers than counting prema-ture arrivals—not a big difference, though.Observations of reordering. Out-of-order delivery is

fairly prevalent in the Internet. In , 36% of the tracesincluded at least one packet (data or ack) delivered out of or-der, while in , 12% did. Overall, 2.0% of all of thedata packets and 0.6% of the acks arrived out of order (0.3%and 0.1% in ). Data packets are no doubt more often re-ordered than acks because they are frequently sent closer to-gether (due to ack-every-other policies), so their reorderingrequires less of a difference in transit times.We should not infer from the differences between reorder-

ing in and that reordering became less likely over thecourse of 1995, because out-of-order delivery varies greatly

2

Figure 1: Out-of-order delivery with two distinct slopes

from site-to-site. For example, fully 15% of the data pack-ets sent by the “ucol” site during arrived out of order,much higher than the 2.0% overall average. As discussed in[Pa96], we do not claim that the individual sites participatingin the measurement framework are plausibly representativeof Internet sites in general, so site-specific behavior cannotbe argued to reflect general Internet behavior.Reordering is also highly asymmetric. For example, only

1.5% of the data packets sent to ucol during arrived outof order. This means a sender cannot soundly infer whetherthe packets it sends are likely to be reordered, based on ob-servations of the acks it receives, which is too bad, as other-wise the reordering informationwould aid in determining theoptimal duplicate ack threshold to use for fast retransmission(see below).The site-to-site variation in reordering coincides with our

earlier findings concerning route flutter among the same sites[Pa96]. We identified two sites as particularly exhibiting flut-ter, ucol and the “wustl” site. For the part of duringwhich wustl exhibited route flutter, 24% of all of the datapackets it sent arrived out of order, a rather stunning degreeof reordering. If we eliminate ucol and wustl from theanalysis, then the proportion of all of the data packets de-livered out-of-order falls by a factor of two. We also note thatin , packets sent by ucol were reordered only 25 timesout of nearly 100,000 sent, though 3.3% of the data packetssent to ucol arrived out of order, dramatizing how over longtime scales, site-specific effects can completely change.Thus, we should not interpret the prevalence of out-of-

order delivery summarized above as giving representativenumbers for the Internet, but instead form the rule of thumb:Internet paths are sometimes subject to a high incidence ofreordering, but the effect is strongly site-dependent, and ap-parently correlated with route fluttering, which makes sensesince route fluttering provides a mechanism for frequentlyreordering packets.We observed reordering rates as high as 36% of all pack-

ets arriving in a single connection. Interestingly, some of themost highly reordered connections did not suffer any packetloss, and no needless retransmissions due to false signalsfrom duplicate acks. We also occasionally observed humon-gous reordering “gaps.” However, the evidence suggests that

See [Pa96] for specifics concerning the sites mentioned in this paper.

these gaps are not due to route changes, but a different effect.Figure 1 shows a sequence plot exhibiting a massive reorder-ing event. This plot reflects packet arrivals at the TCP re-ceiver, where each square marks the upper sequence numberof an arriving data packet. All packets were sent in increas-ing sequence order.Fitting a line to the upper points yields a data rate of a

little over 170 Kbyte/sec, which was indeed the true (T1)bottleneck rate ( 4). The slope of the packets delivered late,though, is just under 1 Mbyte/sec, consistent with an Ether-net bottleneck. What has apparently happened is that a routerwith Ethernet-limited connectivity to the receiver stoppedforwarding packets for 110 msec just as sequence 72,705 ar-rived, most likely because at that point it processed a rout-ing update [FJ94]. It finished between the arrival of 91,137and 91,649, and began forwarding packets normally againat their arrival rate, namely T1 speed. Meanwhile, it hadqueued 35 packets while processing the update, and theseit now finally forwarded whenever it had a chance, so theywent out as quickly as possible, namely at Ethernet speed,but interspersed with new arrivals.We observed this pattern a number of times in our data—

not frequent enough to conclude that it is anything but apathology, but often enough to suggest that significant mo-mentary increases in networking delay can be due to effectsdifferent from both route changes and queueing; most likelydue to router forwarding lulls.Impact of reordering. While out-of-order delivery can

violate one's assumptions about the network—in particular,the abstraction that it is well-modeled as a series of FIFOqueueing servers—we find it often has little impact on TCPperformance. One way it can make a difference, however, isin determining the TCP “duplicate ack” threshold a senderuses to infer that a packet requires retransmission. If the net-work never exhibited reordering, then as soon as the receiverobserved a packet arriving that created a sequence “hole,”it would know that the expected in-sequence packet wasdropped, and could signal to the sender calling for promptretransmission. Because of reordering, however, the receiverdoes not know whether the packet in fact was dropped; itmay instead just be late. Presently, TCP senders retransmitif “dups” arrive, a value chosen so that “false” dupscaused by out-of-order delivery are unlikely to lead to spuri-ous retransmissions.The value of was chosen primarily to assure that

the threshold was conservative. Large-scale measurementstudies were not available to further guide the selection ofthe threshold. We now examine two possible ways to im-prove the fast retransmitmechanism: by delaying the genera-tion of dups to better disambiguate packet loss from reorder-ing, and by altering the threshold to improve the balance be-tween seizing retransmission opportunities, versus avoidingunneeded retransmissions.We first look at packet reordering time scales to determine

how long a receiver needs to wait to disambiguate reorder-ing from loss. We only look at the time scales of data packet

3

reorderings, since ack reorderings do not affect the fast re-transmission process. We find a wide range of times be-tween an out-of-order arrival and the later arrival of the lastpacket sent before it. One noteworthy artifact in the distri-bution is the presence of “spikes” at particular values, thestrongest at 81msec. This turns out to be due to a 56Kbit/seclink, which has a bottleneck bandwidth of about 6,320 userdata bytes/sec. Consequently, transmitting a 512 byte packetacross the link requires 81.0msec, so data packets of this sizecan arrive no closer, even if reordered. Thus we see that re-ordering can have associated with it a minimum time, whichcan be quite large.Inspecting the distributions further, we find that a strat-

egy of waiting 20 msec would identify 70% of the out-of-order deliveries. For , the same proportion can beachieved waiting 8 msec, due to its overall shorter reorder-ing times (presumably due to overall higher bandwidths).Thus, even though the upper end of the distribution is verylarge (12 seconds!), a generally modest wait serves to disam-biguate most sequence holes.We now look at the degree to which false fast retransmit

signals due to reordering are actually a problem. We clas-sify each sequence of dups as either good or bad, dependingon whether a retransmission in response to it was necessaryor unnecessary. When considering a refinement to the fastretransmission mechanism, our interest lies in the resultingratio of good to bad, , controlled by both the dup ackthreshold value we consider, and the waiting time, ,observed by the receiver before generating a dup upon theadvent of a sequence hole.For current TCP, dups and . For these

values, we find in , , and in , !The order of magnitude improvement between and isdue to the use in of bigger windows ( 2), and hencemoreopportunity for generating good dups. Clearly, the currentscheme works well. While improves by abouta factor of 2.5, it also diminishes fast retransmit opportunitiesby about 30%, a significant loss.For , we gain about 65–70% more fast retransmit

opportunities, a hefty improvement, each generally saving aconnection from an expensive timeout retransmission. Thecost, however, is that falls by about a factor of three.If the receiving TCP waited msec before generat-ing a second dup, then falls only slightly (30% for ,not at all for ). Unfortunately, adding to TCPscoupled with the msec delay requires both senderand receiver modifications, greatly increasing the problemof deploying the change. Since partial deployment of onlythe sender change ( ) significantly increases spuri-ous retransmissions, we conclude that, due to the size of theInternet's installed base, safely lowering is impractical.We note that the TCP selective acknowledgement

(“SACK”) option, now pending standardization, also holdspromise for honing TCP retransmission [MMSR96]. SACKprovides sufficiently fine-grained acknowledgement infor-mation that the sending TCP can generally tell which packets

require retransmission and which have safely arrived ( 5.4).To gain any benefits from SACK, however, requires that boththe sender and the receiver support the option, so the deploy-ment problems are similar to those discussed above. Fur-thermore, use of SACK aids a TCP in determining what toretransmit, but notwhen to retransmit. Because these consid-erations are orthogonal, investigating the effects of lowering

to 2 merits investigation, even in face of impending de-ployment of SACK.We observed one other form of dup ack series potentially

leading to unnecessary retransmission. Sometimes a seriesoccurs for which the original ack (of which the others aredups) had acknowledged all of the outstanding data. Whenthis occurs, the subsequent dups are always due to an un-necessary retransmission arriving at the receiving TCP, un-til at least a round-trip time (RTT) after the sending TCPsends new data. For , these sorts of series are 2-15 times more frequent than bad series, which is why theymerit discussion. They are about 10 times rarer than goodseries. They occur during retransmission periods when thesender has already filled all of the sequence holes and is nowretransmitting unnecessarily. Use of SACK eliminates theseseries. So would the following heuristic: whenever a TCP re-ceives an ack, it notes whether the ack covers all of the datasent so far. If so, it then ignores any duplicates it receivesfor the ack, otherwise it acts on them in accordance with theusual fast retransmission mechanism.

3.2 Packet replicationIn this section we look at packet replication: the networkdelivering multiple copies of the same packet. Unlike re-ordering, it is difficult to see how replication can occur. Ourimaginations notwithstanding, it does happen, albeit rarely.We suspect one mechanism may involve links whose link-level technology includes a notion of retransmission, andfor which the sender of a packet on the link incorrectly be-lieves the packet was not successfully received, so it sendsthe packet again.In , we observed only once instance of packet replica-

tion, in which a pair of acks, sent once, arrived 9 times, eachcopy coming 32 msec after the last. The fact that two pack-ets were together replicated does not fit with the explanationoffered above for how a single packet could be replicated,since link-layer effects should only replicate one packet at atime. In , we observed 65 instances of the network in-frastructure replicating a packet, all of a single packet, themost striking being 23 copies of a data packet arriving in ashort blur at the receiver. Several sites dominated thereplication events: in particular, the two Trondheim sites,“sintef1” and “sintef2”, accounted for half of the events(almost all of these involving sintef1), and the two Britishsites, “ucl” and “ukc”, for half the remainder. After elimi-

We have observed traces (not part of this study) in which more than10% of the packets were replicated. The problem was traced to an improp-erly configured bridging device.

4

nating these, we still observed replication events among con-nections between 7 different sites, so the effect is somewhatwidespread.Surprisingly, packets can also be replicated at the sender,

before the network has hadmuch of a chance to perturb them.We know these are true replications and not packet filter du-plications, as discussed in [Pa97a], because the copies havehad their TTL fields decremented. There were no sender-replicated packets in , but 17 instances in , involvingtwo sites (so the phenomenon is clearly site-specific).

3.3 Packet corruptionThe final pathologywe look at is packet corruption, in whichthe network delivers to the receiver an imperfect copy of theoriginal packet. For data packets, tcpanaly cannot directlyverify the checksum because the packet filter used in ourstudy only recorded the packet headers, and not the payload.(For “pure acks,” i.e., acknowledgement packets with no datapayload, it directly verifies the checksum.) Consequently,tcpanaly includes algorithms to infer whether data packetsarrive with invalid checksums, discussed in [Pa97a]. Usingthat analysis, we first found that one site, “lbli,” was muchmore prone to checksum errors than any other. Since lbli'sInternet link is via an ISDN link, it appears quite likely thatthese are due to noise on the ISDN channels.After eliminating lbli, the proportion of corrupted pack-

ets is about 0.02% in both datasets. No other single sitestrongly dominated in suffering from corrupted packets, andin , most of the sites receiving corrupted packets had fast(T1 or greater) Internet connectivity, so the corruptions arenot primarily due to noisy, slow links. Thus, this evidencesuggests that, as a rule of thumb, the proportion of Internetdata packets corrupted in transit is around 1 in 5,000 (but seebelow).A corruption rate of 1 packet in 5,000 is certainly not neg-

ligible, because TCP protects its data with a 16-bit check-sum. Consequently, on average one bad packet out of 65,536will be erroneously accepted by the receiving TCP, resultingin undetected data corruption. If the 1 in 5,000 rate is indeedcorrect, then about one in every 300 million Internet packetsis accepted with corruption—certainly, many each day. Inthis case, we argue that TCP's 16-bit checksum is no longeradequate, if the goal is that globally in the Internet there arevery few corrupted packets accepted by TCP implementa-tions. If the checksum were instead 32 bits, then only aboutone in packets would be accepted with corruption.Finally, we note that the data checksum error rate of 0.02%

of the packets is much higher than that measured directly (byverifying the checksum) for pure acks. For pure acks, wefound only 1 corruption out of 300,000 acks in , and, af-ter eliminating lbli, 1 out of 1.6 million acks in . Thisdiscrepancy can be partially addressed by accounting for thedifferent lengths of data packets versus pure acks. It can befurther reconciled if “header compression” such as CSLIPis used along the Internet paths in our study [Ja90], as that

would greatly increase the relative size of data packets to thatof pure acks. But it seems unlikely that header compressionis widely used for high-speed links, and most of the inferred

data packet corruptions occurred for T1 and faster net-work paths.One possibility is that the packets inferred by tcpanaly

infer as arriving corrupted—because the receiving TCP didnot respond to them in any fashion—actually were never re-ceived by the TCP for a different reason, such as inadequatebuffer space. We partially tested for this possibility by com-puting corruption rates for only those traces monitored by apacket filter running on machine separate from the receiver(but on the same local network), versus those running on thereceiver's machine. The former resulted in slightly higherinferred corruption rates, but not significantly so, so if theTCP is failing to receive the packets in question, it must bedue to a mechanism that still enables the packet filter on thereceiving machine to receive a copy. One can imagine suchmechanisms, but it seems unlikely they would lead to droprates of 1 in 5,000.Another possibility is that data packets are indeed much

more likely to be corrupted than the small pure ack packets,because of some artifact in how the corruption occurs. Forexample, it may be that corruption primarily occurs insiderouters, where it goes undetected by any link-layer check-sum, and that the mechanism (e.g., botched DMA, cache in-consistencies) only manifests itself for packets larger than aparticular size.Finally, we note that bit errors in packets transmitted us-

ing CSLIP can result in surprising artifacts when the CSLIPreceiver reconstructs the packet header—such as introducingthe appearance of in-sequence data, when none was actuallysent!In summary, we cannot offer a definitive answer as to over-

all Internet packet corruption rates: but the conflicting evi-dence that corruption may occur fairly frequently argues forfurther study in order to resolve the question.

4 Bottleneck BandwidthIn this section we discuss how to estimate a fundamentalproperty of a network connection, the bottleneck bandwidththat sets the upper limit on how quickly the network can de-liver the sender's data to the receiver. The bottleneck comesfrom the slowest forwarding element in the end-to-end chainthat comprises the network path. We make a crucial distinc-tion between bottleneck bandwidth and available bandwidth.The former gives an upper bound on how fast a connectioncan possibly transmit data, while the less-well-defined latterterm denotes how fast the connection should transmit to pre-serve network stability. Thus, available bandwidth never ex-ceeds bottleneck bandwidth, and can in fact be much smaller( 6.3).We will denote a path's bottleneck bandwidth as . For

measurement analysis, is a fundamental quantity be-

5

cause it determines what we term the self-interference time-constant, . measures the amount of time required toforward a given packet through the bottleneck element. If apacket carries a total of bytes and the bottleneck bandwidthis byte/sec, then:

(1)

in units of seconds. From a queueing theory perspective,is simply the service time of a -byte packet at the bot-

tleneck link. We use the term “self-interference” becauseif the sender transmits two -byte packets with an interval

between them, then the second one is guaranteedto have to wait behind the first one at the bottleneck element(hence the use of “ ” to denote “queueing”). We will al-ways discuss in terms of user data bytes, i.e., TCP packetpayload, and for ease of discussion will assume is constant.We will not use the term for acks.For our measurement analysis, accurate assessment of

is critical. Suppose we observe a sender transmitting pack-ets and an interval apart. Then if , thedelays experienced by and are perforce correlated, andif their delays, if correlated, are not due to self-interference but some other source (such as additional traf-fic from other connections, or processing delays). Thus, weneed to know so we can distinguish those measurementsthat are necessarily correlated from those that are not. If wedo not do so, then we will skew our analysis by mixing to-gether measurements with built-in delays (due to queueing atthe bottleneck) with measurements that do not reflect built-indelays.

4.1 Packet pairThe bottleneck estimation technique used in previous workis based on “packet pair” [Ke91, Bo93, CC96]. The fun-damental idea is that if two packets are transmitted by thesender with an interval between them, then whenthey arrive at the bottleneck they will be spread out in timeby the transmission delay of the first packet across the bottle-neck: after completing transmission through the bottleneck,their spacing will be exactly . Barring subsequent delayvariations, they will then arrive at the receiver spaced not

apart, but . We then compute via Eqn 1.The principle of the bottleneck spacing effect was noted in

Jacobson's classic congestion paper [Ja88], where it in turnleads to the “self-clocking” mechanism. Keshav formallyanalyzed the behavior of packet pair for a network of routersthat all obey the “fair queueing” scheduling discipline (notpresently used in the Internet), and developed a provably sta-ble flow control scheme based on packet pair measurements[Ke91]. Both Jacobson and Keshav were interested in esti-mating available rather than bottleneck bandwidth, and forthis variations from due to queueing are of primary con-cern ( 6.3). But if, as for us, the goal is to estimate , thenthese variations instead become noise we must deal with.

Bolot used a stream of packets sent at fixed intervals toprobe several Internet paths in order to characterize delayand loss [Bo93]. He measured round-trip delay of UDP echopackets and, among other analysis, applied the packet pairtechnique to form estimates of bottleneck bandwidths. Hefound good agreement with known link capacities, though alimitation of his study is that the measurements were con-fined to a small number of Internet paths.

Recent work by Carter and Crovella also investigates theutility of using packet pair in the Internet for estimating[CC96]. Their work focusses on bprobe, a tool they devisedfor estimating by transmitting 10 consecutive ICMP echopackets and recording the interarrival times of the consecu-tive replies. Much of the effort in developing bprobe con-cerns how to filter the resulting raw measurements in orderto form a solid estimate. bprobe currently filters by firstwidening each estimate into an interval by adding an errorterm, and then finding the point at which the most intervalsoverlap. The authors also undertook to calibrate bprobe bytesting its performance for a number of Internet paths withknown bottlenecks. They found in general it works well,though some paths exhibited sufficient noise to sometimesproduce erroneous estimates.

One limitation of both studies is that they were based onmeasurements made only at the data sender. (This is not anintrinsic limitation of the techniques used in either study).Since in both studies, the packets echoed back from the re-mote end were the same size as those sent to it, neither anal-ysis was able to distinguish whether the bottleneck alongthe forward and reverse paths was the same. The bottleneckcould differ in the two directions due to asymmetric routing,for example [Pa96], or because some media, such as satellitelinks, can have significant bandwidth asymmetries depend-ing on the direction traversed [DMT96].

For estimating bottleneck bandwidth by measuring TCPtraffic, a second problem arises: if the only measurementsavailable are those at the sender, then “ack compression”( 6.1) can significantly alter the spacing of the small ackpackets as they return through the network, distorting thebandwidth estimate. We investigate the degree of this prob-lem below.

For our analysis, we consider what we term receiver-basedpacket pair (RBPP), in which we look at the pattern of datapacket arrivals at the receiver. We also assume that the re-ceiver has full timing information available to it. In partic-ular, we assume that the receiver knows when the packetssent were not stretched out by the network, and can rejectthese as candidates for RBPP analysis. RBPP is considerablymore accurate than sender-based packet pair (SBPP), since iteliminates the additional noise and possible asymmetry ofthe return path, as well as noise due to delays in generatingthe acks themselves. We find in practice this additional noisecan be quite large.

6

Figure 2: Bottleneck bandwidth change

4.2 Difficulties with packet pair

As shown in [Bo93] and [CC96], packet pair techniques of-ten provide good estimates of bottleneck bandwidth. Wefind, however, four potential problems (in addition to noiseon the return path for SBPP). Three of these problem canoften be addressed, but the fourth is more fundamental.Out-of-order delivery. The first problem stems from the

fact that for some Internet paths, out-of-order packet deliveryoccurs quite frequently ( 3.1). Clearly, packet pairs deliv-ered out of order completely destroy the packet pair tech-nique, since they result in , which then leads toa negative estimate for . Out-of-order delivery is symp-tomatic of a more general problem, namely that the twopackets in a pair may not take the same route through thenetwork, which then violates the assumption that the secondqueues behind the first at the bottleneck.Limitations due to clock resolution. Another problem

relates to the receiver's clock resolution, , meaning theminimum difference in time the clock can report. canintroduce large margins of error around estimates of . Forexample, if msec, then for bytes, packetpair cannot distinguish between 51,200 byte/sec, and

.We had several sites in our study with msec. A

technique for coping with large is to use packet bunch, inwhich back-to-back packets are used, rather than justtwo. Thus, the overall arrival interval spanned by thepackets will be about times larger than that spanned bya single packet pair, diminishing the uncertainty due to .Changes in bottleneck bandwidth. Another problem

that any bottleneck bandwidth estimation must deal with isthe possibility that the bottleneck changes over the courseof the connection. Figure 2 shows a sequence plot of datapackets arriving at the receiver for a trace in which thishappened. The eye immediately picks out a transition be-tween one overall slope to another, just after . Thefirst slope corresponds to 6,600 byte/sec, while the second is13,300 byte/sec, and increase of a factor of two. Here, thechange is due to lbli's ISDN link activating a second chan-nel to double the link bandwidth, but in general bottleneckshifts can occur due to other mechanisms, such as routingchanges.

Figure 3: Enlargement of part of previous figure's right half

Multi-channel bottleneck links. A more fundamentalproblem with packet-pair techniques arises from the effectsof multi-channel links, for which packet pair can yield in-correct overestimates even in the absence of any delay noise.Figure 3 expands a portion of Figure 2. The slope of thelarge linear trend in the plot corresponds to 13,300 byte/sec,as earlier noted. However, we see that the line is actuallymade up of pairs of packets. The slope between the pairscorresponds to a data rate of 160 Kbyte/sec. However, thistrace involved lbli, a site with an ISDN link that has a hardlimit of 128 Kbit/sec = 16 Kbyte/sec, a factor of ten smaller!Clearly, an estimate of Kbyte/sec must be wrong,yet that is what a packet-pair calculation will yield.What has happened is that the bottleneck ISDN link uses

two channels that operate in parallel. When the link is idleand a packet arrives, it goes out over the first channel, andwhen another packet arrives shortly after, it goes out overthe other channel. They don't queue behind each other!Multi-channel links violate the assumption that there is a sin-gle end-to-end forwarding path, with disastrous results forpacket-pair, since in their presence it can form completelymisleading overestimates for .We stress that the problem is more general than the cir-

cumstances shown in this example. First, while in this ex-ample the parallelism leading to the estimation error camefrom a single link with two separate physical channels, theexact same effect could come from a router that balancesits outgoing load across two different links. Second, it maybe tempting to dismiss this problem as correctable by usingpacket bunch with instead of packet pair. This argu-ment is not compelling without further investigation, how-ever, because packet bunch could be more prone to error forregular bottlenecks; and, more fundamentally, onlyworks if the parallelism comes from two channels. If it camefrom three channels (or load-balancing links), thenwill still yield misleading estimates.

4.3 Robust bottleneck estimationMotivated by the shortcomings of packet pair, we developeda significantly more robust procedure, “packet bunchmodes”(PBM). The main observation behind PBM is that we candeal with packet-pair's shortcomings by forming estimates

7

for a range of packet bunch sizes, and by allowing for mul-tiple bottleneck values or apparent bottleneck values. Byconsidering different bunch sizes, we can accommodate lim-ited receiver clock resolutions and the possibility of multiplechannels or load-balancing across multiple links, while stillavoiding the risk of underestimation due to noise dilutinglarger bunches, since we also consider small bunch sizes. Byallowing for finding multiple bottleneck values, we again ac-commodate multi-channel (and multi-link) effects, and alsothe possibility of a bottleneck change.Allowing for multiple bottleneck values rules out use of

the most common robust estimator, the median, since itpresupposes unimodality. We instead focus on identifyingmodes, i.e., local maxima in the density function of the dis-tribution of the estimates. We then observe that:

(i) If we find two strong modes, for which one is foundonly at the beginning of the connection and one at theend, then we have evidence of a bottleneck change.

(ii) If we find two strong modes which span the same por-tion of the connection, and if one is found only for apacket bunch size of and the other only for bunchsizes , then we have evidence for an -channelbottleneck link.

(iii) We can find both situations, for a link that exhibits botha change and a multi-channel link, such as shown inFigure 2.

Turning these observations into a working algorithm entails agreat degree of niggling detail, as well as the use of a numberof heuristics. Due to space limitations, we defer the partic-ulars to [Pa97b]. We note, though, that one salient aspectof PBM is that it forms its final estimates in terms of errorbars that nominally encompass % around the bottleneckestimate, but might be narrower if estimates cluster sharplyaround a particular value, or wider if limited clock resolutionprevents finer bounds. PBM always tries bunch sizes rangingfrom two packets to five packets. If required by limited clockresolution or the failure to find a compelling bandwidth es-timate (about one quarter of all of the traces, usually due tolimited clock resolution), it tries progressively larger bunchsizes, up to a maximum of 21 packets. We also note thatnothing in PBM is specific to analyzing TCP traffic. All itrequires is knowing when packets were sent relative to oneanother, how they arrived relative to one another, and theirsize.We applied PBM to and for those traces for which

tcpanaly's packet filter and clock analysis did not uncoverany uncorrectable problems [Pa97a, Pa97b]. After removinglbli, which frequently exhibited both bottleneck changesand multi-channel effects, PBM detected a single bottleneck95–98% of the time; failed to produce an estimate 0-2% ofthe time (due to excessive noise or reordering); detected abottleneck change in about 1 connection out of 250; and in-ferred a multi-channel bottleneck in 1-2% of the connections

Figure 4: Histogram of single-bottleneck estimates for

(though some of these appear spurious). Since all but sin-gle bottlenecks are rare, we defer discussion of the others to[Pa97b], and focus here on the usual case of finding a singlebottleneck.Unlike [CC96], we do not know a priori the bottleneck

bandwidths for many of the paths in our study. We thus mustfall back on self-consistency checks in order to gauge the ac-curacy of PBM. Figure 4 shows a histogram of the estimatesformed for . (The estimates are similar, though lowerbandwidth estimates are more common.) The 170 Kbyte/secpeak clearly dominates, and corresponds to the speed of aT1 circuit after removing overhead. . The 7.5 Kbyte/seccorresponds to 64 Kbit/sec links and the 13–14 Kbyte/secpeak reflects 128 Kbit/sec links. The 30 Kbyte/sec peakcorresponds to a 256 Kbit/sec link, seen almost exclusivelyfor connections involving a U.K. site. The 1 Mbyte/secpeaks are due to Ethernet bottlenecks, and likely reflect T3-connectivity beyond the limiting Ethernet.We speculate that the 330 Kbyte/sec peak reflects use

of two T1 circuits in parallel, 500 Kbyte/sec reflects threeT1 circuits (not half an Ethernet, since there is no easy way tosubdivide an Ethernet's bandwidth), and 80 Kbyte/sec comesfrom use of half of a T1. Similarly, the 100 Kbyte/sec peakmost likely is due to splitting a single E1 circuit in half. In-deed, we find non-NorthAmerican sites predominating theseconnections, as well exhibiting peaks at 200–220 Kbyte/sec,above the T1 rate and just a bit below E1. This peak is absentfrom North American connections.In summary, we believe we can offer plausible explana-

tions for all of the peaks. Passing this self-consistency testin turn argues that PBM is indeed detecting true bottleneckbandwidths.We next investigate the stability of bottleneck bandwidth

over time. If we consider successive estimates for the samesender/receiver pair, then we find that 50% differ by less than1.75%; 80%, by less than 10%; and 98% differ by less thana factor of two. Clearly, bottlenecks change infrequently.The last property of bottleneck bandwidth we investigate

is symmetry: how often is the bottleneck from host to hostthe same as that from to ? Bottleneck asymmetries

are an important consideration for sender-based “echo” mea-surement techniques, since these will observe the minimum

Recall that we compute in terms of TCP payload bytes.

8

bottleneck of the two directions [Bo93, CC96]. We find thatfor a given pair of hosts, the median estimates in the two di-rections differ by more than 20% about 20% of the time.This finding agrees with the observation that Internet pathsoften exhibit major routing asymmetries [Pa96]. The bottle-neck differences can be quite large, with for example somepaths T1-limited in one direction but Ethernet-limited in theother. In light of these variations, we see that sender-basedbottleneck measurement will sometimes yield quite inaccu-rate results.

4.4 Efficacy of packet-pair

We finish with a look at how packet pair performs comparedto PBM. We confine our analysis to those traces for whichPBM found a single bottleneck. If packet pair produces anestimate lying within 20% of PBM's, then we consider itagreeing with PBM, otherwise not.We evaluate “receiver-based packet pair” (RBPP, per4.1) by considering it as PBM limited to packet bunch

sizes of 2 packets (or larger, if needed to resolve limitedclock resolutions). We find RBPP estimates almost always(97–98%) agree with PBM. Thus, if (1) PBM's general clus-tering and filtering algorithms are applied to packet pair, (2)we do packet pair estimation at the receiver, (3) the receiverbenefits from sender timing information, so it can reliablydetect out-of-order delivery and lack of bottleneck “expan-sion,” and (4) we are not concerned with multi-channel ef-fects, then packet pair is a viable and relatively simple meansto estimate the bottleneck bandwidth.We also evaluate “sender-based packet pair” (SBPP), in

which the sender makes measurements by itself. SBPP isof considerable interest because a sender can use it with-out any cooperation from the receiver, making it easy to de-ploy in the Internet. To fairly evaluate SBPP, we assumeuse by the sender of a number of considerations for formingsound bandwidth estimates, detailed in [Pa97b]. Even so,we find, unfortunately, that SBPP does not work especiallywell. In both datasets, the SBPP bottleneck estimate agreeswith PBM only about 60% of the time. About one third ofthe estimates are too low, reflecting inaccuracies induced byexcessive delays incurred by the acks on their return. Theremaining 5–6% are overestimates (typically 50% too high),reflecting ack compression ( 6.1).

5 Packet Loss

In this section we look at what our measurements tell usabout packet loss in the Internet: how frequently it occursand with what general patterns ( 5.1); differences betweenloss rates of data packets and acks ( 5.2); the degree towhich loss occurs in bursts ( 5.3); and how well TCP re-transmission matches genuine loss ( 5.4).

5.1 Loss rates

A fundamental issue in measuring packet loss is to avoidconfusing measurement drops with genuine losses. Here iswhere the effort to ensure that tcpanaly understands thedetails of the TCP implementations in our study pays off[Pa97a]. Because we can determine whether traces sufferfrom measurement drops, we can exclude those that do fromour packet loss analysis and avoid what could otherwise besignificant inaccuracies.For the sites in common, in , 2.7% of the packets were

lost, while in , 5.2%, nearly twice as many. However,we need to address the question of whether the increase wasdue to the use of bigger windows in ( 2). With biggerwindows, transfers will often have more data in flight and,consequently, load router queues much more.We can assess the impact of bigger windows by looking at

loss rates of data packets versus those for ack packets. Datapackets stress the forward path much more than the smallerack packets stress the reverse path, especially since acks areusually sent at half the rate of data packets due to ack-every-other-packet policies. On the other hand, the rate at whicha TCP transmits data packets adapts to current conditions,while the ack transmission rate does not unless an entireflight of acks is lost, causing a sender timeout. Thus, weargue that ack losses give a clearer picture of overall Internetloss patterns, while data losses tell us specifically about theconditions as perceived by TCP connections.In , 2.88% of the acks were lost and 2.65% of the

data packets, while in the figures are 5.14% and 5.28%.Clearly, the bulk of the difference between the andloss rates is not due to the use of bigger windows in .Thus we conclude that, overall, packet loss rates nearly dou-bled during 1995. We can refine these figures in a significantway, by conditioning on observing at least one loss during aconnection. Here we make a tacit assumption that the net-work has two states, “quiescent” and “busy,” and that we candistinguish between the two because when it is quiescent, wedo not observe any (ack) loss.In both and , about half the connections had no ack

loss. For “busy” connections, the loss rates jump to 5.7% inand 9.2% in . Thus, even in , if the network was

busy (using our simplistic definition above), loss rates werequite high, and for they shot upward to a level that ingeneral will seriously impede TCP performance.So far, we have treated the Internet as a single aggre-

gated network in our loss analysis. Geography, however,plays a crucial role. To study geographic effects, we par-tition the connections into four groups: “Europe,” “U.S.,”“Into Europe,” and “Into U.S.” European connections haveboth a European sender and receiver, U.S. have both in theUnited States. “Into Europe” connections have Europeandata senders and U.S. data receivers. The terminology isbackwards here because what we assess are ack loss rates,and these are generated by the receiver. Hence, “Into Eu-rope” loss rates reflect those experienced by packet streams

9

Region Quies Quies Busy BusyEurope 48% 58% 5.3% 5.9% %U.S. 66% 69% 3.6% 4.4% %Into Europe 40% 31% 9.8% 16.9% %Into U.S. 35% 52% 4.9% 6.0% %All regions 53% 52% 5.6% 8.7% %

Table 1: Conditional ack loss rates for different regions

traveling from the U.S. into Europe. Similarly, “Into U.S.”are connections with U.S. data senders and European re-ceivers.Table 1 summarizes loss rates for the different regions,

conditioning on whether any acks were lost (“quiescent” or“busy”). The second and third columns give the proportionof all connections that were quiescent in and , respec-tively. We see that except for the trans-Atlantic links go-ing into the U.S., the proportion of quiescent connections isfairly stable. Hence, loss rate increases are primarily due tohigher loss rates during the already-loaded “busy” periods.The fourth and fifth columns give the proportion of acks lostfor “busy” periods, and the final column summarizes the rel-ative change of these figures. None of the busy loss rates isespecially heartening, and the trends are all increasing. The17% loss rate going into Europe is particularly glum.Within regions, we find considerable site-to-site variation

in loss rates, as well as variation between loss rates for pack-ets inbound to the site and those outbound ( 5.2). We didnot, however, find any sites that seriously skewed the abovefigures.In [Pa97b] we also analyze loss rates over the course of

the day, here omitted due to limited space. We find an un-surprising diurnal pattern of “busy” periods corresponding toworking hours and “quiescent” periods to late night and es-pecially early morning hours. However, we also find that oursuccessful measurements involving European sites exhibit adefinite skew towards oversampling the quiescent periods,due to effects discussed in 2. Consequently, the Europeanloss rates given above are underestimates.We finish with a brief look at how loss rates evolve over

time. We find that observing a zero-loss connection at agiven point in time is quite a good predictor of observingzero-loss connections up to several hours in the future, andremains a useful predictor, though not as strong, even fortime scales of days and weeks [Pa97b]. Similarly, observinga connection that suffered loss is also a good predictor thatfuture connections will suffer loss. The fact that predictionloses some power after a few hours supports the notion devel-oped above that network paths have two general states, “qui-escent” and “busy,” and provides evidence that both statesare long-lived, on time scales of hours. This again is notsurprising, since we discussed earlier how these states ex-hibit clear diurnal patterns. That they are long-lived, though,means that caching loss information should prove beneficial.Finally, we note that the predictive power of observing a

specific loss rate is much lower than that of observing thepresence of zero or non-zero loss. That is, even if we knowit is a “busy” or a “quiescent” period, the loss rate measuredat a given time only somewhat helps us predict loss rates attimes not very far (minutes) in the future, and is of little helpin predicting loss rates a number of hours in the future.

5.2 Data packet loss vs. ack lossWe now turn to evaluating how patterns of packet loss dif-fer among data packets (those carrying any user data) andack packets. We make a key distinction between “loaded”and “unloaded” data packets. A “loaded” data packet is onethat presumably had to queue at the bottleneck link behindone of the connection's previous packets, while an unloadeddata packet is one that we know did not have to queue at thebottleneck behind a predecessor. We distinguish between thetwo by computing each packet's load, as follows.Suppose the methodology in 4 estimates the bottleneck

bandwidth as . It also provides bounds on the estimate,i.e., a minimum value and a maximum . We can thendetermine the maximum amount of time required for a -bytepacket to transit the bottleneck, namely: sec.Let be the time at which the sender transmits the th

data packet. We then sequentially associate a maximum loadwith each packet (assume for simplicity that is con-

stant). The first packet's load is:

Subsequent packets have a load:

thus reflects the maximum amount of extra delay the thpacket incurs due to its own transmission time across thebottleneck link, plus the time required to first transmit anypreceding packets across the bottleneck link, if will ar-rive at the bottleneck before they completed transmission.In queueing theory terms, reflects the th packet's (max-imum) waiting time at the bottleneck queue, in the absenceof competing traffic from exogenous sources.If , then we will term packet “loaded,”

meaning that it had to wait for pending transmission of ear-lier packets. Otherwise, we term it “unloaded.” (We can alsodevelop “central” estimates rather than maximum estimatesusing instead of in this chain of reasoning. These arethe values used in 6.3.)Using this terminology, in both and , about 2/3's

of the data packets were loaded. Figure 5 shows the distri-butions of loss rates during for unloaded data packets,loaded data packets, and acks. All three distributions showconsiderable probability of zero loss. We immediately seethat loaded packets are much more likely to be lost than un-loaded packets, as we would expect. In addition, acks areconsistently more likely than unloaded packets to be lost, butgenerally less likely to be lost than loaded packets, except

10

Figure 5: loss rates for data packets and acks

during times of severe loss. We interpret the difference be-tween ack and data loss rates as reflecting the fact that, whilean ack stream presents a much lighter load to the networkthan a data packet stream, the ack stream does not adapt tothe current network conditions, while the data packet streamdoes, lowering its transmission rate in an attempt to diminishits loss rate.It is interesting to note the extremes to which packet loss

can reach. In , the largest unloaded data packet loss ratewe observedwas 47%. For loaded packets it climbed to 65%,and for acks, 68%. As we would expect, these connec-tions all suffered egregiously. However, they did manage tosuccessfully complete their transfers within their alloted tenminutes, a testimony to TCP's tenacity. For all of these ex-tremes, no packets were lost in the reverse direction! Clearlypacket loss on the forward and reverse paths is sometimescompletely independent. Indeed, the coefficient of correla-tion between combined (loaded and unloaded) data packetloss rates and ack loss rates in is 0.21, and in , the lossrates appear uncorrelated (coefficient of 0.02), perhaps dueto the greater prevalence of significant routing asymmetry[Pa96].Further investigating the loss rate distributions, one inter-

esting feature we find is that the non-zero portions of boththe unloaded and loaded data packet loss rates agree closelywith exponential distributions, while that for acks is not sopersuasive a match. Figure 6 shows the distributions of theper-connection loss rates for unloaded data packets (top) andacks (bottom) in , for those connections that suffered atleast one loss. In both plots we have added an exponen-tial distribution fitted to the mean of the loss rate (dotted).We see that for unloaded data packets (and also for loadedpackets, not shown), the loss rate distribution is quite closeto exponential, with the only significant disagreement in thelower tail. (This tail is subject to granularity effects, sincefor a trace with packets, the minimum non-zero loss ratewill be .) The close fit is widespread—not dominated by afew sites. For ack loss rates, however, we see that the fit isconsiderably less compelling.While striking, interpreting the fit to the exponential dis-

tribution is difficult. If, for example, packet loss occurs in-

Figure 6: Distribution of unloaded data packet and acknon-zero loss rates (solid), with fitted exponential distribu-tions (dotted)

dependently and with a constant probability, then we wouldexpect the loss rate to reflect a binomial distribution, but thatis not what we observe. (We also know from the results in5.1 that there is not a single Internet packet loss rate, or

anything approaching such a situation.)

It seems likely that the better exponential fit for data lossrates than ack loss rates holds a clue. The most salient dif-ference between the transmission of data packets and that ofacks is that the rate at which the sender transmits data pack-ets adapts to the current network conditions, and furthermoreit adapts based on observing data packet loss. Thus, if wepassively measure the loss rate by observing the fate of aconnection's TCP data packets, then we in fact are makingmeasurements using a mechanismwhose goal is to lower thevalue of what we are measuring (by spacing out the measure-ments). Consequently, we need to take care to distinguishbetween measuring overall Internet packet loss rates, whichis best done using non-adaptive sampling, versus measuringloss rates experienced by a transport connection's packets—the two can be quite different.

Finally: the link between the adaptive sampling and thestriking exponential distribution eludes us. We suspect it willprove an interesting area for further study.

11

Type of loss

Loaded data pkt 2.8% 4.5% 49% 50%Unloaded data pkt 3.3% 5.3% 20% 25%Ack 3.2% 4.3% 25% 31%

Table 2: Unconditional and conditional loss rates

5.3 Loss burstsIn this section we look at the degree to which packet lossoccurs in bursts of more than one consecutive loss.The first question we address is the degree to which packet

losses are well-modeled as independent. In [Bo93], Bolot in-vestigated this question by comparing the unconditional lossprobability, , with the conditional loss probability, ,where is conditioned on the fact that the previous packetwas also lost. He investigated the relationship betweenand for different packet spacings , ranging from 8 msecto 500 msec. He found that approaches as in-creases, indicating that loss correlations are short-lived, andconcluded that “losses of probe packets are essentially ran-dom as long as the probe traffic uses less than 10% of theavailable capacity of the connection over which the probesare sent.” The path he analyzed, though, included a heavilyloaded trans-Atlantic link, so the patterns he observed mightnot be typical.Table 2 summarizes and for the different types

of packets and the two datasets. Clearly, for TCP packetswe must discard the assumption that loss events are well-modeled as independent. Even for the low-burden, relativelylow-rate ack packets, the loss probability jumps by a factor ofseven if the previous ack was lost. We would expect to findthe disparity strongest for loaded data packets, as these mustcontend for buffer with the connection's own previous pack-ets, as well as any additional traffic, and indeed this is thecase. We find the effect least strong for unloaded data pack-ets, which accords with these not having to contend with theconnection's previous packets, and having their rate dimin-ished in the face of previous loss.The relative differences between and in Table 2 all

exceed those computed by Bolot by a large factor. His great-est observed ratio of to was about 2.5:1. However,his were all much higher than those in Table 2, even for

msec, suggesting that the path he measured differedconsiderably from a typical path in our study.Given that packet losses occur in bursts, the next natural

question is: how big? To address this question, we groupsuccessive packet losses into outages. Figure 7 shows thedistribution of outage durations for those lasting more than

It is interesting that loaded packets are unconditionally less likely to belost than unloaded packets. We suspect this reflects the fact that lengthyperiods of heavy loss or outages will lead to timeout retransmissions, andthese are unloaded. Note that these statistics differ from the distributionsshown in Figure 5 because those are for per-connection loss rates, whileTable 2 summarizes loss probabilities over all the packets in each dataset.

Figure 7: Distribution of packet loss outage durations ex-ceeding 200 msec

Figure 8: Log-log complementary distribution plot of ackoutage durations

200 msec (the majority). We see that all four distributionsagree fairly closely.It is clear from Figure 7 that outage durations span several

orders of magnitude. For example, 10% of the ack out-ages were 33 msec or shorter (not shown in the plot), whileanother 10% were 3.2 sec or longer, a factor of a hundredlarger. Furthermore, the upper tail of the distributions areconsistent with Pareto distributions. Figure 8 shows a com-plementary distribution plot of the duration of ack out-ages, for those lasting more than 2 sec (about 16% of all theoutages). Both axes are log-scaled. A straight line on sucha plot corresponds to a Pareto distribution. We have addeda least-squares fit. We see the long outages fit quite well toa Pareto distribution with shape parameter , exceptfor the extreme upper tail, which is subject to truncation be-cause of the 600 sec limit on connection durations ( 2).A shape parameter means that the distribution has

infinite variance, indicating immense variability. Pareto dis-tributions for activity and inactivity periods play key roles insomemodels of self-similar network traffic [WTSW95], sug-gesting that packet loss outages could contribute to how TCPnetwork traffic might fit to ON/OFF-based self-similaritymodels.Finally, we note that the patterns of loss bursts we ob-

serve might be greatly shaped by use of “drop tail” queueing.In particular, deployment of Random Early Detection couldsignificantly affect these patterns and the corresponding con-nection dynamics [FJ93].

12

Type of RR Solaris Solaris Other Other% all packets 6% 6% 1% 2%% retrans. 66% 59% 26% 28%Unavoidable 14% 33% 44% 17%Coarse feed. 1% 1% 51% 80%Bad RTO 84% 66% 4% 3%

Table 3: Proportion of redundant retransmissions (RRs) dueto different causes

5.4 Efficacy of TCP retransmissionThe final aspect of packet loss we investigate is how effi-ciently TCP deals with it. Ideally, TCP retransmits if andonly if the retransmitted data was indeed lost. However,the transmitting TCP lacks perfect information, and conse-quently can retransmit unnecessarily. We analyzed each TCPtransmission in our measurements to determine whether itwas a redundant retransmission (RR), meaning that the datasent had already arrived at the receiver, or was in flight andwould successfully arrive. We classify three types of RRs:

unavoidable because all of the acks for the data were lost;

coarse feedback meaning that had earlier acks conveyedfiner information about sequence holes (such as pro-vided by SACK), then the retransmission could havebeen avoided; and

bad RTO meaning that had the TCP simply waited longer,it would have received an ack for the data (bad retrans-mission timeout).

Table 3 summarizes the prevalence of the different typesof RRs in and . We divide the analysis into So-laris 2.3/2.4 TCP senders and others because in [Pa97a] weidentified the Solaris 2.3/2.4 TCP as suffering from signifi-cant errors in computing RTO, which the other implementa-tions do not exhibit. We see that in , a fair proportion ofthe RRs were unavoidable. (Some of these might howeverhave been avoided had the receiving TCP generated moreacks.) But for , only about 1/6 of the RRs for non-SolarisTCPs were unavoidable, the difference no doubt due to 'suse of bigger windows ( 2) increasing the mean number ofacks in flight.“Coarse feedback” RRs would presumably all be fixed us-

ing SACK, and these are the majority of RRs for non-SolarisTCPs. Solaris TCPs would not immediately benefit fromSACK because many of their RRs occur before a SACK ackcould arrive, anyway.“Bad RTO” RRs indicate that the TCP's computation of

the retransmission timeout was erroneous. These are thebane of Solaris 2.3/2.4 TCP, as noted above. Fixing the So-laris RTO calculation eliminates about 4-5% of all of the datatraffic generated by the TCP. For non-Solaris TCPs, bad

We note that this problem has been fixed in Solaris 2.5.1.

RTO RRs are rare, providing solid evidence that the standardTCP RTO estimation algorithm developed in [Ja88] performsquite well for avoiding RRs. A separate question is whetherthe RTO estimation is overly conservative. A thorough in-vestigation of this question is complex because a revised esti-mator might take advantage of both higher-resolution clocksand the opportunity to time multiple packets per flight. Thus,we leave this interesting question for future work.In summary: ensuring standard-conformant RTO calcu-

lations and deploying the SACK option together eliminatevirtually all of the avoidable redundant retransmissions. Theremaining RRs are rare enough to not present serious perfor-mance problems.

6 Packet Delay

The final aspect of Internet packet dynamics we analyze isthat of packet delay. Here we focus on network dynamicsrather than transport protocol dynamics. Consequently, weconfine our analysis to variations in one-way transit times(OTTs) and omit discussion of RTT variation, since RTTmeasurements conflate delays along the forward and reversepath.For reasons noted in 1, we do not attempt frequency-

domain analysis of packet delay. We also do not summa-rize the marginal distribution of packet delays. Mukherjeefound that packet delay along a particular Internet path iswell-modeled using a shifted gamma distribution, but the pa-rameters of the distribution vary from path to path and overthe course of the day [Mu94]. Since we have about 1,000 dis-tinct paths in our study, measured at all hours of the day, andsince the gamma distribution varies considerably as its pa-rameters are varied, it is difficult to see how to summarizethe delay distributions in a useful fashion. We hope to revisitthis problem in future work.Any accurate assessment of delay must first deal with

the issue of clock accuracy. This problem is particularlypronounced when measuring OTTs since doing so involvescomparingmeasurements from two separate clocks. Accord-ingly, we developed robust algorithms for detecting clock ad-justments and relative skew by inspecting sets of OTT mea-surements, described in [Pa97b]. The analysis in this sectionassumes these algorithms have first been used to reject oradjust traces with clock errors.OTT variation was previously analyzed by Claffy and col-

leagues in a study of four Internet paths [CPB93]. Theyfound that mean OTTs are often not well approximated bydividing RTTs in half, and that variations in the paths' OTTsare often asymmetric. Our measurements confirm this lat-ter finding. If we compute the inter-quartile range (75th per-centile minus 25th) of OTTs for a connection's unloaded datapackets versus the acks coming back, in the coefficient ofcorrelation between the two is an anemic 0.10, and in itdrops to 0.006.

13

6.1 Timing compressionPacket timing compression occurs when a flight of packetssent over an interval arrives at the receiver over an in-terval . To first order, compression should notoccur, since the main mechanism at work in the network foraltering the spacing between packets is queueing, which ingeneral expands flights of packets (cf. 4.1). However, com-pression can occur if a flight of packets is at some point heldup by the network, such that transmission of the first packetstalls and the later packets have time to catch up to it.Zhang et al. predicted from theory and simulation that

acks could be compressed (“ack compression”) if a flightarrived at a router without any intervening packets fromcross traffic (hence, the router's queue is draining) [ZSC91].Mogul subsequently analyzed a trace of Internet traffic andconfirmed the presence of ack compression [Mo92]. His def-inition of ack compression is somewhat complex since hehad to infer endpoint behavior from an observation point in-side the network. Since we can compute from our data both

and , we can instead directly evaluate the presenceof compression. He found compression was correlated withpacket loss but considerably more rare. His study was lim-ited, however, to a single 5-hour traffic trace.Ack compression. To detect ack compression, for each

group of at least 3 acks we compute:

(2)

where and are the receiver and sender's clock res-olutions, so is a conservative estimate of the degree ofcompression. We consider a group compressed if .We term such a group a compression event. In , 50% ofthe connections experienced at least one compression event,and in , 60% did. In both, the mean number of eventswas around 2, and 1% of the connections experienced 15 ormore. Almost all compression events are small, with only5% spanning five or more acks. Finally, a significant mi-nority (10–25%) of the compression events occurred for dupacks. These are sent with less spacing between them thanregular acks sent by ack-every-other policies, so it takes lesstiming perturbation to compress them.Were ack compression frequent, it would present two

problems. First, as acks arrive they advance TCP's slidingwindow and “clock out” new data packets at the rate re-flected by their arrival [Ja88]. For compressed acks, thismeans that the data packets go out faster than previously,which can result in network stress. Second, sender-basedmeasurement techniques such as SBPP ( 4.1) can misinter-pret compressed acks as reflecting greater bandwidth thantruly available. Since, however, we find ack compressionrelatively rare and small in magnitude, the first problem isnot serious, and the second can be dealt with by judiciouslyremoving upper extremes from sender-based measurements.

Indeed, it has been argued that occasional ack compression is benefi-

Figure 9: Data packet timing compression

Data packet timing compression. For data packet timingcompression, our concerns are different. Sometimes a flightof data packets is sent at a high rate due to a sudden advancein the receiver's offered window. Normally these flights arespread out by the bottleneck and arrive at the receiver witha distance between each packet ( 4). If after the bottle-neck their timing is compressed, then use of Eqn 2 will notdetect this fact unless they are compressed to a greater degreethan their sending rate. Figure 9 illustrates this concern: theflights of data packets arrived at the receiver at 170Kbyte/sec(T1 rate), except for the central flight, which arrived at Eth-ernet speed. However, it was also sent at Ethernet speed, sofor it, .Consequently, we consider a group of data packets as

“compressed” if they arrive at greater than twice the upperbound on the estimated bottleneck bandwidth, . We onlyconsider groups of at least four data packets, as these, cou-pled with ack-every-other policies, have the potential to thenelicit a pair of acks reflecting the compressed timing, leadingto bogus self-clocking.These compression events are rarer than ack compression,

occurring in only 3% of the traces and 7% of those in. We were interested in whether some paths might be

plagued by repeated compression events due to either pecu-liar router architectures or network dynamics. Only 25–30%of the traces with an event had more than one, and just 3%had more than five, suggesting that such phenomena are rare.But those connections with multiple events are dominated bya few host pairs, indicating that the phenomenon does occurrepeatedly, and is sometimes due to specific routers.It appears that data packet timing compression is rare

enough not to present a problem. That it does occur, though,again highlights the necessity for outlier-filtering when con-ducting timing measurements. (It also has a measurementbenefit: from the arrival rate of the compressed packets, wecan estimate the downstream bottleneck rate.)

6.2 Queueing time scalesIn this section we briefly develop a rough estimate of thetime scales over which queueing occurs. If we take care to

cial, since it provides an opportunity for self-clocking to discover newly-available bandwidth.

14

Figure 10: Proportion (normalized) of connections withgiven timescale of maximum delay variation ( )

eliminate suspect clocks, reordered packets, compressed tim-ing, and traces exhibiting TTL shifts (which indicate routingchanges), then we argue that the remaining measured OTTvariation reflects queueing delays.We compute the queueing variation on the time scale

as follows. We partition the packets sent by a TCP into in-tervals of length . For each interval, let and be thenumber of successfully-arriving packets in the left and righthalves of the interval. If either is zero, or if orvice versa, then we reject the interval as containing too fewmeasurements or too much imbalance between the halves.Otherwise, let and be the median OTTs of the twohalves. We then define the interval's queueing variation as

. Finally, let be the median ofover all such intervals.Thus, reflects the “average” variation we observe in

packet delays over a time scale of . By using medians,this estimate is robust in the presence of noise due to non-queueing effects, or queueing spikes. By dividing intervalsin two and comparing only variation between the two halves,we confine to only variations on the time scale of .Shorter or longer lived variations are in general not included.We now analyze for different values of , confining

ourselves to variations in ack OTTs, as these are not cloudedby self-interference and adaptive transmission rate effects.The question is: are their particular 's on whichmost queue-ing variation occurs? If so, then we can hope to engineer forthose time scales. For example, if the dominant is less thana connection's RTT, then it is pointless for the connection totry to adapt to queueing fluctuations, since it cannot acquirefeedback quickly enough to do so.For each connection, we range through

msec to find , the value of for whichis greatest. reflects the time scale for which the connectionexperienced the greatest OTT variation. Figure 10 showsthe normalized proportion of the connections in andexhibiting different values of . Normalization is done bydividing the number of connections that exhibited withthe number that had durations at least as long as . For bothdatasets, time scales of 128–2048 msec primarily dominate.This range, though, spans more than an order of magnitude,and also exceeds typical RTT values. Furthermore, while

less prevalent, values all the way up to 65 sec remaincommon, with having a strong peak at 65 sec (whichappears genuine; perhaps due to periodic outages causedby router synchronization [FJ94], eliminated by the endof 1995).We summarize the figure as indicating that Internet delay

variations occur primarily on time scales of 0.1-1 sec, butextend out quite frequently to much larger times.

6.3 Available bandwidthThe last aspect of delay variation we look at is an interpre-tation of how it reflects the available bandwidth. In 5.2we developed a notion of data packet 's “load,” , meaninghow much delay it incurs due to queueing at the bottleneckbehind its predecessors, plus its own bottleneck transmissiontime . Since every packet requires to transit the bottle-neck, variations in OTT do not include , but will reflect

. Term this value , and let denote the differ-ence between packet 's measured OTT and the minimumobserved OTT.If the network path is completely unloaded except for

the connection's load itself (no competing traffic), then weshould have , i.e., all of 's delay variation is due toqueueing behind its predecessors. More generally, define

then reflects the proportion of the packet's delay due to theconnection's own loading of the network. If , then allof the delay variation is due to the connection's own queue-ing load on the network, while, if , then the connec-tion's load is insignificant compared to that of other traffic inthe network.More generally, reflects the resources

consumed by the connection, whilereflects the resources con-

sumed by the competing connections.Thus, captures the proportion of the total resources that

were consumed by the connection itself, and we interpretas reflecting the available bandwidth. Values of close to1 mean that the entire bottleneck bandwidth was available,and values close to 0 mean that almost none of it was actuallyavailable.Note that we can have even if the connection does

not consume all of the network path's capacity. All that isrequired is that, to the degree that the connection did attemptto consume network resources, they were readily available.This observation provides the basis for hoping that we mightbe able to use to estimate available bandwidthwithout fullystressing the network path.We can gauge how well truly reflects available band-

width by computing the coefficient of correlation betweenand the connection's overall throughput (normalized by di-viding by the bottleneck bandwidth). For , this is 0.44,while for , it rises to 0.55.

15

Figure 11: Density and cumulative distribution ofinferred available bandwidth ( )

Figure 11 shows the density and cumulative distribution offor . Not surprisingly, we find that Internet connections

encounter a broad range of available bandwidth. As is gen-erally the case with Internet characteristics, a single figurelike this can oversimplify the situation. We note, for exam-ple, that confining the evaluation of to European connec-tions results in a sharp leftward shift in the density, indicat-ing generally less available bandwidth, while for U.S. con-nections, the density shifts to the right. Furthermore, forpaths with higher bottleneck bandwidths, we generally findlower values of , reflecting that such paths tend to be sharedamong more competing connections. Finally, we note thatthe predictive power of tends to be fairly good. On av-erage, a given observation of will be within 0.1 of laterobservations of for the same path, for time periods up toseveral hours.

7 ConclusionsSeveral conclusions emerge from our study:

With due diligence to remove packet filter errors andTCP effects, TCP-based measurement provides a viablemeans for assessing end-to-end packet dynamics.

We find wide ranges of behavior, such that we must ex-ercise great caution in regarding any aspect of packetdynamics as “typical.”

Some common assumptions such as in-order packetdelivery, FIFO bottleneck queueing, independent loss

The depressed density at reflects a measurement bias [Pa97b].

events, single congestion time scales, and path symme-tries are all violated, sometimes frequently.

When implemented correctly, TCP's retransmissionstrategies work in a sufficiently conservative fashion.

The combination of path asymmetries and reverse-path noise render sender-only measurement tech-niques markedly inferior to those that include receiver-cooperation.

This last point argues that when the measurement of interestconcerns a unidirectional path—be it for measurement-basedadaptive transport techniques such as TCP Vegas [BOP94],or general Internet performance metrics such as those in de-velopment by the IPPM effort [A+96]—the extra complica-tions incurred by coordinating sender and receiver are worththe effort.

8 AcknowledgementsThis work would not have been possible without the effortsof the many volunteers who installed the Network ProbeDaemon at their sites. I am indebted to:

G. Almes, J. Alsters, J-C. Bolot, K. Bostic,H-W. Braun, D. Brown, R. Bush, B. Camm,B. Chinoy, K. Claffy, P. Collinson, J. Crowcroft,P. Danzig, H. Eidnes, M. Eliot, R. Elz, M. Flory,M. Gerla, A. Ghosh, D. Grunwald, T. Hagen,A. Hannan, S. Haug, J. Hawkinson, TR Hein,T. Helbig, P. Hyder, A. Ibbetson, A. Jackson,B. Karp, K. Lance, C. Leres, K. Lidl, P. Linington,S. McCanne, L. McGinley, J. Milburn,W. Mueller,E. Nemeth, K. Obraczka, I. Penny, F. Pinard,J. Polk, T. Satogata, D. Schmidt, M. Schwartz,W. Sinze, S. Slaymaker, S. Walton, D. Wells,G. Wright, J. Wroclawski, C. Young, and L.Zhang.

I am likewise indebted to Keith Bostic, Evi Nemeth,Rich Stevens, George Varghese, Andres Albanese, WielandHolfelder, and Bernd Lamparter for their invaluable help inrecruiting NPD sites. Thanks, too, to Peter Danzig, JeffMogul, and Mike Schwartz for feedback on the design ofNPD.This work greatly benefited from discussions with

Domenico Ferrari, Sally Floyd, Van Jacobson, Mike Luby,Greg Minshall, John Rice, and the comments of the anony-mous reviewers. My heartfelt thanks.

References[A+96] G. Almes et al, “Framework for IP Provider

Metrics,” Internet draft, ftp://ftp.isi.edu/internet-drafts/draft-ietf-bmwg-ippm-framework-00.txt,Nov. 1996.

16

[Bo93] J-C. Bolot, “End-to-End Packet Delay and LossBehavior in the Internet,” Proc. SIGCOMM '93,pp. 289-298, Sept. 1993.

[BOP94] L. Brakmo, S. O'Malley and L. Peterson, “TCPVegas: New Techniques for Congestion Detectionand Avoidance,” Proc. SIGCOMM '94, pp. 24-35,Sept. 1994.

[CC96] R. Carter andM. Crovella, “Measuring BottleneckLink Speed in Packet-Switched Networks,” Tech.Report BU-CS-96-006, Computer Science Depart-ment, Boston University, Mar. 1996.

[CPB93] K. Claffy, G. Polyzos and H-W. Braun, “Measure-ment Considerations for Assessing UnidirectionalLatencies,” Internetworking: Research and Expe-rience, 4 (3), pp. 121-132, Sept. 1993.

[DMT96] R. Durst, G. Miller and E. Travis, “TCP Exten-sions for Space Communications,” Proc. MOBI-COM '96, pp. 15-26, Nov. 1996.

[FJ93] S. Floyd and V. Jacobson, “Random Early De-tection Gateways for Congestion Avoidance”,IEEE/ACM Transactions on Networking, 1(4),pp. 397-413, Aug. 1993.

[FJ94] S. Floyd and V. Jacobson, “The Synchroniza-tion of Periodic Routing Messages,” IEEE/ACMTransactions on Networking, 2(2), pp. 122-136,Apr. 1994.

[Ja88] V. Jacobson, “Congestion Avoidance andControl,” Proc. SIGCOMM '88, pp. 314-329,Aug. 1988.

[JLM89] V. Jacobson, C. Leres, and S. McCanne,tcpdump, available via anonymous ftp toftp.ee.lbl.gov, June 1989.

[Ja90] V. Jacobson, “Compressing TCP/IP headers forlow-speed serial links,” RFC 1144, NetworkInformation Center, SRI International, MenloPark, CA, February, 1990.

[Ke91] S. Keshav, “A Control-Theoretic Approach toFlow Control”, Proc. SIGCOMM '91, pp. 3-15,Sept. 1991.

[MMSR96] M. Mathis, J. Mahdavi, S. Floyd and A. Ro-manow, “TCP Selective Acknowledgment Op-tions,” RFC 2018, DDN Network InformationCenter, Oct. 1995.

[Mo92] J. Mogul, “Observing TCP Dynamics in RealNetworks,” Proc. SIGCOMM '92, pp. 305-317,Aug. 1992.

[Mu94] A. Mukherjee, “On the Dynamics and Signifi-cance of Low Frequency Components of Inter-net Load,” Internetworking: Research and Expe-rience, Vol. 5, pp. 163-205, December 1994.

[Pa96] V. Paxson, “End-to-End Routing Behavior inthe Internet,” Proc. SIGCOMM '96, pp. 25-38,Aug. 1996.

[Pa97a] V. Paxson, “Automated Packet Trace Analysisof TCP Implementations,” Proc. SIGCOMM '97,Sep. 1997.

[Pa97b] V. Paxson, “Measurements and Analysis of End-to-End Internet Dynamics,” Ph.D. dissertation,University of California, Berkeley, April 1997.

[WTSW95] W. Willinger, M. Taqqu, R. Sherman, andD. Wilson, “Self-Similarity Through High-Variability: Statistical Analysis of Ethernet LANTraffic at the Source Level,”Proc. SIGCOMM '95,pp. 100-113, Sept. 1995.

[ZSC91] L. Zhang, S. Shenker, and D. Clark, “Observa-tions on the Dynamics of a Congestion ControlAlgorithm: The Effects of Two-Way Traffic,”Proc. SIGCOMM '91, pp. 133-147, Sept. 1991.

17

end-to-end internet packet dynamicssylvia/cs268/... · end-to-end internet packet dynamics vern...

Documents