chapter3 5th aug 2009

Chapter 3Transport LayerComputer Networking: A Top Down Approach 5th edition. Jim Kurose, Keith RossAddison-Wesley, April 2009.

A note on the use of these ppt slides:Were making these slides freely available to all (faculty, students, readers). Theyre in PowerPoint form so you can add, modify, and delete slides (including this one) and slide content to suit your needs. They obviously represent a lot of work on our part. In return for use, we only ask the following: If you use these slides (e.g., in a class) in substantially unaltered form, that you mention their source (after all, wed like people to use our book!) If you post any slides in substantially unaltered form on a www site, that you note that they are adapted from (or perhaps identical to) our slides, and note our copyright of this material.

Thanks and enjoy! JFK/KWR

All material copyright 1996-2009J.F Kurose and K.W. Ross, All Rights Reserved

Chapter 3: Transport LayerOur goals: understand principles behind transport layer services:multiplexing/demultiplexingreliable data transferflow controlcongestion control

learn about transport layer protocols in the Internet:UDP: connectionless transportTCP: connection-oriented transportTCP congestion control

Chapter 3 outline3.1 Transport-layer services3.2 Multiplexing and demultiplexing3.3 Connectionless transport: UDP3.4 Principles of reliable data transfer3.5 Connection-oriented transport: TCPsegment structurereliable data transferflow controlconnection management3.6 Principles of congestion control3.7 TCP congestion control

Transport services and protocolsprovide logical communication between app processes running on different hoststransport protocols run in end systems send side: breaks app messages into segments, passes to network layerrcv side: reassembles segments into messages, passes to app layermore than one transport protocol available to appsInternet: TCP and UDP

Transport vs. network layernetwork layer: logical communication between hoststransport layer: logical communication between processes relies on, enhances, network layer servicesHousehold analogy:12 kids sending letters to 12 kidsprocesses = kidsapp messages = letters in envelopeshosts = housestransport protocol = Ann and Billnetwork-layer protocol = postal service

Internet transport-layer protocolsreliable, in-order delivery (TCP)congestion control flow controlconnection setupunreliable, unordered delivery: UDPno-frills extension of best-effort IPservices not available: delay guaranteesbandwidth guarantees

Multiplexing/demultiplexingapplicationtransportnetworklinkphysicalP1applicationtransportnetworklinkphysicalapplicationtransportnetworklinkphysicalP2P3P4P1host 1host 2host 3= process= socketdelivering received segmentsto correct socketgathering data from multiplesockets, enveloping data with header (later used for demultiplexing)

How demultiplexing workshost receives IP datagramseach datagram has source IP address, destination IP addresseach datagram carries 1 transport-layer segmenteach segment has source, destination port number host uses IP addresses & port numbers to direct segment to appropriate socketsource port #dest port #32 bitsapplicationdata (message)other header fieldsTCP/UDP segment format

Connectionless demultiplexingCreate sockets with port numbers:DatagramSocket mySocket1 = new DatagramSocket(12534);DatagramSocket mySocket2 = new DatagramSocket(12535);UDP socket identified by two-tuple:(dest IP address, dest port number)When host receives UDP segment:checks destination port number in segmentdirects UDP segment to socket with that port numberIP datagrams with different source IP addresses and/or source port numbers directed to same socket

Connectionless demux (cont)DatagramSocket serverSocket = new DatagramSocket(6428);

SP provides return address

Connection-oriented demuxTCP socket identified by 4-tuple: source IP addresssource port numberdest IP addressdest port numberreceiving host uses all four values to direct segment to appropriate socketServer host may support many simultaneous TCP sockets:each socket identified by its own 4-tupleWeb servers have different sockets for each connecting clientnon-persistent HTTP will have different socket for each request

Connection-oriented demux (cont)ClientIP:BserverIP: CSP: 9157DP: 80D-IP:CS-IP: AD-IP:CS-IP: BD-IP:CS-IP: B

Connection-oriented demux: Threaded Web ServerClientIP:BserverIP: CSP: 9157DP: 80P4D-IP:CS-IP: AD-IP:CS-IP: BD-IP:CS-IP: B

UDP: User Datagram Protocol [RFC 768]no frills, bare bones Internet transport protocolbest effort service, UDP segments may be:lostdelivered out of order to appconnectionless:no handshaking between UDP sender, receivereach UDP segment handled independently of others

Why is there a UDP?no connection establishment (which can add delay)simple: no connection state at sender, receiversmall segment headerno congestion control: UDP can blast away as fast as desired

UDP: moreoften used for streaming multimedia appsloss tolerantrate sensitiveother UDP usesDNSSNMPreliable transfer over UDP: add reliability at application layerapplication-specific error recovery!source port #dest port #32 bitsApplicationdata (message)UDP segment formatlengthchecksumLength, inbytes of UDPsegment,includingheader

UDP checksumSender:treat segment contents as sequence of 16-bit integerschecksum: addition (1s complement sum) of segment contentssender puts checksum value into UDP checksum field

Receiver:compute checksum of received segmentcheck if computed checksum equals checksum field value:NO - error detectedYES - no error detected. But maybe errors nonetheless? More later .

Goal: detect errors (e.g., flipped bits) in transmitted segment

Internet Checksum ExampleNoteWhen adding numbers, a carryout from the most significant bit needs to be added to the resultExample: add two 16-bit integers1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1wraparoundsumchecksum

Principles of Reliable data transferimportant in app., transport, link layerstop-10 list of important networking topics!

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Reliable data transfer: getting startedsendsidereceiveside

Reliable data transfer: getting startedWell:incrementally develop sender, receiver sides of reliable data transfer protocol (rdt)consider only unidirectional data transferbut control info will flow on both directions!use finite state machines (FSM) to specify sender, receiverevent causing state transitionactions taken on state transitionstate: when in this state next state uniquely determined by next event

Rdt1.0: reliable transfer over a reliable channelunderlying channel perfectly reliableno bit errorsno loss of packetsseparate FSMs for sender, receiver:sender sends data into underlying channelreceiver read data from underlying channelWait for call from abovepacket = make_pkt(data)udt_send(packet)rdt_send(data)extract (packet,data)deliver_data(data)Wait for call from belowrdt_rcv(packet)senderreceiver

Rdt2.0: channel with bit errorsunderlying channel may flip bits in packetchecksum to detect bit errorsthe question: how to recover from errors:acknowledgements (ACKs): receiver explicitly tells sender that pkt received OKnegative acknowledgements (NAKs): receiver explicitly tells sender that pkt had errorssender retransmits pkt on receipt of NAKnew mechanisms in rdt2.0 (beyond rdt1.0):error detectionreceiver feedback: control msgs (ACK,NAK) rcvr->sender

rdt2.0: FSM specificationWait for call from abovesnkpkt = make_pkt(data, checksum)udt_send(sndpkt)extract(rcvpkt,data)deliver_data(data)udt_send(ACK)rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)rdt_rcv(rcvpkt) && isACK(rcvpkt)udt_send(sndpkt)rdt_rcv(rcvpkt) && isNAK(rcvpkt)senderreceiverrdt_send(data)L

rdt2.0: operation with no errorsWait for call from abovesnkpkt = make_pkt(data, checksum)udt_send(sndpkt)extract(rcvpkt,data)deliver_data(data)udt_send(ACK)rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)rdt_rcv(rcvpkt) && isACK(rcvpkt)udt_send(sndpkt)rdt_rcv(rcvpkt) && isNAK(rcvpkt)Wait for call from belowrdt_send(data)L

rdt2.0: error scenarioWait for call from abovesnkpkt = make_pkt(data, checksum)udt_send(sndpkt)extract(rcvpkt,data)deliver_data(data)udt_send(ACK)rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)rdt_rcv(rcvpkt) && isACK(rcvpkt)udt_send(sndpkt)rdt_rcv(rcvpkt) && isNAK(rcvpkt)Wait for call from belowrdt_send(data)L

rdt2.0 has a fatal flaw!What happens if ACK/NAK corrupted?sender doesnt know what happened at receiver!cant just retransmit: possible duplicate

Handling duplicates: sender retransmits current pkt if ACK/NAK garbledsender adds sequence number to each pktreceiver discards (doesnt deliver up) duplicate pktSender sends one packet, then waits for receiver response

rdt2.1: sender, handles garbled ACK/NAKsWait for call 0 from abovesndpkt = make_pkt(0, data, checksum)udt_send(sndpkt)rdt_send(data)udt_send(sndpkt)rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) ||isNAK(rcvpkt) )sndpkt = make_pkt(1, data, checksum)udt_send(sndpkt)rdt_send(data)rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt) udt_send(sndpkt)rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) ||isNAK(rcvpkt) )rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt) LL

rdt2.1: receiver, handles garbled ACK/NAKssndpkt = make_pkt(NAK, chksum)udt_send(sndpkt)rdt_rcv(rcvpkt) && not corrupt(rcvpkt) && has_seq0(rcvpkt)

rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt) extract(rcvpkt,data)deliver_data(data)sndpkt = make_pkt(ACK, chksum)udt_send(sndpkt)rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq0(rcvpkt) extract(rcvpkt,data)deliver_data(data)sndpkt = make_pkt(ACK, chksum)udt_send(sndpkt)rdt_rcv(rcvpkt) && (corrupt(rcvpkt)sndpkt = make_pkt(ACK, chksum)udt_send(sndpkt)rdt_rcv(rcvpkt) && not corrupt(rcvpkt) && has_seq1(rcvpkt)

rdt_rcv(rcvpkt) && (corrupt(rcvpkt)sndpkt = make_pkt(ACK, chksum)udt_send(sndpkt)sndpkt = make_pkt(NAK, chksum)udt_send(sndpkt)

rdt2.1: discussionSender:seq # added to pkttwo seq. #s (0,1) will suffice. Why?must check if received ACK/NAK corrupted twice as many statesstate must remember whether current pkt has 0 or 1 seq. #

Receiver:must check if received packet is duplicatestate indicates whether 0 or 1 is expected pkt seq #note: receiver can not know if its last ACK/NAK received OK at sender

rdt2.2: a NAK-free protocolsame functionality as rdt2.1, using ACKs onlyinstead of NAK, receiver sends ACK for last pkt received OKreceiver must explicitly include seq # of pkt being ACKed duplicate ACK at sender results in same action as NAK: retransmit current pkt

rdt2.2: sender, receiver fragmentssndpkt = make_pkt(0, data, checksum)udt_send(sndpkt)rdt_send(data)udt_send(sndpkt)rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,1) )rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) sender FSMfragmentrdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt) extract(rcvpkt,data)deliver_data(data)sndpkt = make_pkt(ACK1, chksum)udt_send(sndpkt)rdt_rcv(rcvpkt) && (corrupt(rcvpkt) || has_seq1(rcvpkt))udt_send(sndpkt)receiver FSMfragmentL

rdt3.0: channels with errors and lossNew assumption: underlying channel can also lose packets (data or ACKs)checksum, seq. #, ACKs, retransmissions will be of help, but not enoughApproach: sender waits reasonable amount of time for ACK retransmits if no ACK received in this timeif pkt (or ACK) just delayed (not lost):retransmission will be duplicate, but use of seq. #s already handles thisreceiver must specify seq # of pkt being ACKedrequires countdown timer

rdt3.0 sendersndpkt = make_pkt(0, data, checksum)udt_send(sndpkt)start_timerrdt_send(data)rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) ||isACK(rcvpkt,1) )sndpkt = make_pkt(1, data, checksum)udt_send(sndpkt)start_timerrdt_send(data)rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) ||isACK(rcvpkt,0) )rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,1) stop_timerstop_timerudt_send(sndpkt)start_timertimeoutudt_send(sndpkt)start_timertimeoutrdt_rcv(rcvpkt)Lrdt_rcv(rcvpkt)LLL

rdt3.0 in action

Performance of rdt3.0rdt3.0 works, but performance stinksex: 1 Gbps link, 15 ms prop. delay, 8000 bit packet:

U sender: utilization fraction of time sender busy sending 1KB pkt every 30 msec -> 33kB/sec thruput over 1 Gbps linknetwork protocol limits use of physical resources!

U

sender

=

.008

30.008

RTT + L / R

L / R

=

=

0.00027 microseconds

rdt3.0: stop-and-wait operationfirst packet bit transmitted, t = 0senderreceiverRTT last packet bit transmitted, t = L / Rfirst packet bit arriveslast packet bit arrives, send ACKACK arrives, send next packet, t = RTT + L / R

U

sender

=

.008

30.008

RTT + L / R

L / R

=

=

0.00027 microseconds

Pipelined protocolsPipelining: sender allows multiple, in-flight, yet-to-be-acknowledged pktsrange of sequence numbers must be increasedbuffering at sender and/or receiverTwo generic forms of pipelined protocols: go-Back-N, selective repeat

Pipelining: increased utilizationfirst packet bit transmitted, t = 0senderreceiverRTT last bit transmitted, t = L / Rfirst packet bit arriveslast packet bit arrives, send ACKACK arrives, send next packet, t = RTT + L / Rlast bit of 2nd packet arrives, send ACKlast bit of 3rd packet arrives, send ACKIncrease utilizationby a factor of 3!

U

sender

=

.024

30.008

RTT + L / R

3 * L / R

=

=

0.0008 microseconds

Pipelining ProtocolsGo-back-N: overviewsender: up to N unACKed pkts in pipelinereceiver: only sends cumulative ACKsdoesnt ACK pkt if theres a gapsender: has timer for oldest unACKed pktif timer expires: retransmit all unACKed packetsSelective Repeat: overviewsender: up to N unACKed packets in pipelinereceiver: ACKs individual pktssender: maintains timer for each unACKed pktif timer expires: retransmit only unACKed packet

Go-Back-NSender:k-bit seq # in pkt headerwindow of up to N, consecutive unACKed pkts allowed

ACK(n): ACKs all pkts up to, including seq # n - cumulative ACKmay receive duplicate ACKs (see receiver)timer for each in-flight pkttimeout(n): retransmit pkt n and all higher seq # pkts in window

GBN: sender extended FSMstart_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])udt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data) if (nextseqnum < base+N) { sndpkt[nextseqnum] = make_pkt(nextseqnum,data,chksum) udt_send(sndpkt[nextseqnum]) if (base == nextseqnum) start_timer nextseqnum++ }else refuse_data(data)base = getacknum(rcvpkt)+1If (base == nextseqnum) stop_timer else start_timerrdt_rcv(rcvpkt) && notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) && corrupt(rcvpkt)

L

GBN: receiver extended FSMACK-only: always send ACK for correctly-received pkt with highest in-order seq #may generate duplicate ACKsneed only remember expectedseqnumout-of-order pkt: discard (dont buffer) -> no receiver buffering!Re-ACK pkt with highest in-order seq #Waitudt_send(sndpkt)default

rdt_rcv(rcvpkt) && notcurrupt(rcvpkt) && hasseqnum(rcvpkt,expectedseqnum) extract(rcvpkt,data)deliver_data(data)sndpkt = make_pkt(expectedseqnum,ACK,chksum)udt_send(sndpkt)expectedseqnum++expectedseqnum=1sndpkt = make_pkt(expectedseqnum,ACK,chksum)

L

GBN inaction

Selective Repeatreceiver individually acknowledges all correctly received pktsbuffers pkts, as needed, for eventual in-order delivery to upper layersender only resends pkts for which ACK not receivedsender timer for each unACKed pktsender windowN consecutive seq #sagain limits seq #s of sent, unACKed pkts

Selective repeat: sender, receiver windows

Selective repeatdata from above :if next available seq # in window, send pkttimeout(n):resend pkt n, restart timerACK(n) in [sendbase,sendbase+N]:mark pkt n as receivedif n smallest unACKed pkt, advance window base to next unACKed seq #

pkt n in [rcvbase, rcvbase+N-1]send ACK(n)out-of-order: bufferin-order: deliver (also deliver buffered, in-order pkts), advance window to next not-yet-received pktpkt n in [rcvbase-N,rcvbase-1]ACK(n)otherwise: ignore

Selective repeat in action

Selective repeat: dilemmaExample: seq #s: 0, 1, 2, 3window size=3

receiver sees no difference in two scenarios!incorrectly passes duplicate data as new in (a)

Q: what relationship between seq # size and window size?

TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581full duplex data:bi-directional data flow in same connectionMSS: maximum segment sizeconnection-oriented: handshaking (exchange of control msgs) inits sender, receiver state before data exchangeflow controlled:sender will not overwhelm receiverpoint-to-point:one sender, one receiver reliable, in-order byte steam:no message boundariespipelined:TCP congestion and flow control set window sizesend & receive buffers

TCP segment structureURG: urgent data (generally not used)ACK: ACK #validPSH: push data now(generally not used)RST, SYN, FIN:connection estab(setup, teardowncommands)# bytes rcvr willingto acceptcountingby bytes of data(not segments!)Internetchecksum(as in UDP)

TCP seq. #s and ACKsSeq. #s:byte stream number of first byte in segments dataACKs:seq # of next byte expected from other sidecumulative ACKQ: how receiver handles out-of-order segmentsA: TCP spec doesnt say, - up to implementerHost AHost BSeq=42, ACK=79, data = CSeq=79, ACK=43, data = CSeq=43, ACK=80UsertypesChost ACKsreceipt of echoedChost ACKsreceipt ofC, echoesback Csimple telnet scenario

TCP Round Trip Time and TimeoutQ: how to set TCP timeout value?longer than RTTbut RTT variestoo short: premature timeoutunnecessary retransmissionstoo long: slow reaction to segment lossQ: how to estimate RTT?SampleRTT: measured time from segment transmission until ACK receiptignore retransmissionsSampleRTT will vary, want estimated RTT smootheraverage several recent measurements, not just current SampleRTT

TCP Round Trip Time and TimeoutEstimatedRTT = (1- )*EstimatedRTT + *SampleRTTExponential weighted moving averageinfluence of past sample decreases exponentially fasttypical value: = 0.125

Example RTT estimation:

TCP Round Trip Time and TimeoutSetting the timeoutEstimtedRTT plus safety marginlarge variation in EstimatedRTT -> larger safety marginfirst estimate of how much SampleRTT deviates from EstimatedRTT: TimeoutInterval = EstimatedRTT + 4*DevRTTDevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT|

(typically, = 0.25) Then set timeout interval:

TCP reliable data transferTCP creates rdt service on top of IPs unreliable servicepipelined segmentscumulative ACKsTCP uses single retransmission timer

retransmissions are triggered by:timeout eventsduplicate ACKsinitially consider simplified TCP sender: ignore duplicate ACKsignore flow control, congestion control

TCP sender events:data rcvd from app:create segment with seq #seq # is byte-stream number of first data byte in segmentstart timer if not already running (think of timer as for oldest unACKed segment)expiration interval: TimeOutInterval timeout:retransmit segment that caused timeoutrestart timer ACK rcvd:if acknowledges previously unACKed segmentsupdate what is known to be ACKedstart timer if there are outstanding segments

TCP sender(simplified) NextSeqNum = InitialSeqNum SendBase = InitialSeqNum

loop (forever) { switch(event)

event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data)

event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer

event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer }

} /* end of loop forever */ Comment: SendBase-1: last cumulatively ACKed byteExample: SendBase-1 = 71; y= 73, so the rcvr wants 73+ ; y > SendBase, so that new data is ACKed

TCP: retransmission scenariosHost ASeq=100, 20 bytes dataACK=100premature timeoutHost BSeq=92, 8 bytes dataACK=120Seq=92, 8 bytes dataACK=120Seq=92 timeoutSendBase= 100SendBase= 120SendBase= 120Sendbase= 100

TCP retransmission scenarios (more)SendBase= 120

TCP ACK generation [RFC 1122, RFC 2581]Event at Receiver

Arrival of in-order segment withexpected seq #. All data up toexpected seq # already ACKed

Arrival of in-order segment withexpected seq #. One other segment has ACK pending

Arrival of out-of-order segmenthigher-than-expect seq. # .Gap detected

Arrival of segment that partially or completely fills gap

TCP Receiver action

Delayed ACK. Wait up to 500msfor next segment. If no next segment,send ACK

Immediately send single cumulative ACK, ACKing both in-order segments

Immediately send duplicate ACK, indicating seq. # of next expected byte

Immediate send ACK, provided thatsegment starts at lower end of gap

Fast Retransmittime-out period often relatively long:long delay before resending lost packetdetect lost segments via duplicate ACKs.sender often sends many segments back-to-backif segment is lost, there will likely be many duplicate ACKs for that segment

If sender receives 3 ACKs for same data, it assumes that segment after ACKed data was lost:fast retransmit: resend segment before timer expires

Host AtimeoutHost BtimeXresend seq X2seq # x1seq # x2seq # x3seq # x4seq # x5ACK x1ACK x1ACK x1ACK x1tripleduplicateACKs

Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else { increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { resend segment with sequence number y } a duplicate ACK for already ACKed segmentfast retransmit

TCP Flow Controlreceive side of TCP connection has a receive buffer:speed-matching service: matching send rate to receiving applications drain rateapp process may be slow at reading from buffer

TCP Flow control: how it works(suppose TCP receiver discards out-of-order segments)unused buffer space:= rwnd= RcvBuffer-[LastByteRcvd - LastByteRead]receiver: advertises unused buffer space by including rwnd value in segment headersender: limits # of unACKed bytes to rwndguarantees receivers buffer doesnt overflow

TCP Connection ManagementRecall: TCP sender, receiver establish connection before exchanging data segmentsinitialize TCP variables:seq. #sbuffers, flow control info (e.g. RcvWindow)client: connection initiator Socket clientSocket = new Socket("hostname","port number"); server: contacted by client Socket connectionSocket = welcomeSocket.accept();Three way handshake:Step 1: client host sends TCP SYN segment to serverspecifies initial seq #no dataStep 2: server host receives SYN, replies with SYNACK segmentserver allocates buffersspecifies server initial seq. #Step 3: client receives SYNACK, replies with ACK segment, which may contain data

TCP Connection Management (cont.)Closing a connection:client closes socket: clientSocket.close(); Step 1: client end system sends TCP FIN control segment to server Step 2: server receives FIN, replies with ACK. Closes connection, sends FIN.

TCP Connection Management (cont.)Step 3: client receives FIN, replies with ACK. Enters timed wait - will respond with ACK to received FINs Step 4: server, receives ACK. Connection closed. Note: with small modification, can handle simultaneous FINs.clientFINserverACKACKFINclosingclosingclosedtimed waitclosed

TCP Connection Management (cont)TCP clientlifecycleTCP serverlifecycle

Principles of Congestion ControlCongestion:informally: too many sources sending too much data too fast for network to handledifferent from flow control!manifestations:lost packets (buffer overflow at routers)long delays (queueing in router buffers)a top-10 problem!

Causes/costs of congestion: scenario 1 two senders, two receiversone router, infinite buffers no retransmission

large delays when congestedmaximum achievable throughput

Causes/costs of congestion: scenario 2 one router, finite buffers sender retransmission of lost packet

finite shared output link buffersHost Alin : original dataHost Bloutl'in : original data, plus retransmitted data

Causes/costs of congestion: scenario 2 always: (goodput)perfect retransmission only when loss:retransmission of delayed (not lost) packet makes larger (than perfect case) for same

costs of congestion: more work (retrans) for given goodputunneeded retransmissions: link carries multiple copies of pkt

Causes/costs of congestion: scenario 3 four sendersmultihop pathstimeout/retransmit

Q: what happens as and increase ?finite shared output link bufferslin : original dataloutl'in : original data, plus retransmitted data

Causes/costs of congestion: scenario 3 another cost of congestion: when packet dropped, any upstream transmission capacity used for that packet was wasted!lout

Approaches towards congestion controlend-end congestion control:no explicit feedback from networkcongestion inferred from end-system observed loss, delayapproach taken by TCPnetwork-assisted congestion control:routers provide feedback to end systemssingle bit indicating congestion (SNA, DECbit, TCP/IP ECN, ATM)explicit rate sender should send attwo broad approaches towards congestion control:

Case study: ATM ABR congestion controlABR: available bit rate:elastic service if senders path underloaded: sender should use available bandwidthif senders path congested: sender throttled to minimum guaranteed rateRM (resource management) cells:sent by sender, interspersed with data cellsbits in RM cell set by switches (network-assisted) NI bit: no increase in rate (mild congestion)CI bit: congestion indicationRM cells returned to sender by receiver, with bits intact

Case study: ATM ABR congestion controltwo-byte ER (explicit rate) field in RM cellcongested switch may lower ER value in cellsender send rate thus maximum supportable rate on pathEFCI bit in data cells: set to 1 in congested switchif data cell preceding RM cell has EFCI set, sender sets CI bit in returned RM cell

TCP congestion control:goal: TCP sender should transmit as fast as possible, but without congesting networkQ: how to find rate just below congestion leveldecentralized: each TCP sender sets its own rate, based on implicit feedback: ACK: segment received (a good thing!), network not congested, so increase sending ratelost segment: assume loss due to congested network, so decrease sending rate

TCP congestion control: bandwidth probingprobing for bandwidth: increase transmission rate on receipt of ACK, until eventually loss occurs, then decrease transmission rate continue to increase on ACK, decrease on loss (since available bandwidth is changing, depending on other connections in network)

ACKs being received, so increase rateXXXXsending ratetimeQ: how fast to increase/decrease?details to followTCPssawtoothbehavior

TCP Congestion Control: detailssender limits rate by limiting number of unACKed bytes in pipeline:

cwnd: differs from rwnd (how, why?)sender limited by min(cwnd,rwnd)roughly,

cwnd is dynamic, function of perceived network congestion

LastByteSent-LastByteAcked cwnd

cwndbytesRTT

TCP Congestion Control: more detailssegment loss event: reducing cwndtimeout: no response from receivercut cwnd to 13 duplicate ACKs: at least some segments getting through (recall fast retransmit)cut cwnd in half, less aggressively than on timeoutACK received: increase cwndslowstart phase: increase exponentially fast (despite name) at connection start, or following timeoutcongestion avoidance: increase linearly

TCP Slow Startwhen connection begins, cwnd = 1 MSSexample: MSS = 500 bytes & RTT = 200 msecinitial rate = 20 kbpsavailable bandwidth may be >> MSS/RTTdesirable to quickly ramp up to respectable rateincrease rate exponentially until first loss event or when threshold reacheddouble cwnd every RTTdone by incrementing cwnd by 1 for every ACK received

Host Aone segmentRTTHost Btwo segmentsfour segments

Transitioning into/out of slowstartssthresh: cwnd threshold maintained by TCPon loss event: set ssthresh to cwnd/2remember (half of) TCP rate when congestion last occurred when cwnd >= ssthresh: transition from slowstart to congestion avoidance phase

L

TCP: congestion avoidancewhen cwnd > ssthresh grow cwnd linearlyincrease cwnd by 1 MSS per RTT approach possible congestion slower than in slowstartimplementation: cwnd = cwnd + MSS/cwnd for each ACK received

ACKs: increase cwnd by 1 MSS per RTT: additive increaseloss: cut cwnd in half (non-timeout-detected loss ): multiplicative decrease

AIMDAIMD: Additive IncreaseMultiplicative Decrease

TCP congestion control FSM: overviewnew ACKloss:3dupACKloss:3dupACK

TCP congestion control FSM: detailsL

Popular flavors of TCPssthreshssthreshTCP TahoeTCP RenoTransmission roundcwnd window size (in segments)

Summary: TCP Congestion Controlwhen cwnd < ssthresh, sender in slow-start phase, window grows exponentially.when cwnd >= ssthresh, sender is in congestion-avoidance phase, window grows linearly.when triple duplicate ACK occurs, ssthresh set to cwnd/2, cwnd set to ~ ssthreshwhen timeout occurs, ssthresh set to cwnd/2, cwnd set to 1 MSS.

TCP throughputQ: whats average throughout of TCP as function of window size, RTT?ignoring slow startlet W be window size when loss occurs.when window is W, throughput is W/RTTjust after loss, window drops to W/2, throughput to W/2RTT. average throughout: .75 W/RTT

TCP Futures: TCP over long, fat pipesexample: 1500 byte segments, 100ms RTT, want 10 Gbps throughputrequires window size W = 83,333 in-flight segmentsthroughput in terms of loss rate: L = 210-10 Wownew versions of TCP for high-speed

TCP Fairnessfairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K

Why is TCP fair?Two competing sessions:Additive increase gives slope of 1, as throughout increasesmultiplicative decrease decreases throughput proportionally RRequal bandwidth shareConnection 1 throughputConnection 2 throughputcongestion avoidance: additive increaseloss: decrease window by factor of 2congestion avoidance: additive increaseloss: decrease window by factor of 2

Fairness (more)Fairness and UDPmultimedia apps often do not use TCPdo not want rate throttled by congestion controlinstead use UDP:pump audio/video at constant rate, tolerate packet loss

Fairness and parallel TCP connectionsnothing prevents app from opening parallel connections between 2 hosts.web browsers do this example: link of rate R supporting 9 connections; new app asks for 1 TCP, gets rate R/10new app asks for 11 TCPs, gets R/2 !

Chapter 3: Summaryprinciples behind transport layer services:multiplexing, demultiplexingreliable data transferflow controlcongestion controlinstantiation and implementation in the InternetUDPTCPNext:leaving the network edge (application, transport layers)into the network core

Kurose and Ross forgot to say anything about wrapping the carry and adding it to low order bit

chapter3 5th aug 2009

Documents