transport layer3-1 89350 מבוא לתקשורת הרצאות 3-5: רמת התמסורת (chapter 3...
TRANSCRIPT
Transport Layer 3-1
מבוא לתקשורת 89350: רמת התמסורת3-5הרצאות
(chapter 3 – Transport)
פרופ' אמיר הרצברגComputer Networking: A Top Down Approach Featuring the Internet, 3rd edition. Jim Kurose, Keith RossAddison-Wesley, July 2005.
Based on foils by Kurose & Ross ©, see: http://www.aw.com/kurose-ross/
My site: http://amirherzberg.com
Transport Layer 3-2
Today (& Chapter 3): outline
3.1 Transport-layer services
3.2 Multiplexing and demultiplexing
3.3 Connectionless transport: UDP
3.4 Principles of reliable data transfer
3.5 Connection-oriented transport: TCP segment structure reliable data transfer flow control connection
management
3.6 Principles of congestion control
3.7 TCP congestion control
Transport Layer 3-3
Processes & communication
Process: program running within a host.
within same host, two processes communicate using inter-process communication (IPC) Not in this course
processes in different hosts communicate by exchanging messages In this course
Client process: process that initiates communication
Server process: process that waits to be contacted; always on
Transport Layer 3-4
Transport services and protocols communication between
application processes (not just hosts)
Run in end systems send segments via
network layer rcv side: reassemble
segments to messages
Pass to application via socket (buffer)
Internet transport protocols: TCP and UDP
application
transportnetworkdata linkphysical
application
transportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysicalnetwork
data linkphysical
logical end-end transport
Transport Layer 3-5
Internet transport-layer protocols Connection: TCP
Reliable, in-order delivery
congestion control flow control connection setup
Connection-less: UDP Unreliable, unordered
delivery “best-effort”, like IP
Neither ensure: delay guarantees bandwidth guarantees
application
transportnetworkdata linkphysical
application
transportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysicalnetwork
data linkphysical
logical end-end transport
Transport Layer 3-7
Multiplexing of Packets within Host Network layer delivers packet to correct host (by IP address) Transport layer delivers packet to correct process How to identify process (in host)?
Use 16-bit port number (in Transport Layer header) Multiplexing: adding ports identifier to packet delivered to IP Demultiplex: deliver packets to correct apps (using ports in
packet) Server: fixed port # <1024
So request reach server HTTP server: 80 Mail server: 25
Client: OS-assigned (`random`), dynamic port >1024 Multiple client processes can work with same server
Process queues data of a connection/port via socket
Transport Layer 3-8
Sockets and Sockets API processes
sends/receive messages via socket API
API=Application Program Interface
Socket: in/out buffer Metaphor: mailbox
process
TCP/UDP
socket
host (Client / Server)
process
TCP/UDP
socket
host
Internet
Sockets API functions: Send, receive Connection (TCP, not UDP): open, close Fix a few parameters, e.g. buffer size (more on this later) Choose transport service…
Application Layer
Transport Layer 3-10
How demultiplexing works host receives datagram
Datagram carries (one) transport-layer segment
Segment has source, destination port numbers
Use IP addresses & ports to identify socket
How? Depends… UDP: one socket per port TCP server: one socket per
<SrcIP,SrcPrt,DstIP, DstPrt>
source port # dest port #
32 bits
applicationdata
(message)
other transportheader fields
Tra
nsp
ort
layer
seg
men
t
Inte
rnet
layer
pack
et
source IP Addressdest IP Address
other Internet layerheader fields
source IP Addressdest IP Address
IP H
ead
er
Transport Layer 3-12
NAPT/NAT: Network Address (Port) Translation
1. SrcIP=10.0.0.1SrcPort=3373DstPort=80
2. SrcIP=133.44.5.8SrcPort=6678DstPort=80
3. DstIP=133.44.5.8SrcPort=80DstPort=6678
4. DstIP=10.0.0.1SrcPort=80DstPort=3373
Goal: share IP addresses among multiple hostsHow: identify host by port
Transport Layer 3-13
NAPT: Network Address/Port Translation
10.0.0.1
10.0.0.2
10.0.0.3
S: 10.0.0.1, 3345D: 128.119.40.186, 80
1
10.0.0.4
138.76.29.7
1: host 10.0.0.1 sends datagram to 128.119.40, 80
NAT translation tableWAN side addr LAN side addr
138.76.29.7, 5001 10.0.0.1, 3345…… ……
S: 128.119.40.186, 80 D: 10.0.0.1, 3345
4
S: 138.76.29.7, 5001D: 128.119.40.186, 80
2
2: NAT routerchanges datagramsource addr from10.0.0.1, 3345 to138.76.29.7, 5001,updates table
S: 128.119.40.186, 80 D: 138.76.29.7, 5001
3
3: Reply arrives dest. address: 138.76.29.7, 5001
4: NAT routerchanges datagramdest addr from138.76.29.7, 5001 to 10.0.0.1, 3345
Transport Layer 3-14
UDP: Connectionless demultiplexing Create sockets with port
numbers:DatagramSocket mySocket1 =
new DatagramSocket(9111);
DatagramSocket mySocket2 = new DatagramSocket();
UDP socket identified by destination port (and IP)
When host receives UDP segment: checks destination port
number in segment directs UDP segment to
socket with that port number
Regardless of source IP and port
Transport Layer 3-15
UDP: Connectionless demux (cont)
DatagramSocket serverSocket = new DatagramSocket(6428);
ClientIP:B
P3
client IP: A
P1P1P2
serverIP: C
SP: 6428DP: 9157
SP: 9157DP: 6428
SP: 6428DP: 5775
SP: 5775DP: 6428
DP= Destination Port, SP= Source PortClient’s SP provides “return address” (server’s DP)Client’s DP (destination port on server) known to client (e.g. standard)
Transport Layer 3-16
TCP: Connection-oriented demux
TCP socket identified by 4-tuple: source IP address source port number dest IP address dest port number
recv host uses all four values to direct segment to appropriate socket
Server host may support many simultaneous TCP sockets: each socket identified
by its own 4-tuple
E.g., web servers have different sockets for each connecting client
Transport Layer 3-17
TCP: Connection-oriented demux
ClientIP:B
P3
client IP: A
P1P1P3
serverIP: C
SP: 80DP: 9157
SP: 9157DP: 80
SP: 80DP: 5775
SP: 5775DP: 80
P4
What is different here? Would this work well for web server?
Transport Layer 3-18
TCP: Connection-oriented demux
ClientIP:B
P3
client IP: A
P1P1P2
serverIP: C
SP: 80DP: 9157
SP: 9157DP: 80
SP: 80DP: 5775
SP: 5775DP: 80
Multithreading server may listen to multiple ports
Transport Layer 3-19
Internet transport layer servicesTCP service: connection-oriented: setup
overhead required TCP setup: one round trip
reliable transport: no loss (but some overhead)
flow control: feed (send) at receiver’s speed
congestion control: throttle sender when network overloaded
does not provide: delay, bandwidth guarantees
UDP service: Datagram: no setup Unreliable transport: loss does not provide:
Reliability Flow/congestion control Delay or bandwidth
guarantees
Others (e.g. RDP)Q: Would we ever choose UDP?
Transport Layer 3-20
Choosing transport service…Reliability, congestion, flow
controls TCP: yes, UDP: no Overhead, slowdown Often required
E.g. file transfer, e-mail Not always, e.g., audiocast
Size and complexity Code size, complexity are
sometimes critical (…UDP) E.g., dust computer, sensor
Bandwidth & Delay Guarantee
Not in TCP,UDP Ok for “elastic” appl.
Use whatever bandwidth & delay they get
E.g. e-mail, Web Phone & other apps require
minimal Quality of Service
Setup overhead? TCP: yes, UDP: no Ignore in long-lived app
E.g. file transfer, telnet
Sometimes significant E.g. DNS queries
Transport Layer 3-21
Chapter 3 outline
3.1 Transport-layer services
3.2 Multiplexing and demultiplexing
3.3 Connectionless transport: UDP
3.4 Principles of reliable data transfer
3.5 Connection-oriented transport: TCP segment structure reliable data transfer flow control connection
management
3.6 Principles of congestion control
3.7 TCP congestion control
Transport Layer 3-22
UDP: User Datagram Protocol [RFC 768]
“no frills,” “bare bones” transport protocol
“best effort” service, UDP segments may be: lost delivered out of order
to app connectionless:
no handshaking each UDP segment
handled independently of others
Why is there a UDP? no connection setup
(delay) simple: no connection
state at sender, receiver no retransmission no congestion control:
UDP can blast away as fast as desired (or more…)
smaller segment header
Transport Layer 3-23
UDP: header and more
UDP often used for streaming multimedia apps loss tolerant rate sensitive
other UDP uses DNS (why?) SNMP – Simple Network
Management Protocol UDP Congestion/flow control?
By application; flexible Limited buffer per port
[notes] Reliability over UDP?
Checksum detects (most) corruptions [hidden foils]
Requires application-specific error recovery!
source port # dest port #
32 bits
Applicationdata
(message)
UDP segment format
length checksum
Length, inbytes of UDP
segment,including
header
Transport Layer 3-26
Chapter 3 outline
3.1 Transport-layer services
3.2 Multiplexing and demultiplexing
3.3 Connectionless transport: UDP
3.4 Principles of reliable data transfer
3.5 Connection-oriented transport: TCP segment structure reliable data transfer flow control connection
management
3.6 Principles of congestion control
3.7 TCP congestion control
Transport Layer 3-27
Principles of Reliable data transfer important in app., transport, link layers top-10 list of important networking topics!
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Transport Layer 3-29
RDT Protocol : Events Finite state machine Inputs:
rdt_send(m) – message m from application rdt_rcv(p) – packet p from (unreliable) channel
Outputs: udt_send(p) – packet p to (unreliable) channel deliver_data(m) – message m to application ready( ) – to receive another message from
application
Execution: Sequence of input, output events Outputs defined by protocol
Transport Layer 3-30
Reliable data transfer: planWe’ll: Present few (improving) versions of RDT protocol
From strong to weak assumptions on input events (channel)
Later, also add more inputs/outputs…
Consider only unidirectional message transfer but bidirectional packets (&control info) flow
Define protocol by pseudo-code or state diagram: States S={S1,S2,…} Events E={e1,e2,…} Actions A={a1,a2,…} Transition rules,
SEAS, e.g.:
stateS1
stateS2
Event e1 causing transition S1S2Action a1 taken on state transition
Event e2Action a2
Event e3Action a3
S3
Transport Layer 3-31
Rdt1.0: reliable transfer over a reliable channel
underlying channel (udt) perfectly reliable no bit errors no loss of packets No need to setup and teardown connection (always
up)
separate state machines for sender, receiver: sender sends data into underlying channel receiver read data from underlying channel
Wait for call from above
packet = make_pkt(data)udt_send(packet)
rdt_send(data)
extract (packet,data)deliver_data(data)
Wait for call from
below
rdt_rcv(packet)
sender receiver
Transport Layer 3-32
Rdt2.0: udt channel with bit errors
underlying channel may flip bits in packet recall: UDP checksum to detect bit errors Assume: errors detected, no packet loss / reordering
the question: how to recover from errors: acknowledgements (ACKs): receiver explicitly tells sender
that pkt received OK negative acknowledgements (NAKs): receiver explicitly
tells sender that pkt had errors sender retransmits pkt on receipt of NAK
new mechanisms in rdt2.0 (beyond rdt1.0): error detection receiver feedback: control msgs (ACK,NAK) rcvr->sender `ready` output [why?]
Transport Layer 3-33
rdt2.0: FSM specification
Wait for call from above
sndpkt = make_pkt(data, checksum)udt_send(sndpkt)
extract(rcvpkt,data)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) && isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && isNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) && corrupt(rcvpkt)
Wait for ACK or
NAK
Wait for call from
belowsender
receiverrdt_send(data)
ready
Transport Layer 3-34
rdt2.0: operation with no errors
Wait for call from above
sndpkt = make_pkt(data, checksum)udt_send(sndpkt)
extract(rcvpkt,data)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) && isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && isNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) && corrupt(rcvpkt)
Wait for ACK or
NAK
Wait for call from
below
rdt_send(data)
ready
Transport Layer 3-35
rdt2.0: error scenario
Wait for call from above
sndpkt = make_pkt(data, checksum)udt_send(sndpkt)
extract(rcvpkt,data)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) && isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && isNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) && corrupt(rcvpkt)
Wait for ACK or
NAK
Wait for call from
below
rdt_send(data)
ready
Transport Layer 3-36
Analysis of RDT 2.0 Goals of RDT2.0:
Delivered messages are prefix of sent messages
Eventually, all messages get delivered, `ready` for more
Assumptions of RDT 2.0: Every packet sent, is eventually delivered But: possibly corrupted (always detected) 1-to-1 mapping from packets delivered to
sent Goals are not met. Why?
Transport Layer 3-37
rdt2.0 has a fatal flaw!
What happens if ACK/NAK corrupted?
1st : add checksum… but sender still doesn’t know
what happened at receiver!
can’t just retransmit: possible duplicate
What to do? sender ACKs/NAKs
receiver’s ACK/NAK? What if this ACK/NAK garbled?
Retransmit on garbled Danger: duplicates
Handling duplicates: sender adds sequence number
to each pkt sender retransmits current pkt
if ACK/NAK garbled receiver discards (doesn’t
deliver) duplicate pkt
Sender sends one packet, then waits for receiver’sresponse; use (1-bit) sequence number
RDT 2.1, aka `stop & wait` protocol
Transport Layer 3-40
RDT2.1 in pseudo-code…
Sender:state=ready; i=0;On rdt_send(m):
if state=readythen { p=<m,i,EDC(m,i)>; udt_send(p); state=wait; }
On rdt_rcv(p’) If
p’=<“ack”,EDC(“ack”)> then { i=i+1 mod 2; state=ready; }else udt_send(p);
Receiver:i=0;On rdt_rcv(p) If p=<m,j,EDC(m,j)>
then { c=EDC(“ack”); udt_send(“ack”,c); if i=j then
{ deliver(m); i=i+1 mod 2; } }else rdt_sent(“nack”);
EDC: Error Detection Code (e.g. checksum)
Transport Layer 3-41
Analysis of RDT 2.1 [sketch] Goals, Assumptions: same as RDT 2.0
Every packet sent, is eventually delivered But: possibly corrupted (always detected) 1-to-1 mapping from packets delivered to sent
Define `legitimate` states: [E=Even, O=Odd]1. RE/RO: Ready (to receive E/O message); channel empty2. ME/MO: Wait for E/O Ack; channel has new E/O message3. AE/AO: Wait for E/O Ack; channel has E/O Ack4. MER/MOR: Wait for E/O Ack; channel has E/O message
which was already delivered5. WEAO: Wait for E Ack but channel has O Ack6. WOAE: Wait for O Ack but channel has E Ack
Prove by induction: system stays in legitimate state!
Transport Layer 3-42
Analysis of RDT 2.1 [sketch, cont’] Claim: system always in legitimate state Base: initial state is (legitimate) RE Consider moves from legitimate states:
1. RE/RO: only to ME/MO2. ME/MO: If received Ok, move to AE/AO;
otherwise, to WEAO/WOAE3. AE/AO: If received Ok, move to RO/RE;
otherwise, to MER/MOR4. MER/MOR: same as ME/MO!5. WEAO/WOAE: move to MER/MOR
If received correctly… and also if corrupted!
System always stays in legit state!! □
Transport Layer 3-43
RDT2.1: Discussion
Nack/Ack, 1 bit counter Handles (detectable) packet corruptions Assume:
No processor faults No connection setup / teardown (always up) All error in packets detected (by checksum) No packet loss / reordering / duplication Duplication of message is no problem But duplicate ACK can cause loss RDT2.2: handle duplication: add seq # to
ack…
Transport Layer 3-44
RDT2.2: duplicate-tolerant protocol
Same functionality as rdt2.1, using ACKs only
Instead of NAK, receiver sends ACK for last pkt received OK Explicitly include seq # of pkt being ACKed Resilient to packet (incl. Ack) duplication
• Exercise: why RDT2. 1fails if Ack is duplicated? Important: allows sender to retransmit… see
later! Duplicate ACK at sender results in same
action as NAK: retransmit current pkt
Transport Layer 3-46
RDT2.2: pseudo-code…
Sender:state=ready; i=0;On rdt_send(m):
if state=readythen { p=<m,i,EDC(m,i)>; udt_send(p); state=wait; }
On rdt_rcv(p’) Let c=EDC (“ack”,i);
If p’=<“ack”,i,c> then { i=i+1 mod 2; state=ready; }else udt_send(p);
Receiver:
i=0;On rdt_rcv(p’) If p’=<m,i, EDC(m,i)>
then { p=<“ack”,i,EDC(“ack”,i)>; udt_send(p); deliver(m); i=i+1 mod 2; }else rdt_sent(p);
Transport Layer 3-47
RDT3.0: Alternating Bit Protocol
Allow packet loss: channel can also lose packets (data or ACKs) checksum, seq. #, ACKs,
retransmissions are not enough
Q: how to deal with loss? 1st Idea: sender waits
maximal round-trip time (RTT), then retransmits
Pro: simple, no duplicates Good for fixed, known RTT What for variable RTT??
Better idea: sender waits “reasonable” amount of time for ACK (<max RTT)
retransmits if no ACK received during this time
if pkt (or ACK) just delayed (not lost): retransmission will be
duplicate, but use of seq. #’s already handles this
Recall: receiver includes seq # of pkt being ACKed
requires countdown timer
Handles errors, dup and loss, but no reordering
Transport Layer 3-49
Alternating Bit Protocol (RDT3.0): pseudo-code for sender…
state=ready; i=0;On rdt_send(m):
if state=readythen { p=<m,i,EDC(m,i)>; udt_send(p); state=wait; sleep(“timeout”,T); }
On rdt_rcv(p’)If p’=<“ack”,i,EDC(“ack”,i)> then { i=i+1 mod 2; state=ready; abort(“timeout”); }
On WakeUp(“timeout”): {udt_send(p); sleep(“timeout”,T); }
Recipient: as in RDT 2.2!
Transport Layer 3-53
Alternating bit (rdt 3.0): problems
No pipelining (at most one packet on link) Performance problem
Assumes FIFO (no reordering) But datagram networks may not ensure FIFO!
No setup and teardown of connection Teardown required to allow appl to take
corrective/alternate action if not connected And to avoid keeping state indefinitely Also: to handle processor faults
Transport Layer 3-54
Rest of Transport Layer…
TCP Overview TCP Reliability TCP Timeouts TCP Flow Control 3.6 Principles of congestion control 3.7 TCP congestion control TCP Fairness
Transport Layer 3-55
TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581
pipelined: TCP congestion and
flow control set window size
flow control: sender will not overwhelm receiver
send & receive buffers
full duplex data: bi-directional data flow
in same connection
point-to-point: one sender, one
receiver
reliable, in-order byte steam: no “message
boundaries”
connection-oriented: Handshake before data
exchange Teardown (to free
resources, detect failure)
socketdoor
T C Psend buffer
T C Preceive buffer
socketdoor
segm ent
applicationwrites data
applicationreads data
Transport Layer 3-56
TCP segment structure
source port # dest port #
32 bits
applicationdata
(variable length)
sequence number
acknowledgement numberReceive window
Urg data pnterchecksum
FSRPAUheadlen
notused
Options (variable length)
U (URG): urgent data (generally not used)
A (ACK): this is an ack of ack-numberP (PSH): push data(don’t aggregate)
RST, SYN, FIN:connection estab(setup, teardown
commands)
# segmentsrcvr willingto accept
countingby bytes of data(not segments!)
Internetchecksum
(as in UDP)
Transport Layer 3-57
TCP Services TCP connection-based transport service:
Setup connection Use connection Close connection
TCP connection services: Flow control: don’t send faster than receiver can
absorb [later] Congestion control: slow down if network gets
congested [later] Fairness: fair sharing of Net resources among
connections [later] Reliable, pipelined stream service – next!
Transport Layer 3-58
TCP reliable data transfer
TCP: reliability on top of IP’s unreliable service Packet loss / error Duplicates Out of order Setup/teardown Loss of state in proc.
Pipelined segments hybrid of GBN & SR But a single
retransmission timer Substantial flexibility
Cumulative acks Resend (duplicate)
ACK instead of NACK Retransmissions are
triggered by: timeout events duplicate ACKs (later)
Initially consider simplified TCP sender: ignore duplicate acks ignore flow control,
congestion control Window of N bytes
Transport Layer 3-59
Reliable (TCP) Connection
What is a reliable (TCP) connection? Reliable connection management
Setup Teardown
Reliable communication in connection Delivery FIFO…
Connection fails if and only if packets cannot pass
Transport Layer 3-60
Reliable connection: More Interface… Connection status at both ends:
Link Up/Down indicators [Sender: also Ready/Wait] Open/Close Connection commands rdt_send, deliver only when link Up [sender: and
Ready]
sendside
receiveside
Link up / down (Ok/Fault)
Ready / Wait Open/CloseOpen/Close
Transport Layer 3-61
TCP Reliability Reliable Connection Management
Setup (SYN, SYN-ACK), tear-down (later) Reliable Communication in Connection
Counters, timeout, retransmit [like RDT3.0]
Syn
SynAck
A (Client)
B (Server) Up
Up
1 2 2
Ack1
Timeout
X
Transport Layer 3-62
Reliable Connection Management Three basic states:
Up Down Syn (in progress) Teardown states - later
Send, receive only when connection is Up Connection setup (SYN):
Completes in finite time But may fail
Connection closure (FIN): Completes in finite time
Transport Layer 3-63
TCP Connection States Three basic states:
Up Down Syn (in progress)
Send, receive only when connection Up
Syn
A
BUp
Up
Syn
Syn
X
X
X
X
X
Syn
Syn Syn
Syn Syn
Up
UpUp
Transport Layer 3-64
TCP Connection Teardown Ordered and Forced Teardown
mechanisms Forced Teardown:
When `fatal` error detected: Send RST signal and abort (break) connection Same when receiving RST signal
• If sequence number in RST is within `window` [why?]
Ordered Teardown: Ensure all messages sent are received Ensure peer will also close connection
Transport Layer 3-65
TCP Connection Tear-Down
Closing a connection:
client closes socket: clientSocket.close();
Step 1: client end system sends TCP FIN control segment to server
Step 2: server receives FIN, replies with ACK. Closes connection, sends FIN.
client
FIN
server
ACK
ACK
FIN
close
closing
closed
tim
ed w
ait
Step 1
Step 2
closed
Transport Layer 3-66
TCP Connection Tear-Down (cont.)
Step 3: client receives FIN, replies with ACK.
Enters “timed wait” - will respond with ACK to received FINs (WHY??)
Step 4: server, receives ACK. Connection closed.
Exercise: Modify to allow (also) server to initiate TearDown (`simultaneous FINs`)
client
FIN
server
ACK
ACK
FIN
closing
closing
closed
tim
ed w
ait
closed
Step 1
Step 2
Step 3
(Step 4)
Transport Layer 3-67
TCP Reliable Connection Closure Fin, Fin-Wait: new states
Fin : closure begun; receive OK but don’t send
Fin-Wait: wait (30sec), to ensure peer is done• Only by closure initiator (usually client)
Initiator: send FIN, wait for ACK and FIN Responder: ACK (to FIN), last messages,
FIN Done when receiving ACK to FIN
A
B Up
FINAck
dataAck
FIN FINAck Ack
Up 30 sec
Fin-Wait
Fin
FinDown
Down
Transport Layer 3-70
Reliable Connection Management Send, receive only when connection Up Setup (SYN): complete/fail in finite time Closure (FIN): completes/fail in finite time Reset (RST): only due to disconnection Each Up at A (B) maps to unique Syn at B
(A)
Syn
A
BUp
Up
Syn
Syn
X
X
X
X
X
Syn
Syn Syn
Syn Syn
Up
UpUp
Transport Layer 3-72
Mapping Between Connections Each Up at A (B) maps to unique Syn at B
(A) Use mapping functions MAPA, MAPB
b=MAPA(a): map from A’s Up to B’s Syn
In example: MAPA is identity, i.e. MAPA(a)=a
But: MAPB(1)=1, MAPB(2)=4 ! In general: MAP is monotonous, one-to-
one
Syn
A
BUp(1)
Up (1)
Syn (1)
Syn(1)
X
X
X
X
X
Syn(2)
Syn (2) Syn (3)
Syn(3)
Syn (4)
Up(2)
Up (3)
Up(2)
Transport Layer 3-73
Stateful Connection Management Reliable connection management is
easy… If we count Syn periods:
Each party maintains Syn periods counter Send your Syn counter with SYN Echo other party’s Syn counter in SYN-ACK Ignore SYN-ACK with old echoed counter
Syn,1
A
BUp(1)
Up (1)
Syn (1)
Syn(1)
X
X
X
X
X
Syn(2)
Syn (2) Syn (3)
Syn(3)
Syn (4)
Up(2)
Up (3)
Up(2)
Syn,1;1
Ack,1
Fin,1 Syn,2
Syn,3
Syn,2;3
Ack,2Ack,3
Syn,4
Transport Layer 3-74
Stateless Connection Management Stateful connection management is easy But servers, clients may lose state And… too much state – should cnn.com
keep state for each client (forever)? And… How to identify repeating client?
IP address – which (DHCP, NAPT)? IP + port - Fails for NAPT; ports are temporary
Need Stateless Connection ManagementUse random numbers instead of counters!
Transport Layer 3-75
Stateless Connection ManagementUse random seq# to identify Syn period
Select, send your seq# with SYN, SYN-ACK Echo other party’s seq# with SYN-ACK, ACK Ignore SYN responses with wrong (old) echoed
identifier
37
A
BUp(1)
Up (1)
Syn (1)
Syn(1)
X
X
X
X
X
Syn(2)
Syn (2) Syn (3)
Syn(3)
Syn (4)
Up(2)
Up (3)
Up(2)
66;A37A66
F37 88
42
91;A42
A91A63
34
13:A37
Transport Layer 3-76
Stateless Three Way Handshake
Setup connection client: initiator server: respondingEach choose
random seq# Match SYN, SYNACK Seq# to ensure no-
loss of data Use flags in TCP
header: SYN, ACK, RST, FIN
Three way handshake:
Step 1: client sends TCP SYN segment to server Specify client seq# no data
Step 2: server receives SYN, replies with SYN-ACK segment
Allocate buffers
Ack client’s seq#
Send server’s seq# no data
Step 3: client receives SYN-ACK, validates #, Acks server # and (optionally) sends data
Transport Layer 3-77
Reliable Communication Whenever a connection terminates (even
with failure) Sequence received is prefix of sequence sent No gaps No reordering No insertions No duplications
Well-terminated connections: At one or (usually) both ends
• Can’t guarantee to always detect at both ends All bytes delivered
• Sequence received = sequence sent• No truncations
Transport Layer 3-78
TCP Reliability in Connection Sequence numbers kept by both parties 32 bit counters, initialized randomly
When overflow, `wrap` to begin from zero Send with each packet sent to peer:
Sequence number of first byte sent in this packet Seq. number of next byte expected from peer All bytes till this one were received Ok (ACK)
SYN,37
A
B66;A37
“Hi”,37,A66
“Ready”,66,A39
“Wait”,39,A71
71;A43
Transport Layer 3-79
Loss and Retransmissions Packet or Ack is lost
Often due to congested router TCP retransmits lost packet Loss detected by timeout
Another method later
SYN,37
A
B66;A37
“Hi”,37,A66
“Ready”,66,A39
“Wait”,39,A71
71;A43X“Wait”,39,A71
Timeout
Transport Layer 3-80
Premature Timeout Duplicates What if packet is delayed, but not lost?
Often due to congested router… Delayed too long TCP retransmits it
duplication No harm – TCP discards duplicate packets
Send another ACK, in case first one was lost
SYN,37
A
B66;A37
“Hi”,37,A66
“Ready”,66,A39
“Wait”,39,A71
71;A43
Timeout
Transport Layer 3-81
Stop-and-wait [RDT3.0]: inefficiency
first packet bit transmitted, t = 0
sender receiver
RTT
last packet bit transmitted, t = L / R
first packet bit arriveslast packet bit arrives, send ACK
ACK arrives, send next packet, t = RTT + L / R
U sender
= L / R
RTT + L / R
Transport Layer 3-82
Pipeline / Window for Efficiency Protocol as described is inefficient:
Transmission rate = 10 Mb/sec Max segment size (MSS) = 1250B= 10000 bits Round Trip Time (RTT) = 100msec
Includes transmit, propagation, queuing File size = 125KB = 1 million bits = 100
segments Time to send= 100 segments * 100msec = 10sec
Idea: send file without waiting for ACK? Transmission time= (1Mb)/(10Mb/sec)=100msec RTT = 200msec Send 100 segments in pipeline / window
Transport Layer 3-83
Pipelined protocolsPipelining: sender allows multiple, “in-flight”, yet-
to-be-acknowledged pkts Need `real` sequence numbers (not 1 bit)
also deals with out-of-order (old) packetsTCP: count bytes, begin with random numbers from setup
buffering at sender and (optionally) receiver
Pipelining improves link utilization
Transport Layer 3-84
Pipelining: increased utilization
first packet bit transmitted, t = 0
sender receiver
RTT
last bit transmitted, t = L / R
first packet bit arriveslast packet bit arrives, send ACK
ACK arrives, send next packet, t = RTT + L / R
last bit of 2nd packet arrives, send ACKlast bit of 3rd packet arrives, send ACK
Transport Layer 3-85
TCP seq. #’s and ACKs with Pipeline
Seq. #’s: byte stream
“number” of first byte in segment’s data
ACKs: seq # of next byte
expected from other side
cumulative ACKQ: how receiver handles
out-of-order segments A: ignore old pkts;
buffer or ignore future packets
Host A Host B
Seq=42, ACK=79, data = ‘Req’
Seq=79, ACK=45, data = ‘Ok’
Seq=45, ACK=81
Usersends‘Req’
Client ACKsreceipt of ‘Ok’
host ACKsreceipt of
‘Req’, sendsback ‘Ok’
timesimple telnet scenario
Transport Layer 3-86
TCP: retransmission scenarios
Host A
Seq=100, 20 bytes data
ACK=100
timepremature timeout
Host B
Seq=92, 8 bytes data
ACK=120
Seq=92, 8 bytes data
Seq=
92
tim
eout
ACK=120
Host A
Seq=92, 8 bytes data
ACK=100
loss
tim
eout
lost ACK scenario
Host B
X
Seq=92, 8 bytes data
ACK=100
time
Seq=
92
tim
eout
SendBase= 100
SendBase= 120
SendBase= 120
Sendbase= 100
Transport Layer 3-87
TCP retransmission scenarios (more)
Host A
Seq=92, 8 bytes data
ACK=100
loss
tim
eout
Cumulative ACK scenario
Host B
X
Seq=100, 20 bytes data
ACK=120
time
SendBase= 120
Transport Layer 3-88
TCP sender events (simplified):TCP_send(m): segment=<seq#,
m>; seq # is byte-stream
number of first data byte in segment
start timer if not already running Timer is for oldest
unacked segment
Timeout: retransmit segment
that caused timeout restart timer Ack rcvd: If acknowledges
previously unacked segments update what is known
to be acked start timer if there are
outstanding segments
Transport Layer 3-90
Duplicate Ack & Fast Retransmit
Time-out period often relatively long: long delay before
resending lost packet
Detect lost segments via duplicate ACKs. Sender often sends
many segments back-to-back
If segment is lost, there will likely be many duplicate ACKs.
If sender receives 3 ACKs for the same data, it supposes that segment after ACKed data was lost: fast retransmit: resend
segment before timer expires
Why not two? Exercise…
Transport Layer 3-91
On TCP_RCV(ACK,y) // and valid checksum if (y > SendBase) then
{ remove y-SendBase bytes from Unacked; SendBase = y;
if (NextSeqNum>SendBase) then start timer; } else { increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { resend segment with sequence number y }
Fast retransmit algorithm:
a duplicate ACK for already ACKed segment
fast retransmit
Transport Layer 3-92
TCP ACK generation [RFC 1122, RFC 2581]
Event at Receiver
Arrival of in-order segment withexpected seq #. All data up toexpected seq # already ACKed
Arrival of in-order segment withexpected seq #. One other segment has ACK pending
Arrival of higher-than-expectedsequence number (gap)
TCP Receiver action
Delayed ACK. Wait 200ms (up to 500ms)for next segment. If no next segment,send ACK
Immediately send single cumulative ACK, ACKing both in-order segments
Immediately send duplicate ACK, indicating seq. # of next expected byte; usually, buffer data
Transport Layer 3-95
Pipelining in TCP: Summary Cumulative & Delayed Acknowledgements
Ack with seq # n all bytes up to n received Send Ack only after receiving 2 MSS or 200msec
Recipient usually buffers out-of-order bytes As long as they are inside the window Optional selective-ack mechanism [RFC2018; we
ignore] Sender retransmits
On timeout, or `fast retransmit` (3 dup ack) Usually: resend only oldest (first) unACKed
packet
Transport Layer 3-96
Rest of Transport Layer…
TCP Overview TCP Reliability TCP Flow Control 3.6 Principles of congestion control TCP Timeouts 3.7 TCP congestion control TCP Fairness
Transport Layer 3-97
TCP Header: RcvWindow
RcvWindow: TCP’s flow
control
Maximal number of un-acked segments peer can send
Source Port Destination Port
Sequence Number
Acknowledge NumberHdr Len
RSV
Flags RcvWindow
Checksum Urgent pointer
Options (variable length)
Payload(Application Data)
Transport Layer 3-98
Problem: application on receiver may be slower than network no place for incoming data
Flow Control: prevent receiver buffer overflow By limiting amount of (unacked) segments in transit Send: RcvWindow = (|buffer|-|used|)/MSS MSS (Max Segment Size): indicated by initiator at
SYN
RcvWindow (bytes)
TCP Flow Control
Receiver BufferReceiver Buffer Read byapplication
ReceivedFrom Net
usedIgnore segments
received out of order
Transport Layer 3-99
Window Size (1) Window size is critical for performance
If RcvWindow is too small, performance suffers Even if network is available, application clears
buffers
Example: consider simple network below Client and server connected by two routers W = Window size (in bits) R = Transmission rate of all links RTT = Round Trip Time (RTT >> MSS/R)
Transport Layer 3-101
W/R
RTT
W/R
Ack
Average Rate
Average Rate ≤ [R*(W/R)+0*(RTT-W/R)]/RTT= W/RTT
Average Rate ≤ [R*(W/R)+0*(RTT-W/R)]/RTT= W/RTT
Average Rate ≤ RAverage Rate ≤ R
Average Rate ≤ min(R,W/RTT)Average Rate ≤ min(R,W/RTT)
Transport Layer 3-102
Optimal Receiver Window Average Rate = min{R, W/RTT}
Flow control is not bottleneck, as long asW R * RTT
ReceiverBuffer Bandwidth x Delay If window too small: slows down connection Fat links: huge Bandwidth x Delay (e.g. satellite)
Aggressive flow control: send RcvWindow larger than available buffers By estimating drain rate; risk of buffer overflow
Works well when delay does not change much
Transport Layer 3-103
TCP Buffers - Critical Resource
TCP Receiver Buffers: critical for performance
But large buffers limit number of connections 10,000 connections, 10KB each 100 MB
Costs and performance implications
Abused by SYN-flooding DoS attack (Net. Security course)
Servers should minimize open connections Client should close connection (and wait 30 sec)
Many servers initiate close, but don’t wait 30 sec
• If Ack is lost, client may timeout (abnormal close)
Transport Layer 3-105
Lecture 7 outline
TCP Overview TCP Flow Control 3.6 Principles of congestion control TCP Timeouts 3.7 TCP congestion control TCP Fairness
Transport Layer 3-106
Principles of Congestion Control
Congestion: informally: “too many sources sending too
much data too fast for network to handle” Cf. to flow control (limited by receiver’s buffers)
manifestations: lost packets (buffer overflow at routers) long delays (queueing in router buffers)
a top-10 problem! Main cause for packet loss Main cause for delay and jitter (changing delay)
Transport Layer 3-107
Congestion scenario 1: infinite buffers
two senders, two receivers one router
Thruput limit: in-rate, capacity λout=min(λin,c/2)
Assume no loss, retransmissions
Congestion: λinc/2 Infinite delays maximal
throughput What if buffers are
finite?
unlimited shared output link buffers
Host Ain : original data
Host B
out
Transport Layer 3-108
Congestion scenario 2: finite buffers one router, finite buffers λout=λin
'in≥ in since sender retransmits packet if… Packet or Ack lost due to buffer overflow (or noise) Excessive delay: if timeout<|queue| ∙ transmit time +
AckTime
c/2 ≥ 'in≥ ‘out≥out since delivery is only in order, no dups
finite shared buffers
Host Ain : original data
Host B
out
'in : data+retransmit ‘out
Transport Layer 3-109
Congestion scenario 2: finite buffers one router, finite buffers λout=λin
sender retransmits packet if… Lost due to buffer overflow (or noise) Excessive delay (> timeout)
No congestion – queue usually not full No re-transmission, loss: out=’out='in= in
Happens if in << C/2 (capacity) – for both senders…
Congestion – queues often/usually full Packet loss: whenever queue full (i.e., usually…) 'in≥ ‘out
Cannot deliver retransmission, out-of-order ‘out>out = in
• Especially with Go-Back-N ! Excessive delay (at least |Q|*TransmitTime) Overhead retransmissions due to loss, time-out
Transport Layer 3-110
Causes/costs of congestion: scenario 3 four senders multihop paths timeout/retransmit
in
Q: what happens as and increase ?
in
finite shared output link buffers
Host Ain : original data
Host B
out
'in : original data, plus retransmitted data
Transport Layer 3-111
Causes/costs of congestion: scenario 3
Another “cost” of congestion: when packet dropped, any “upstream transmission capacity used for that packet
was wasted! Causes sender to retransmit… vicious cycle!
Host A
Host B
o
u
t
Transport Layer 3-112
Approaches towards congestion control
End-end congestion control:
no explicit feedback from network
congestion inferred from end-system observed loss, delay
approach taken by TCP
Network-assisted congestion control:
routers provide feedback to end systems: To source, e.g. ICMP Source
Quench packet Or via destination, often
piggyback on message; may use flow control
From single `congestion bit` (SNA) to explicit rate (ATM ABR) – in book
Main issue: router efficiency; avoid state
Two broad approaches towards congestion control:
Transport Layer 3-113
Lecture 7 outline
TCP Overview TCP Timeouts TCP Flow Control 3.6 Principles of congestion control 3.7 TCP congestion control TCP Fairness
Transport Layer 3-114
TCP Congestion Control end-end control (no network assistance) Assume all senders comply
Protocols must be `TCP friendly`
sender limits transmission: LastByteSent-LastByteAcked
min(CongWin, RcvWin) Roughly,
CongWin is dynamic, function of perceived congestion
How does sender perceive congestion?
loss event = timeout or 3 duplicate acks
Sender reduces rate (CongWin) after loss
Two modes: `Slow start` - with
exponential build-up
Congestion avoidance (AIMD)
rate = CongWin
RTT Bytes/sec
RTT=Roun
dTripTime
Transport Layer 3-115
TCP Slow Start
Slow start: When connection begins, CongWin = 1 MSS Example: MSS = 500 bytes
& RTT = 200 msec initial rate = 20 kbps
=(8*500)/0.2
Available bandwidth may be >> MSS/RTT Need to increase congestion window (and rate) By how much? Rapid, exponential build-up Until we detect possible congestion (by loss)…
MSS=Maximal Segment Size
Transport Layer 3-116
TCP Slow Start Exponential rate build-up
until first loss event: double CongWin every RTT by incrementing CongWin
for every ACK received Summary: initial rate is
slow (CongWin = 1MSS) but ramps up rapidly
Loss: Change to Congestion Avoidance (AIMD)…
Host A
one segment
RTT
Host B
time
two segments
four segments
Transport Layer 3-117
TCP Congestion Avoidance (AIMD: Additive Increase, Multiplicative Decrease)
8 Kbytes
16 Kbytes
24 Kbytes
time
congestionwindow
Multiplicative Decrease: cut CongWin in half after 3 duplicate Acks (of possibly lost packet) [TimeOut: see later…]
Additive Increase: increase CongWin by 1 MSS every RTT in the absence of loss events
Long-lived TCP connection
Transport Layer 3-118
Response to TimeOut vs. 3dupACKs
After 3 dup ACKs: CongWin is cut in half (multiplicative decrease) window then grows linearly (additive
increase)
After Timeout: Restart connection CongWin set to 1 MSS; Exponential build up Up to a threshold, then grows linearly
(AIMD)
• Duplicate ACK sent if segments received out of order (maybe some lost) network delivers some segments• Timeout happens if no segments get thru no connection need drastic response (restart)• Restart is like start except threshold…• What’s the threshold?
Philosophy:
Transport Layer 3-119
Restart ThresholdRestart is like slow start, except
exponential ramp-up ends at threshold (not failure), then switch to congestion avoidance (AIMD)
Idea: slow-down before reaching rate (CongWin) which (may have) caused congestion
Implementation: Variable Threshold At loss event, Threshold is set to 1/2 of
CongWin just before loss event
Transport Layer 3-122
TCP Round Trip Time and TimeoutsQ: how to set TCP
timeout value? longer than RTT
but RTT varies too short: premature
timeout unnecessary
retransmissions too long: slow
reaction to segment loss
Q: how to estimate RTT? SampleRTT: measured
time from segment transmission until ACK receipt ignore retransmissions
SampleRTT will vary, want estimated RTT “smoother” average several recent
measurements, not just current SampleRTT
TimeoutInterval = EstimatedRTT + SafetyMargin
RTT=Roun
dTripTime
Transport Layer 3-123
Difficulties to `measure delay` Delayed (accumulated) Acks For retransmitted segments, can’t tell
whether acknowledgement is response to original transmission or retransmission Don’t estimate delay from retransmissions
Network conditions may change suddenly Yet don’t overreact to impact of burst traffic
Solution applicable to many situations of estimating dynamic quantities!
Transport Layer 3-124
TCP Round Trip Time and TimeoutEstimatedRTT = (1- )*EstimatedRTT + *SampleRTT Exponential weighted moving average influence of past sample decreases exponentially fast typical value: = 0.125
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RT
T (
mil
liseco
nd
s)
SampleRTT Estimated RTT
Usually: symmetric, no queuing delay
Transport Layer 3-125
TCP Timeout: Safety MarginSetting the timeout EstimtedRTT plus “safety margin”
large variation in EstimatedRTT -> larger safety margin (usually changes in queuing delay)
First: estimate how much SampleRTT deviates from EstimatedRTT:
DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT|
(typically, = 0.25)
Then set timeout interval:TimeoutInterval = EstimatedRTT + SafetyMargin
= EstimatedRTT + 4*DevRTT
Transport Layer 3-126
Summary: TCP Congestion Control Initially: no (or: infinite) Threshold
When CongWin is below Threshold, sender in slow-start, rapid accelerate phase, window grows exponentially (every Ack increments CongWin)
When CongWin is above Threshold, sender is in congestion-avoidance phase, window grows linearly.
When a triple duplicate ACK occurs, Threshold set to CongWin/2 and CongWin set to Threshold (multiplicative decrease).
When timeout occurs, Threshold set to CongWin/2 and CongWin is set to 1 MSS (restart)
Transport Layer 3-127
Lecture 7 outline TCP Overview TCP Timeouts TCP Flow Control 3.6 Principles of congestion control 3.7 TCP congestion control TCP Fairness
Transport Layer 3-128
Fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K
TCP connection 1
bottleneckrouter
capacity R
TCP connection 2
TCP Fairness
New (non-TCP) protocols should be `friendly, well behaved` towards TCP…
Transport Layer 3-129
Why is TCP fair?
Two competing sessions: Additive increase gives slope of 1, as throughout increases multiplicative decrease decreases throughput proportionally
R
R
Equal bandwidth share line
Connection 1 throughputConnect
ion 2
th
roughput
congestion avoidance: additive increaseloss: decrease window by factor of 2
A
B
C
D
Total bandwidthlimit line
Transport Layer 3-130
Steady state throughput area
Whenever the total bandwidth exceeds R, one or both senders identify congestion (by receiving 3 dup acks), and reduces rate by half.
R
R
Equal bandwidth share line
Connection 1 throughputConnect
ion 2
th
roughput
Total bandwidthlimit line
Threshold line
Transport Layer 3-131
Fairness (more)
Fairness and non-TCP Multimedia apps
often do not use TCP Admission control
instead of fairness May use UDP:
pump audio/video e.g. at constant rate, tolerate packet loss
Or: new transport protocol (e.g. non-standard TCP)
Many research areas: (e.g. TCP friendly)
Fairness and parallel TCP connections
App may open many parallel connections
Web browsers do this Esp. pipelined, non-
persistent Example: link of rate R
with 9 connections; new app asks for 1 TCP
connection, gets rate R/10 new app asks for 9 TCP
connections, gets R/2 !
Transport Layer 3-132
Summary: TCP Services
Reliable Connection management Setup, teardown
Reliable connection-based communication Flow control Congestion control Fairness Efficiency improvements:
Pipelining / Window Fast retransmit Others…