tcp/ip how it works

48
1 TCP/IP How it Works Les Cottrell – SLAC Lecture # 1 presented at the Workshop on Scientific Information in the Digital Age: Access and Dissemination ICTP, Trieste, Italy October , 2009 www.slac.stanford.edu/grp/scs/net/talk09/ictp-tcp ip.ppt

Upload: april

Post on 19-Mar-2016

55 views

Category:

Documents


3 download

DESCRIPTION

TCP/IP How it Works. Les Cottrell – SLAC Lecture # 1 presented at the Workshop on Scientific Information in the Digital Age: Access and Dissemination ICTP, Trieste, Italy October , 2009 www.slac.stanford.edu/grp/scs/net/talk09/ictp-tcpip.ppt. 1. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TCP/IP How  it Works

11

TCP/IP How it Works

Les Cottrell – SLACLecture # 1 presented at the Workshop on Scientific Information in the Digital Age:

Access and Dissemination ICTP, Trieste, Italy October , 2009

www.slac.stanford.edu/grp/scs/net/talk09/ictp-tcpip.ppt

Page 2: TCP/IP How  it Works

2

Overview• This is not a lecture on how to program TCP/IP,

rather an introduction to how major portions works, it also does not cover IPv6.

• IP• Addressing: IP addresses, ARP, routing• ICMP • UDP• TCP: flow control, error recovery, establishment,

diconnect• References:

– “Internetworking with TCP/IP, volume I, principles, protocols & Architecture”, by Douglas Comer

– “TCP/IP Illustrated: the protocols”, by W. Richard Stevens– Most information also available free via Web searches

Page 3: TCP/IP How  it Works

3

Internet Protocol (IP RFC-791)

Transport Services

Connectionless packet delivery service

Application services

TCP/IP Internet provides 3 layers of service

•Layering allows one to replace one service without affecting others•IP layer (basic unit of transfer in TCP/IP) provides:

•Best-effort (does not discard capriciously), unreliable (no guarantees)

•Packet may be lost, duplicated, out-of-order with no notification

•Connectionless (each packet treated independently)•IP software provides routing

Page 4: TCP/IP How  it Works

4

Internet datagram (“packet”)• Basic transfer unit

• Format of Internet datagramDatagram header Datagram data area

Vers Type of serv. Total length0 8 16 31

Identification Flags

24Hlen

4

Fragment offset

19

TTL Protocol Header ChecksumSource IP address

Destination IP addressIP Options (if any) Padding

Data…

Page 5: TCP/IP How  it Works

5

IP Datagram format (cont.)• Source & destination IP address (32 bits each):

contain IP address of sender and intended recipient• Options (variable length): Mainly used to record a

route, or timestamps, or specify routing

Page 6: TCP/IP How  it Works

6

IP Fragmentation• How do we send a datagram of say 1400 bytes through a

link that has a Maximum Transfer Unit (MTU) of say 620 bytes?

• Answer the datagram is broken into fragments

– Router fragments 1400 byte datagrams• Into 600 bytes, 600 bytes, 200bytes (note 20 bytes for IP header)• Routers do NOT reassemble, up to end host

Net 1MTU=1500 Net 2

MTU=620

Net 3MTU=1500

Page 7: TCP/IP How  it Works

7

Fragmentation Control• Identification: copied into fragment, allows destination to

know which fragments belong to which datagram• Fragment Offset (12 bits): specifies the offset in the original

datagram of the data being carried in the fragment– Measured in units of 8 bytes starting at 0

• Flags (3 bits): control fragmentation– Reserved (0-th bit)– Don’t Fragment – DF (1st bit):

• useful for simple (computer bootstrap) application that can’t handle • also used for MTU discovery (see later)• if need to fragment and can’t router discards & sends error to source

– More Fragments (least sig bit): tells receiver it has got last fragment• TCP traffic is hardly ever fragmented (due to use of MTU

discovery). About 0.5% - 0.1% of TCP packets are fragmented .

Page 8: TCP/IP How  it Works

8

Fragment series composition

NB. If data segment contains its own header that is not replicated

Offset=0More frags

Offset=1480More frags

Offset=2960More frags

Offset=3440Last frag

Page 9: TCP/IP How  it Works

9

Internet Addressing• IP address is a 32 bit integer

– Refers to interface rather than host– Consists of network and host portions

• Enables routers to keep 1 entry/network instead of 1/host– Class A, B, C for unicast– Class D for multicast– Class E reserved– Classless addresses

• Written as 4 octets/bytes in decimal format– E.g. 134.79.16.1, 127.0.0.1

Page 10: TCP/IP How  it Works

10

Internet Class-based addresses• Class A: large number of hosts, few networks

– 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh• 7 network bits (0 and 127 reserved, so 126 networks), 24 host bits (> 16M

hosts/net)• Initial byte 1-127 (decimal)

• Class B: medium number of hosts and networks– 10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh

• 16,384 class B networks, 65,534 hosts/network• Initial byte 128-191 (decimal)

• Class C: large number of small networks– 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

• 2,097,152 networks, 254 hosts/network• Initial byte 192-223 (decimal)

• Class D: 224-239 (decimal) Multicast [RFC1112]• Class E: 240-255 (decimal) Reserved

Page 11: TCP/IP How  it Works

11

Subnets• A subnet mask is applied to the host bits to

determine how the network is subnetted, e.g. if the host is: 137.138.28.228, and the subnet mask is 255.255.255.0 then the right hand 8 bits are for the host (255 is decimal for all bits set in an octet)

• Host addresses of all bits set or no bits set, indicate a broadcast, i.e. the packet is sent to all hosts.

Page 12: TCP/IP How  it Works

12

Subnet Mask Conversions

/1 128.0.0.0/2 192.0.0.0/3 224.0.0.0/4 240.0.0.0/5 248.0.0.0/6 252.0.0.0/7 254.0.0.0/8 255.0.0.0/9 255.128.0.0/10 255.192.0.0/11 255.224.0.0/12 255.240.0.0/13 255.248.0.0/14 255.252.0.0/15 255.254.0.0/16 255.255.0.0

/17 255.255.128.0/18 255.255.192.0/19 255.255.224.0/20 255.255.240.0 /21 255.255.248.0/22 255.255.252.0/23 255.255.254.0/24 255.255.255.0/25 255.255.255.128/26 255.255.255.192/27 255.255.255.224/28 255.255.255.240/29 255.255.255.248/30 255.255.255.252/31 255.255.255.254/32 255.255.255.255

PrefixLength

Subnet Mask PrefixLength

Subnet Mask

128 1000 0000 192 1100 0000 224 1110 0000 240 1111 0000 248 1111 1000 252 1111 1100 254 1111 1110 255 1111 1111

Decimal Octet Binary Number

Page 13: TCP/IP How  it Works

13

Address depletion• In 1991 IAB identified 3 dangers

– Running out of class B addresses– Increase in nets has resulted in routing table explosion– Increase in net/hosts exhausting 32 bit address space

• Four strategies to address– Creative address space allocation {RFC 2050}– Private addresses {RFC 1918}, Network Address

Translation (NAT) {RFC 1631}– Classless InterDomain Routing (CIDR) {RFC 1519}– IP version 6 (IPv6) {RFC 1883}

Page 14: TCP/IP How  it Works

14

Creative IP address allocation• Class A addresses 64 – 127 reserved

– Handle on individual basis, got some back (eg Stanford)• Class B only assigned given a demonstrated need• Class C

– divided up into 8 blocks allocated to regional authorities– 208-223 remains unassigned and unallocated

• Four main registries handle assignments– APNIC – Asia & Pacific www.apnic.net– ARIN – N. & S. America, Caribbean & sub-Saharan

Africa www.arin.net– RIPE – Europe and surrounding areas www.ripe.net– AFRINIC

Page 15: TCP/IP How  it Works

15

Private IP Addresses• IP addresses that are not globally unique, but used

exclusively in an organization• Three ranges:

– 10.0.0.0 - 10.255.255.255 a single class A net– 172.16.0.0 - 172.31.255.255 16 contiguous class Bs– 192.168.0.0 – 192.168.255.255 256 contiguous class Cs

• Connectivity provided by Network Address Translator (NAT)– translates outgoing private IP address to Internet IP

address, and a return Internet IP address to a private address

– Only for TCP/UDP packets

Page 16: TCP/IP How  it Works

16

Class InterDomain Routing (CIDR)• Many organization have > 256 computers but few

have more than several thousand• Instead of giving class B (16384 nets) give

sufficient contiguous class C addresses to satisfy needs– < 256 addresses assign 1 class C– …– < 8192 addresses assign 32 contiguous Class C nets

Page 17: TCP/IP How  it Works

17

• Since assigned contiguously, class C CIDR has same most significant bits & so only needs one routing table entry

• CIDR block represented by a prefix and prefix length– Prefix = single address representing block of nets, e.g

• 192.32.136.0 = 11000000 00100000 10001000 00000000 while• 192.32.143.0 = 11000000 00100000 10001111 00000000

– Prefix length indicates number of routing bits, e.g.192.32.136.0/21 means 21 bits used for routingMask = 255.255.248.0

• CIDR collects all nets in range 192.32.136.0 through 143.0 into a single router entry – reduces router table entries

• Removes address classes A, B & C boundaries• For more details see RFC 1519

CIDR & Supernetting

21 bit prefix (2048 host addresses)

Page 18: TCP/IP How  it Works

18

Address Recognition Protocol (ARP)• IP address is at network layer, need to map it to the

MAC (Ethernet address) link layer address• Use ARP to map 48 bit Ethernet address to 32 bit IP

– IP requests MAC address for IP address from local ARP table

– If not there, then an ARP request packet for IP address is sent using physical broadcast address (all FFFs)

– Host with requested IP address responds with its MAC address as a unicast packet

– On return, host updates ARP table and returns MAC address

– ARP cache times out– ARP packets are on top of Ethernet

Page 19: TCP/IP How  it Works

19

ARP cont.• ARP requests are local only, do not cross routers

• Compare local IP and subnet mask => local subnet• Compare local subnet to destination IP

– if local, ARP for MAC address– else remote so

• if ROUTE entry, ARP for router to subnet• if default route, ARP for default gateway• otherwise, drop packet & return error

134.79.10.17 134.79.15.3134.79.15.1134.79.10.1User A User B

Subnet 1 Subnet 2

Page 20: TCP/IP How  it Works

20

Routing• Routers must select next hop for packet• Get route information from other routers via a

routing protocol (RIP, OSPF, EIGRP, BGP etc.)• Note the following are non-routable:

– private networks: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16

– Loopback 127.0.0.0/24

Page 21: TCP/IP How  it Works

21

ICMP Purpose (RFC 792)• Communicates control & error information

– Between routers and hosts– Only reports to original source, suggests corrections– Error messages about error messages are not generated– Never generated due to multicasts

• Packet format

Type Code Checksum0 8 16 31

ICMP data (depends on type/code)

24

Page 22: TCP/IP How  it Works

22

Main ICMP request types

Type ICMP 0 Echo reply, ping3 Destination unreachable (code 1 host, code 3 port)

DF and must fragment (code 4)4 Source quench5 Redirect (change a route)8 Echo request11 Time exceeded (code 0 ttl=0, code 1 reassembly)12 Parameter problems

Page 23: TCP/IP How  it Works

23

ICMP Echo/Ping• Very commonly used diagnostic tool• Implementations vary between OS’• Build echo request

– Identifier used to match request to replies (e.g. pid)– Sequence number, starts at 0 increments by 1 for each ping packet

• Used to detect loss, reorder, duplicates– Optional data, sent by requester, returned by replier

• Usually contains a timestamp when the request was sent plus pad data

Type=8 Code=0 Checksum0 8 16 31

Identifier Sequence numberOptional data

24

Page 24: TCP/IP How  it Works

24

What do we learn from Ping• Host reachable

– Host may respond to ping but not be running services• Round trip timing• Lost packets• Packet reordering duplicate packets• Example:

13cottrell@noric05:~>ping -c 4 lhr.comsats.net.pkPING lhr.comsats.net.pk (210.56.16.10) from 134.79.125.205 : 56(84) bytes of data.64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=0 ttl=242 time=716.962 msec64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=1 ttl=242 time=720.375 msec64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=2 ttl=242 time=725.907 msec64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=3 ttl=242 time=710.734 msec

--- lhr.comsats.net.pk ping statistics ---4 packets transmitted, 4 packets received, 0% packet lossround-trip min/avg/max/mdev = 710.734/718.494/725.907/5.566 ms

Page 25: TCP/IP How  it Works

25

Time Exceeded

• Time-to-live has expired at a router (code=0)– ttl sets bound on number routers datagram can transit

• Prevents infinite routine loops• Initialized by sender, decremented by 1 each time passes router• When ttl = 0 datagram thrown away & sender notified by

ICMP message

• Fragment reassembly timer (code=1)

Type 11 Code Checksum0 8 16 31

Unused

Internet header & 8 bytes of data

24

Page 26: TCP/IP How  it Works

26

MTU Discovery• Path MTUs vary• Fragmentation is bad• Small transmission units are bad• SO need to discover optimum MTU (largest without

fragmentation)• Host sends a packet with the Don’t Fragment bit set

– Length is lesser of local MTU and MSS announced by remote system

– If MTU between hosts requires fragmentation (e.g. at an intermediate router), then

• if an ICMP DF bit set & must fragment then an ICMP message is sent back to source, saying “I can’t fragment”

• try again with smaller size.

Page 27: TCP/IP How  it Works

27

User Datagram Protocol - UDP• RFC 768, Protocol 17

• Provides unreliable, connectionless on top of IP• Minimal overhead, high performance

– No setup/teardown, 1 datagram at a time• Application responsible for reliability

– Includes datagram loss, duplication, delay, out-of-sequence, multiplexing, loss of connectivity

IP

Port 1

TCP UDP

Port 2 Port 1 Port 2

Demux on IP protocol

Demux onPort number

Network

Transport

App.

Page 28: TCP/IP How  it Works

28

UDP Datagram format

• Source/destination port: port numbers identify sending & receiving processes– Port number & IP address allow any application in any computer on Internet to be

uniquely identified

– Used to demultiplex datagrams to processes

– Ports can be static or dynamic• Static (< 1024) assigned centrally, known as well known ports

• Dynamic

• Message length in bytes includes the UDP header and data

Source port Destination portUDP message len Checksum (opt.)

0 8 16 3124

Data…

Page 29: TCP/IP How  it Works

29

UDP applications• Message oriented, e.g. SNMP, DNS, time, some

Real Time data (e.g. VoIP data, but not setup)• Some File systems, e.g. NFS, AFS• Lightweight file transfer, e.g. tftp, bootp

Page 30: TCP/IP How  it Works

30

Transmission Control Protocol -TCP• RFC 768 & host requirements RFC 1122

– Reliable stream transport • Connection oriented (full duplex virtual circuit)

– Conceptually place call, two ends communicate to agree on details– After agreeing application notified of connection– During transfer, ends communicate continuously to verify data received

correctly– When done, ends tear down the connection– If UDP is like regular mail, TCP is like phone call

• Provides buffering and flow control• Takes care of lost packets, out of order, duplicates, long delays • Isolates application program from network details• Jargon

– Segment = TCP packet– Socket= source (address + port) + destination (address + port)

Page 31: TCP/IP How  it Works

31

TCP layering

• To ID connection need:– Source: (address, port) AND Destination: (address, port)– Only need one port on host to allow multiple connections, since each

connection will have different (host, port) at other end• E.g. single host can serve multiple telnet connections

• Passive open: application contacts OS & indicates will accept incoming connection, OS assigns port and listens

• Active open: application requests OS to connect to an (host, port)

IP

Port 1

TCP UDP

Port 2 Port 1 Port 2

Demux on IP protocol

Demux onPort number

Network

Transport

App.

IP port 6

Page 32: TCP/IP How  it Works

32

TCP – providing reliability• Positive acknowledgement (ACK) with

retransmission– Sender keeps record of each packet sent– Sender awaits an ACK– Sender starts timer when sends packet

Send pkt 1

Rcv ACK 1Send pkt 2

Rcv ACK 2

Network messages

Rcv pkt 1

Rcv pkt 2Send ACK 2

Send ACK 1

Sender site Receiver siteTim

e

Page 33: TCP/IP How  it Works

33

TCP – simple lost packet recovery

Send pkt 1Start timer

ACK normallyarrives

Rcv ACK 1

Network messages

Pkt should arrive

Rcv pkt 1Send ACK 1

ACK should be sent

Sender site Receiver siteLoss

Timer expiresRetransmit pkt 1 start timer

Page 34: TCP/IP How  it Works

34

TCP – improving performance• BUT simple ACK protocol wastes bandwidth since it must delay

sending next packet until it gets ACK• Use sliding window

• Sender can send 4 packets of data without ACK– When sender gets ACK then can send another packet– Window = unacknowledged packets/bytes– Keeps timer for each packet

1 2 3 4 5 6 7 8 …

Initial window of 4 packets1 2 3 4 5 6 7 8 …

Window slides

Packets successfully sent

Packets sent, awaiting ACK

Packets to be sent

Page 35: TCP/IP How  it Works

35

Tuning to fill pipe• Optimal window size depends on:

– Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth

– Round Trip Time (RTT)– For TCP keep pipe full

• Window (sometime called pipe) ~ RTT*BW– Can increase bandwidth by orders of magnitude

• Windows also used for flow control

Src Rcv

ACKt = bits in packet/link speed

RTT

Page 36: TCP/IP How  it Works

36

Implementation• Sliding window operates at byte level, NOT packet

• Receiver keeps similar window to put stream back together

• Since full duplex, altogether 4 windows & pointer sets

1 2 3 4 5 6 7 8 …

Current window

Highest byte that can be sent

Bytes sent and acknowledged

3 pointersHighest byte sent

Page 37: TCP/IP How  it Works

37

TCP flow control• Windows vary over time

– Receiver advertises (in ACKs) how many it can receive• Based on buffers etc. available

– Sender adjusts its window to match advertisement– If receiver buffers fill, it sends smaller adverts

• Used to match buffer requirements of receiver• Also used to address congestion control (e.g. in

intermediate routers)

Page 38: TCP/IP How  it Works

38

TCP Segment format

• Source/Dest port: TCP port numbers to ID applications at both ends of connection

• Sequence number: ID position in sender’s byte stream

Source port Destination portSequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Code WindowUrgent ptrChecksum

Options (if any) PaddingData if any

Page 39: TCP/IP How  it Works

39

TCP segment format – cont.• Acknowledgement: identifies the number of the

byte the sender of this segment expects to receive next

• Hlen: specifies the length of the segment header in 32 bit multiples. If there are no options, the Hlen = 5 (20 bytes)

• Reserved for future use, set to 0• Code: used to determine segment purpose, e.g.

SYN, ACK, FIN, URG

Page 40: TCP/IP How  it Works

40

TCP Segment format- cont• Window: Advertises how much data this station is

willing to accept. Can depend on buffer space remaining.

• Checksum: Verifies the integrity of the TCP header and data. It is mandatory.

• Urgent pointer: used with the URG flag to indicate where the urgent data starts in the data stream. Typically used with a file transfer abort during FTP or when pressing an interrupt key in telnet.

• Options: used for window scaling, SACK, timestamps, maximum segment size etc.

Page 41: TCP/IP How  it Works

41

TCP timeout• Need a timeout estimate that will work for LANs

(RTT < msec.) to satellite WANs (hundreds of msec. to secs). RTT can vary a lot with time of day, day of week, or one second to next.– TCP records time segment sent – and time ACK received– Then calculates RTT sample– Smooth & use to estimate timeout, e.g.

• Timeout=beta * RTTs

• Timeout= RTTs + eta{=4}*f(dev(RTTs))– Needs to take account of losses, e.g.

• New_timeout=gamma{2} * timeout

May 12th

RTT

ms.

Time of day

Page 42: TCP/IP How  it Works

42

TCP connection establishment• 3 way handshake

• Initial sequence numbers (x, y) are chosen randomly• Guarantees both sides ready & know it, and sets

initial sequence numbers, also sets window & mss• Once connection established, data can flow in both

directions, equally well, there is no master or slave

Send SYN seq x

Rcv SYN/ACKSend ACK y+1

Rcv SYN segment

Rcv ACK segment

Send SYN seq=y, ACK x+1

Site 1 Site 2ActiveWin 4096, mss 1024PassiveWin 4096, mss 1024

Page 43: TCP/IP How  it Works

43

TCP close connection• Modified 3 way handshake (or 4 way termination)

• App tells TCP to close, TCP sends remaining data & waits for ACK, then sends FIN

• Site 2 TCP ACKs FIN, tells its application “end of data”• Site 2 sends FIN when its app closes connection (may be long delay (e.g.

require human interaction).

(App closes) Send FIN seq=x,ACK=y

Rcv ACK segment

Rcv FIN segment

Receive ACK segment

Send seq=y, ACK x+1(inform app)

Site 1 Site 2

Rcv FIN + ACK segSend ACK y+1

(app closes connection)Send FIN seq=y, ACK x+1

FIN Wait1 Close

WaitFIN Wait2

Last ACKTime

Wait

Closed

Page 44: TCP/IP How  it Works

44

More Information• Lectures, tutorials etc:

– www.nv.cc.va.us/home/joney/tcp_ip.htm– www.cs.pdx.edu/~jrb/tcpip.lectures.html– www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS – www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm – www.cis.ohio-state.edu/htbin/rfc/rfc1180.html – www.jbmelectronics.com/tcp.htm

• Encylopaedia– http://www.freesoft.org/CIE/index.htm

• TCP/IP Resources– www.private.org.il/tcpip_rl.html

• Understanding IP addresses– http://www.3com.com/solutions/en_US/ncs/501302.html

• Configuring TCP (RFC 1122)– ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

• Assigned protocols, ports etc (RFC 1010)– http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols

Page 45: TCP/IP How  it Works

45

Example: 3 way handshake• atlas> telnet sunstats.cern.ch

– atlas is a WNT PC, sunstats is a Sun Solaris 5.6 host– MSS is set in TCP option in a SYN segment,

communicates the MSS the sender wants to receive – len=ip_hlen/tcp_hlen:ip_total_len– Initial Sequence Numbers are randomly selected– Telnet = port 23– W=Receive window size advertises how much data this

host will accept

Page 46: TCP/IP How  it Works

46

Example: 3 way handshake - cont.• TCP from atlas:1174 to sunstats:23 seq=180839, A=0,

W=8192, SYN [len=5/6:44, opt=020405B4 <opt=2, len=4, mss=0x5B4=1460>]

• TCP from sunstats:23 to atlas:1174 seq=1383568304, A=180840, W=64240, SYN/ACK [len=5/6:44, opt=020405B4]

• TCP from atlas:1174 to sunstats:23 seq =180840, A=1383568305, W=8760 [len=5/5:40, opt=nul]– Notice window size can vary from segment to segment depending on

buffer space available– Notice smaller PC window advertisement– Notice ephemeral port selected by telnet client – Notice acknowledge next expected byte (=seq+1)– 0x020405B4: 02 = option type, 04=len, 0x5B4=1460

Page 47: TCP/IP How  it Works

47

Session startSLAC>CERN: 256kbyte window,1 stream, full speed > 30msec, 13MBytes in 20s, 5.1MBytes/s

Rcvr Advertised window

Acks returned by Rcvr

Segments sent

Congestion window

Page 48: TCP/IP How  it Works

48

Unreachable76cottrell@flora06:~>ping islamabad-server2.comsats.net.pk

ICMP 13 Unreachable from gateway 207.45.205.18

for icmp from FLORA06.SLAC.Stanford.EDU (134.79.16.101) to islamabad-server2.comsats.net.pk (210.56.8.8)

What does this mean, see exercise?