radia perlman: principles - tu berlinstefan/npa10_january-19.pdf · stefan schmid 1 radia perlman:...
TRANSCRIPT
Stefan Schmid 1
Radia Perlman: Principles
Radia Perlman: „mother of the Internet“Contributions to spanning tree protocol (for networkbridges)PhD @ MIT, now with Sun Microsystems
Entertaining summary of the memo, have a look:http://www.usenix.org/event/usenix01/invitedtalks/perlman.pdf
Stefan Schmid 2
Radia Perlman’s Folklore of protocol design
Collect various tricks and ''gotchas'' (wörtlich: “erwischt!”) in protocol design
'‘Here are several ways to solve problem X'', with technical explanation of pros/cons
Some “real world” examples
We’ll cover most, not all,“tricks and gotchas”
Stefan Schmid 3
Simplicity vs. flexibility vs. optimality??
Is a more complex protocol reasonable?Is “optimal” important?(Approximation enough, e.g., dynamic anyway?)KISS: “The simpler the protocol, the more likely it is to be successfully implemented and deployed.”
Why are protocols overly complex?Design by committee (different stakeholders)Backward compatibilityFlexibility: Heavyweight swiss army knifeUnreasonable striving for optimalityUnderspecificationExotic/unneeded features
“Making the simple complicated iscommonplace; making the complicated simple, awesomely simple, that’s creativity!”
Charles Mingus
Stefan Schmid 4
Know the problem you are trying to solve:Have at least one well-defined problem in mindThen: solve other problems without complicating solution?
Make it Well-defined and Scalable!
Think about scalingThink about what happens if you’re successful: protocol is used by millions (prevent “success desaster”)Think also about the other extreme: Does the protocol make sense in small situations as well?
Stefan Schmid 5
Operation above capacity
Protocol should degrade gracefully in overload, at least detect overload and complain
How does protocol break and die?
Can’t just die under overload…!
Think about Overload/Failures!
Stefan Schmid 6
Example: How to design identifiers?
Identifiers: Protocols often contain a field identifying something, e.g., the protocol type.
Two approaches: global or hierarchicalHighly encoded universal numbers: E.g., upper layer protocol # assigned by IANA:compact and interoperational, but central adminGeneral purpose object identifiers, as in ASN.1 (Abstract Syntax Notation One): hierarchical structure: not compact (memory, BW, CPU, …), name clashes, but “federalistic”
e.g., “Next Header” field in IPv5: 1=ICMP, 9=IGP, etc.
Stefan Schmid 7
SNMP namingQuestion: How to name every possible standard object
(protocol, data, more..) in every possible network standard??
Answer: ISO Object Identifier tree:Hierarchical naming of all objectsEach branch point has name, number
1.3.6.1.2.1.7.1ISO (0=ITU)
ISO-ident. Org.US DoD
Internet
udpInDatagramsUDPMIB2management
Simple network management protocol(central protocol to monitor networkelements)
e.g., also used for X.509 public keycertificates (objects therein...)
Stefan Schmid 8
Check out www.alvestrand.no/harald/objectid/top.html
OSI Object Identifier Tree
Stefan Schmid 9
Assigned Internet Protocol numbersFrom RFC 1700:
Decimal Keyword Protocol References------- ------- -------- ----------
0 Reserved [JBP]1 ICMP Internet Control Message [RFC792,JBP]2 IGMP Internet Group Management [RFC1112,JBP]3 GGP Gateway-to-Gateway [RFC823,MB]4 IP IP in IP (encasulation) [JBP]5 ST Stream [RFC1190,IEN119,JWF]6 TCP Transmission Control [RFC793,JBP]7 UCL UCL [PK]8 EGP Exterior Gateway Protocol [RFC888,DLM1]9 IGP any private interior gateway [JBP]
10 BBN-RCC-MON BBN RCC Monitoring [SGC]11 NVP-II Network Voice Protocol [RFC741,SC3]12 PUP PUP [PUP,XEROX]13 ARGUS ARGUS [RWS4]14 EMCON EMCON [BN7]15 XNET Cross Net Debugger [IEN158,JFH2]
Stefan Schmid 10
Optimize for common caseSeen this before…Nice example: IPV6 payload (packet) length field
Example: Design for Common Case
Payload length: only 2 bytesIf packet longer: payload length = 0, but 4 byte length field found in IP optionsDesigners chose against 4-byte
header to optimize common case: 2 bytes are typically enough
Of course, if not alternative avialable, better overestimate than underestimate! (e.g., IP packet identifier is arguably too small)
Stefan Schmid 11
Forward compatibilityThink about future changes, evolutionMake fields large enoughReserve some spare bitsSpecify an options fieldthat can be used/augmented later (see IP length discussion before!)
Compatibility & Use of Parameters
Parameters: yes or no?Protocol parameters can be useful?
Designers can’t determine reasonable valuesTradeoffs exist: Leave parameter choice to users
Parameters can be bad?Users (often not well informed) will need to choose valuesTry to make values plug-and-play!
Stefan Schmid 12
Making systems “robust”: Many forms of robustnessImmediately adapt to failure/changeSelf-stabilization: Eventually adapt to failure/change(example: self-stabilizing peer-to-peer overlays like SKIP+: regaining logarithmic degree and diameter from any initially connected overlay network!)Byzantine robustness: Will work in spite of malicious usersMaybe better to crash than degrade when problems occur: signal that problem existsTechniques for limited spread of figures
Robustness: Notions
A Polylogarithmic Time Algorithm for Distributed Self-Stabilizing Skip Graphs, PODC 2009.
Stefan Schmid 13
Missing folklore/advice?
Stefan Schmid 14
Summary: Implementation principles Identify, study principles that can guide implementation of network protocols
Common principles among many protocols“Folklore” of protocol design
Synthesis: Big pictureArchitecture and implementation:
Both more art than science
Stefan Schmid 15
Where we are now…
Goals:Identify, study common architectural components, protocol mechanismsSynthesis: big pictureDepth: important topics not covered in introductory courses
Overview:SignalingStateMultiplexing/ResourceAllocationRandomizationIndirectionService locationNetwork virtualization
Stefan Schmid 16
Randomization
Randomization used in many protocolsE.g., to?
break symmetries (e.g., among symmetric elements)desynchronize (e.g., when only one answer is needed) „avoid worst-cases“... or just make protocol simpler!!
we’ll study examples:Shared medium/bus access:Ethernet multiple access protocolrouter (de)synchronization switch scheduling
Stefan Schmid 17
Ethernet
Metcalfe’s Ethernetsketch
Single shared broadcast channel 2+ simultaneous transmissions by nodes: interference
only one node can send successfully at a time multiple access protocol: distributed algorithm that determines how nodes share channel, i.e., determine when node can transmitInspired by the ALOHANet of Hawaii (first radio network, connectingHawaian islands...), quite efficient (close to 100% at low utilization)Initially star topology connected by hub, nowadays switch in center...
TAP:
“vampire tap”, “T-Stück”, …
(“Spannungsmessung”
etc.)Transceiver:
Transmitter
and Receiver
Stefan Schmid 18
Deterministic Algorithms
How to share the medium using deterministicalgorithms...?
Time Division Multiplexing ?But how to organize? What if someone has nothing to send? Whatif additional hosts are added and removed? Etc.
Polling?Virtual Ring?Etc.
Randomized often simpler and more efficient!
Stefan Schmid 19
Ethernet: uses CSMA/CD
A: sense channel (“CS”), if idle then {
transmit and monitor the channel;// “asynchronous protocol”!
If
detect another transmission (“CD”)then
{ abort and send jam signal; update # collisions; delay as required by exponential backoff algorithm; goto A}
else
{done with the frame; set collisions to zero}}
else {wait until ongoing transmission is over and goto A}
Carrier Sense Multiple Access / Collision Detection
Stefan Schmid 20
Ethernet’s CSMA/CD: Jam Signal
Jam Signal: make sure all other transmitters are aware of collision (48 bits)
Why?:A starts to send, at shortly before signal reaches B, B starts to send:
B immediately notices collision and stops; but to makesure A notices the collision too and will also stop thetransmission, a higher power signal is neededEtnernet limits spatial extension... (notice before finished!)
A B
Stefan Schmid 21
Ethernet’s CSMA/CD: Backoff
Exponential Backoff Algorithm:first collision for given packet: choose K randomly from {0,1}; delay is K x 512 bit transmission timesafter second collision: choose K randomly from {0,1,2,3}, {0,1,2,3,4,5,6,7}, etc.after ten or more collisions, choose K randomly from {0,1,2,3,4,…,1023} (limited scale!)
Stefan Schmid 22
Ethernet’s use of randomization
Resulting behavior: probability of retransmission attempt (equivalently: length of randomization interval) adapted to current load
simple, load-adaptive, multiple access!
morecollisions
heavierload (most likely), more
nodes trying to send
randomizeretransmissionsover longer time
interval, to reduce collision
probability
Stefan Schmid 23
Ethernet Comments
Upper bounding at 1023 = K limits max network size!Max spatial extension of Ethernet makes sure sender withlowest K value has a good chance to successfully send entire packet before next collisionCould remember last value of K when we were successfull... rather, new packet is tried with minimal backoff again! (Analogy: TCP remembers last values of congestion window size)Q: why use binary backoff rather than something more sophisticated such as AIMD: simplicity
Stefan Schmid 24
The bottom line
Why does Ethernet use randomization?
E.g., to desynchronize:
A distributed (=“each host runs the protocol independently”) adaptive algorithm to spread out load over time when there is contention for multiple access channel
Stefan Schmid 25
Efficiency of Ethernet?
Approximation formulas, e.g.
Eff = 1/(1+5*prop/dur)
whereprop = max propagation time between two adaptersdur = time to transmit packet of max size
Intuition:If prop is very small, transmissions are stopped immediately
when colliding, so efficient!If dur is very large, channel is used for a long time without
collisions, which is efficient again.
Stefan Schmid 26
Excursion: Medium Access on Wireless Networks: What changes…?
Typical wireless networks…:are not full-duplex (just one channel...)nodes cannot sense the medium during owntransmissions (just one antenna...)no bounded propagation domainare multihop (hidden and exposed terminal problems):
A B C
Hidden terminal: C does not notice that B is currently receiving transmissions fromA also => no „remote carrier sense“
A B C
Exposed terminal: B sends A and C wantsto send to someone on the right: it waitsbecause it hears B, but B would notreach the recipient of C, so actually C could send! => inefficient
Stefan Schmid 27
Excursion: Medium Access on Wireless Networks?
Therefore, CD is often replaced by (best effort) Collision Avoidance (CA)
Side note: still ongoing research, e.g., there are randomized distributed medium access protocols which optimally coordinate medium access probabilities and exploit the unpredictable non- jammed (e.g., due to external inteference) time periods (e.g., the Jade protocol).
A Jamming-Resistant MAC Protocol for Multi- Hop Wireless Networks, DISC 2010.
Stefan Schmid 28
Randomization to avoid synchronization!
Phenomenon: many apparently independent processes synchronizeover timeClassic example: 17th century (Huygens)
Two pendulums synchronize if attached to same wall!Try putting two metronomes on the same floor...Similar phenomena: blinking of fireflies, road traffic and car kinetics(one car reduces speed: collective decrease in flow), TCP windowincrease/decrease cycles in presence of shared bottleneck gateway, client/server scenarios where server is busy, etc.
Stefan Schmid 29
Youtube!
http://www.youtube.com/watch?v=tlYIyKic3w8
Stefan Schmid 30
Fireflies... („Glühwürmchen“)
Stefan Schmid 31
Routing messages can get synchronized over time!Emergent phenomenon: no synchronization up to a certain scale, and then fully synchronized!Can result in long delays...Randomization can help, but quite a lot is needed!
Stefan Schmid 32
(de)Synchronization of periodic routing updates
Periodic losses observed in end-end Internet traffic at 90 sec intervals
Ping messages to Harvard and MIT (1-sec intervals)Round trip times in figure: losses shown as negative RTT
Why?
IGRP routing updates: routers could not forward other packets while large routingupdates were processed; similar phenomena with RIP...Found paths with 318sec/45sec/15sec spikes...
Stefan Schmid 33
Router UpdatesA simplified model (for EGP, IGRP, RIP, etc.):
Routers transmit routing messages at periodic intervals (ensuresconsistent tables even after losses)
1. A router prepares and sends its routing message. In the absence of incoming routing messages, a router resets its timer Tc (= time to process an outgoing or incoming message) seconds after Step 1. begins. Other nodes receive this router‘s message after Td seconds.
2. If a router receives an incoming routing message while preparing its own outgoing routing message, it also processes the incoming routing message, which takes time another Tc seconds.After Steps 1.+2., a router sets its timer, it expires after {Tp-Tr, Tp+Tr} time somewhere, where Tr describes the randomfluctuation (e.g., OS overhead). When it expires, it goes back to Step 1.If a router receives a message after the timer has been set, therouting message is processed immediately. If it is a triggeredupdate (e.g., link failure), we go directly to Step 1. without waitingfor timer to expire.
Stefan Schmid 34
Router Update Operation:
prepareown routing
update(time: TC)
receive update from neighborprocess (time: TC)
wait
receive update from neighborprocess
<ready>send update (time: Td
to arrive at dest)start_timer (uniform: Tp
+/-
Tr)
timeout, or link fail
update
time spent in statedepends on msgs
received from others(weak coupling between routers
processing)
Stefan Schmid 35
Router SynchronizationSimulation: 20 routers broadcasting updates to each otherx-axis: time until routing update sent relative to start of roundBy t=100,000 all router rounds are of length 120 and synchronized! Yields long delays... (20*Tc instead of Tc seconds!)synchronization or lack thereof depends on system parameters… (e.g., crucially on network size according to the paper)Often a robust trend to oraway from synchronization...
Stefan Schmid 36
Details
Blowup of previous graph
Note expansion of computation phase
→ increased period
A‘s timer expires, begins to send message butbefore finishing, B‘s timer expires, A needs to process this also before resetting ist timer, so this takes time 2*Tc: A and B are synrhonizedand become a cluster... (for Td=0)
Desynchronization due tosome random event...
short
interval
long
interval
Stefan Schmid 37
Sync
Coupled routersExample of spontaneous synchronization
firefliessleep cycleheart beatetc.Steven Strogatz . Sync, Hyperion Books, 2003.
Stefan Schmid 38
Avoiding Synchronization?
Enforce max time spent in prepare stateMake thingsindependent of externalevents (e.g., spec of RIP)?Problem: If initiallysync, never desync...Choose random timer component, Tr large
prepareown routing
update(time: TC)
receive update from neighborprocess (time: TC)
wait
receive update from neighborprocess
<ready>send update (time: Td
to arrive)start_timer (uniform: Tp
+/-
Tr)
Stefan Schmid 39
Router (de)synchronization
One use of randomization:
Desynchronization of routers!
Our model was simplistic: ignores collisions, Ethernet retransmissions, etc.
Stefan Schmid 40
Randomization in Reliable Multicast
Reliable Multicast: how to transfer data “reliably” from source(s) to R receivers.
”Like in real life”: all current RM error and congestion control approaches have an analogy in human-human communication
Stefan Schmid 41
Scalability: Feedback Implosion
. . .
AC
K
ACK
ACK
ACK
ACK
ACK ACK
senderrcvrs
If all receivers ACK immediately upon reception,the sender has to process a large number of messages!
Smart and scalable reliable MC?
Stefan Schmid 42
Reliable Mcast
Thus, we can distinguish between two main types of multicasts:
Sender-oriented multicast: how to implement? Pro and con?
Receiver-oriented multicast: how to implement? Pro and con?
What is better? Two sampleimplementations next...
Stefan Schmid 43
Sender Oriented Reliable Mcast
Sender:mcasts all (re)transmissionsselective repeat if loss (only lost packet, but Mcast to all again)timers for loss detection(positive) ACK table: for each packet list of who ACKed alreadypkt removed when ACKs are in
Rcvr: ACKs received pktsNote: group membership
important (sender needs to know…)
X
sender
receivers
ACK ACK
AC
KACK ACK
How to do it reliably with less burden at server?!
burden: ACK lists, timer, ACK implosion, ...
Without ACKs?!
Stefan Schmid 44
(simple) Rcvr Oriented Reliable Mcast
Sender:mcasts (re)transmissionsselective repeat (but to all)responds to NAKsProblem: when stop buffering pkt?(sender does not know who is there and interested!)
Rcvr:NAKs (unicast to sender) missing pkts (e.g., gap in seq numbers)timer to detect lost retransmission
Note: easy to allow joins/leaves: no list at sender
X
sender
receivers
NA
K
Stefan Schmid 45
Receiver- vs Sender-oriented RM: Observations? (Dis)Advantages?
Rcvr-oriented: shift recovery burden to rcvrsloss detection “responsibility”, timersscaling: protocol computational resources grow as R (# receivers) grows (“receivers scale, sender does not!”) weaker notion of “group” (no explicit lists at server…)also cool: receivers can transparently choose their own, individual reliability semantics!
but ……when does sender “release” data rcvd by all?heartbeat needed to detect lost last pkt (receivers won’t notice a lost last packet, no gap in seqnumbers…)
Stefan Schmid 46
Evaluation of Approaches
Let’s examine resource requirements!processing requirements
expected time to process pkt• at sender: X, E[X]• at rcvr: Y, E[Y]
mean value approach
network requirements
Stefan Schmid 47
Processing in Sender-Initiated Protocol
For Mcast, sender must:Obtain data from higher layers (app)Construct packetSet timerProcess every ACK for each packet and receiverTimer interrupts and context switches...If error: rebroadcast, set timer again...
Stefan Schmid 48
Assumptions for Analysisone sender, R receivers
computational load matters!independent errors (not true for spanning tree propagation!), p per rcvrlossless signaling (okay: short ACKs less likely to get lost, and sometimes get a “better service”)
M - total number of transmissions per packet:
( ) K,1,1][ =−=≤ mpmMP Rm
( )∑∞
=
−−=1
11][m
RmpME
Prob that
none
of the
m transmissions
arrives
at a given
rcvr: pm, so it
works
with
prob 1-pm; prob that
all rcvrs
work: product...
E[M]= ∑
m·
P[M=m]: counting
multiple times
with
P[M>m]!
Stefan Schmid 49
Analysis…
E.g.:
Stefan Schmid 50
Sender vs Receiver: SimulationMetric - rcvr oriented thruput/sender oriented thruput
- sender is bottleneck (s. paper), so sender throughput = overall system throughput- much better throughput in receiver-oriented MC- especially for many receivers (scales better) and low error probability (hardly any
NAKs…)- in many-to-many multicasts less…
0
20
40
60
80
100
120
140
160
0 100 200 300 400 500 600 700 800 900 1000No. Receivers
p=0.01
p=0.05
p=0.10
p=0.25
One-to-Many Comparison
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
0 100 200 300 400 500 600 700 800 900 1000No. Receivers
p=0.01
p=0.05
p=0.10
p=0.25
Many-to-Many Comparison
RM: Coping with Scale, Heterogeity
Issues:avoid feedback implosion in reverse pathavoid receiving unneeded data (retrans.) in forward pathrecover data quickly, avoid long repair times
Techniques:•
feedback suppression
•
local recovery: “local retransmission”
How to do even better?
Stefan Schmid 52
Feedback Suppressionrandomly delay NAKs
“listen” to NAKs generated by othersif no NAK for lost pkt when timer expires, multicast NAK
widely used in RM tradeoffs
reduces bandwidth, especially with correlated errors (e.g., along spaning tree)but: additional complexity at receivers (timers, etc), maybe higher delay…
sender
X X
NAK
Stefan Schmid 53
Feedback Suppression: Performance GainsMetric - suppression thruput/no suppression thruput
- If high errors helps more and scales almost linearly in number of receivers!
- gains/loss depends on whether 1-many or many-many (receivers are also senders, additional complexity now matters, etc.)
0
5
10
15
20
25
0 100 200 300 400 500 600 700 800 9001000No. Receivers
p=0.01
p=0.05
p=0.10
p=0.25
One-to-Many Comparison
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 100 200 300 400 500 600 700 800 9001000No. Receivers
p=0.01
p=0.05
p=0.10
p=0.25
Many-to-Many Comparison
Local Recovery in SRM
Allow rcvr to recover lost pkt from “nearby” rcvr
“ask your neighbor”: send localized NAK (repair request)multicast: randomize local repair transmission time to avoid too many replies
orthogonal(complementary) to feedback suppression who to recover from?
don’t want repair request to go to everyonescoping: how to restrict how far request will travel: IP time-to-live field
Another idea: fix locally!
Stefan Schmid 55
Local Recovery: Example
R2 detects lost pktmulticasts repair requestlimited scope
not seen by R4
R1 and R3 have pktR3 times out first and sends repair
R4R4
R3R3
R2R2R1R1
NAK
repair
Stefan Schmid 56
Reliable multicast (SRM)
Use of randomizationavoid synchronizing all repliesto reduce feedback implosionin local recovery, to reduce number of retransmissions of same messagecould scale the randomization interval to be load-adaptive…
Stefan Schmid 57
Sidenote: Multicast vs N Unicasts
Multicast “group concept” preferable (e.g., IP multicast, indirection) to N unicasts (e.g., N TCP connections):
no redundant transmissions over a link
Challenges for (reliable) multicast:„fate-sharing“ in unicast clear: either sender or receiver mustdetect and recover from errors (e.g., in TCP: sender); but in multicast receivers can come and go any time? smart round trip time estimate with heterogeneous recievers?? (in unicast clear...)congestion window size?
=> receiver-based often better... (e.g., IP multicast, RSVP)
Stefan Schmid 58
To be continued next week…