september 1998 volume 1, number 2 - cisco › c › dam › en_us › about › ac123 › ac147 ›...

September 1998 Volume 1, Number 2

Missed the first issue of IPJ?Download your copy in

PDF format from: www.cisco.com/ipj

A Quarterly Technical Publication for Internet and Intranet Professionals

In This Issue

From the Editor .......................1

What Is a VPN?–Part II ...........2

Reliable Multicast Protocolsand Applications....................19

Layer 2 and Layer 3Switch Evolution....................38

Book Review..........................44

Fragments ..............................47

F r o m T h e E d i t o r

We begin this issue with Part II of “What Is a VPN?” by Paul Fergusonand Geoff Huston. In Part I they introduced a definition of the term “Vir-tual Private Network” (VPN) and discussed the motivations behind theadoption of such networks. They outlined a framework for describingthe various forms of VPNs, and examined numerous network-layer VPNstructures, in particular, that of controlled route leakage and tunneling. InPart II the authors conclude their examination of VPNs by describing vir-tual private dial networks and network-layer encryption. They alsoexamine link-layer VPNs, switching and encryption techniques, andissues concerning Quality of Service and non-IP VPNs.

IP Multicast

is an emerging set of technologies and standards thatallow many-to-many transmissions such as conferencing, or one-to-many transmissions such as live broadcasts of audio and video over theInternet. Kenneth Miller describes multicast in general, and reliablemulticast protocols and applications in particular. Although multicastapplications are primarily used in the research community today, thissituation is likely to change as the demand for Internet multimediaapplications increases and multicast technologies improve.

Successful deployment of networking technologies requires an under-standing of a number of technology options ranging from wiring andtransmissions systems via switches, routers, bridges and other pure net-working components, to networked applications and services.

TheInternet Protocol Journal

(IPJ) is designed to look at all aspects of these“building blocks.” This time, Thayumanavan Sridhar details some ofthe issues in the evolution of Layer 2 and Layer 3 switches.

Interest in the first issue of IPJ has exceeded our expectations, and hardcopies are almost gone. However, you can still view and print the issuein PDF format on our Web site at

www.cisco.com/ipj

. The currentedition is also available on the Web. If you want to receive our nextissue, please complete and return the enclosed card.

We welcome your comments, questions and suggestions regarding any-thing you read in this journal. We are also actively seeking authors fornew articles. The Call for Papers and Author Guidelines can be foundon our Web page. Please send your comments to

[email protected]

—Ole J. Jacobsen, Editor and Publisher

[email protected]

T h e I n t e r n e t P r o t o c o l J o u r n a l

2

What Is a VPN? — Part II

by Paul Ferguson, Cisco Systemsand Geoff Huston, Telstra

n Part I we introduced a working definition of the term “VirtualPrivate Network” (VPN), and discussed the motivations behind theadoption of such networks. We outlined a framework for describ-

ing the various forms of VPNs, and then examined numerous network-layer VPN structures, in particular, that of controlled route leakage andtunneling techniques. We begin Part II with examining other network-layer VPN techniques, and then look at issues that are concerned withnon-IP VPNs and Quality-of-Service (QoS) considerations.

Types of VPNs

This section continues from Part I to look at the various types of VPNsusing a taxonomy derived from the layered network architecturemodel. These types of VPNs segregate the VPN network at the net-work layer.

Network-Layer VPNs

A network can be segmented at the network layer to create an end-to-end VPN in numerous ways. In Part I we described a controlled routeleakage approach that attempts to perform the segregation only at theedge of the network, using route advertisement control to ensure thateach connected network received a view of the network (only peer net-works). We pick up the description at this point in this second part ofthe article.

Tunneling

As outlined in Part I, the alternative to a model of segregation at theedge is to attempt segregation throughout the network, maintaining theintegrity of the partitioning of the substrate network into VPN compo-nents through the network on a hop-by-hop basis. Part I examinednumerous tunneling technologies that can achieve this functionality.Tunneling is also useful in servicing VPN requirements for dial access,and we will resume the description of tunnel-based VPNs at this point.

Virtual Private Dial Networks

Although several technologies (vendor-proprietary technologies as wellas open, standards-based technologies) are available for constructing a

Virtual Private Dial Network

(VPDN), there are two principal meth-ods of implementing a VPDN that appear to be increasing inpopularity—

Layer 2 Tunneling Protocol

(L2TP) and

Point-to-PointTunneling Protocol

(PPTP) tunnels. From an historical perspective,L2TP is the technical convergence of the earlier Layer 2 Forwarding(L2F)

[1]

protocol specification and the PPTP protocol. However, onemight suggest that because PPTP is now being bundled into the desk-top operating system of many of the world’s personal computers, itstands to be quite popular within the market.

I


3

At this point it is worthwhile to distinguish the difference between “cli-ent-initiated” tunnels and “NAS-initiated” (Network Access Server,otherwise known as a Dial Access Server) tunnels. The former is com-monly referred to as “voluntary” tunneling, whereas the latter is com-monly referred to as “compulsory” tunneling. In voluntary tunneling,the tunnel is created at the request of the user for a specific purpose; incompulsory tunneling, the tunnel is created without any action from theuser, and without allowing the user any choice in the matter.

L2TP, as a compulsory tunneling model, is essentially a mechanism to“off-load” a dialup subscriber to another point in the network, or toanother network altogether. In this scenario, a subscriber dials into aNAS, and based on a locally configured profile (or a NAS negotiationwith a policy server) and successful authentication, a L2TP tunnel isdynamically established to a predetermined endpoint, where the sub-scriber’s

Point-to-Point Protocol

(PPP) session is terminated (Figure 1).

Figure 1:PPP Tunnel

Termination Modelof L2TP

PPTP, as a voluntary tunneling model, on the other hand, allows endsystems (for example, desktop computers) to configure and establishindividual discrete point-to-point tunnels to arbitrarily located PPTPservers, without the intermediate NAS participating in the PPTPnegotiation and subsequent tunnel establishment. In this scenario, asubscriber dials into a NAS, but the PPP session is terminated on theNAS, as in the traditional Internet access PPP model. The layered PPTPsession is then established between the client end system and anyupstream PPTP server that the client desires to connect to. The onlycaveats on PPTP connectivity are that the client can reach the PPTPserver via conventional routing processes, and that the user has beengranted the appropriate privileges on the PPTP server (Figure 2).

Figure 2:PPP Tunnel

Termination Modelof PPTP

Client Host L2TP AccessServer

Dial Access Provider

PPP Access Protocol

Internet or VPN Service

Dial AccessServer

Non-routedforwarding path

V.x modem protocol L2TP

Client Host PPTP AccessServer

Dial Access Provider VPN Service

Dial AccessServer

PPTP VirtualInterface

SerialInterface

Dial IP Access

PPP Access Protocol

What Is a VPN? — Part II:

continued


4

Although L2TP and PPTP may sound extraordinarily similar, there aresubtle differences that deserve further examination. The applicability ofboth protocols is very much dependent on what problem is beingaddressed. It is also about control—who has it, and why it is needed. Italso depends heavily on how each protocol implementation isdeployed—in either the voluntary or the compulsory tunneling models.

With PPTP in a voluntary tunneling implementation, the dial-in usercan choose the PPTP tunnel destination (the PPTP server) after the ini-tial PPP negotiation has completed. This feature is important if thetunnel destination changes frequently, because no modifications areneeded to the client’s view of the base PPP access when there is achange in the server and the transit path to the server. It is also asignificant advantage that the PPTP tunnels are transparent to the ser-vice provider, and no advance configuration is required between theNAS operator and the overlay dial access VPN. In such a case, the ser-vice provider does not house the PPTP server, and simply passes thePPTP traffic along with the same processing and forwarding policies asall other IP traffic. In fact, this feature should be considered asignificant benefit of this approach. The configuration and support of atunneling mechanism within the service provider network would beone less parameter that the service provider has to operationally man-age, and the PPTP tunnel can transparently span multiple serviceproviders without any explicit service provider configuration. How-ever, the economic downside to this feature for the service provider, ofcourse, is that a “VPDN-enabled” network service can be marketed toyield an additional source of revenue. Where the client undertakes theVPDN connection, there is no direct service provider involvement andno consequent value added to the base access service.

From the subscriber’s perspective, this is a “win-win” situation, becausethe user is not reliant on the upstream service provider to deliver theVPDN service—at least no more than any user is reliant for basic IP-level connectivity. The other “win” is that the subscriber does not haveto pay a higher subscription fee for a VPN service. Of course, the situa-tion changes when the service provider takes an active role in providingthe VPDN, such as housing the PPTP servers, or if the subscriber resideswithin a subnetwork in which the parent organization wants the ser-vice provider’s network to make the decision concerning where tunnelsare terminated. The major characterization of PPTP-based VPDN is oneof a roaming client base, where the clients of the VPDN use a local con-nection to the public Internet data network, and then overlay a privatedata tunnel from the client’s system to the desired remote service point.Another perspective is to view this approach as “on-demand” VPDNvirtual circuits.

With L2TP in a “compulsory” tunneling implementation, the serviceprovider controls where the PPP session is terminated. This setup can beextremely important in situations where the service provider to whom


5

the subscriber is actually dialing into (let’s call it the “modem pool pro-vider” network) must transparently hand off the subscriber’s PPPsession to another network (let’s call this network the “content pro-vider”). To the subscriber, it appears as though the local system isdirectly attached to the content provider’s network, when in fact theaccess path has been passed transparently through the modem pool pro-vider’s network to the subscribed content service. Very large contentproviders, for instance, may outsource the provisioning and mainte-nance of thousands of modem ports to a third-party access provider,who in turn agrees to transparently pass the subscribers’ access sessionsback to the content provider. This setup is generally called “wholesaledial.” The major motivation for such L2TP-based wholesale dial lies inthe typical architecture of the

Public Switched Telephone Network

(PSTN), where the use of wholesale dial facilities can create a morerational PSTN call load pattern with Internet access PSTN calls termi-nated in the local Central Office.

Of course, if all subscribers who connect to the modem pool provider’snetwork are destined for the same content provider, then there are cer-tainly easier ways to hand this traffic off to the content provider’snetwork—such as simply aggregating all the traffic in the local CentralOffice and handing the content provider a “big fat pipe” of the aggre-gated session traffic streams. However, in situations where the modempool provider is providing a wholesale dial service for multipleupstream “next-hop” networks, the methods of determining how eachsubscriber’s traffic must be forwarded to his/her respective content pro-vider are somewhat limited. Packet forwarding decisions could be madeat the NAS, based on the source address of the dialup subscriber’s com-puter. This scenario would allow for traffic to be forwarded along theappropriate path to its ultimate destination, in turn intrinsically provid-ing a virtual connection. However, the use of assigning static IPaddresses to dial-in subscribers is highly discouraged because of theinefficiencies in IP address utilization policies, and the critical success ofthe

Dynamic Host Configuration Protocol

(DHCP).

There are, however, some serious scaling concerns in deploying a large-scale L2TP network; these concerns revolve around the issue ofwhether large numbers of tunnels can actually be supported with littleor no network performance impact. Since there have been no large-scale deployments of this technology to date, there is no empirical evi-dence to support or invalidate these concerns.

In some cases, however, appearances are everything—some contentproviders do not wish for their subscribers to know that when theyconnect to their service, they have instead been connected to anotherservice provider’s network, and then passed along ultimately to the ser-vice to which they have subscribed. In other cases, it is merely designedto be a matter of convenience, so that subscribers do not need to loginto a device more than once.


continued


6

Regrettably, the L2TP draft does not detail all possible implementa-tions or deployment scenarios for the protocol. The basic deploymentscenario is quite brief when compared to the rest of the document, andis arguably biased toward the compulsory tunneling model. Nonethe-less, there are implementations of L2TP that follow the voluntarytunneling model. To the best of our knowledge, there has never beenany intent to exclude this model of operation. In addition, at variousrecent interoperability workshops, several different implementations ofa voluntary L2TP client have been modeled. Nothing in the L2F proto-col would prohibit deploying it in a voluntary tunneling manner, butto date it has not been widely implemented. Further, PPTP has alsobeen deployed using the compulsory model in a couple of specific ven-dor implementations.

In summary, consideration of whether PPTP or L2TP is more appro-priate for deployment in a VPDN depends on whether control needs tolie with the service provider or with the subscriber. Indeed, the differ-ence can be characterized with respect to the client of the VPN, wherethe L2TP model is one of a “wholesale” access provider who hasnumerous configured client service providers who appear as VPNs onthe common dial access system, whereas the PPTP model is one of dis-tributed private access where the client is an individual end user andthe VPN structure is that of end-to-end tunnels. One might also sug-gest that the difference is also a matter of economics, because the L2TPmodel allows service providers to actually provide a “value-added”service, beyond basic IP-level connectivity, and charge their subscribersaccordingly for the ability to access it, thus creating new revenuestreams. By contrast, the PPTP model enables distributed reach of theVPN at a much more basic level, enabling corporate VPNs to extendaccess capabilities without the need for explicit service contracts with amultitude of network access providers.

Network-Layer Encryption

Encryption technologies are extremely effective in providing the seg-mentation and virtualization required for VPN connectivity, and theycan be deployed at almost any layer of the protocol stack. The evolv-ing standard for network-layer encryption in the Internet is

IP Security

(IPSec)

[3, 4]

. (IPSec is actually an architecture—a collection of proto-cols, authentication, and encryption mechanisms. The IPSec securityarchitecture is described in detail in [3].)

While the

Internet Engineering Task Force

(IETF) is finalizing thearchitecture and the associated protocols of IPSec, there is relatively lit-tle network-layer encryption being done in the Internet today.However, some vendor proprietary solutions are currently in use.

Whereas IPSec has yet to be deployed in any significant volume, it isworthwhile to review the two methods in which network-layer encryp-tion is predominantly implemented. The most secure method for network-


7

layer encryption to be implemented is end-to-end, between participatinghosts. End-to-end encryption allows for the highest level of security. Thealternative is more commonly referred to as “tunnel mode,” in which theencryption is performed only between intermediate devices (routers), andtraffic between the end system and the first-hop router is in plaintext. Thissetup is considerably less secure, because traffic intercepted in transitbetween the first-hop router and the end system could be compromised.

As a more general observation on this security vulnerability, where aVPN architecture is based on tunnels, the addition of encryption to thetunnel still leaves the tunnel ingress and egress points vulnerable,because these points are logically part of the host network as well asbeing part of the unencrypted VPN network. Any corruption of theoperation, or interception of traffic in the clear, at these points willcompromise the privacy of the private network.

In the end-to-end encryption scheme, VPN granularity is to the individ-ual end-system level. In the tunnel mode scheme, the VPN granularityis to the subnetwork level. Traffic that transits the encrypted linksbetween participating routers, however, is considered secure. Network-layer encryption, to include IPSec, is merely a subset of a VPN.

Link-Layer VPNs

One of the most straightforward methods of constructing VPNs is touse the transmission systems and networking platforms for the physi-cal and link-layer connectivity, yet still be able to build discretenetworks at the network layer. A link-layer VPN is intended to be aclose (or preferably exact) functional analogy to a conventional pri-vate data network.

ATM and Frame Relay Virtual Connections

A conventional private data network uses a combination of dedicatedcircuits from a public carrier, together with an additional private com-munications infrastructure, to construct a network that is completelyself-contained. Where the private data network exists within privatepremises, the network generally uses a dedicated private wiring plantto carry the VPN. Where the private data network extends outside theprivate boundary of the dedicated circuits, it is typically provisionedfor a larger public communications infrastructure by using some formof time-division or frequency-division multiplexing to create the dedi-cated circuit. The essential characteristic of such circuits is thesynchronization of the data clock, such that the sender and receiverpass data at a clocking rate that is fixed by the capacity of the dedi-cated circuit.

A link-layer VPN attempts to maintain the critical elements of this self-contained functionality, while achieving economies of scale and opera-tion, by utilizing a common switched public network infrastructure.Thus, a collection of VPNs may share the same infrastructure for con-nectivity, and share the same switching elements within the interior of


continued


8

the network, but explicitly must have no visibility, either direct orinferred, of one another. Generally, these “networks” operate at Layer3 (the network layer) or higher in the OSI Reference Model, and the“infrastructure” itself commonly consists of either a

Frame Relay

or

Asynchronous Transfer Mode

(ATM) network (Figure 3). The essen-tial difference here between this architecture of virtual circuits and thatof dedicated circuits is that there is now no synchronized data clockshared by the sender and receiver, nor necessarily is there a dedicatedtransmission path that is assigned from the underlying common hostnetwork. The sender generally has no a priori knowledge of the avail-able capacity of the virtual circuit, because the capacity varies inresponse to the total demand placed on it by other simultaneous trans-mission and switching activity. Instead, the sender and receiver can useadaptive clocking of data, where the sender can adjust the transmis-sion rate to match the requirements of the application and anysignaling received from the network and the receiver. It should benoted that a dedicated circuit system using synchronized clocking can-not be oversubscribed, whereas the virtual circuit architecture (wherethe sender does not have a synchronized end-to-end data clock) canindeed be oversubscribed. It is the behavior of the network when ittransitions into this oversubscribed state that is of most interest here.

Figure 3:Conceptualization of

Discrete Layer 3Networks on a

Common Layer 2Infrastructure

One of the nice things about a public switched wide-area network thatprovides virtual circuits is that it can be extraordinarily flexible. Mostsubscribers to Frame Relay services, for example, have subscribed tothe service for economic reasons—it is cheap, and the service providerusually adds a

Service-Level Agreement

(SLA) that “guarantees” somepercentage of frame delivery in the Frame Relay network itself.

The remarkable thing about this service offering is that the customer isgenerally completely unaware of whether the service provider can actu-ally deliver the contracted service at all times and under all possibleconditions. The Layer 2 technology is not a synchronized clock block-ing technology in which each new service flow is accepted or deniedbased on the absolute ability to meet the associated resource demands.Each additional service flow is accepted into the network and carriedon a best-effort basis. Admission functions provide the network with asimple two-level discard mechanism that allows a graduated responseto instances of overload; however, when the point of saturated over-load is reached within the network, all services will be affected.

This situation brings up several other important issues: The first con-cerns the engineering practices of the Frame Relay service provider. Ifthe Frame Relay network is poorly engineered and is constantly con-gested, then obviously the service quality delivered to the subscriberswill be affected. Frame Relay uses a notion of a per-virtual circuit

Com-mitted Information Rate

(CIR), which is an ingress function associatedwith Frame Relay that checks the ingress traffic rate against the CIR.

Discrete Layer 3Network

Layer 2Infrastructure

(e.g., FrameRelay, ATM)




9

Frames that exceed this base rate are still accepted by the Frame Relaynetwork, but they are marked as

discard eligible

(DE). Because the net-work can be oversubscribed, the data rate within a switch will at timesexceed both the egress transmission rate and the local buffer storage.When this situation occurs, the switch will begin to discard dataframes, and will do so initially for frames with the DE marker present.This scenario is essentially a two-level discard precedence architecture.It is an administrative decision by the service provider as to the relativelevels of provisioning of core transmission and switching capacity, andthe ratio of network ingress capacity used by subscribers. The associ-ated CIRs of the virtual circuits against this core capacity are criticaldeterminants of the resultant deliverable quality of performance of thenetwork and the layered VPNs.

For example, at least one successful (and popular) Frame Relay serviceprovider provides an economically attractive Frame Relay service thatpermits a zero-rate CIR on PVCs, combined with an SLA that ensuresthat at least 99.8 percent of all frame-level traffic presented to theFrame Relay network will be delivered successfully. If this SLA is notmet, then the subscriber’s monthly service fee will be appropriatelyprorated the following month. The Frame Relay service provider pro-vides frame level statistics to each subscriber every month, culled fromthe Frame Relay switches, to measure the effectiveness of this SLA“guarantee.” This particular Frame Relay service provider is remark-ably successful in honoring the SLAs because they conduct ongoingnetwork capacity management on a weekly basis, provisioning newtrunks between Frame Relay switches when trunk utilization exceeds50 percent, and ensuring that trunk utilization never exceeds 75 per-cent. In this fashion, traffic on PVCs with a zero-rate CIR can generallyavoid being discarded in the Frame Relay network.

Having said that, the flexibility of PVCs allows discrete VPNs to beconstructed across a single Frame Relay network. And in manyinstances, this scenario lends itself to situations where the Frame Relaynetwork provider also manages each discrete VPN via a telemetry PVC.Several service providers have

Managed Network Services

(MNS) thatprovide exactly this type of service.

Whereas the previous example revolves around the use of Frame Relayas a link-layer mechanism, essentially the same type of VPN mechan-ics hold true for ATM. As with Frame Relay, there is no data clocksynchronization between the sender, the host network, and thereceiver. In addition, the sender’s traffic is passed into the ATM net-work via an ingress function, which can mark cells with a

Cell LossPriority

(CLP) indication. And, as with Frame Relay, where a switchexperiences congestion, the switch will attempt to discard marked(CLP) cells as the primary load shedding mechanism, but if this step isinadequate, the network must shed other cells that are not so marked.Once again, the quality of the service depends on proper capacity engi-neering of the network, and there is no guarantee of service qualityinherently in the technology itself.


continued


1 0

The generic observation is that the engineering of Frame Relay andATM common carriage data networks is typically very conservative.The inherent capabilities of both of these link-layer architectures donot permit a wide set of selective responses to network overload, sothat in order for the network to service the broadest spectrum ofpotential VPN clients, the network must provide high-quality carriageand very limited instances of any form of overload. In this way, suchnetworks are typically positioned as a high-quality alternative to dedi-cated circuit private network architectures, which are intended tooperate in a very similar manner (and, not surprisingly, are generallypriced as a premium VPN offering). Technically, the architecture oflink-layer VPNs is almost indistinguishable from the dedicated circuitprivate data network—the network can support multiple protocols,private addressing, and routing schemes, because the essential differ-ence between a dedicated circuit and a virtual link-layer circuit is theabsence of synchronized clocking between the sender and the receiver.In all other aspects, the networks are very similar.

These approaches to constructing VPNs certainly involve scaling con-cerns, especially with regard to configuration management of pro-visioning new

Virtual Connections

(VCs) and routing issues. Configura-tion management still tends to be one of the controversial points in VPNmanagement—adding new subscribers and new VPNs to the networkrequires VC path construction and provisioning, a tedium that requiresongoing administrative attention by the VPN provider. Also, as alreadymentioned, full mesh networks encounter scaling problems, in turnresulting in construction of VPNs in which partial meshing is done toavoid certain scaling limitations. The liabilities in these cases need to beexamined closely, because partial meshing of the underlying link-layernetwork may contribute to suboptimal routing (for example, extra hopscaused by hub-and-spoke issues, or redirects).

These problems apply to all types of VPNs built on the “overlay”model—not just ATM and Frame Relay. Specifically, the problems alsoapply to

Generic Routing Encapsulation

(GRE) tunnels.

MPOA and the “Virtual Router” Concept

Another unique model of constructing VPNs is the use of

Multiproto-col over ATM

(MPOA)

[5]

, which uses RFC 1483 encapsulation

[6]

. ThisVPN approach is similar to other “cut-through” mechanisms in whicha particular switched link layer is used to enable all “Layer 3” egresspoints to be only a single hop away from one another.

In this model, the edge routers determine the forwarding path in theATM switched network, because they have the ability to determinewhich egress point packets need to be forwarded to. After a network-layer reachability decision is made, the edge router forwards the packetonto a VC designated for a particular egress router. However, since theegress routers cannot use the

Address Resolution Protocol

(ARP) fordestination address across the cloud, they must rely on an externalserver for address resolution (ATM address to IP address).


1 1

The first concern here is a sole reliance on ATM—this particular modeldoes not encompass any other types of data link layer technologies,rendering the technology less than desirable in a hybrid network.Whereas this scenario may have some domain of applicability within ahomogenous ATM environment, when looking at a broader VPN envi-ronment that may encompass numerous link-layer technologies, thisapproach offers little benefit to the VPN provider.

Secondly, there are serious scaling concerns regarding full mesh mod-els of connectivity, where suboptimal network-layer routing may resultbecause of cut-through. And the reliance on address resolution serversto support the ARP function within the dynamic circuit frameworkbrings this model to the point of excessive complexity.

The advantage of the MPOA approach is the use of dynamic circuitsrather than more cumbersome, statically configured models. The tradi-tional approach to supporting private networks involves extensivemanual design and operational support to ensure that the variousconfigurations on each of the bearer switching elements are mutuallyconsistent. The desire within the MPOA environment is to attempt touse MPOA to govern the creation of dynamically controlled, edge-to-edge ATM VCs. Although this setup may offer the carrier operatorsome advantages in reduced design and operational overhead, it doesrequire the uniform availability of ATM, and in many heterogeneousenvironments this scenario is not present.

In summary, this model is another overlay model, with some seriousconcerns regarding the ability of the model to withstand scale.

“Peer” VPN models that allow the egress nodes to maintain separaterouting tables have also been introduced—one for each VPN—effec-tively allowing separate forwarding decisions to be made within eachnode for each distinctive VPN. Although this is an interesting model, itintroduces concerns about approaches in which each edge device runs aseparate routing process and maintains a separate

Routing InformationBase

(RIB, or routing table) process for each VPN community of inter-est. It also should be noted that the “virtual router” concept requiressome form of packet labeling, either within the header or via some light-weight encapsulation mechanism, in order for the switch to be able tomatch the packet against the correct VPN routing table. If the label isglobal, the issue of operational integrity is a relevant concern, whereas ifthe label is local, the concept of label switching and maintenance ofedge-to-edge label switching contexts is also a requirement.

Among the scaling concerns are issues regarding the number of sup-ported VPNs in relation to the computational requirements, and stabilityof the routing system within each VPN (that is, instability in one VPNaffecting the performance of other VPNs served by the same device). Theaggregate scaling demands of this model are also significant. Given achange in the underlying physical or link-layer topology, the consequent


continued


1 2

requirement to process the routing update on a per-VPN basis becomesa significant challenge. Use of distance vector protocols to manage therouting tables would cause a corresponding sudden surge in traffic load,and the surge grows in direct proportion to the number of supportedVPNs. The use of link-state routing protocols would require the conse-quent link-state calculation to be repeated for each VPN, causing therouter to be limited by available CPU capacity.

Multiprotocol Label Switching

One method of addressing these scaling issues is to use VPN labelswithin a single routing environment, in the same way that packet labelsare necessary to activate the correct per-VPN routing table. The use oflocal label switching effectively recreates the architecture of a Multi-protocol Label Switching VPN. It is perhaps no surprise that whenpresented with two basic approaches to the architecture of the VPN—the use of network-layer routing structures and per-packet switching,and the use of link-layer circuits and per-flow switching—the industrywould devise a hybrid architecture that attempts to combine aspects ofthese two approaches. This hybrid architecture is referred to as

Multi-protocol Label Switching

(MPLS)

[7, 8]

.

The architectural concepts used by MPLS are generic enough to allow itto operate as a peer VPN model for switching technology for a varietyof link-layer technologies, and in heterogeneous Layer 2 transmissionand switching environments. MPLS requires protocol-based routingfunctionality in the intermediate devices, and operates by making theinterswitch transport infrastructure visible to the routing. In the case ofIP over ATM, each ATM bearer link becomes visible as an IP link, andthe ATM switches are augmented with IP routing functionality. IP rout-ing is used to select a transit path across the network, and these transitpaths are marked with a sequence of labels that can be thought of aslocally defined forwarding path indicators. MPLS itself is performedusing a label swapping forwarding structure. Packets entering the MPLSenvironment are assigned a local label and an outbound interface basedon a local forwarding decision. The local label is attached to the packetvia a lightweight encapsulation mechanism. At the next MPLS switch,the forwarding decision is based on the incoming label value, where theincoming label determines the next hop interface and next hop label,using a local forwarding table indexed by label. This lookup table isgenerated by a combination of the locally used IP routing protocol,together with a label distribution protocol, which creates end-to-endtransit paths through the network for each IP destination. It is not ourintention to discuss the MPLS architecture in detail, apart from notingthat each MPLS switch uses a label-indexed forwarding table, where theattached label of an incoming packet determines the next-hop interfaceand the corresponding outgoing label.


1 3

The major observation here is that this lightweight encapsulation,together with the associated notion of boundary-determined transitpaths, provides many of the necessary mechanisms for the support ofVPN structures

[9]

. MPLS VPNs have not one, but three key ingredients:(1) constrained distribution of routing information as a way to formVPNs and control inter-VPN connectivity; (2) the use of VPN-IDs, andspecifically the concatenation of VPN-IDs with IP addresses to turn(potentially) nonunique addresses into unique ones; and (3) the use oflabel switching (MPLS) to provide forwarding along the routesconstructed via (1) and (2). The generic architecture of deployment isthat of a label-switched common host network and a collection of VPNenvironments that use label-defined virtual circuits on an edge-to-edgebasis across the MPLS environment. An example is indicated in Figure4, which shows how MPLS virtual circuits are constructed.

Figure 4:MPLS “Tunnels,”

or VPNs

Numerous approaches are possible to support VPNs within an MPLSenvironment. In the base MPLS architecture, the label applied to apacket on ingress to the MPLS environment effectively determines theselection of the egress router, as the sequence of label switches definesan edge-to-edge virtual path. The extension to the MPLS local labelhop-by-hop architecture is the notion of a per-VPN global identifier (or

Closed User Group

(CUG) identifier, as defined in [5]), which is usedeffectively within an edge-to-edge context. This global identifier couldbe assigned on ingress, and is then used as an index into a per-VPNrouting table to determine the initial switch label. On egress from theMPLS environment, the CUG identifier would be used again as anindex into a per-VPN global identifier table to undertake next-hopselection.

Service ProviderMPLS Network

BNetwork BNetwork A Level 1 MPLS

paths for VPN A

6

71

1

4

A

B

B

B

A

B

A

A

B

A

MPLS Table

1 3 4

L O L

5 3 1

7 4 6

5


continued


1 4

Routing protocols in such an environment need to carry the CUGidentifier to trigger per-VPN routing contexts, and a number of sugges-tions are noted in [5] as to how this could be achieved.

It should be stressed that MPLS itself, as well as the direction of VPNsupport using MPLS environments, is still within the area of activeresearch, development, and subsequent standardization within the IETF,so this approach to VPN support is still somewhat speculative in nature.

Link-Layer Encryption

As mentioned previously, encryption technologies are extremely effec-tive in providing the segmentation and virtualization required for VPNconnectivity, and can be deployed at almost any layer of the protocolstack. Because there are no intrinsically accepted industry standards forlink-layer encryption, all link-layer encryption solutions are generallyvendor specific and require special encryption hardware.

Although this scenario can avoid the complexities of having to dealwith encryption schemes at higher layers of the protocol stack, it canbe economically prohibitive, depending on the solution adopted. Invendor proprietary solutions, multivendor interoperability is certainly agenuine concern.

Transport and Application-Layer VPNs

Although VPNs can certainly be implemented at the transport andapplication layers of the protocol stack, this setup is not very com-mon. The most prevalent method of providing virtualization at theselayers is to use encryption services at either layer; for example,encrypted e-mail transactions, or perhaps authenticated

Domain NameSystem

(DNS) zone transfers between different administrative nameservers, as described in DNSSec (

Domain Name System Security

)

[10]

.

Some interesting, and perhaps extremely significant, work is beingdone in the IETF to define a

Transport Layer Security

(TLS) proto-col

[11]

, which would provide privacy and data integrity between twocommunicating applications. The TLS protocol, when finalized anddeployed, would allow applications to communicate in a fashion thatis designed to prevent eavesdropping, tampering, or message forgery. Itis unknown at this time, however, how long it may be before this workis finalized, or if it will be embraced by the networking community as awhole after the protocol specification is completed.

The significance of a “standard” transport-layer security protocol,however, is that when implemented, it could provide a highly granularmethod for virtualizing communications in TCP/IP networks, thusmaking VPNs a pervasive commodity, and native to all desktop com-puting platforms.


1 5

Non-IP VPNs

Although this article has focused on TCP/IP and VPNs, it is recognizedthat multiprotocol networks may also have requirements for VPNs.Most of the same techniques previously discussed can also be appliedto multiprotocol networks, with a few obvious exceptions—many ofthe techniques described herein are solely and specifically tailored forTCP/IP protocols.

Controlled route leaking is not suitable for a heterogeneous VPN pro-tocol environment, in that it is necessary to support all protocolswithin the common host network. GRE tunnels, on the other hand, areconstructed at the network layer in the TCP/IP protocol stack, butmost routable multiprotocol traffic can be transported across GRE tun-nels (for example, IPX and AppleTalk). Similarly, the VPDNarchitectures of L2TP and PPTP both provide a PPP end-to-end trans-port mechanism that can allow per-VPN protocols to be supported,with the caveat that it is a PPP-supported protocol in the first place.

The reverse of heterogeneous VPN protocol support is also a VPNrequirement in some cases, where a single VPN is to be layered above aheterogeneous collection of host networks. The most pervasive methodof constructing VPNs in multiprotocol networks is to rely upon applica-tion-layer encryption, and the resulting VPNs are generally vendorproprietary, although some would contend that one of the most perva-sive examples of this approach was the mainstay of the emergentInternet in the 1970s and 1980s—that of the UNIX-to-UNIX Copy Pro-gram (UUCP) network, which was (and remains) an open technology.

Quality-of-Service Considerations

In addition to creating a segregated address environment to allow pri-vate communications, the expectation that the VPN environment willbe in a position to support a set of service levels also exists. Such per-VPN service levels may be specified either in terms of a defined servicelevel that the VPN can rely upon at all times, or in terms of a level ofdifferentiation that the VPN can draw upon the common platformresource with some level of priority of resource allocation.

Using dedicated leased circuits, a private network can establish fixedresource levels available to it under all conditions. Using a sharedswitched infrastructure, such as Frame Relay virtual circuits or ATMvirtual connections, a quantified service level can be provided to theVPN through the characteristics of the virtual circuits used to imple-ment the VPN.

When the VPN is moved away from such a circuit-based switchingenvironment to that of a general Internet platform, is it possible for theInternet Service Provider to offer the VPN a comparable service levelthat attempts to quantify (and possibly guarantee) the level of resourcesthat the VPN can draw upon from the underlying host Internet?


continued


1 6

This area is evolving rapidly, and much of it remains within the realmof speculation rather than a more concrete discussion about the rela-tive merits of various Internet QoS mechanisms. Efforts within the

Integrated Services Working Group

of the IETF have resulted in a setof specifications for the support of guaranteed and controlled load end-to-end traffic profiles using a mechanism that loads per-flow state intothe switching elements of the network

[12, 13]

. There are numerous cave-ats regarding the use of these mechanisms, in particular relating to theability to support the number of flows that will be encountered on thepublic Internet

[14]

. Such caveats tend to suggest that these mechanismswill not be the ones that are ultimately adopted to support service lev-els for VPNs in very large networking environments.

If the scale of the public Internet environment does not readily supportthe imposition of per-flow state to support guarantees of service levelsfor VPN traffic flows, the alternative query is whether this environ-ment could support a more relaxed specification of a differentiatedservice level for overlay VPN traffic. Here, the story appears to offermore potential, given that differentiated service support does not neces-sarily imply the requirement for per-flow state, so stateless servicedifferentiation mechanisms can be deployed that offer greater levels ofsupport for scaling the differentiated service

[15]

. However, the precisenature of these differentiated service mechanisms, and their capabilityto be translated to specific service levels to support overlay VPN trafficflows, still remain in the area of future activity and research.

Conclusions

So what is a virtual private network? As we have discussed, a VPN cantake several forms. A VPN can be between two end systems, or it canbe between two or more networks. A VPN can be built using tunnelsor encryption (at essentially any layer of the protocol stack), or both,or alternatively constructed using MPLS or one of the “virtual router”methods. A VPN can consist of networks connected to a service pro-vider’s network by leased lines, Frame Relay, or ATM, or a VPN canconsist of dialup subscribers connecting to centralized services or otherdialup subscribers.

The pertinent conclusion here is that although a VPN can take manyforms, a VPN is built to solve some basic common problems, whichcan be listed as virtualization of services and segregation of communi-cations to a closed community of interest, while simultaneouslyexploiting the financial opportunity of economies of scale of the under-lying common host communications system.

To borrow a popular networking axiom, “When all you have is a ham-mer, everything looks like a nail.” Every organization has its ownproblem that it must solve, and each of the tools mentioned in this arti-cle can be used to construct a certain type of VPN to address aparticular set of functional objectives. More than a single “hammer” is


1 7

available to address these problems, and network engineers should becognizant of the fact that VPNs are an area in which many people usethe term generically—there is a broad problem set with equally as manypossible solutions. Each solution has numerous strengths and alsonumerous weaknesses and vulnerabilities. No single mechanism forVPNs that will supplant all others in the months and years to comeexists, but instead a diversity of technology choices in this area of VPNsupport will continue to emerge.

Acknowledgments

Thanks to Yakov Rekhter, Eric Rosen, and W. Mark Townsley, all ofCisco Systems, for their input and constructive criticism.

References

[1] Valencia, A., M. Littlewood, and T. Kolar. “Layer Two Forwarding(Protocol) ‘L2F’.”

draft-valencia-l2f-00.txt

, work in progress,October 1997.

[2] Droms, R. “Dynamic Host Configuration Protocol.” RFC 2131, March1997.

[3] Kent, S., and R. Atkinson. “Security Architecture for the InternetProtocol.” draft-ietf-ipsec-arch-sec-04.txt, work inprogress, March 1998.

[4] Additional information on IPSec can be found on the IETF IPSec homepage, located at http://www.ietf.org/html.charters/ipsec-charter.html

[5] Heinanen, J. “Multiprotocol Encapsulation over ATM Adaptation Layer5.” RFC 1483, July 1993.

[6] The ATM Forum. “Multi-Protocol Over ATM Specification v1.0.” af-mpoa-0087.000, July 1997.

[7] Callon, R., P. Doolan, N. Feldman, A. Fredette, G. Swallow, and A.Viswanathan. “A Framework for Multiprotocol Label Switching.”draft-ietf-mpls-framework-02.txt, work in progress, Nov-ember 1997.

[8] Rosen, E., A. Viswanathan, and R. Callon. “A Proposed Architecture forMPLS.” draft-ietf-mpls-arch-01.txt, work in progress, March1998.

[9] Heinanen, J. and E. Rosen. “VPN Support for MPLS.” draft-heinanen-mpls-vpn-01.txt, work in progress, March 1998.

[10] Eastlake, D. and C. Kaufman. “Domain Name System SecurityExtensions.” RFC 2065, January 1997. For further informationregarding DNSSec, see: http://www.ietf.org/html.charters/dnssec-charter.html

What Is a VPN? — Part II: continued

T h e I n t e r n e t P r o t o c o l J o u r n a l1 8

[11] Dierks, T. and C. Allen. “The TLS Protocol—Version 1.0.” draft-ietf-tls-protocol-05.txt, work in progress, November 1997.For more information on the IETF TLS working group, see http://www.ietf.org/html.charters/tls-charter.html. See also thearticle on SSL in the Internet Protocol Journal, Volume 1, No. 1, June1998.

[12] Wroclawski, J. “Specification of the Controlled-Load Network ElementService.” RFC 2211, September 1997.

[13] Shenker, S., C. Partridge, and R. Guerin. “Specification of GuaranteedQuality of Service.” RFC 2212, September 1997.

[14] Mankin, A., F. Baker, S. Bradner, M. O’Dell, A. Romanow, A. Weinrib,and L. Zhang. “Resource ReSerVation Protocol (RSVP) Version 1—Applicability Statement, Some Guidelines on Deployment.” RFC 2208,September 1997.

[15] “Differentiated Services Operational Model and Definitions.” draft-nichols-dsopdef-00.txt, work in progress, K. Nichols and S.Blake (editors), February 1998.

PAUL FERGUSON is a consulting engineer at Cisco Systems and an activeparticipant in the Internet Engineering Task Force (IETF). His principal areas ofexpertise include large-scale network architecture and design, global routing, Qualityof Service (QoS) issues, and Internet Service Providers. Prior to his current position atCisco Systems, he worked in network engineering, analytical, and consultingcapacities for Sprint, Computer Sciences Corporation (CSC), and NASA. He iscoauthor of Quality of Service: Delivering QoS on the Internet and in CorporateNetworks, published by John Wiley & Sons, ISBN 0-471-24358-2, a collaborationwith Geoff Huston. E-mail: [email protected]

GEOFF HUSTON holds a B.Sc and a M.Sc from the Australian National University.He has been closely involved with the development of the Internet for the pastdecade, particularly within Australia, where he was responsible for the the initialbuild of the Internet within the Australian academic and research sector. Huston iscurrently the Chief Technologist in the Internet area for Telstra. He is also an activemember of the IETF, and was an inaugural member of the Internet Society Board ofTrustees. He is coauthor of Quality of Service: Delivering QoS on the Internet and inCorporate Networks, published by John Wiley & Sons, ISBN 0-471-24358-2, acollaboration with Paul Ferguson. E-mail: [email protected]


Reliable Multicast Protocols and Applications by C. Kenneth Miller, StarBurst Communications

ulticast IP network services offer new opportunities toprovide value-added applications that involve many-to-many transmission such as conferencing or network

gaming, or one-to-many transmission such as multimedia events,tickertape feeds, and file transfer, where the many could be thousandsor even conceivably millions. Multicast IP services use a different kindof IP address, called Class D. In contrast to individual host addresses(Classes A–C), which include a host and a network component andusually are semipermanent, Class D multicast addresses may by designbe used only for a particular session, or can be semipermanent, asmulticast groups may be set up and torn down relatively quickly, onthe order of seconds. The IP address structure is shown in Figure 1.

Figure 1:IP Address Types

Hosts join groups at the receiver’s initiation using the Internet GroupManagement Protocol (IGMP). When a host joins a group, it notifiesthe nearest multicast subnet router of its presence in the group, asshown in Figure 2. First defined in RFC 1112[1], IGMPv1 is still theversion of IGMP most widely supported. IGMPv2 has recently beendocumented as an official RFC (RFC 2236[2]). The main feature thatIGMPv2 brings is reduced latency for leaving groups. In IGMPv1, thedesignated multicast router for the subnet polls for multicast groupmembers; no response between polls indicates that all hosts in aparticular multicast group have left the group, and that the routerscan prune back the multicast routing tree.

Figure 2:IGMPv1 Dialog

M

Class A netid0 hostid

netid1 0 hostid

netid1 1 0 hostid

1 1 1 0 multicast address

Class B

Class C

Class D

0 1 2 3 4 8 16 24 31

Router

IGMP Queryto 224.0.0.1

Host MembershipReport to Group Address

RouterNetwork

Host A

Host B

Host C

Reliable Multicast Protocols and Applications: continued


Network infrastructure devices, for example, routers, need to providea routing protocol to forward multicast packets to group members, ina fashion similar to that performed for unicast routing. Multicast IPpacket forwarding is best effort, just as it is with unicast packetforwarding. However, most unicast applications use TCP as atransport layer to provide guaranteed packet ordering and delivery.Some examples of applications that use TCP are the File TransferProtocol (FTP) for file transfer and the Hypertext Transfer Protocol(HTTP) for Web access.

However, TCP is a unicast (point-to-point) only transport protocol.Thus, all multicast applications must run on top of the User Data-gram Protocol (UDP) or alternatively, interface directly to IP via“raw” sockets and provide their own customized transport layer, asshown in Figure 3. UDP provides only minimal transport-layer ser-vices, error detection, and port multiplexing. Thus, if any errors orpacket loss due to congestion occur, packets are simply lost to theapplication, and they are not recoverable. Thus, all multicast applica-tions must have a specific transport-layer service to support thatparticular application. When that transport layer operates over UDP,it operates in the application layer with the application. When it inter-faces directly to IP using “raw” sockets, the specialized transport layeroperates at the transport layer, but is specialized to the particularapplication that uses it.

It should be noted that TCP supports only data reliability; it is notsuited for transport of multimedia streams, which require consistenttime delivery at the receiver and only need to be semireliable. Thus,multimedia streaming applications need a specialized transport layersuch as the Real-Time Transport Protocol (RTP)[3] for unicast as wellas multicast transmissions.

Figure 3:Specialized Multicast

Transport ProtocolsOperate over UDP

or IP

MulticastApplication

SpecializedTransport

TCP UDPTCP UDP

IP

Link Layer

Physical Layer

SpecializedTransport

MulticastApplication

IP

Link Layer

Physical Layer

4

3

2

1


Many equate multicast with multimedia, thinking that the Internet andprivate intranets will become an alternative entertainment media totelevision by using multicast IP network services and multimediastreaming technology. However, numerous other multicast applicationsrequire reliability rather than timeliness; they are multicast applicationsthat are similar to those unicast applications that operate over TCP,except that delivery is to many recipients rather than just one.

Reliable Multicast Application Categories and Requirements Reliable multicast applications come in three basic categories withdiffering requirements, as shown in Figure 4.

Figure 4:Reliable Multicast

ApplicationCategories

Collaborative applications such as data conferences (whiteboarding)and network-based games are many-to-many applications with modestscaling requirements of less than 100 participants. This kind of applica-tion requires low latency of less than 400 msec so that responses do notcause discomfort to the human participants. Transmission does notalways need strict reliability; for example, refresh of background infor-mation for a network game could wait for the next refresh.

Message streaming applications such as tickertape and news feeds alsooften require low latency. Tickertape feeds to brokerage houses needto be very timely because the information loses value greatly withtime. Time is very much money in this application, and there is also aneed for strict reliability.

Tickertape feeds to consumers are purposely delayed by minutesbecause they are usually transmitted without charge, but they cannotbe so stale as to be viewed as “old” information. This data does nothave a strict reliability requirement because the next trade of aparticular security refreshes the data. News feeds likewise have onlya moderate latency requirement. If the news feeds are sent in acarousel fashion, that is, each news story is repeated, strict reliabilitymay not be needed because it is refreshed in the next transmission ofthe same story.

Bulk data delivery has no specific latency requirement. Often there is adesire to schedule delivery during the night, when there is less networktraffic. At other times, the desire is to receive the data almost

Application Type Latency Req. Reliability Scalability

Collaborative Low Semi/Strict <100

Message Str. Low/Medium Semi/Strict to Millions

Bulk Data Not Real Time Strict to Millions



immediately. However, at all times the entire “file” or piece of dataneeds to be received to be complete. Strict reliability is the rule; forexample, if any bit of a software image is lost, the data is worthless.

Message streaming and bulk data application scaling requirementsspan the gamut from tens to possibly even millions.

Reliable multicast transport protocols, in contrast to multimediastreaming transport protocols, have not yet been standardized.However, numerous reliable multicast protocols exist; some have beenused only for research, while others have been commercialized.

The Reliable Multicast Research Group (RMRG) in the InternetResearch Task Force (IRTF) is now studying reliable multicast. It ischartered to recommend techniques for a working group in theInternet Engineering Task Force (IETF) to create a set of reliablemulticast standards.

Standardization Effort The standardization effort has been started in an IRTF research groupto study the problems and possible solutions by Internet researchers.This effort was first placed in the hands of researchers because theproblems were considered very difficult to solve in the global Internet.Some of the concerns about reliable multicast were discussed in anexpired Internet Draft published in November 1996 by the TransportArea Directors of IETF.

These concerns formed the basis for the work of the RMRG, whichwas formed in early 1997. The concerns from that document follow:

“A particular concern for the IETF (and a dominant concern for theTransport Services Area) is the impact of reliable multicast traffic onother traffic in the Internet in times of congestion (more specifially,the effect of reliable multicast traffic on competing TCP traffic). Thesuccess of the Internet relies on the fact that best-effort trafficresponds to congestion on a link (as currently indicated by packetdrops) by reducing the load presented on that link. Congestioncollapse in today’s Internet is prevented only by the congestioncontrol mechanism in TCP.

There are a number of reasons to be particularly attentive to the con-gestion-related issues raised by reliable multicast proposals. Multicastapplications in general have the potential to do more congestion-related damage to the Internet than do unicast applications. This isbecause a single multicast flow can be distributed along a large, glo-bal multicast tree reaching throughout the entire Internet.


Further, reliable multicast applications have the potential to do morecongestion-related damage than do unreliable multicast applications.First, unreliable multicast applications such as audio and video are, atthe moment, usually accompanied by a person at the receiving end,and people typically unsubscribe from a multicast group if congestionis so heavy that the audio or video stream is unintelligible. Reliablemulticast applications such as group file transfer applications, on theother hand, are likely to be between computers, with no humans inattendance monitoring congestion levels.

In addition, reliable multicast applications do not necessarily havethe natural time limitations typical of current unreliable multicastapplications. For a file transfer application, for example, the datatransfer might continue until all of the data is transferred to all of theintended receivers, resulting in a potentially-unlimited duration foran individual flow. Reliable multicast applications also have tocontend with a potential explosion of control traffic (e.g., ACKs,NAKs, status messages), and with control traffic issues in generalthat may be more complex than for unreliable multicast traffic.

The design of congestion control mechanisms for reliable multicastfor large multicast groups is currently an area of active research. Thechallenge to the IETF is to encourage research and implementationsof reliable multicast, and to enable the needs of applications forreliable multicast to be met as expeditiously as possible, while at thesame time protecting the Internet from the congestion disaster orcollapse that could result from the widespread use of applicationswith inappropriate reliable multicast mechanisms. Because of thesetbacks and costs that could result from the widespread deploymentof reliable multicast with inadequate congestion control, the IETFmust exercise care in the standardization of a reliable multicastprotocol that might see widespread use.”

One of the statements in this document is very specious:

“First, unreliable multicast applications such as audio and video are,at the moment, usually accompanied by a person at the receivingend, and people typically unsubscribe from a multicast group if con-gestion is so heavy that the audio or video stream is unintelligible.Reliable multicast applications such as group file transfer applica-tions, on the other hand, are likely to be between computers, with nohumans in attendance monitoring congestion levels.”

This statement is a very weak argument; it is not reliable to depend ona human to turn off a nonfunctioning event. Do we typically turn offthe television when we leave the house? Or leave the room to dosomething else?



In contrast, some of the reliable multicast protocols such as theMulticast File Transfer Protocol (MFTP) have the sense of a finitesession, and automatically time out and leave a group, even if allgroup members did not receive all the content.

Essentially what is desired is a reliable multicast protocol that behaveslike TCP in that it backs off in the face of congestion approximatelythe same way as TCP and shares the bandwidth with TCP traffic“fairly.” This feature is of prime importance to Internet researcherswho wish to specify protocols that can scale to the global Internet andnot cause harm to the traffic already present.

Two additional significant problems need to be solved: scalability andthe ability to operate with scalability over many different networkinfrastructures.

Scaling Issues and How Current Reliable Multicast Protocols Solve Them Two primary issues are related to scaling, that is, the ability to handlelarge groups. The first and most significant is widely known asacknowledgment/negative acknowledgment (ACK/NAK) implosion.As the number of receivers grows, the amount of back traffic to thesender eventually overwhelms its capacity to handle them.Additionally, the network at the sender site becomes congested fromthe cumulative back traffic from the receivers.

The second issue is one of retransmissions (often referred to as“repairs”). If the packet loss is uncorrelated at the receivers, retrans-missions grow, so the data may need to be sent multiple times tosatisfy all the receivers. Measurements of the Multicast backbone(Mbone) have shown that loss consists of both correlated and uncor-related parts[4]. Satellite networks will also exhibit mostly uncorrelatedloss, unless receivers are geographically close.

Various methods have been used to achieve scaling by reducing theamount of ACK/NAK administrative traffic while still retainingreliability. A straightforward approach is to simply deploy repeaters/aggregators in the network, as shown in Figure 5. This approach isprovided by the Reliable Multicast Transport Protocol (RMTP)[5].RMTP provides for designated receivers (DRs) that collect statusmessages from nodes in a local RMTP domain and provide repairs(retransmissions of missing data), if available. Receivers direct theadministrative messages to the DR by unicast. Thus, the DR providesboth local recovery and consolidation of control traffic to the next DRin the hierarchy if the data requested is not available.


Figure 5:RMTP Designated

Receivers

A second approach is to allow any receiver to provide the repair,biasing the request to the nearest receiver that has the requested data.This approach, called Scalable Reliable Multicast (SRM)[6], dependson the concept of repair by any receiver that has the data to gainscalability in reducing administrative back traffic to the source,putting the onus of responsibility on receivers to ensure that they getmissed data.

Group members in SRM send low-frequency session messages to thegroup so that their neighbors can learn their status, measure the delayamong group members and learn group membership, and detect thelast packet in a burst. Session messages are designed to take onlyabout five percent of the traffic in the session.

Receivers with missing data wait a random time period before issuingrepair requests, allowing suppression of duplicate requests similar tothe mechanism that IGMP uses on its subnet. A similar process occursfor making the actual repairs. The random backoff time for bothrepair requests made by receivers and repairs made by senders is a

RouterSender Receiver DesignatedReceiver

Ack



function of “closeness” to the sender and requesting receiver. Thus,those closest to each other time out first and make the repair requestor the actual repair in an attempt to keep repairs as local as possible.A receiver that sees the first request and determines that it is the samerequest that it would have made simply stays silent, reducing potentialredundant requests. The requester continues to send repair requestsuntil the repair is received.

Any receiver may satisfy the repair request, because all receivers arerequired to cache previously sent data. Any receiver that can satisfythe request is prepared to do so; a random backoff timer is usedbefore a repair is sent, and if it sees the repair being sent by anothergroup member, it stays silent to reduce the probability of sendingduplicate repairs.

SRM was first developed to be the reliable multicast protocol to operatewith the wb whiteboard data conferencing tool developed by LawrenceBerkeley Labs (LBL) researchers, SRM is currently operational over theMbone, the experimental multicast network of the Internet.

A third approach is to have the network infrastructure, that is,routers, help in providing scaling. This approach, called Pretty GoodMulticast (PGM)[7], is a new proposal that was first publicly presentedto the RMRG meeting held in February 1998.

One design goal of the creators of PGM was simplicity and the abilityto optimally leverage routers in the network to provide scalability.PGM is an example of a protocol that bypasses UDP and interfacesdirectly to IP via “raw” sockets, as shown in Figure 6.

Figure 6:PGM Interfaces

Directly to IP

PGM provides no notion of group membership; it simply providesreliability within a source’s transmit window from the time a receiverjoins a group until it departs.

PGM UDP

Applications

IP

Data Link

Physical

TCP


PGM has only a few data packets that are defined:

ODATA: original content data

NAK: selective negative acknowledgment

NCF: NAK confirmation

RDATA: retransmission (repair)

SPM: source path message

Each PGM packet contains a Transport Session Identifier (TSI) toidentify the session and source of that data, so multiple sessions maybe easily identified by PGM-aware routers and receivers. ODATA,NCF, RDATA, and SPM packets flow downstream in the distributiontree, and NAK packets flow upstream toward the source.

PGM is designed for scalability as well as the ability to serve real-timeapplications. Thus there is a need for timeliness. This need is handledby the transmit window, which defines a sliding window of data suchthat if no NAKs are received by the sender or a designated localretransmitter by the time the window is up, the data is simply notavailable for repairs.

PGM is totally NAK based, so the scaling issue is to reduce thenumber of NAKs sent back to the source, while at the same timeprotecting against lost NAKs. Enter here the router assist, as shown inFigure 7.

Figure 7:PGM NAK/NCF

Dialog

NAKs are unicast from PGM-router to PGM-router, initiated by thereceiver that lost data sending a NAK to its nearest PGM-aware router.Each PGM-aware router keeps forwarding NAKs until it sees an NCFor RDATA, which indicates that a repair is being sent. NAK suppres-

PGMReceiver

PGMReceiver

PGMReceiver

PGMSender

Router Subnet

NAK

NAK

NCF

NCF

NAK

NCF

NCF

NAK

NAK

NCF

NCF, ODATA, RDATA

NAK



sion is provided by a receiver’s subnet PGM-aware router, and allPGM-aware routers eliminate duplicate NAKs all the way upstream tothe source.

The unicast path back to the source must be the same path as thedownstream multicast tree. SPMs are sent downstream interleavedwith ODATA packets to establish a source path state for a givensource and session. PGM-aware routers use this information todetermine the unicast path back to the source for forwarding NAKs.SPMs also alert receivers that the oldest data in the transmit windowis about to be retired from the window and will thus no longer beavailable for repairs from the source. SPMs are sent by a source at arate that is at least the rate at which the transmit window is advanced.This rate provokes “last call” NAKs from receivers and updates thereceive window state at receivers.

PGM-aware routers also keep state on where the NAKs come fromin the distribution tree so that they may constrain the forwarding ofRDATA repairs to only those ports from which NAKs requestingthat repair were received. This scenario eliminates the transmissionof repair data to parts of the distribution tree where the repair isnot needed.

The PGM feature can also optionally redirect NAKs to a designatedlocal retransmitter (DLR) rather than the source. A DLR announces itspresence to provoke the redirection of NAKs for that session and source.

A fourth approach is to not have a low-latency requirement (that is,only serve “bulk data” delivery applications) and use this feature toadvantage to gain scalability. MFTP was first published as an InternetDraft in February 1997, and an update was submitted in April 1998[8].

MFTP also has a provision for sender-based group creation, withdifferent group models, and the group setup protocol to notify receiversto join the group. Group creation is discussed later in this article.

The basic MFTP protocol breaks the data entity to be sent intomaximum size “blocks,” where a block by default consists ofthousands or tens of thousands of packets, depending on packet sizeused. This setup is shown in Figure 8.

Figure 8:MFTP Blocks

Sender Receiver

File

Block 1Block 2Block 3

NAK 1NAK 2

• packet 54• packet 150

• packet 101• packet 290


MFTP is a “NAK-only” protocol; that is, if data is received correctlyin a block, nothing is sent back to the sender. If one or more packetsare in error or missing in a block, receivers respond with a NAK thatconsists of a bit map of the bad packets in the block. It is thus aselective reject mechanism. In this respect, MFTP is similar to RMTP;the main difference is that MFTP explicitly attempts to make theblock as large as possible for scaling purposes.

NAKs are normally sent unicast back to the source, unless aggrega-tion to improve scaling using enabled network routers is used. In thiscase, the NAKs are sent multicast to a special administrative trafficgroup address.

MFTP does not repair after each block, however; it takes advantage ofthe non-real time nature of the application for benefit. The data entity,such as a file, is sent initially in its entirety in a first pass. The sendercollects the NAK packets for a block from all the receivers. One NAKpacket from a receiver can represent thousands or even tens ofthousands of bad packets, reducing NAK implosion by orders ofmagnitudes. The collection of NAKs received by the sender from allthe receivers is logically OR-ed together to represent the collectiveneed for repairs for the receiving group. These repairs are sent by thesender in a second pass to the group. If certain receivers already havethe repair, it is simply ignored. This scenario is repeated, if necessary,until all repairs are received by all receivers or until a configurabletimeout occurs.

Thus, packet ordering services are not provided, and holes in thedata caused by dropped packets or packets in error are filled in asthey are received.

The sender is rate based; in other words, it transmits at a data rate setby the operator to be less than or equal to what the network canhandle. The protocol is thus very efficient with high-latency networkssuch as satellites, and it is impervious to network asymmetry. It alsoattempts to be as scalable as possible on one-hop networks such assatellite networks, and it provides for extensions so that networkelements may aggregate downstream responses to increase scalabilityfurther, depending on the network configuration.

This aggregation capability is shown in Figure 9. The networkelement, which can be a router, collects MFTP administrative backtraffic routers are members. These routers aggregate back traffic fromall nodes downstream in the multicast tree from the source, includingregistrations, NAKs, and dones. Registration and done messages areused by MFTP’s group setup protocol, and they are described later inthis article.



Depending on the network configuration, this aggregation capabilitycan further improve the scalability of MFTP by orders of magnitudes.

Figure 9:Routers as Network

Aggregators

The upper limit to scalability with no network aggregation ofadministrative traffic is in the tens of thousands of receivers. Forexample, for a Maximum Transmission Unit (MTU) of 1500 bytes(the Ethernet maximum), the default block size is over 11,000packets. If the number of receivers is 10,000 and each receiver has atleast one bad packet per block, then there will be a total of 10,000NAK packets coming back to the sender from the group about thatblock, approximately the same number of packets as were sent in theforward direction in that block. MFTP provides for a NAK backofftimer to spread the NAKs out in time to the sender to avoid bursts. Ifthe bandwidth is symmetric at the sender, the sender should be able tohandle this maximum NAK. In many situations, the amount of backtraffic could exceed forward traffic.

MFTP also has provision for a crude congestion control mechanism.The sender at the beginning of a session sends announce messages.These messages are used for many functions, including the setting upof groups. Additionally, it conveys a packet loss parameter to allreceivers. This packet loss threshold parameter may be used byreceivers to leave the group if the packet loss exceeds the threshold.Leaving the group prunes the distribution tree, relieving thecongestion in that section of the tree.

Router ReceiverSender (Grey routersprovide aggregation)


Commercial Usage The reliable multicast protocols previously discussed are the mostprominent ones on the market today. RMTP has been deployed in itsmessage streaming version for a billing record distribution applicationwithin a very large telecommunications carrier, but it has hadgenerally limited deployment. It also does not scale over satellitenetworks, where most of the early multicast deployments reside.

SRM has been used by the research community only over the Mbone,and it is still being refined. Another problem with SRM is that in its cur-rent incarnation, it supports neither asymmetric nor satellite networks.Some early Internet Service Provider (ISP) multicast implementations,offer multicast support in only one direction; SRM requires total multi-cast support.

PGM is new and offers promise, but there is no deployment yet, and itlikely will not occur until early 1999. PGM also requires router sup-port in a terrestrial land-line network to gain scaling.

MFTP has the limitation that it supports only bulk transfer applica-tions. However, one trade-off is that it can support all networkinfrastructures, including satellite infrastructures with scaling. MFTPhas also been available commercially in products with the longestapplication support, dating back to 1995. Thus, MFTP-based prod-ucts have the largest installed base of any reliable multicast-basedproduct being used over WANs. The largest commercial installationof over 8,500 remote sites in the group is the General Motors[9] dealernetwork. Several other commercial installations of MFTP-based appli-cations number over 1,000 group members.

Advanced Research Topics Discussed in Reliable Multicast Research Group A promising technique to reduce the amount of repair data that needsto be retransmitted is called erasure correction. This technique cansignificantly reduce the amount of repairs that need to be resent if thepacket loss is largely uncorrelated at the receivers. It uses a forwarderror correction (FEC) code to generate parity packets to be used forrepairs only. This setup provides benefit if errors at receivers areuncorrelated. For example, suppose 16 receivers each have one miss-ing packet, but they are all different. Rather than send all 16 originaldata packets, one FEC packet could be sent that could correct the onemissing packet at all 16 receivers, requiring retransmission of only onepacket rather than 16.

If the loss is correlated, then many of the receivers lose the same data,and erasure correction is of no benefit. However, there is also no pen-alty, except for the need for computing power at both the sender andthe receivers to perform the FEC correction calculations. Simulationshave show[10] that there is a greater than 2:1 reduction in the numberof repairs needed to be sent with our example of 10,000 receivers.This benefit will be even larger when group sizes become larger thantens of thousands.



Perhaps a more significant application for FEC is a congestion controltechnique known as layering[11,12]. With layering, numerous groups areset up by the sender, all with different rates. Receivers that can receiveat the highest rate join all the “layer” groups. Those receivers thatcannot receive at the highest rate simply leave “layers” until congestionis relieved, and they take longer to receive the data. For this to workwithout sending data redundantly, the number of parity packetscreated must be very large compared to the number of data packets.

There are some further issues that have been pointed out by theresearchers with the Other issues with the layering approaches havebeen pointed out by the researchers, however. For layering to beeffective, the routing tree should be identical for the different groups;otherwise congestion will not be relieved on a part of the tree. Thismay not always be the case, especially in sparse mode routingprotocols, where selection of the rendezvous point or core is based ongroup address.

Even if the same distribution tree is used for the different layers, it hasbeen pointed out[12] that leaves of hosts downstream from a congestedlink should be coordinated; otherwise the action of less than all ofthem has no effect on congestion. Additionally, a receiver could causecongestion by adding a layer that another receiver could interpret ascongestion, causing it to drop a layer with no effect.

Thus, layering using FEC techniques is an interesting technique thatshows promise for use in congestion control. However, there areissues associated with this type of layering that researchers still needto address.

Another technique that has been proposed for congestion control isbulk feedback to the sender[13]. If the sender receives an excessivenumber of NAKs from receivers, it drops the sender’s transmissionrate with an algorithm that attempts to emulate the behavior of TCP.This approach is an obvious one because it is an extension of theprocess in which TCP falls back in the face of congestion.

This approach, however, has two basic problems. The first is that thereis delay, because the sender needs to get feedback from the multitudeof receivers before it acts. This delay can be considerably longer thanin the case of TCP, which needs feedback from only one receiver.

The second flaw is that one errant receiver can effectively penalize thewhole group, because the sender reduces the rate to the total group.

This approach is not viewed as a viable solution for these reasons. Infact, the general consensus is that congestion control decision makingwill be required at the multiple receivers rather than at the sender forboth scaling and timeliness reasons.


Another idea that is now receiving intense study by researchers is thatof “subcasting”[14,15,16]. The key idea in subcasting is to optimize localrepair to be a retransmitter that may be just above a link congestionpoint, as shown in Figure 10. The problem is to gain knowledge of thenetwork topology so as to locate a receiving host that is willing toretransmit and that has the repair data.

Then the repairs need to be contained within only the region of thenetwork that lost the original transmission, that is, the “subcast” region.

Figure 10:Optimized Local

Repair

One proposal is to ask for assistance from the network routers. Theyknow the topology and could be used to find the closest willingretransmitter that has the repair. The router could also direct therepair to only the affected region: the subcast.

This technique can be viewed as an extension of concepts originallyproposed in SRM to provide local recovery. It assumes that most lossis caused by congested links, and that uncorrelated loss is caused by aseries of mildly congested links with few group members. This modelis probably the right one for many land-line routed networks; it isproblematical with other network infrastructures.

Nevertheless, it is an interesting proposal that merits further researcheffort. Local repair is destined to be an important tool to meet thegoal of improved scalability with minimal traffic overhead.

Group Creation and Destruction The process of joining a group and leaving a group in IP multicast isleft to a potential group member that uses IGMP to notify the nearestmulticast router of its membership state. However, mechanisms needto be in place to allow potential members of a group to gain theinformation needed to decide to join the group.

CongestedLink

RepairRequest

SourceSource

Retransmitter

”Turning Point“

SubcastRegion

”Subcast“ Repair



There are two basic ways to accomplish this scenario for one-to-many sessions. The first and most common is the “broadcast TV”model. The Multiparty Multimedia Session Control (MMUSIC)working group of the IETF has developed some protocols that can beused to advertise content. The Session Announcement Protocol(SAP)[17] provides the mechanism to send a stream on a “well-known” multicast address to announce content to any potential lis-teners who may be interested. It uses the Session DescriptionProtocol (SDP)[18] to describe the contents that are announced. Thesetwo protocols together have been used to create a session directorytool that is available on the Mbone. This setup creates essentially theequivalent of a “preview channel” such as is often available on cabletelevision systems.

SDP is also used to post content on Web sites, which advertise thatcontent to anyone who wishes to receive it.

Although these protocols were originally developed primarily to adver-tise multimedia streaming applications, they are also applicable fordata. They provide a useful tool for “push” vendors to advertise multi-cast “channels” based on content that any consumer can “tune in” to.

Internet researchers describe this model as providing “looselycoupled” sessions, because the sender does not know who is listening,much like radio or TV broadcasters do not know who tunes in totheir stations.

MFTP also includes a group setup protocol. The “closed group”option in MFTP provides a mechanism to create a “tightly coupled”session that is very useful to organizations that wish to deliver criticalinformation from a central site to many remote branch offices. Theclosed group provides a means for the sender to define a group listcentrally and direct those members so defined to join the group. Thisscenario is somewhat similar to e-mail, except more robust.

These instructions are sent in an “announce” message on a specialmulticast group address that the superset of possible candidatereceivers always listens to. Hosts so directed to join the group notifytheir designated multicast router of their membership directed to jointhe group notify their designated multicast router of their membershipusing IGMP and “register” back to the sender of their presence. Thus,the sender knows group membership before transmission commences,and the sender can then also positively confirm delivery.


This approach has proven very desirable for organizations that havemany branches where information is desired to be sent at thediscretion and time determined by the and desire to send informationat the discretion and time determined by the sender, and usually theinformation is delivered to a branch office server. Several deploymentsof applications that use MFTP and the closed group model with groupmembers approaching 10,000 exist.

The MMUSIC group has also created the Session Invitation Protocol(SIP)[19], which is used to invite members to a conference of some sort,including possibly a data conference. This protocol is appropriate foruse with whiteboard applications, for example.

Summary and Conclusions Although multicast has often been viewed as synonymous withmultimedia, there is a wide spectrum of reliable multicast applicationsthat involve the transfer of data to multiple group members. Becausethis wide spectrum of applications has many different requirements, asshown in Figure 4, no one reliable multicast protocol can handle allapplications and network infrastructures. The result is that numerousreliable multicast protocols are likely to become standardized, andtoday numerous reliable multicast protocols are either in commercialproducts/toolkits or due to be available soon.

The reliable multicast standardization effort now resides in the IRTF,because Internet researchers are concerned about congestion controland fairness to TCP for any protocols that might become standardizedfor general Internet use. This problem is difficult to solve, given thedisparate requirements placed on protocols by the wide variety ofapplications and different network infrastructures.

Nevertheless, a significant number of reliable multicast-based productdeployments have already occurred over private networks. These havebeen shown to save organizations much money and to help create newbusiness opportunities for them.

Stay tuned; reliable multicast-based applications are ready to be main-streamed. Together with multimedia multicast applications, multicastapplications of all forms will become common soon, first in privateintranets and extranets and then in the Internet as a whole.



References [1] Deering, S. “Host Extensions for IP Multicasting.” RFC 1112, August

1989.

[2] Fenner, W. “Internet Group Management Protocol, Version 2.” RFC2236, November 1997.

[3] Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V. “RTP: ATransport Protocol for Real-Time Applications.” RFC 1889, January1996.

[4] Handley, M. “An Examination of Mbone Performance.” ISI Report,January 10, 1997.

[5] Paul, S., Sabnani, K. K., Lin, J. C., and Bhattacharyya, S. “ReliableMulticast Transport Protocol (RMTP).” IEEE Journal on Selected Areasin Communications, April 1997.

[6] Floyd, S., Jacobson, V., Liu, C., McCanne, S., and Zhang, L. “A ReliableMulticast Framework for Light-weight Sessions and Application LevelFraming.” ACM Transactions on Networking, November 1996.

[7] Farinacci, D., Lin, A., Speakman, T., and Tweedly, A. “PGM ReliableTransport Protocol Specification.” Work in progress, Internet Draft,draft-speakman-pgm-spec-01.txt, January 29, 1998.

[8] Miller, K., Robertson, K., Tweedly, A., and White, M. “StarBurstMulticast File Transfer Protocol (MFTP) Specification.” Work in progress,Internet Draft, draft-miller-mftp-spec-03.txt, April 1998.

[9] Miller, K. “Reliable Multicast Protocols: A Practical View.” 22ndConference on Local Computer Networks, November 1997.

[10] Kasera, S. K., Kurose, J., Towsley, D., “Scalable Reliable MulticastUsing Multiple Multicast Groups.” CMPSCI Technical Report TR 96-73, October 1996.

[11] Nonnenmacher, J., and Biersack, E. W. “Asynchronous Multicast Push:AMP.” Proceedings of ICCC ’97 International Conference on ComputerCommunications, Cannes, France, November 1997.

[12] Crowcroft, J., Rizzo, L., and Vicisano, L. “TCP-Like CongestionControl for Layered Multicast Data Transfer.” Submitted to INFOCOM’98, August 1997.

[13] Sano, T., Yamanouchi, N., et al. “Flow and Congestion Control for BulkReliable Multicast Protocols—toward coexistence with TCP.” Submittedto INFOCOM ’98, presented at RMRG meeting in Cannes, France,September 1997.

[14] Hofmann, M. “Enabling Group Communication in Global Networks.”Proceedings of Global Networking ’97, June 1997.

[15] Papadopoulos, C., Parulkar, G., and Varghese, G. “An Error ControlScheme for Large-Scale Multicast Applications.” Submitted toINFOCOM ’98, presented at RMRG meeting in Cannes, France,September 1997.


[16] Levine, B. N., Paul, S., and Garcia-Luna-Aceves, J. J. “DeterministicOrganization of Multicast Receivers Based on Packet Loss Correlation.”Presented at RMRG meeting in Orlando, Fla., February 1998, submittedfor publication.

[17] Handley, M. “SAP: Session Announcement Protocol.” Work in progress,Internet Draft, draft-ietf-mmusic-sap-00.txt, November 1996.

[18] Handley, M., and Jacobson, V. “SDP: Session Description Protocol.”Work in progress, Internet Draft, draft-ietf-mmusic-sdp-07.txt,April 1998.

[19] Handley, M., Schulzrinne, H., and Schooler, E. “SIP: Session InvitationProtocol.” Work in progress, Internet Draft, draft-ietf-mmusic-sip-04.txt, November 1997.

(This article is based in part on material in the book MulticastNetworking and Applications written by C. Kenneth Miller to bepublished by Addison Wesley Longman, Inc. in 1998.ISBN 0-201-30979-3. Used with permission.)

C. KENNETH MILLER is the founder, Chairman, and Chief Technology Officer ofStarBurst Communications. StarBurst Communications provides reliable multicastsolutions for commercial applications with such corporate customers as GM, Ford,Chrysler, Toys ‘R Us, Thomson Financial, and many others. Miller has been in thedata communications industry since 1972. He founded Concord Data Systems in late1980 and served as its President and CEO until 1986. Concord Data Systems pro-duced high-speed dial modems. He was the author of the IEEE 802.4 LAN standard,which became the lower layer for the Manufacturing Automation Protocol (MAP)factory LAN standard. Miller was a regular columnist in Data CommunicationsMagazine from 1992 to 1994. He has also published numerous articles and partici-pated in many panels at trade show and other industry events. He is now writing abook entitled Multicast Networking and Applications, to be published in 1998 byAddison-Wesley. Miller received a BEE degree from Rensselaer Polytechnic Instituteand a MSEE degree from the University of Pennsylvania, specializing in communica-tions. Miller can be reached at [email protected]


Layer 2 and Layer 3 Switch Evolution by Thayumanavan Sridhar, Future Communications Software

ayer 2 switches are frequently installed in the enterprise forhigh-speed connectivity between end stations at the data linklayer. Layer 3 switches are a relatively new phenomenon, made

popular by (among others) the trade press. This article details some ofthe issues in the evolution of Layer 2 and Layer 3 switches. Wehypothesize that that the technology is evolutionary and has its originsin earlier products.

Layer 2 Switches Bridging technology has been around since the 1980s (and maybeeven earlier). Bridging involves segmentation of local-area networks(LANs) at the Layer 2 level. A multiport bridge typically learnsabout the Media Access Control (MAC) addresses on each of itsports and transparently passes MAC frames destined to those ports.These bridges also ensure that frames destined for MAC addressesthat lie on the same port as the originating station are not forwardedto the other ports. For the sake of this discussion, we consider onlyEthernet LANs.

Layer 2 switches effectively provide the same functionality. They aresimilar to multiport bridges in that they learn and forward frames oneach port. The major difference is the involvement of hardware thatensures that multiple switching paths inside the switch can be active atthe same time. For example, consider Figure 1, which details a four-port switch with stations A on port 1, B on port 2, C on port 3 and Don port 4. Assume that A desires to communicate with B, and Cdesires to communicate with D. In a single CPU bridge, thisforwarding would typically be done in software, where the CPUwould pick up frames from each of the ports sequentially and forwardthem to appropriate output ports. This process is highly inefficient in ascenario like the one indicated previously, where the traffic between Aand B has no relation to the traffic between C and D.

Figure 1:Layer 2 switch with External Router

for Inter-VLAN traffic and connectingto the Internet

L

1 4

2 3Station A Station D

Station CStation B

Layer 2Switch

Internet

Router


Enter hardware-based Layer 2 switching. Layer 2 switches with theirhardware support are able to forward such frames in parallel so that Aand B and C and D can have simultaneous conversations. The parallel-ism has many advantages. Assume that A and B are NetBIOS stations,while C and D are Internet Protocol (IP) stations. There may be no rea-son for the communication between A and C and A and D. Layer 2switching allows this coexistence without sacrificing efficiency.

Virtual LANs In reality, however, LANs are rarely so clean. Assume a situationwhere A,B,C, and D are all IP stations. A and B belong to the same IPsubnet, while C and D belong to a different subnet. Layer 2 switchingis fine, as long as only A and B or C and D communicate. If A and C,which are on two different IP subnets, need to communicate, Layer 2switching is inadequate—the communication requires an IP router. Acorollary of this is that A and B and C and D belong to differentbroadcast domains—that is, A and B should not “see” the MAC layerbroadcasts from C and D, and vice versa. However, a Layer 2 switchcannot distinguish between these broadcasts—bridging technologyinvolves forwarding broadcasts to all other ports, and it cannot tellwhen a broadcast is restricted to the same IP subnet.

Virtual LANs (VLANs) apply in this situation. In short, Layer 2VLANs are Layer 2 broadcast domains. MAC broadcasts arerestricted to the VLANs that stations are configured into. How canthe Layer 2 switch make this distinction? By configuration. VLANsinvolve configuration of ports or MAC addresses. Port-based VLANsindicate that all frames that originate from a port belong to the sameVLAN, while MAC address-based VLANs use MAC addresses todetermine VLAN membership. In Figure 1, ports 1 and 2 belong tothe same VLAN, while ports 3 and 4 belong to a different VLAN.Note that there is an implicit relationship between the VLANs and theIP subnets—however, configuration of Layer 2 VLANs does notinvolve specifying Layer 3 parameters.

We indicated earlier that stations on two different VLANs can com-municate only via a router. The router is typically connected to one ofthe switch ports (Figure 1). This router is sometimes referred to as aone-armed router since it receives and forwards traffic on to the sameport. In reality, of course, such routers connect to other switches or towide-area networks (WANs). Some Layer 2 switches provide thisLayer 3 routing functionality within the same box to avoid an exter-nal router and to free another switch port. This scenario is reminiscentof the large multiprotocol routers of the early ’90s, which offeredrouting and bridging functions.

A popular classification of Layer 2 switches is “cut-through” versus“store-and-forward.” Cut-through switches make the forwardingdecision as the frame is being received by just looking at the header ofthe frame. Store-and-forward switches receive the entire Layer 2 frame

Layer 2 and Layer 3 Switch Evolution: continued


before making the forwarding decision. Hybrid adaptable switcheswhich adapt from cut-through to store-and-forward based on theerror rate in the MAC frames are very popular.

Characteristics Layer 2 switches themselves act as IP end nodes for Simple NetworkManagement Protocol (SNMP) management, Telnet, and Web basedmanagement. Such management functionality involves the presence ofan IP stack on the router along with User Datagram Protocol (UDP),Transmission Control Protocol (TCP), Telnet, and SNMP functions.The switches themselves have a MAC address so that they can beaddressed as a Layer 2 end node while also providing transparentswitch functions. Layer 2 switching does not, in general, involvechanging the MAC frame. However, there are situations whenswitches change the MAC frame. The IEEE 802.1Q Committee isworking on a VLAN standard that involves “tagging” a MAC framewith the VLAN it belongs to; this tagging process involves changingthe MAC frame. Bridging technology also involves the Spanning-TreeProtocol. This is required in a multibridge network to avoid loops.The same principles also apply towards Layer 2 switches, and mostcommercial Layer 2 switches support the Spanning-Tree Protocol.

The previous discussion provides an outline of Layer 2 switching func-tions. Layer 2 switching is MAC frame based, does not involve alteringthe MAC frame, in general, and provides transparent switching in par-allel with MAC frames. Since these switches operate at Layer 2, theyare protocol independent. However, Layer 2 switching does not scalewell because of broadcasts. Although VLANs alleviate this problem tosome extent, there is definitely a need for machines on differentVLANs to communicate. One example is the situation where an orga-nization has multiple intranet servers on separate subnets (and henceVLANs), causing a lot of intersubnet traffic. In such cases, use of arouter is unavoidable; Layer 3 switches enter at this point.

Layer 3 Switches Layer 3 switching is a relatively new term, which has been “extended”by a numerous vendors to describe their products. For example, oneschool uses this term to describe fast IP routing via hardware, whileanother school uses it to describe Multi Protocol Over ATM (MPOA).For the purpose of this discussion, Layer 3 switches are superfast rout-ers that do Layer 3 forwarding in hardware. In this article, we willmainly discuss Layer 3 switching in the context of fast IP routing,with a brief discussion of the other areas of application.

Evolution Consider the Layer 2 switching context shown in Figure 1. Layer 2switches operate well when there is very little traffic between VLANs.Such VLAN traffic would entail a router—either “hanging off” one ofthe ports as a one-armed router or present internally within theswitch. To augment Layer 2 functionality, we need a router—which


leads to loss of performance since routers are typically slower thanswitches. This scenario leads to the question: Why not implement arouter in the switch itself, as discussed in the previous section, and dothe forwarding in hardware?

Although this setup is possible, it has one limitation: Layer 2 switchesneed to operate only on the Ethernet MAC frame. This scenario inturn leads to a well-defined forwarding algorithm which can beimplemented in hardware. The algorithm cannot be extended easily toLayer 3 protocols because there are multiple Layer 3 routableprotocols such as IP, IPX, AppleTalk, and so on; and second, theforwarding decision in such protocols is typically more complicatedthan Layer 2 forwarding decisions.

What is the engineering compromise? Because IP is the most commonamong all Layer 3 protocols today, most of the Layer 3 switchestoday perform IP switching at the hardware level and forward theother protocols at Layer 2 (that is, bridge them). The second issue ofcomplicated Layer 3 forwarding decisions is best illustrated by IPoption processing, which typically causes the length of the IP headerto vary, complicating the building of a hardware forwarding engine.However, a large number of IP packets do not include IP options—so,it may be overkill to design this processing into silicon. Thecompromise is that the most common (fast path) forwarding decisionis designed into silicon, whereas the others are handled typically by aCPU on the Layer 3 switch.

To summarize, Layer 3 switches are routers with fast forwarding donevia hardware. IP forwarding typically involves a route lookup,decrementing the Time To Live (TTL) count and recalculating thechecksum, and forwarding the frame with the appropriate MACheader to the correct output port. Lookups can be done in hardware,as can the decrementing of the TTL and the recalculation of thechecksum. The routers run routing protocols such as Open ShortestPath First (OSPF) or Routing Information Protocol (RIP) tocommunicate with other Layer 3 switches or routers and build theirrouting tables. These routing tables are looked up to determine theroute for an incoming packet.

Combined Layer 2/Layer 3 Switches We have implicitly assumed that Layer 3 switches also provide Layer2 switching functionality, but this assumption does not always holdtrue. Layer 3 switches can act like traditional routers hanging offmultiple Layer 2 switches and provide inter-VLAN connectivity. Insuch cases, there is no Layer 2 functionality required in these switches.This concept can be illustrated by extending the topology in Figure1—consider placing a pure Layer 3 switch between the Layer 2 Switchand the router. The Layer 3 Switch would off-load the router frominter-VLAN processing.

Layer 2 and Layer 3 Switch Evolution: continued


Figure 2:Combined Layer2/

Layer3 Switchconnecting directly

to the Internet

Figure 2 illustrates the combined Layer 2/Layer 3 switching function-ality. The combined Layer 2/Layer 3 switch replaces the traditionalrouter also. A and B belong to IP subnet 1, while C and D belong toIP subnet 2. Since the switch in consideration is a Layer 2 switch also,it switches traffic between A and B at Layer 2. Now consider the situ-ation when A wishes to communicate with C. A sends the IP packetaddressed to the MAC address of the Layer 3 switch, but with an IPdestination address equal to C’s IP address. The Layer 3 switch stripsout the MAC header and switches the frame to C after performingthe lookup, decrementing the TTL, recalculating the checksum andinserting C’s MAC address in the destination MAC address field. Allof these steps are done in hardware at very high speeds.

Now how does the switch know that C’s IP destination address is Port3? When it performs learning at Layer 2, it only knows C’s MACaddress. There are multiple ways to solve this problem. The switchcan perform an Address Resolution Protocol (ARP) lookup on all theIP subnet 2 ports for C’s MAC address and determine C’s IP-to-MACmapping and the port on which C lies. The other method is for theswitch to determine C’s IP-to-MAC mapping by snooping into the IPheader on reception of a MAC frame.

CharacteristicsConfiguration of the Layer 3 switches is an important issue. When theLayer 3 switches also perform Layer 2 switching, they learn the MACaddresses on the ports—the only configuration required is the VLANconfiguration. For Layer 3 switching, the switches can be configuredwith the ports corresponding to each of the subnets or they canperform IP address learning. This process involves snooping into theIP header of the MAC frames and determining the subnet on that portfrom the source IP address. When the Layer 3 switch acts like a one-armed router for a Layer 2 switch, the same port may consist ofmultiple IP subnets.

Management of the Layer 3 switches is typically done via SNMP.Layer 3 switches also have MAC addresses for their ports—this setupcan be one per port, or all ports can use the same MAC address. TheLayer 3 switches typically use this MAC address for SNMP, Telnet,and Web management communication.

1 4

2 3Station A Station D

Station CStation B

Combined Layer 2/3Switch

Internet


Conceptually, the ATM Forum’s LAN Emulation (LANE) specificat-ion is closer to the Layer 2 switching model, while MPOA is closer tothe Layer 3 switching model. Numerous Layer 2 switches areequipped with ATM interfaces and provide a LANE client function onthat ATM interface. This scenario allows the bridging of MAC framesacross an ATM network from switch to switch. The MPOA is closerto combined Layer2/Layer 3 switching, though the MPOA client doesnot have any routing protocols running on it. (Routing is left to theMPOA server under the Virtual Router model.)

Do Layer 3 switches completely eliminate need for the traditionalrouter ? No, routers are still needed, especially where connections tothe wide area are required. Layer 3 switches may still connect to suchrouters to learn their tables and route packets to them when thesepackets need to be sent over the WAN. The switches will be veryeffective on the workgroup and the backbone within an enterprise,but most likely will not replace the router at the edge of the WAN(read Internet in many cases). Routers perform numerous otherfunctions like filtering with access lists, inter-Autonomous System (AS)routing with protocols such as the Border Gateway Protocol (BGP),and so on. Some Layer 3 switches may completely replace the need fora router if they can provide all these functions (see Figure 2).

References [1] Computer Networks, 3rd Edition, Andrew S. Tanenbaum, ISBN 0-13-

349945-6, Prentice-Hall, 1996.

[2] Interconnections: Bridges and Routers, Radia Perlman, ISBN 0-201-56332-0, Addison-Wesley, 1992.

[3] “MAC Bridges,” ISO/IEC 10038, ANSI/IEEE Standard 802.1 D-1993.

[4] “Draft Standard for Virtual Bridged Local Area Networks,” IEEEP802.1Q/D6, May 1997.

[5] “Internet Protocol,” Jon Postel, RFC 791, 1981.

[6] “Requirements for IP Version 4 Routers,” Fred Baker, RFC 1812, June1995.

[7] “LAN Emulation over ATM Version 1.0,” af-lane-0021.000, TheATM Forum, January 1995.

[8] “Multiprotocol over ATM (MPOA) Specication Version 1.0” af-mpoa-0087.000, The ATM Forum, July 1997.

THAYUMANAVAN SRIDHAR is Director of Engineering at Future CommunicationsSoftware in Santa Clara, CA. He received his BE in Electronics and CommunicationsEngineering from the College of Engineering, Guindy, Anna University, Madras, India,his Master of Science in Electrical and Computer Engineering from the University ofTexas at Austin. He can be reached at [email protected]


Book ReviewGigabit Ethernet Gigabit Ethernet: Technology and Applications for High-Speed LANs,

by Rich Seifert, ISBN 0-201-18553-9, Addison-Wesley, 1998, http://www.awl.com/cseng/titles/0-201-18553-9.

Gigabit Ethernet is storming its way onto the high-speed LAN scene.From a concept in 1984 to an emerging commercial reality in 1998,Gigabit Ethernet promises to give other high-speed LAN technologies,especially ATM, a serious run for their money. Capitalizing on thebasic ease of use and deployment that has made other forms of Ether-net the most popular LAN technology of all, Gigabit Ethernet promisesto add major bandwidth to such networks in a straightforward, com-pletely compatible, and relatively affordable way. This book performsan excellent survey of the technologies, algorithms, and design princi-ples that make Gigabit Ethernet possible, and also explains where thetremendous appeal of Gigabit Ethernet really lies. Much of the book isdevoted to explaining Ethernet principles and operation in general, aswell as exploring recent developments that have enabled gigabit tech-nologies to emerge.

Organization The book is divided into three parts. Part I explores the foundationsthat underpin Gigabit Ethernet, starting with a brief but cogent explo-ration of Ethernet before gigabit versions loomed on the horizon. Therest of Part I covers the trends in LAN usage in general, and Ethernet inparticular, that laid the groundwork for Gigabit Ethernet. These trendsinclude the move from shared media to dedicated media on manyLANs, and likewise from shared LANs to dedicated LANs, and theconcomitant deployment of full-duplex technologies to support bidirec-tional, high-bandwidth communications. Seifert, an original member ofthe DIX (Digital-Intel-Xerox) team that developed Ethernet, writesclearly and compellingly about complex issues, such as flow control,medium independence, and automatic configuration, as he explainswhat made Gigabit Ethernet possible, if not inevitable.

In Part II, Seifert turns his focus onto Gigabit Ethernet itself, beginningwith an overview. In the rest of Part II, he explains how Media AccessControl (MAC) works for half-duplex and full-duplex versions ofGigabit Ethernet, and makes a strong case for the essential irrelevancyof shared-media and half-duplex operation for Gigabit Ethernet. Alongthe way, Seifert also covers how Gigabit Ethernet networking devices,such as repeaters and switching and routing hubs, must be designedand how they work, and covers the behavior and operation of thephysical layer at gigabit speeds.


He concludes this section of the book with a brief overview of the cur-rent IEEE Draft 802.3z specification that governs current GigabitEthernet operations, and mentions ongoing work in the 802.3ab sub-committee to define a workable implementation for Gigabit Etherneton twisted-pair media (1000BaseT, as it will probably be known).

In Part III, Seifert tackles some of the most interesting material in thisbook. He begins with a discussion of how LANs and computers changeroles over time in acting as the bottleneck for network use. The pointhere is that because of its extremely high bandwidth relative to thedemands of most applications and end-user requirements, GigabitEthernet is likely to remain a backbone or clustering technology for theforeseeable future. He also explores the performance considerations forboth networks and applications involved when extreme speeds orexcessive bandwidths are available, to point out how bandwidth aggre-gation is presently Gigabit’s most immediate and compellingcontribution to networking.

Finally, he explores how Gigabit Ethernet compares to other high-speed networking technologies, including Fast Ethernet, Fiber Distrib-uted Data Interface (FDDI), High-Performance Parallel Interface(HIPPI), Fibre Channel, and ATM. His discussion of why both ATMand Gigabit Ethernet are necessary, and why neither can fully sup-plant the other, represents a humorous and insightful analysis of whyconnection-oriented and connectionless communications and applica-tions are both good, and why the two can never truly converge.

An Outstanding Contribution A rundown of Seifert’s layout and content, however, fails to do com-plete justice to this book. For one thing, Seifert’s work includes thefunniest and most ingenious footnotes I’ve seen in recent publications,including some truly horrendous puns and some downright howlers.For example, when discussing how repeaters work, he comments that“A jabbering station causes carrier sense to be continuously assertedand blocks all use of a shared LAN. A repeater looks for this conditionand isolates the offending station.” To this last sentence, he appendsthe following footnote: “Research is underway to determine if thismechanism can be extended for use on politicians and university lectur-ers.” And this is just one of dozens of such gems that help to relieve thedryness that deeply technical material can sometimes manifest.

This book is also masterful simply because the author understands hismaterial so well, and does such an outstanding job of explaining andexploring even the most abstruse networking concepts. Although I’vebeen working with Ethernet for 15 years, I learned a great deal of newmaterial from Part I of the book because old concepts were explainedin new ways that improved my understanding. I suspect other readerswill have one or two “Aha!” experiences from this tome as well.

Book Review: continued


But it’s when making the case for full-duplex Gigabit Ethernet andexploring the requirements for switching and routing behaviors inGigabit Ethernet networking devices that this material really shines.

Without a doubt, this book is among the very best of any of the litera-ture available on high-speed networking today. I give it an A+ rating,not only because of the breadth and depth of its technical coverage andits compilation of essential concepts and information, but also becausethe author’s deep understanding of networking protocols and commu-nications needs enlivens all of his discussions of matters technical,business, and political. If you want to understand Gigabit Ethernet,this book is the obvious place to begin (and for many, to end) yoursearch for enlightenment.

But even if all you want is a good read about expensive, exotic, andhigh-performance technology, Seifert’s book offers the opportunityfor outright enjoyment of the prose, and shared delight at untanglingthe technical dilemmas that any good design engineer must unravelon the road between a set of requirements and working implementa-tion thereof.

—Ed TittelLANWrights, Inc.

[email protected]

More Book Reviews We have more book reviews awaiting publication:

• Internet Cryptography, by Richard E. Smith, ISBN 0-201-92480-3,Addison-Wesley, 1998. Reviewed by Fred Avolio.

• Web Security: A Step-by-Step Reference Guide, by Lincoln D. Stein,ISBN 0-201-63489-9, Addison-Wesley, December 1997. Reviewedby Richard Perlman

• IP Multicasting: The Complete Guide to Interactive CorporateNetworks, by Dave Kosiur ISBN 0-471-24359-0, Wiley ComputerPublishing, 1998. Reviewed by Neophytos Iacovou.

So, make sure you receive the next issue of The Internet Protocol Jour-nal due out in December 1998.


FragmentsMore on The Future of the Domain Name System (DNS)Shortly after our first issue went to press, the US Government issued aso-called White Paper as a follow on to the Green Paper. The WhitePaper, entitled “Management of Internet Names and Addresses,” canbe found at:http://www.ntia.doc.gov/ntiahome/domainname/domainhome.htm

In early July, The International Forum on The White Paper (IFWP)was formed. The IFWP is “an ad hoc coalition of professional, tradeand educational associations representing a diversity of Internet stake-holder groups.” The IFWP held a series of meetings in Reston,Brussels, Geneva, Singapore and Buenos Aires to discuss the WhitePaper, specifically the incorporation of the Internet Assigned NumbersAuthority (IANA). For more information on the IFWP process, see:http://www.ifwp.org

The IANA has posted draft bylaws for its incorporation on the IANAweb site at: http://www.iana.org, and asked for community input.By the time you read this, the incorporation should already have takenplace. We will provide an update in our next issue.

IETF Wins AwardThe Computer Professionals for Social Responsibility (CPSR) has cho-sen the Internet Engineering Task Force (IETF) to be honored with theNorbert Wiener award for the group’s influential role in the evolutionof the Internet. In its 12-year history, this is only the second time theCPSR has recognized an organization rather than an individual. TheIETF will accept the award at CPSR’s annual conference, on Saturdayevening, October 10, 1998, in Boston. The IETF is noted for its highlyopen and democratic processes that have affected the development ofthe Internet. The CPSR believes that such open processes are bothextremely important and seriously threatened, and have accordinglymade Internet governance the focus of its 1998 program year. TheNorbert Wiener award was established in 1987 by the CPSR in mem-ory of the originator of the field of cybernetics, whose pioneering workwas one of the pillars on which the computer technology was created.See: http://www.cpsr.org and http://www.ietf.org

Send us your comments! We look forward to hearing your comments and suggestions regardinganything you read in this publication. Send us e-mail at: [email protected]

This publication is distributed on an “as-is” basis, without warranty of any kind eitherexpress or implied, including but not limited to the implied warranties of merchantability,fitness for a particular purpose, or noninfringement. This publication could containtechnical inaccuracies or typographical errors. Later issues may modify or updateinformation provided in this issue. Neither the publisher nor any contributor shall have anyliability to any person for any loss or damage caused directly or indirectly by theinformation contained herein.

The Internet Protocol JournalOle J. Jacobsen, Editor and Publisher

Editorial Advisory BoardDr. Vint Cerf, Sr. VP, Internet Architecture and EngineeringMCI Communications, USA

David Farber The Alfred Fitler Moore Professor of Telecommunication Systems University of Pennsylvania, USA

Edward R. Kozel, Sr. VP, Corporate Development Cisco Systems, Inc., USA

Peter Löthberg, Network ArchitectStupi AB, Sweden

Dr. Jun Murai, Professor, WIDE Project Keio University, Japan

Dr. Deepinder Sidhu, Professor, Computer Science & Electrical Engineering, University of Maryland, Baltimore County Director, Maryland Center for Telecommunications Research, USA

Pindar Wong, Chairman and President,VeriFi Limited, Hong Kong

The Internet Protocol Journal is published quarterly by the Cisco News Publications Group, Cisco Systems, Inc.www.cisco.comTel: +1 408 526-4000E-mail: [email protected]

Cisco, Cisco Systems, and the Cisco Systems logo are registered trademarks of Cisco Systems, Inc. in the USA and certain other countries. All other trademarks mentioned in this document are the property of their respective owners.

Copyright © 1998 Cisco Systems Inc.All rights reserved. Printed in the USA.

The Internet Protocol Journal, Cisco Systems170 West Tasman Drive, M/S SJ-J4San Jose, CA 95134-1706USA

ADDRESS SERVICE REQUESTED

Bulk Rate MailU.S. Postage

PAIDCisco Systems, Inc.

september 1998 volume 1, number 2 - cisco › c › dam › en_us › about › ac123 › ac147 ›...

Documents