the case for an internet health monitoring system · the case for an internet health monitoring...

The Case for an Internet Health Monitoring SystemMatthew Caesar, Lakshminarayanan Subramanian, Randy H. Katz�

mccaesar,lakme,randy � @cs.berkeley.edu

AbstractInternet routing is plagued with several problems today,including chronic instabilities, convergence problems, andmisconfiguration of routers. We believe that a first step to-wards making the Internet robust to these problems is bydeveloping a systematic methodology for analyzing rout-ing changes and inferring why they happen and where theyoriginate. In this paper, we motivate the need as well as de-scribe the design of an Internet health monitoring systemthat identifies the source of routing instabilities purely bypassively observing routing updates from different vantagepoints. We believe such a system could be used to contin-uously infer the state of the network. Such inferences maythen be used offline for network performance monitoringand troubleshooting, or online to improve path selectionand damping of instability.

1 IntroductionDiagnosing Internet routing problems is a highly challeng-ing task for network operators. On many occasions, locat-ing the source of a problem and fixing it requires the col-lective effort of several operators in different administra-tive domains. Unfortunately, the current diagnosis method-ologies used by operators are often very ad-hoc and time-consuming processes involving phone calls and email ex-changes [13].

Compounding this problem is the fact that the Border Gate-way Protocol (BGP), the de facto inter-domain routing pro-tocol, has evolved into a complex protocol with a numberof configurable policies, making the dynamics of Internetrouting hard to model and comprehend. BGP operates atthe granularity of Autonomous Systems (ASes) where op-erators in each AS independently make policy decisions.This makes it difficult for operators to predict the impactof simple configuration changes on global routing dynam-ics. Hence, existing efforts to diagnose problems in BGPand to modify the protocol to eliminate its shortcomingshave become a black art.

To assist in diagnosis, this paper proposes the design of anInternet health monitoring system that analyzes route up-dates from multiple vantage points and localizes the sourceand potential cause of routing events that triggered eachupdate. Although pinpointing the cause of certain routingchanges may be fundamentally hard [5], for certain types

of massive routing events (e.g.,, session resets), it is oftenpossible to pinpoint the exact inter-AS link which mighthave underwent the reset due to the large number of con-stituent updates [17]. We believe such a system can beuseful for network operators in several ways. First, shar-ing health inferences across administrative domains can re-duce operator time spent investigating problems and speedtheir repair. Second, creating a repository for historical in-formation about observed events using the inferences fromour system can be used by operators to set policies. Third,the current mechanisms used for damping route flaps (toimprove stability) is known to exacerbate route conver-gence [7]. Using a health inference system could enableapplication of damping selectively at a much finer granu-larity. Finally, overlay networks like RON [2] or route con-trol devices [1] can utilize these inferences to selectivelyprobe and choose alternative paths that have better stabil-ity properties.

We have developed a full-fledged health monitoring sys-tem that has been operational for over �� months. Duringthis period, we have been able to pinpoint several routinganomalies of high magnitude many of which were previ-ously unknown. We found a number of inter-AS links tobe perennially unstable and found certain links to be par-ticularly prone to failure when the Internet was congested(e.g., when Internet worms were propagating). While ourproposed design currently uses only passive BGP updateinformation, we believe that the precision can be enhancedby coupling it with active probing diagnostic tools like AS-traceroute [8].

2 Types of routing problemsBGP’s rich set of configurable policies and cross-protocolinteractions make it highly subject to a wide range of rout-ing problems. We illustrate three very important problemsthat occur on a regular basis today:

Continuously flapping routes: BGP is known to havepoor convergence properties [7]. A continuously flappingroute is a sign of an extreme convergence problem wherea prefix is repeatedly updated over long periods of time.Flapping is clearly undesirable since it affects route avail-ability as well as places a phenomenal load on routers. Wefound that routes to a surprisingly large fraction of prefixes(25%) are involved in a continuously flapping event at any

point in time. Moreover, some flapping events would lastfor long periods of time, for example, one such event lastedfor 80 days.

High magnitude events: Some BGP events simultane-ously affect a large number of routes within a short periodof time. For example, a single session between two BGProuters that resets can simultaneously affect route availabil-ity for nearly �� prefixes as well as trigger a separateroute convergence process for each prefix. Session failuresoccur on a daily basis and form the primary reason for mas-sive bursts of updates and routing changes in the Internet.They can be triggered by flash crowds and worm outbreaks,which can cause the TCP channel carrying the BGP sessionto suffer significant packet loss. We observed over �� session resets during our �� month period. However, wefound that most resets were caused by a small number ofsessions: the 9% most unstable sessions contributed 49%of the total number of observed resets.

Misconfigurations: Nearly �� prefixes are af-fected by misconfigurations every day [9]. There are threekey types of misconfigurations. (1) address hijacking:where an AS announces one or more prefixes belongingto another AS. Certain hijacking events are hard to pro-tect against and have caused large outages in the Interneton several occasions. (2) policy violations: where the ASadvertises a route that is in violation of its policy. For ex-ample, routes from providers and peers are typically notforwarded to other providers and peers, as there is no eco-nomic incentive for an ISP to forward such traffic. Theseincidents typically do not cause connectivity problems butthey do bring more traffic into the AS, which can trig-ger congestion [9]. (3) leaking of routes: where routes thatshould be filtered due to configuration or aggregation rulesare unintentionally leaked to neighbors. Leaking of routescan increase load on routers and routing table size, and cantrigger router reboots if the routing table exceeds memorycapacity [11]. One key class of leaked routes are adver-tised prefixes that are a subblock of a larger, more persis-tently available prefix. We found that routes in this classare particularly unstable, with 50% of these routes beingwithdrawn within 8 minutes of being advertised.

3 Challenges to diagnosisIn this section we discuss three key challenges to diagnosisin BGP and describe ways we aim to address them.

AS as a distributed system: Each AS is a large networkcomprising of thousands of routers. BGP advertisementsat the AS-level provide zero visibility into intra-AS routingdynamics. Hence it is hard to determine whether a problemexists within an AS or on inter-AS peering links. More-over, since ASes may be connected by multiple peeringlinks, two AS-level paths that intersect in the AS-topologymight not intersect in the network-level topology, as shown

Figure 1: Two paths that intersect in the AS-level topology mightnot intersect in the router-level topology. Hence an event takingplace in the intersection of the two AS-level paths might not affectboth routes.

in Figure 1. We address these issues by designing inferencemechanisms that do not make assumptions about networklevel topology.

Simultaneous events: To determine the cause of a routingevent, it is necessary to determine the set of updates trig-gered by the event. However, given the vast size of the In-ternet and the high rate of routing events, multiple routingevents may simultaneously affect routes to the same pre-fix. This may cause route advertisements triggered by thetwo events to overlap in time. Hence, it becomes a difficulttask to determine when one can safely determine whether agroup of routing updates for a prefix is triggered by a singleevent.

We address this by separating prefixes into two groups: sta-ble prefixes that are rarely updated, and continuously flap-ping prefixes that are repeatedly updated over long periodsof time. For stable prefixes, events are separated by longperiods of time, making separation of updates tractable.Most popular prefixes (prefixes that sink the most traffic)also tend to be very stable, and hence events affecting datatraffic are more likely to be observable. Separating the up-dates triggered by different events for a continuously flap-ping prefix is fundamentally hard.

Visibility of events: Not all events may cause an observ-able route update in BGP. There are two key reasons whyupdates from an event might not be observed: (i) The stateof the route might change through several intermediatestates while the route is being filtered or dampened. (ii)The event may not affect any of the views due to its loca-tion. In addition, minor events may be unobservable in thepresence of major events, since the few updates caused bya minor event get subsumed as noise when a major eventtriggers many route updates at the same time. However, ifa minor event affects a prefix that is not continuously flap-ping, then updates to it are typically visible.

We found that if an event affects a prefix that is not con-tinuously flapping, then updates to it are typically visible.Moreover, we found that an event with a large magnitudeis visible at a vantage point provided the vantage point

has a large number of routes traversing the location wherethe event occurred. In addition, the observed magnitude ofan event is fairly constant regardless of distance (AS hopcount) from the view. To increase the chances we observean event, we use multiple vantage points. In practice, wefound that using � � �

vantage points is sufficient to ob-serve most minor events.

4 Diagnosis approach

Figure 2: Architecture of health monitoring system.

The health monitoring system (Figure 2) collects updatesfrom multiple vantage points (routers) located in variousASes, whose owners volunteer to make their routing in-formation available to the health monitor. The monitor, inturn, reports its inferences to each vantage point. Route-views [15] and RIPE [14] are two-real world examples ofupdate collection systems where one can deploy a healthmonitor. In this section we provide a brief overview of thesystem.

We define a routing event as an activity taking place atsome location in the network that generates one or moreroute updates. A route update observed by the health mon-itoring system is a clear signal of some routing event inthe Internet. BGP route updates are at the granularity ofprefixes and every update contains an ASPATH attributewhich describes the entire routing path at the AS level tothe destination prefix. This ASPATH attribute is the pri-mary clue at the disposal of a health monitor to determinethe location and potential cause of routing events.

4.1 System components

Our health monitoring system consists of three compo-nents. First, the Local-view inference engine collects rout-ing updates via BGP peering sessions with routers. Its roleis to infer the location of events purely based on localobservations. Next, the Aggregator aggregates the poten-tially large volume of inferences from each view and de-termines a refined set of events that might have potentiallyoccurred. Finally, the Health monitor computes differentstatistics on the inferences, and triggers alarms when un-healthy behavior is detected (e.g., events of high magni-

tude, recurrent events etc.). This component also providesan interface for applications to query statistics or set trapsfor various alarms. We continuously publish the resultsfrom our health monitor based on Routeviews data on aweb site [18].

Separating the functionality of our system into these com-ponents provides several benefits. First, information is ag-gregated and refined locally before sending them to thehealth monitor, thereby reducing bandwidth requirementsand improving scalability (especially given that these com-ponents may be geographically distributed). Second, weobtain flexibility in deployment where an operator maychoose to deploy a set of components within their ownAS for internal monitoring, without joining the centralizedhealth monitor. Third, information considered proprietarycan be filtered or anonymized between components (espe-cially given that operators may wish to restrict informationflow across AS boundaries due to privacy considerations).Finally, components can be replicated and appropriatelyplaced to improve fault-tolerance and availability duringfailures.

4.2 Inference methodology

Figure 3: Algorithm flow.

The primary design challenge in building a health monitor-ing system is the methodology used for inferring routingevents purely based on passively observing route updates.In particular, the goal is to determine the set of locationswhere the event could have occurred, and for each locationa list of possible causes that could have occurred at that

location.

Given the complicated nature of this problem and the asso-ciated challenges, we split the problem into different sub-problems and tackle each in isolation, as shown in Figure 3.We only touch upon the salient aspects of our method-ology in this section (refer to [17] for details). Previouswork [4] [3] has demonstrated the ability to localize eventsbased solely on BGP updates. Our inference methodologyextends previous work along three dimensions:

Separating stable from continuously flapping prefixes:We separate prefixes which are relatively stable from thosethat get continuously updated. We observe that for stableprefixes, two properties commonly hold: (a) routing eventsaffecting these prefixes are typically visible and not af-fected by damping; (b) two different events affecting thesame prefix are separable in time, making it possible todistinguish updates from different events. These two prop-erties make root-cause analysis for these prefixes feasible.Root-cause analysis is more challenging for continuouslyflapping prefixes, and so we use a different methodologyfor these prefixes as discussed in our technical report [17].

0

0.01

0.02

0.03

0.04

0.05

0 300 600 900 1200

Per

cent

Number of prefixes updated

Figure 4: Frequency of time intervals with various numbers ofupdated prefixes for a view in AS 1239. We use the gap (sepa-ration shown with vertical dashed lines) to distinguish Turbulentfrom Quiescent periods. Less than 0.2% of the mass is within thegap. We found similarly large gaps for other links and views.

Improve vs. Worsen classes: While the list of causes ofrouting events is innumerable, many of these causes can becategorized into two classes: events that cause the path toImprove, and those that cause the path to Worsen accord-ing to the BGP path selection process. For each update,our system determines which of the two classes each ASin the AS path falls into, giving rough information aboutthe cause of the event. We can also intersect observationsacross views and prefixes for each class to further narrowdown location. As future work, we aim to define a morerich set of classes, to obtain a finer distinction between dif-ferent types of causes.

For example, suppose a view � observes a single routechange from path �� to �� to a destination � as shownin Figure 5. While one can associate several causes for this

Figure 5: Example: A single prefix undergoes an ASPATHchange.

route change, we can definitely conclude that the routingevent either improves the properties of � � causing it to bethe more preferred path or worsens the properties of � �by making it less preferred. Hence, the underlying causeof the event, though unknown, can be classified into theimprove or worsen categories. If the event occurred along�� , then the cause belongs to the Improve class, otherwiseit belongs to the Worsen class if it occurred in path �� .Knowing the classes can also help narrow down location:since an event must be in the same class regardless of view,our system intersects ASes appearing in the Improve (i.e.ASes along path �� ) and Worsen (ASes along �� ) classesacross views. Thus far, we have defined only a small set ofclasses and defining a more complete set to obtain a finerdistinction between different types of causes is a subject offuture work.

Correlation of routing updates: One of the key problemswe need to address is how we correlate updates that arecaused by the same event without incorrectly correlatingobservations from different events. We correlate updatesacross three dimensions: time, prefixes and views. From asingle view, we cluster updates to each prefix based on howclose they occur together in time so as to separate routingupdates triggered by different routing events. Since the pre-fix is not continuously flapping, bursts of updates are likelyto be caused by the same event. Next, we cluster acrossprefixes by noting that a large number of prefixes simul-taneously updated in a short period of time can indicatethat the multiple prefixes are affected by the same event.We found that time intervals can be clearly classified intoQuiescent and Turbulent periods based on the number ofprefixes updated during that time (Figure 4), and show thatmany observations in a Turbulent period are triggered bythe same event.

Figure 6: Example: Effects of a MED decrease on a single view.

For example, suppose � uses � to reach � and � , asshown in Figure 6(a). Suppose many prefixes simultane-ously change to start using � . Given that such a large burstof updates is rare, it is likely they were all caused by thesame event. Moreover, we can determine the event mostlikely occurred on the common sub-path across all theseprefixes. Hence there are two possible causes: either anevent took place on �� that worsened the propertiesof the path, or an event took place on �� that improvedthe properties of the path. A second example is shown inFigure 6(b). Suppose many prefixes simultaneously changeto start using �� . Since it is unlikely two simultaneousmajor events occurred on both � � �� and � � � � �� , itis likely that a single major event either (a) took place at�� that improved the properties of the path, or (b) tookplace internally in � that worsened the paths of several pre-fixes using �� and �� (since � is the onlycommon AS across the sub-paths). In both these examples,we can separate ASes into improve and worsen categoriesand perform Turbulent inference on each class indepen-dently.

5 Implementation resultsWe have built a prototype of our system and used it to an-alyze route route updates collected from Routeviews andRIPE over a period of 18 months. We describe some pre-liminary observations that we believe show such a systemcan both be used by network operators to isolate and re-pair faults, and also to provide better insight into Internetrouting dynamics arising from those faults.

Figure 7: Breakdown of updates by cause.

Although we were unable to pinpoint the cause for any up-date, we were able to narrow down the cause into a severalcategories. Figure 7 gives the number of updates for eachtype of event. We found that we could pinpoint the locationof virtually all major events we detected, and could narrowdown the location of Quiescent events to within 2 ASesfor 20% of updates. We found the majority of continuouslyflapping prefixes could be pinpointed to a single AS. Over-all, we could pinpoint the location to a single inter-AS link(pair of ASes) for 70% of updates.

Previously unnoticed events: A number of importantrouting anomalies go unnoticed in the Internet today. Our

system was able to detect many events which have goneunnoticed in the past. We give three examples here. First,on January 2002, a Chinese ISP underwent on average twosession resets per day for a period of two weeks, affectingreachability to over 1000 prefixes in China. Second, in July2003, the peering link between AS 3561 and AS 1239 un-derwent a large number of session-reset like events, affect-ing the reachability of over 20,000 prefixes including thedomains cnet.com, excite.com, and weather.com. Third, inJune 2004, AS 2500 began to advertise paths for over 500prefixes it did not own, affecting reachability to several ma-jor providers.

Continuously flapping routes: From our analysis of BGPupdates, we detected two distinct categories of flappingprefixes: (a) a prefix is classified as continuously flap-ping by several views (near-origin flaps); (b) a prefix con-tinuously flaps only with respect to a single view (near-view flaps). Many near-origin flaps are triggered due to thewidespread deployment of route control products [1] fortraffic engineering purposes, while many near-view flapsare caused by instabilities within the AS containing theview. In practice, we found that near-origin flaps are muchmore likely to affect reachability than near-view flaps.Also, we found that a small fraction of the prefixes (1%)are particularly prone to continuously flapping events, andthereby trigger a disproportionately large number (90%) ofrouting changes.

0

0.2

0.4

0.6

0.8

1

Jan 20 8:21 Jan 24 00:00 Jan 28 00:00

Per

cent

age

of e

vent

s

Time (UTC)

(3257)(3549)(4777)(7018)

Figure 8: Effect of the SQL slammer worm on the number of ma-jor events detected (we filter out effects of local resets). Each linecorresponds to observations made at a different vantage point.

High magnitude events: We observed 39,387 large events(events affecting more than 1000 prefixes) during the 18month period. Roughly two thirds of these appeared toaffect inter-AS adjacencies while the rest appeared to bedue to instability inside an AS. We found certain links tobe perennially unstable, and we found certain links to behighly prone to instability: over 50% of large events oc-curred within 2 hours of another large event affecting thesame link/AS. Although large events are more rare in theInternet core (tier-1 ISPs) than at the edges, we found therate at which these events occur in the core increases by an

order of magnitude when Internet worms are propagating.For example, Figure 8 shows a sharp increase in the num-ber of major events taking place while the SQL slammerworm spreads.

Correctness: Although validation can be difficult becauseISPs are (understandably) unwilling to share informationabout faults occurring inside their networks, we use severalapproaches to validate the correctness of our inferences.First, we compare our inferences on certain route updateswhere the cause and location are well known. These up-dates may have been manually injected as part of exper-iments [6] or they may be due to large historical eventsreported on in the media. Next, we find large events in ourinferences and perform post-mortem analysis to attempt toexplain their cause [16]. Then we cross-validate inferencesmade at each view in isolation and check for conflicts. Fi-nally, we correlate our observations with logs from twolarge ISPs with cases when we infer that ISP to be the causebased on external observations. In each of these cases, wefound our inferences to match with what was known to becorrect.

6 Conclusions and research agendaThis paper has made the case that building a health mon-itoring system for the Internet is feasible. We presentedsome preliminary measurements from a prototype imple-mentation that suggest such a system could benefit networkoperators and end users. However, there is substantial fu-ture work to be done to improve upon the system design.First, we aim to improve inference accuracy, by leverag-ing active measurements, and by developing a methodol-ogy to determine which parts of the network are most crit-ical for deploying vantage points. Second, we plan to de-velop a more robust distributed implementation that canhandle the heavy update loads of the Internet core. We planto complete the design of the health monitor, by develop-ing guidelines for determining which observations consti-tute unhealthy behavior, so system can then trigger alarmswhen these observations occur. Finally, demonstrating fea-sibility of this system opens the door to several interestingavenues of future work, including how the system can beused to simplify network management and troubleshoot-ing, and how the inferences can be used in an online fash-ion to improve stability of Internet and overlay routing.

References[1] A. Akella, J. Pang, B. Maggs, S. Seshan and A. Shaikh,

“A Comparison of Overlay Routing and Multihoming RouteControl”, ACM SIGCOMM 2004.

[2] D. Andersen, H. Balakrishnan, M. Kaashoek, R. Morris,“Resilient overlay networks,” in Proc. of SOSP, Banff,Canada, October 2001.

[3] D. Chang, R. Govindan, J. Heidemann, “The Temporal andTopological Characteristics of BGP Path Changes,” in Proc.of ICNP, Atlanta, GA, November 2003.

[4] A. Feldmann, O. Maennel, Z. Mao, A. Berger, B. Maggs,“Locating Internet routing instabilities,” in Proc. of SIG-COMM, Portland, Oregon, August 2004.

[5] T. Griffin, “What is the sound of one route flapping?,” pre-sentation made at the Network Modeling and SimulationSummer Workshop, 2002.

[6] Z. Mao, R. Bush, T. Griffin, M. Roughan, “BGP beacons,” inProc. Internet Measurement Conference, October 2003.

[7] Z. Mao, R. Govindan, D. Varghese, R. Katz, “Route flapdamping exacerbates Internet routing convergence,” in Proc.of SIGCOMM, Pittsburgh, PA, August 2002.

[8] Z. Mao, J. Rexford, J. Wang, R. Katz, “Towards an accurateAS-level traceroute tool,” in Proc. of SIGCOMM, Karlsruhe,Germany, August 2003.

[9] R. Mahajan, D. Wetherall, T. Anderson, “UnderstandingBGP misconfiguration,” in Proc. of SIGCOMM, Pittsburgh,Pennsylvania, August 2002.

[10] Y. Rekhter, T. Li, “A Border Gateway Protocol 4 (BGP-4)”,IETF, RFC 1771, March 1995. Section 9, pgs 33-46

[11] L. Wang, X. Zhao, D. Pei, R. Bush, D. Massey, A. Mankin,S. Wu, L. Zhang, “Observation and analysis of BGP behaviorunder stress,” in IMW, November 2002.

[12] C. Villamizar, R. Chandra, R. Govindan, “BGP route flapdamping,” RFC 2439, November 1998.

[13] “NANOG Mailing List,” http://www.merit.edu/mail.archives/nanog/

[14] “RIPE RIS,” http://www.ripe.net/ris/.[15] “Route Views Project,” http://www.routeviews.

org.[16] “Sprintlink: Scheduled Maintenence and Outage” http:

//www.sprintlink.net/maintview/[17] M. Caesar, L. Subramanian, R. Katz, ”Towards a BGP

health inferencing system,” U.C. Berkeley Technical ReportUCB/CSD-04-1321, November 2004.

[18] http://www.cs.berkeley.edu/˜mccaesar/hmon.html

the case for an internet health monitoring system · the case for an internet health monitoring...

Documents