ieee transactions on dependable and secure ...cseweb.ucsd.edu/~savage/papers/tdsc06.pdfpresents a...

15
Detecting and Isolating Malicious Routers Alper Tugay Mzrak, Student Member, IEEE, Yu-Chung Cheng, Keith Marzullo, Member, IEEE, and Stefan Savage, Member, IEEE Abstract—Network routers occupy a unique role in modern distributed systems. They are responsible for cooperatively shuttling packets amongst themselves in order to provide the illusion of a network with universal point-to-point connectivity. However, this illusion is shattered—as are implicit assumptions of availability, confidentiality, or integrity—when network routers are subverted to act in a malicious fashion. By manipulating, diverting, or dropping packets arriving at a compromised router, an attacker can trivially mount denial-of-service, surveillance, or man-in-the-middle attacks on end host systems. Consequently, Internet routers have become a choice target for would-be attackers and thousands have been subverted to these ends. In this paper, we specify this problem of detecting routers with incorrect packet forwarding behavior and we explore the design space of protocols that implement such a detector. We further present a concrete protocol that is likely inexpensive enough for practical implementation at scale. Finally, we present a prototype system, called Fatih, that implements this approach on a PC router and describe our experiences with it. We show that Fatih is able to detect and isolate a range of malicious router actions with acceptable overhead and complexity. We believe our work is an important step in being able to tolerate attacks on key network infrastructure components. Index Terms—Communication/networking and information technology, network-level security and protection, network protocols, routing protocols, fault tolerance. Ç 1 INTRODUCTION T HIS paper addresses part of a simple, yet increasingly important network security problem: how to detect the existence of compromised routers in a network and then remove them from the routing fabric. The root of this problem arises from the key role that routers play in modern packet switched data networks. To a first approx- imation, networks can be modeled as a series of point-to- point links connecting pairs of routers to form a directed graph. Since few endpoints are directly connected, data must be forwarded—hop-by-hop—from router to router, toward its ultimate destination. Therefore, if a router is compromised, it stands to reason that an attacker may drop, delay, reorder, corrupt, modify, or divert any of the packets passing through. Such a capability can then be used to deny service to legitimate hosts, to implement ongoing network surveillance, or to provide an efficient man-in-the-middle functionality for attacking end systems. Such attacks are not mere theoretical curiosities, but they are actively employed in practice. Attackers have repeat- edly demonstrated their ability to compromise routers, through combinations of social engineering and exploita- tion of weak passwords or latent software vulnerabilities [1], [2], [3]. One network operator recently documented more than 5,000 compromised routers as well as an underground market for trading access to them [4]. Once a router is compromised an attacker need not modify the router’s code base to exploit its capabilities. Current standard command-line interfaces from vendors such as Cisco and Juniper are sufficiently powerful enough to drop and delay packets, send copies of packets to a third party, or “divert” packets through a third party and back. In fact, several widely published documents provide a standard cookbook for transparently “tunneling” packets from a compromised router through an arbitrary third-party host and back again—effectively amplifying the attacker’s abilities, including arbitrary packet sniffing, injection, or modification [5], [6]. Such attacks can be extremely difficult to detect manually, and it can be even harder to isolate which particular router or group of routers has been compromised. One approach to this problem is to detect the act of router intrusion as it happens. For example, traditional host-based intrusion detection techniques, such as system call anomaly analysis [7], [8], [9], could be applied to the router environment as well. However, this approach also has the same limitations as in the host environment—once a router is compromised, the detection software is no longer dependable. Indeed, it has become increasingly common for malware to disable antivirus and IDS products and we see no reason to believe the router environment will be different. Thus, while such attack detection mechanisms are undoubtedly an important defensive element, in this paper, we start from the assumption that routers may be successfully compromised. Given this threat model, the problem of detecting a compromised router falls to its neighbors: A compromised router can potentially be identified by correct routers when it deviates from exhibiting expected behavior. This overall approach can be broken into three distinct subproblems: 1. Traffic validation. Traffic information is the basis of detecting anomalous behavior: Given traffic entering a part of the network, and an expected behavior for IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006 1 . The authors are with the Computer Science and Engineering Department, University of California at San Diego, EBU3B, 9500 Gilman Drive, La Jolla, CA 92093-0404. E-mail: {amizrak, ycheng, marzullo, savage}@cs.ucsd.edu. Manuscript received 14 Sept. 2005; revised 12 Apr. 2006; accepted 25 Apr. 2006; published online 3 Aug. 2006. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TDSC-0123-0905. 1545-5971/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

Upload: others

Post on 06-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

Detecting and Isolating Malicious RoutersAlper Tugay M�zrak, Student Member, IEEE, Yu-Chung Cheng, Keith Marzullo, Member, IEEE, and

Stefan Savage, Member, IEEE

Abstract—Network routers occupy a unique role in modern distributed systems. They are responsible for cooperatively shuttling

packets amongst themselves in order to provide the illusion of a network with universal point-to-point connectivity. However, this

illusion is shattered—as are implicit assumptions of availability, confidentiality, or integrity—when network routers are subverted to act

in a malicious fashion. By manipulating, diverting, or dropping packets arriving at a compromised router, an attacker can trivially mount

denial-of-service, surveillance, or man-in-the-middle attacks on end host systems. Consequently, Internet routers have become a

choice target for would-be attackers and thousands have been subverted to these ends. In this paper, we specify this problem of

detecting routers with incorrect packet forwarding behavior and we explore the design space of protocols that implement such a

detector. We further present a concrete protocol that is likely inexpensive enough for practical implementation at scale. Finally, we

present a prototype system, called Fatih, that implements this approach on a PC router and describe our experiences with it. We show

that Fatih is able to detect and isolate a range of malicious router actions with acceptable overhead and complexity. We believe our

work is an important step in being able to tolerate attacks on key network infrastructure components.

Index Terms—Communication/networking and information technology, network-level security and protection, network protocols,

routing protocols, fault tolerance.

Ç

1 INTRODUCTION

THIS paper addresses part of a simple, yet increasinglyimportant network security problem: how to detect the

existence of compromised routers in a network and thenremove them from the routing fabric. The root of thisproblem arises from the key role that routers play inmodern packet switched data networks. To a first approx-imation, networks can be modeled as a series of point-to-point links connecting pairs of routers to form a directedgraph. Since few endpoints are directly connected, datamust be forwarded—hop-by-hop—from router to router,toward its ultimate destination. Therefore, if a router iscompromised, it stands to reason that an attacker may drop,delay, reorder, corrupt, modify, or divert any of the packetspassing through. Such a capability can then be used to denyservice to legitimate hosts, to implement ongoing networksurveillance, or to provide an efficient man-in-the-middlefunctionality for attacking end systems.

Such attacks are not mere theoretical curiosities, but they

are actively employed in practice. Attackers have repeat-

edly demonstrated their ability to compromise routers,

through combinations of social engineering and exploita-

tion of weak passwords or latent software vulnerabilities

[1], [2], [3]. One network operator recently documented

more than 5,000 compromised routers as well as an

underground market for trading access to them [4]. Once

a router is compromised an attacker need not modify the

router’s code base to exploit its capabilities. Current

standard command-line interfaces from vendors such asCisco and Juniper are sufficiently powerful enough to dropand delay packets, send copies of packets to a third party, or“divert” packets through a third party and back. In fact,several widely published documents provide a standardcookbook for transparently “tunneling” packets from acompromised router through an arbitrary third-party hostand back again—effectively amplifying the attacker’sabilities, including arbitrary packet sniffing, injection, ormodification [5], [6]. Such attacks can be extremely difficultto detect manually, and it can be even harder to isolatewhich particular router or group of routers has beencompromised.

One approach to this problem is to detect the act ofrouter intrusion as it happens. For example, traditionalhost-based intrusion detection techniques, such as systemcall anomaly analysis [7], [8], [9], could be applied to therouter environment as well. However, this approach alsohas the same limitations as in the host environment—once arouter is compromised, the detection software is no longerdependable. Indeed, it has become increasingly common formalware to disable antivirus and IDS products and we seeno reason to believe the router environment will bedifferent. Thus, while such attack detection mechanismsare undoubtedly an important defensive element, in thispaper, we start from the assumption that routers may besuccessfully compromised.

Given this threat model, the problem of detecting acompromised router falls to its neighbors: A compromisedrouter can potentially be identified by correct routers whenit deviates from exhibiting expected behavior. This overallapproach can be broken into three distinct subproblems:

1. Traffic validation. Traffic information is the basis ofdetecting anomalous behavior: Given traffic enteringa part of the network, and an expected behavior for

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006 1

. The authors are with the Computer Science and Engineering Department,University of California at San Diego, EBU3B, 9500 Gilman Drive, LaJolla, CA 92093-0404.E-mail: {amizrak, ycheng, marzullo, savage}@cs.ucsd.edu.

Manuscript received 14 Sept. 2005; revised 12 Apr. 2006; accepted 25 Apr.2006; published online 3 Aug. 2006.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TDSC-0123-0905.

1545-5971/06/$20.00 � 2006 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

the routers in the network (i.e., a known routingconfiguration), anomalous behavior is detectedwhen the monitored traffic leaving one part of thenetwork differs significantly from what is expected.However, implementing such validation practicallycan be quite tricky and requires tradeoffs betweenthe overhead of monitoring, communication, andaccuracy.

2. Distributed detection. It is impossible for a singlerouter to establish that its neighbor is anomalous.Thus, detection requires synchronizing a collectionof traffic information and distributing the results soanomalous behavior can be detected by sets ofcorrect routers.

3. Response. Once a router, or set of routers, is thoughtto be faulty, the forwarding tables of correct routersmust be changed to avoid using those compromisednodes. In addition, over longer time scales, anappropriate alert must be raised so human forensicexperts can respond appropriately.

This paper addresses each of these problems in turn.For different threats, we describe a range of appropriateand efficient traffic validation functions. Next, we examinehow these functions can be used to build an anomalousbehavior detector for compromised routers. Finally, weshow how this detector can be integrated into link-staterouting to automatically isolate network paths demonstrat-ing anomalous behavior. This paper is substantiallyextended version of [10].

2 RELATED WORK

There are two threats posed by a compromised router: theattacker may attack by means of the routing protocol (forexample, by sending false advertisements) or by having therouter violate the forwarding decisions it should makebased on its routing tables. The first is often referred to as anattack on the control plane, while the second is termed anattack on the data plane.

The first threat has received, by far, the lion’s share of theattention in the research community, perhaps due itspotential for catastrophic effects. By issuing false routingadvertisements, a compromised router may manipulate howother routers view the network topology, and thereby disruptservice globally. For example, if a router claims that it isdirectly connected to all possible destinations, it may becomea “black hole” for most traffic in the network. While thisproblem is by no means solved in practice, there has beensignificant progress towards this end in the researchcommunity, beginning with the seminal work of Perlman.In her PhD thesis [11], Perlman described robust floodingalgorithms for delivering the key state across any connectednetwork and a means for explicitly signing route advertise-ments. There have subsequently been a variety of efforts toimpart similar guarantees to existing routing protocols withvarying levels of cost and protection based on ensuring theauthenticity of route updates and detecting inconsistencybetween route updates [12], [13], [14], [15], [16], [17], [18], [19].

By contrast, the threat posed by subverting the forward-ing process has received comparatively little attention. This

is surprising since, in many ways, this kind of attackpresents a wider set of opportunities to the attacker—notonly denial-of-service, but also packet sniffing, modifica-tion, and insertion—and is both trivial to implement (a fewlines typed into a command shell) and difficult to detect.This paper focuses entirely on the problem of maliciousforwarding.

The earliest work on fault-tolerant forwarding is also dueto Perlman [11], [20]. Perlman developed a novel methodfor robust routing based on source routing, digitally signedroute-setup packets, reserved buffers. However, many im-plementation details are left open and the protocol requireshigher network level participation to detect anomalies.Several researchers have subsequently proposed lighter-weight protocols for actively probing the forwarding pathto test for consistency with advertised routes. Subramanianet al.’s Listen protocol [12] does this by comparing TCPData and Acknowledgment packets to provide evidencethat a path is part of end-to-end connectivity. This approachonly tests for gross connectivity and cannot reveal whetherpackets have been diverted, modified, created, reordered,or selectively dropped. Padmanabhan and Simon’s SecureTraceroute [21] achieves a similar goal monitoring thetraffic to the intermediate routers. Recently, Avramopouloset al. [22], [23] present a secure router routing a combinationof source routing, hop by hop authentication, end-to-endreliability mechanisms, and timeouts. But, it still has a highoverhead to be deployable in modern networks. Theapproach most similar to our own is the WATCHERS[24], [25] protocol, which detects disruptive routers basedon a distributed network monitoring approach and a trafficinvariant called conservation of flow. However, theWATCHERS protocol had many limitations in both itstraffic validation mechanism and in its control protocol,many of which were documented by Hughes et al. [26].Many of these weaknesses arose from the absence of aformal specification, a weak threat model, and an excessiverequirement for per-router state (bounded only by the totalsize of the network). Herzberg and Kutten [27] presents anabstract model for Byzantine detection of compromisedrouters based on timeouts and acknowledgments from thedestination and possibly from some of the intermediaterouters to the source. The requirement of information fromintermediate routers offers a trade-off between faultdetection time and message communication overhead. Thistrade-off is analyzed within the given abstract model.

3 SYSTEM MODEL

Our work proceeds from an informed, yet abstracted,model of how the network is constructed, the capabilities ofthe attacker, and the complexities of the traffic validationproblem. In this section, we describe and motivate theassumptions underlying our model.

3.1 Network Model

We consider a network as consisting of individual routersinterconnected via directional point-to-point links. In usingthis model, we purposely ignore several real-life complex-ities. For example, real routers are not homogeneous nodes,but in fact represent a collection of independent network

2 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

Page 3: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

interfaces which are themselves interconnected and are, inpractice, addressed and controlled distinctly. We haveeliminated this detail for the convenience of description;our analysis can easily accommodate this expanded model.Similarly, we have chosen to ignore the possibility ofbroadcast channels in our model since they are rare intoday’s wired networks and can be easily incorporated ascollections of point-to-point links (although there may beopportunities for additional optimization in broadcastenvironments).

We assume that each router has sufficient data processingcapability to generate traffic summaries describing thenetwork traffic they are forwarding. For the most precisesummary functions this may involve touching all packetcontent. This is well within the capabilities of modern ASICs,but might require sampling in software implementations.We also assume that each router has sufficient computa-tional capability to exchange and reconcile summaries withits neighbors. We discuss the issue of overhead further inSection 4 and Section 9, but, in general, our algorithms areefficient and practical for existing router CPUs.

Within a network, we presume that packets are for-warded in a hop-by-hop fashion—each router following thedirections of a local forwarding table. As well, we assumethat these forwarding tables are updated via a distributedlink-state routing protocol such as OSPF or IS-IS. This iscritical, as we depend on the routing protocol to provideeach node with a global view of the current networktopology, which we assume to be consistent. Indeed, thisassumption is valid in practice since studies of OSPFbehavior have shown that in-network topology changes areextremely rare [28], [29]. When topology changes do occurthere may be transient inconsistencies between routers thatmake it impossible to determine whether traffic is beingforwarded correctly or not. Thus, if a compromised routerfrequently changes its connectivity, it may hide someforwarding attacks. For this reason, among others, anysecure network infrastructure also requires software todetect security violations or anomalies in the control plane[12], [13], [14], [15], [16], [17], [18], [19].

Finally, we assume the administrative ability to assignand distribute shared keys to sets of nearby routers. Thisoverall model is consistent with the typical construction oflarge enterprise IP networks or the internal structure ofsingle ISP backbone networks, but is not well-suited fornetworks that are composed of multiple administrativedomains using BGP.

At this level of abstraction, we can assume a synchronousnetwork model of synchronized clocks and boundedmessage delays. Our goal is to extend the routing protocolto detect compromised routers. If the network behavesasynchronously for too long, then the routing tables will beupdated, thereby changing the network topology. Thisassumption is common to all protocols we know of that haveaddressed the problem of detecting compromised routers.

We define a path to be a finite sequence hr1; r2; . . . rni ofadjacent routers. Operationally, a path defines a sequence ofrouters a packet can follow. We call the first router of thepath the source and the last router its sink; together, these are

called terminal routers. A path might consist of only onerouter, in which case the source and sink are the same.

An x-path segment is a sequence of x consecutive routersthat is a subsequence of a path. A path segment is an x-pathsegment for some value of x > 0. For example, if a networkconsists of the single path ha; b; c; di, then hc; di and hb; ci areboth 2-path segments, but ha; ci is not because a and c arenot adjacent.

We do not rely on source routing, as has been done bysome work in the past [11], [22], [27]. We do assume someknowledge of the path a packet will take, at least in thestable state. In link state protocols, this can be problematicbecause they can take advantage of multiple paths withequal cost for load balancing purposes. Fortunately, thecurrent generation of routers uses a deterministic hashalgorithm to spread the traffic load across availableinterfaces (e.g., Cisco Express Forwarding [30] and Juniperrouters with an Internet Processor ASIC [31]). Thus, a routercan predict the path a packet will take in the stable statebased on its own routing tables and the hash functions.

3.2 Threat Model

We assume that attackers can compromise one or morerouters in a network and may even compromise sets ofadjacent routers as well. In general, we parameterize thestrength of the adversary in terms of the maximum numberof adjacent routers along a given path that can becompromised. However, we assume that between any twouncompromised routers that there is sufficient pathdiversity that the malicious routers do not partition thenetwork. In some sense, this assumption is pedantic since itis impossible to guarantee any network communicationacross such a partition. Another way to view this constraintis that path diversity between two points in the network is anecessary, but insufficient, condition for tolerating compro-mised routers. What we are proposing is a set of protocolsthat offer the sufficiency condition in the presence of thenecessary diversity.

Recently, Teixeira et al. [32] empirically measured pathdiversity in ISP networks and found that multiple pathsbetween pairs of nodes were common. Similarly, manyenterprise networks are designed with such diversity inorder to mask the impact of link failures. Consequently, webelieve that this assumption is reasonable in practice.

It is worth noting however, that this diversity usuallydoes not extend to individual hosts on local-area net-works—single workstations rarely have multiple paths totheir network infrastructure. In these situations, for fate-sharing reasons, there is little that can be done. If the host’saccess router is compromised, then the host is partitionedand there is no routing remedy even if an anomaly isdetected; the fate of individual hosts and their accessrouters are directly intertwined. Moreover, from thestandpoint of the network such traffic originates from acompromised router, and, therefore, cannot demonstrateanomalous forwarding behavior.1 To summarize, ourprotocols are designed to detect anomalies between pairs

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 3

1. This issue can be partially mitigated by extending our protocol toinclude hosts as well as routers, but this simply pushes the problem to endhosts. Traffic originating from a compromised node can be modified beforeany correct node witnesses it.

Page 4: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

of correct nodes. Thus, for simplicity, we assume that aterminal router is not faulty with respect to trafficoriginating or being consumed by that router.

Finally, since link-state protocols operate by periodicallymeasuring and disseminating information, we assume asynchronous system. The failure of a router is defined interms of an interval of time, which in practice correspondswith a period of time during which traffic measurementsare made. Specifically, a router r is traffic faulty with respectto a path segment � during � if � contains r and, during theperiod of time � , r exhibits anomalous behavior with respectto forwarding data that traverses �. For example, router rcan selectively alter, misroute, drop, reorder, or delay thedata that flows through �, and it can fabricate new data tosend along � such that the packets, if they were valid,would have been routed through �.

4 TRAFFIC VALIDATION

The first problem we address is traffic validation: whatinformation is kept about packet traffic and how it is usedto determine that a router has been compromised.

We assume that a compromised router can arbitrarilyalter the forwarding behavior of that router. For example, acompromised router can drop or modify selected (or all)packets, or divert them to other routers. However, given thedistributed nature of packet forwarding, such bad behaviorcan be detected. For example, suppose traffic traverses apath hr1; r2; r3i and router r2 modifies this traffic. Ifrouters r1 and r3 are not compromised, then one couldsimply compare what r1 sent to r2 and what r3 receivedfrom r2 to detect r2’s behavior.

At an abstract level, we represent traffic validationmechanisms as a predicate TV ð�; infoðri; �; �Þ; infoðrj; �; �ÞÞ,where:

. � is a path segment hr1; r2; . . . rxi whose traffic is tobe validated between routers ri and rj 2 �.

. infoðr; �; �Þ is information about traffic that router rforwarded to � over some time interval � .

. If routers ri and rj are not faulty, then

TV ð�; infoðr; �; �Þ; infoðrj; �; �ÞÞ

evaluates to false iff � contains a router r that wasfaulty in forwarding traffic along � during � , androuter r is between routers ri and rj within �.

Implementing a traffic validation mechanism is a trickyengineering problem. The simplest, and most precise,representation of infoðr; �; �Þ is a complete copy of thepackets sent, with each packet tagged with the time it wasforwarded. However, the storage requirements to bufferthese packets and the bandwidth consumed by resendingthem, make this approach impractical at best. In practice,implementing traffic validation is a trade off betweenaccuracy and overhead. For example, sampling can be usedto decrease overhead, but depending on how sampling isdone and the duration of the attack, accuracy may bereduced. The overhead-accuracy trade-off also depends onwhat limits one might place on the kind of bad behavior acompromised router can exhibit.

Similarly, TV could be implemented simply usingequality infoðri; �; �Þ ¼ infoðrj; �; �Þ. However, real net-

works occasionally lose packets due to congestion, reorderpackets due to internal multiplexing, and corrupt packets

due to interface errors. Consequently, TV needs to besomewhat more sophisticated to accommodate this abnor-

mal, but nonmalicious behavior. Thus, implementing traffic

validation involves striking a balance between the accep-table number of false positives and false negatives.

Note that we have already made one engineering decision:

We describe aggregate traffic rather than individual packets.This is in contrast to some prior works, e.g., Perlman [11],

Herzberg et al. [27], and Avramopoulos et al. [22]. While wecompute state over each packet locally, limiting distributed

reconciliation to traffic aggregates amortizes the commu-nication and synchronization overhead (otherwise, prohibi-

tive) across many packets. This also makes it feasible to applya threshold mechanism to distinguish between acceptable

bad behavior (e.g., small amounts of packet loss andreordering) and malicious behavior.

4.1 Characterizing Traffic

While the most precise description of traffic sent into thenetwork is an exact copy of that traffic, many characteristics

of the traffic can be summarized far more concisely. Inparticular, if we concentrate on particular threats—ways in

which a malicious entity might alter the traffic—we canlimit our effort to detecting just those actions.

One limitation of previous work has been that they

assumed limited kinds of threats. For example, WATCHERS[26] and Herzberg et al. [27] effectively detect only packet

losses that result from a compromised router. In all of the

previous work, reordering attacks have been ignored, eventhough reordering of packets can effect TCP performance

tremendously [33], [34]. It would not be hard to extendSecure Traceroute [21] to include this kind of attack, but it

cannot be done with the protocols of Perlman [11],Herzberg et al. [27] and Avramopoulos et al. [22] since

they monitor a single packet at each round. We insteaddivide up arbitrary behavior into five different threats:

1. Packet loss. A compromised router can drop anysubset of the packets. As per Almes et al. [35], losscan be measured as the amount of data arriving atthe sink of path segment subtracted from the amountof data sent from its source.

2. Packet fabrication. A compromised router can gen-erate packets and inject them into the traffic stream.This can be measured as the number packets whichare reported at the sink of a packet segment but notmonitored as being sent by its source.

Misrouting packets can be considered an instanceof both packet loss and packet fabrication.

3. Packet modification. One can consider this threat as acombination of packet loss and fabrication, but itmay not be detectable by simply comparing thenumber of packets arriving at the sink with thenumber sent from the source. Instead, some sum-mary of the content needs to be maintained, and onemeasures the number of modified packets.

4 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

Page 5: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

4. Packet reordering. A compromised router can reorderpackets. Doing this can lead to performance pro-blems or, in the extreme, denial of service. There aremany reasonable and incompatible methods ofmeasuring the amount of reordering, e.g., [33], [34],[36], [37]. We use the definition from [36] because ofits simplicity: Given a transmitted stream S and areceived stream F , remove from both all lost,fabricated, and modified packets. Then, find thelongest common subsequence ‘ between thesemodified S and F . The amount of reordering isdefined as jSj � j‘j.

5. Time behavior. A compromised router can delaytraffic. Like reordering, doing this can lead toperformance problems or, in the extreme, denial ofservice. There are simple metrics one can use, suchas the first n moments of the interpacket delaydistribution. However, such metrics are notoriouslysensitive in packet networks; we believe this arearequires more research.

These threats completely cover the set of bad behaviors a

router can exhibit in forwarding data. When all of these

metrics are zero, then no router is forwarding traffic in a

faulty manner.

4.2 Traffic Summary Functions

We have implemented three traffic summary functions,

each reflecting a different trade off among accuracy,

completeness, and threat. They are:

. Conservation of Flow. Our first summary functionresembles the “conservation of flow” approach usedby the WATCHERS protocol [25]. The traffic informa-tion infoðri; �; �Þ collected by router ri consists oftwo counters, each counting the number of packetsin one of the two directions of traffic along thepath �. Periodically, ri passes these counters to otherrouters collecting information about �. Trafficvalidation is done by comparing the values of thesecounters. In general, this is a fragile summaryfunction because it only detects actions that causepacket losses, and it assumes that malicious routerscannot fabricate packets to “fudge” the countsappropriately. However, it is extremely cheap toimplement, both in the per packet cost and in theassociated overhead to communicate the trafficinformation among routers. It might be possible toextract such information from existing traffic analy-sis tools, such as Cisco’s Netflow [38], withoutrequiring router modifications although we havenot attempted this (there are a number of complex-ities in how Netflow manages flow records that posenontrivial synchronization challenges).

. Conservation of content. To detect modification ofpackets, we can use a fingerprint (that is, a hash), ofthe payload in place of a simple counter. Analogousto conservation of flow, each router then periodicallycommunicates a set of packet fingerprints with otherrouters for traffic validation, which is calculated viaset difference. In addition to detecting packet

modification, this approach also detects packet loss,packet fabrication, and misrouting.

One technical difficulty with this approach is thatpackets are naturally modified as they traverserouters. In particular, the TTL field in the IP headeris decremented and the IP header checksum isupdated. We address this by having each routercompute fingerprints based on the TTL value andchecksum at one end of the path �.

One downside to this approach is that it requiresstoring and communicating a fingerprint for eachpacket forwarded by a router. This is a significantoverhead. One can save significant space andbandwidth by using more sophisticated algorithmsfor calculating set differences. The simplest ap-proach is to simply use Bloom filters [39] torepresent the set of fingerprints. One can then usethe population of the bitwise difference between thefilters to calculate the size of the set difference. Thisapproach is far cheaper to implement, but comes atsome expense in accuracy. More problematic is thatit is difficult to know in advance the appropriateparameters for the Bloom filter. A too-small filter canresult in significant errors in estimation. A morepromising approach is to leverage distributed setreconciliation algorithms [40]. This approach hasgreater computation overhead than Bloom filters,but it is optimal in bandwidth utilization [40].

. Conservation of content order. Packet reordering can bedetected by maintaining ordered lists of packetfingerprints rather than simply sets. Like withconservation of content, this can result in a sig-nificant storage overhead. Traffic validation can thenbe done using the metric defined in Section 4.1. Thishas a higher overhead than simply computing thesize of the set difference. We have not yet looked intomethods that would reduce the storage or computa-tional overheads.

4.3 Detection Accuracy

Any traffic validation mechanism presents the potential forfalse negatives—attacks which are not detected—and falsepositives—benign traffic that is incorrectly identified asinvalid. While the strengths and weaknesses of specificmechanisms are beyond the scope of this paper (our focushere is on the general problems posed by distributeddetection and the remainder of the paper considers TV tobe a black box), the overall issues bear some discussion. Theissue of false negatives is ultimately tied to the precisionand completeness of the validation technique used. Asystem based on conservation of flow could not expect todetect attacks that modify packet content without changingtheir number. Similarly, false positives relate to how wellthe validation mechanism can model benign networkbehavior. For example, the TV mechanism must accountfor IP packets whose TTL field is set to 1 and will becorrectly dropped at the next router. Similarly, there aresome cases where traffic is legitimately tampered by therouters such as lawful intercept [41] and GRE tunneling[42]. In these cases, a traffic validation mechanism would

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 5

Page 6: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

also need to model the associated overlay topology andclassification rules.

Ultimately, there is little data about the prevalence ofdifferent router attacks and how different validationmechanisms would interact with these. Instead, we haveattempted to describe attacks and associated validationmechanisms in a generic fashion. The detailed design oftraffic validation mechanisms and their empirical evalua-tion remains an ongoing research activity.

5 DISTRIBUTED DETECTION

The second problem is synchronizing the collection oftraffic information and distributing the results for detectionpurposes. The solution to this problem is a protocol and, so,we consider specifications of the problem as well asimplementation.

Since routers collect the information upon which trafficvalidation is based, there will be some uncertainty indetermining which router is faulty. For example, considertraffic that traverses a path containing the sequence ofrouters hr1; r2; r3i. Suppose router r3 receives 10 packetsfrom r2 and r1’s traffic information state that r1 sent20 packets to r2. There’s no way for r3 to determine whetherr2 did indeed drop 10 packets, or whether r1 is misreportingthe traffic.

We cast the problem as a failure detector with accuracyand completeness properties. This failure detector reportssuspicions as path segments, which are sequences ofadjacent routers. More specifically, a failure detector reportsa path segment � if it suspects a router in � is forwardingtraffic along � in a faulty manner. A failure detector also hasa precision, which is the maximum length of a path segmentit suspects.

5.1 Specification

Due to the nature of the problem, our specification is morecomplex than the typical accuracy and completenessproperties of an imperfect failure detector [43]. Since thisdetector is based on evaluating traffic collected over aperiod of time, we have the failure detector reportpairs ðr; �Þ, which means that r was suspected as beingfaulty during the time interval � . A perfect failure detectorwould implement the following two properties:

1. Accuracy (tentative): A failure detector is accurate if,whenever a correct router suspects ðr; �Þ, then r wasfaulty during � .

2. Completeness (tentative): A failure detector is completeif, whenever a router r is faulty at some time t, thenall correct routers eventually suspect ðr; �Þ for some� containing t.

As noted above, though, our failure detector is based ontraffic information collected by untrustworthy routers. Thisresults in detections that are imprecise: we may not be ableto pin down which router is compromised. So, we have thefailure detector return a pair ð�; �Þ, where � is a pathsegment. This also allows us to give more information: Werestrict the detection to a router being faulty with respect toforwarding traffic along �:

. a-Accuracy: A failure detector is a-Accurate if,whenever a correct router suspects ð�; �Þ, then j�j �a and some router r 2 � was faulty in � during � .

. a-Completeness (tentative): A failure detector is a–Complete if, whenever a router r is faulty at sometime t, then all correct routers eventually suspectð�; �Þ for some path segment � : j�j � a such that rwas faulty in � at t, and for some interval �containing t.

Note that a faulty router can report an incorrect value forthe traffic that traversed a path as well as alter this traffic.We use the term t-faulty (that is, traffic faulty) to indicate arouter that alters traffic and the term p-faulty (that is, protocolfaulty) to indicate a router that misreports traffic. A faultyrouter is one that is t-faulty, p-faulty, or both. As before, wewill add the phrase “in �” to indicate that the faultybehavior is with respect to traffic that transits the path �.Thus, the a-Accuracy requirement can result in a detection ifa router is either p-faulty or t-faulty.

Distinguishing between p-faulty and t-faulty behavior isuseful because, while it is important to detect routers thatare t-faulty, it isn’t as critical to detect routers that are onlyp-faulty: Routers that are only p-faulty are not altering thetraffic flow. Hence, we weaken a-Completeness:

. a-Completeness: A failure detector is a-Complete if,whenever a router r is t-faulty at some time t, thenall correct routers eventually suspect ð�; �Þ for somepath segment � : j�j � a such that r was t-faulty in �at t, and for some interval � containing t.

One cannot depend on faulty servers to detect faultyservers. Hence, failure detection is influenced more by themaximum number of adjacent faulty routers rather than thetotal number of faulty routers. So, we impose an upperbound badðkÞ on the number of adjacent faulty routers. Forexample, if badð3Þ holds, then there can be no more thanthree adjacent faulty routers in any path.

Making a badðkÞ assumption has an effect on failuredetection. Allowing compromised routers to be adjacentcomplicate detection further, since they can cooperate to hidethe evidence that a router is faulty. For example, assume thatbadð3Þ holds, and consider a path hr1; r2; r3; r4; r5; r6; r7i inwhich r3; r4, and r5 are faulty. Suppose that over an interval� , r1 through r5 report having forwarded 100 packets thatwere to traverse this path, and r6 and r7 reports only 20 suchpackets. Let r1 obtain these counters. It could be the case thatr4 dropped the traffic, and r3 and r5 misreported the traffic inan effort to hide the fact that r4 is faulty. From r1’s point ofview, however, something is wrong with either r5 or r6, sincethe counters indicate that traffic has disappeared betweenthem. To satisfy a-Completeness, p1 needs to detect anypossible routers that could have led to this traffic discre-pancy. So, the failure detector at r1 could report that the pathsegment hr3; r4; r5; r6i contains a faulty router, since if r5 isfaulty, then r3 and r4 could be as well, given badð3Þ. That is,we could have the failure detector report a kþ 1-length pathsegment to accommodate the fact that badðkÞ holds.

Another way to accommodate this kind of scenario isto weaken a-Completeness. In the example just given, wecould have r1 just suspect the path segment hr5; r6i. A

6 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

Page 7: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

countermeasure protocol would stop routing data throughthis path, and if r3 or r4 continue to behave in a faultymanner, then they would be suspected later. We for-malize this approach by defining a condition we call

being fault connected. Given a path segment � and aninterval � , we say that a router r is fault connected torouter s with respect to � if both r and s are in �, and allof the routers between s and r are faulty in � during � .

Trivially, any router r is fault connected to itself even if ris correct. We then weaken a-Completeness again:

. a-FC Completeness: A failure detector is a-FCComplete if, whenever a router r is t-faulty at sometime t, then all correct routers eventually suspectð�; �Þ for some � : j�j � a and some � containing tsuch that there is a router r0 that was faulty in � attime t0 in � and is fault-connected to r.

The choice between the two completeness properties isnot obvious. One would expect the precision obtainable

under a-FC Completeness to be better, but it could increasethe latency in detecting some t-faulty routers. If the value ofk for which badðkÞ holds is not large, then a-Completeness isprobably a better choice because of its simplicity.

Since we are assuming arbitrarily faulty routers, we have

to allow faulty router to suspect correct routers. We addressthis issue in the countermeasure protocol: as with WATCH-ERS, the only suspicion that elicits a countermeasure iswhen a router r suspects a path segment that is adjacent tor. In this case, the countermeasure protocol will cause r not

to route traffic along the suspected path segment. A faultyrouter can already drop packets and, so, allowing a faultyrouter to break links with its neighbor (as a result of faultysuspicions) adds no further disadvantage.

Using this terminology, Perlman [11] protocol is

M-Accurate and M-Complete, where M is the maximumlength of a path between any pair of routers. All other priorprotocols we know of are 2-Accurate. However, thedetection is attained at only the source router in Perlman[11], Avramopoulos et al. [22], and Secure Traceroute [21]. If

other routers learn of the fault from the detecting router r,then such a router has to consider the possibility that r is p-faulty. WATCHERS [26] is neither Complete nor FC Com-plete, due to a flaw, which we have shown in [44]. The

claims in [21] indicate that Secure Traceroute is 2-Accurateand 2-FC Complete. But, the description of the protocol(especially on how the overhead is reduced by pretendingto monitor all prefixes of a path segment from a source to adestination) is not detailed enough for us to be confident of

the actual accuracy and completeness of the protocol.Herzberg et al. [27] and Avramopoulos et al. [22] protocolsare 2-FC Complete.2

5.2Q

2 : A 2-FC Complete and 2-Accurate Protocol

Given a complete and accurate traffic validation mechan-ism, an obvious approach is to have each router collecttraffic information over some agreed-upon interval, andthen use consensus to have all correct routers agree upon

the traffic information. With this information, each routercan agree upon which routers might be faulty.

For example, consider the 4-path segment

� ¼ hr1; r2; r3; r4i;

where r1 and r4 are not faulty. Let infoði; �; �Þ be the trafficinformation that router ri collects over during the agreedupon time interval � for the path segment �. If at least one ofthe other routers is t-faulty with respect to � during thisinterval, then TV ð�; infoð1; �; �Þ; infoð4; �; �ÞÞ will be false.This implies that for some i, TV ð�; infoði; �; �Þ; infoðiþ1; �; �ÞÞ is false, which means that at least one of routers riand riþ1 is faulty. Since infoði; �; �Þ and infoðiþ 1; �; �Þweredisseminated using consensus, all correct routers will knowthat at least one of fri; riþ1g is faulty.

We use these observations to construct a 2-Accuratefailure detector protocol. The first issue to address is overwhich paths a router should record information about. Notethat it will need to run an instance of the protocol for eachpath about which it records information. An obviousanswer—over each data stream’s path—could result in anenormous set of paths. We can make the set smaller byhaving each router keep track of each x-path segment ofwhich it is a member, for some value of x. The number ofx-path segments can grow very quickly with increasing x,and, so, x should be as small as possible. It must be largeenough so that any sequence of faulty routers will besurrounded by correct routers since this is necessary todetect faulty behavior.

If we assume that badðkÞ holds, then the minimum valueof x satisfying the above constraint is kþ 2. Note that givensource and destination may not be separated by at leastk routers, and, so, a router also collects information aboutall paths of which it is a member whose length is less thankþ 2.

The resulting protocolQ

2 is shown in Fig. 1. In thisprotocol, each router r maintains the current topology Tfrom which it derives its routing table. Each router r alsomaintains a set of path segments Pr that contain r and thatr monitors. A router r runs a thread for each path segmentin Pr. Pr, which is computed from T , contains all ðkþ2Þ-path segments containing r and all x-path segments, 3 �x < kþ 2 whose ends are terminal routers.

For each path segment � 2 Pr, r synchronizes with theother routers in � and collects information for the sametraffic passing through � for an agreed-upon interval � .Periodically, r sends that traffic information to all routers in� using consensus. This data is digitally signed to preventan attack during consensus. We use ½x�i to indicate that x isdigitally signed by i.

Consider the traffic passing through a path segment �.The traffic will be consistent—that is,

TV ð�; infoði; �; �Þ; infoðiþ 1; �; �ÞÞ

will be true—for each pair of routers hi; iþ 1i in � unless adiscrepancy is introduced by a faulty router. In otherwords, if TV ð�; infoði; �; �Þ; infoðiþ 1; �; �ÞÞ is false then atleast one of the two routers i or iþ 1 is faulty. Note that itcould either be t-faulty or p-faulty (because it reports trafficinformation that does not represent the actual traffic that

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 7

2. However, the detection has been done only at the source router. Notevery router in the network agree on the detection as we specified above.

Page 8: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

transited during �). In either case, a correct router r in � willput the 2-path segment hi; iþ 1i into suspect�r ½�� set, andreliably broadcast the evidence of the failure detection:(½infoði; �; �Þ�i, ½infoðiþ 1; �; �Þ�iþ1). Upon receiving thisinformation, all other correct routers will evaluateTV ð�; infoði; �; �Þ; infoðiþ 1; �; �ÞÞ as false and detect thefault on the 2-path segment hi; iþ 1i.Q

2 is given in Fig. 1. In Appendix 1.2, we show thatQ

2

is 2-Accurate and 2-FC Complete.

5.3Q

kþ2 : A ðkþ 2Þ-Complete and ðkþ 2Þ-AccurateProtocol

Q2 has considerable requirements in terms of collecting

traffic information, synchronization, and consensus. Theserequirements can be avoided by making the detection beless accurate.

The idea is to apply TV just for the end nodes of eachpath segment in Pr. For example, consider the 4-pathsegment � ¼ hr1; r2; r3; r4i, where r1 and r4 are not faulty.Let infoðr1; �; �Þ; infoðr4; �; �Þ be the traffic information thatrouter r1 and r4 collect during � . If at least one of the otherrouters is t-faulty with respect to � during this interval, thenTV ð�; infoðr1; �; �Þ; infoðr4; �; �ÞÞ will be false. In this case,r1 suspects hr2; r3; r4i. Similarly, r4 suspects hr1; r2; r3i.

For this protocol, a router need not monitor as many pathsegments as with

Q2 . Instead, a router needs only keep

track of each x-path segment of which it is one of the endnodes, for some value of x. The number of x-path segmentscan grow very quickly with increasing x, and, so, x shouldbe as small as possible. It must be large enough so that anysequence of faulty routers will be surrounded by correctrouters, as this is necessary to detect faulty behavior.

If we assume that badðkÞ holds, then the minimum valueof x satisfying the above constraint is kþ 2. However,monitoring only kþ 2-path segments is not sufficient. Atrivial reason this is so is that not all path segments need be

kþ 2 long. A more substantial reason is that compromisedrouters may hide another router’s bad behavior. Forexample, given that badð2Þ holds, consider the 4-pathsegment � ¼ hr1; r2; r3; r4i, where r1 and r3 are correct andr2 and r4 are faulty. In this case, r1 and r4 monitors �, but r4

can hide the fact that r2 is t-faulty by simply sending trafficinformation to r1 such that TV ð�; infoðr1; �; �Þ; infoðr4; �; �ÞÞholds. If r1 were to instead also monitor the path hr1; r2; r3i,then r1 could detect r2’s faulty behavior. So, it is necessaryfor a router r to monitor all x-path segments for 3 � x �kþ 2 of which r is an end.

For each path segment � 2 Pr, r synchronizes with theother end router of � and collects information for the trafficpassing through � during an agreed-upon interval � .Router r then exchanges digitally signed traffic informationwith the router r0 on the other end. If the exchangeoperation fails within a prespecified timeout interval �, orif r finds TV ð�; infoðr; �; �Þ; infoðr0; �; �ÞÞ is false, then thereis at least one faulty router in � during � . In particular,either r0 is p-faulty or some router in � is t-faulty. So, rdetects �� hri. However, when it announces this detectionto the other routers, a correct router receiving thisinformation suspects � since r might be faulty. Forsimplicity, we also have router r suspect �.Q

kþ2 is given in Fig. 2. In Appendix 1.3, we show thatQkþ2 is (k+2)-Accurate and (k+2)-Complete.

6 RESPONSE

Once a path segment � is detected as containing compro-mised routers, some countermeasure should be taken. Ofcourse, the most important countermeasure is to log thesuspicion and alert the administrator of the affected routers.Less obvious is how the routers should react to a detection.

Suppose that some path segment � is detected. Anobvious countermeasure would be to remove all of the

8 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

Fig. 1.Q

2 : A 2-FC Complete and 2-Accurate protocol.

Page 9: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

routers in � from the routing fabric. By doing so, we avoidusing any router that has been suspected of beingcompromised. Doing this, though, could also have aserious, and perhaps unnecessarily high, impact on net-work performance. A less aggressive countermeasurewould be to only remove the path segment � from therouting fabric. In doing so, we may allow compromisedrouters to keep forwarding packets, but only along pathsover which no faulty behavior has been observed.

We have chosen the second approach because of its lessdisruptive behavior. If only a single interface is compro-mised (today’s interfaces are effectively their own CPUs),then only the path segments incident on that interface willbe excluded. If a router is disrupting traffic along severalpaths, then each of these paths will be separately detectedand then routed around. Finally, if a router is uniformlymalicious (i.e., causes traffic validation to fail for all trafficpassing through it) then all intersecting path segments willbe excluded and the router will be completely isolated.

7 SYSTEM ARCHITECTURE

We have implemented a prototype system, called Fatih, thatincorporates our approach into a Linux 2.4-based routerplatform running OSPF. We have implemented a variety oftraffic validation mechanisms mentioned in Section 4,including conservation of flow, content, and content order,as well as the

Qkþ2 distributed detection algorithm

explained in Section 5.3. The system architecture, shownin Fig. 3, consists of five principal components.

7.1 FATIH Coordinator

This module implements the distributed detection algo-rithm, namely,

Qkþ2 , and carries out the general scheduling

and the communication with the Routing Daemon and theTraffic Summary Generator.

1. Based on the network topology exported by theRouting Daemon, the Coordinator decides which

path segments to monitor depending on the systemparameter k (Section 5). We have configured ourprototype with k ¼ 1, thus each router monitors all3-path segments originating from itself. Besidesimplicity, our implementation focuses on this pointof the design space because it reflects the mostcommon capabilities available to an attacker. For anadversary to compromise adjacent routers in amanner that they attempt to conceal their actions(i.e., k > 1) is considerably more difficult as thisrequires an adversary to modify the executableprotocol code running on the routers.

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 9

Fig. 2.Q

kþ2 : A ðkþ 2Þ-Complete and ðkþ 2Þ-Accurate Protocol.

Fig. 3. Fatih system architecture.

Page 10: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

2. The Coordinator receives information collected bythe Traffic Summary Generator, determines the pathbeing traversed, and delivers the received informa-tion to the corresponding Traffic Validator modules.

3. Finally, the Coordinator, schedules and synchro-nizes validation rounds among the Traffic Valida-tors. In the current implementation, rounds areconfigured at a five second interval. A longer timeinterval requires more traffic summary state to bemaintained, while a shorter time interval placesmore stringent synchronization requirements on thesystem.

Traffic Validator. For each monitored path segment, thereexists one corresponding Traffic Validator module, whichkeeps state for the traffic sent and received during the lasttime interval. At the end of each round, it exchangessummary information with its corresponding peers anddecides whether any discrepancies exceed acceptable limits.Messages between traffic validation modules on differentrouters is via an authenticated TCP connection (usingmanually configured shared keys in the current prototype).If a Traffic Validator does not receive any traffic informationfrom its peer within a timeout interval, or if it decides that atraffic discrepancy is excessive, the path segment isidentified as “suspicious” to the Routing Daemon.

7.2 Traffic Summary Generator

As described in Section 4, the Traffic Summary Generatorupdates traffic information with each forwarded packet.Depending on the underlying validation mechanism, thegenerator computes a packet fingerprint using the UHASHfunction [45] and associates a timestamp with it. Theperformance of this module is critical and, therefore, wehave implemented it in the kernel to avoid unnecessarycopies of packet contents. Traffic summaries are ultimatelypassed back to the Coordinator.

7.3 Link-State Routing Daemon

As described in Section 3, Fatih cooperates with a standardlink state routing protocol. The Routing Daemon, which isbased on Zebra [46] in the current prototype, implementsthe OSPF [47] protocol and manages link state announce-ments, shortest path computation and forwarding tablecalculation and installation. In addition, we have modifiedthe protocol to incorporate input from Fatih. When theRouting Daemon receives an alert from the Coordinator (orvia link-state updates from other routers), it recomputesnew shortest path routes to avoid any suspicious pathsegments identified. This alert is then flooded via the link-state update mechanism to allow other routers to excludethe suspicious path segments as well.

However, it is not possible to reflect these routingchanges in a single forwarding table update because arouter, in the middle of a suspicious path segment �, mightneed to forward traffic traversing a prefix of �, but destinedfor an path which is not a suffix of �. To allow thisdistinction, we exploit policy-based routing to forward trafficusing a combination of the source and destination ad-dresses. The source address is used, effectively, to select theparticular path segment prefix upon which a packet hasbeen delivered. The coordinator is kept abreast of routingchanges so it always knows which path segments should bemonitored, which peers to synchronize with and, so, thesource address can be efficiently mapped to a particular

path segment and forwarding table.

7.4 Packet Forwarding

Linux supports policy-based routing through the use ofmultiple forwarding tables and an associated routing policydatabase. The database defines the criteria used to decidewhich forwarding table should be used to look-up a packet.For example, in our environment, a router maintains adistinct forwarding table for each detected suspicious pathsegment that the current router is in the middle.

7.5 Time Synchronization

Fatih requires routers to have clocks synchronized closelyenough so that there is not a significant disagreement overthe intervals that traffic information is collected. For ourpurposes, clocks that are synchronized within a fewmilliseconds is sufficient. We use NTP [48] to synchronizethe routers clocks.

8 EXPERIENCES

In this section, we describe the behavior of Fatih in asimulated network environment. While this methodology issufficient to capture the gross behavior of Fatih in adistributed setting, it is not detailed enough to predictFatih’s precise performance in an actual deployment.Similarly, our network traffic load is simulated as welland, thus, any end-to-end performance measurementscould be misleading. Instead, we describe overhead on a“component” basis—fingerprint computation, router state,and synchronization overhead—and then explore how theymight scale and be combined in the next section.

We have chosen a topology based on the Abilenenetwork [49], Fig. 4a, because its structure, link delays,bandwidths, and link-state metrics are all public. As well,the topology has sufficient connectivity to demonstrate thedynamics of Fatih during an attack.

We represent each Abilene Point of Presence (POP) as asingle router and configure the link delay and routingmetrics accordingly. Each of these routers is in turnemulated by a User-Mode Linux [50] virtual machine,configured with 64MB of memory and implementing theFatih system architecture as described earlier. The routersare interconnected through Ethernet bridging in the Linuxhost operating system, modified to emulate configured linkdelays. The host system is a 2.6Ghz Pentium 4 server with1GB of physical memory. Although our emulation testbed isincapable of processing the traffic volume of a real Internetbackbone, it handles sufficient traffic to demonstrate Fatih’skey behaviors.

Fig. 4b demonstrates Fatih in progress. Time is shown onthe x-axis in seconds, different routers are shown on the lefty-axis, and the right y-axis is used for the latency measuredbetween the New Y ork and Sunnyvale during the experi-ment. At the beginning of the experiment, each routerdiscovers its immediate neighbors, transmits and receivesOSPF link state updates, and computes new routing tableswith the most recent link state database. After roughly55 seconds, all routers have agreed on a common networktopology and a stable forwarding path is available betweenall pairs of routers.

10 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

Page 11: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

At this point in time, we inject a synthetic traffic load intothe network and initiate round trip time (RTT) measure-ments between New Y ork and Sunnyvale. Initially, theforwarding path between these routers is

hSunnyvale;Denver;Kansas City;Indianapolis; Chicago;New Y orki

with a configured one-way latency of 25 ms in theconfiguration files. As expected the measured round-triptime is roughly 50 ms.

At roughly 117 seconds, our simulated attacker compro-mises the Kansas City router and modifies its behaviorsuch that 20 percent of its transit traffic is dropped oraltered. We chose the Kansas City router as a victimbecause most intercoastal traffic traverses it and is thereforea obvious target. Using the Fatih protocol, the pathsegments

hDenver;Kansas City; Indianapolisi;hDenver;Kansas City;Houstoni; and

hHouston;Kansas City; Indianapolisi

are all validated every � ¼ 5 seconds by their terminalrouters. Thus, at the end of the current traffic validationround (about 3 seconds after the attack), Denver, Houston,and Indianapolis detect that traffic through these monitoredpath segments is inconsistent and notify their OSPF routingdaemons.

There are two parameters of the OSPF routing protocolthat affect the following events. OSPF_delay_time is thetime passed before computing a new routing table as aresult of a triggering event (e.g., a new link-state updatemessage or an alert, as in this case). OSPF hold time is thetime passed before any consecutive routing table computa-tions. These values default to 5 seconds and 10 seconds,respectively, in the Zebra OSPF implementation. As aresult, an additional 15 seconds pass before the detectedtraffic inconsistency causes the associated path segments tobe removed from the routing topology. At roughly135 seconds, this process completes and the path betweenNew York and Sunnyvale is changed to

hSunnyvale; Los Angeles;Houston;Atlanta;Washington DC;New Y orki:

This new path has a configured one-way latency of 28 ms,thus the measured RTT becomes 56 ms. Note that theKansas City router continues to operate, but its neighbor-ing routers will no longer forward traffic through it.

9 ANALYSIS OF THE PROTOCOLS

In this section, we consider the overhead of our approach.

9.1 Fingerprinting

Like any protocol that detects packet alteration [11], [21],[22], our Conservation of content and Conservation of contentorder traffic summary functions compute a fingerprint foreach packet. To be practical, computing a fingerprint musthave a low overhead. One possibility for fingerprinting isCRC32. This is already computed by most networkinterfaces at line rate and is a good uniform hash function.However, it is reversible and an adversary could createmodified packets that have the same hash. In somecircumstances, such as a compute limited adversary, itmight serve as a good fingerprinting function.

We implemented fingerprinting using UHASH which isan unkeyed version of the UMAC algorithm [45]. Thisalgorithm allows a trade off between security and perfor-mance and is designed to support parallel implementations.Black et al. [45] and Rogaway [51] demonstrated UHASHperformance of more than 1Gbps on a 700Mhz Pentium IIIprocessor computing a 4 byte hash value and delivering a2�30 forging probability (which is more than sufficient forour application). In hardware, of course this performancecould be increased still further.

Between software and hardware are emerging networkprocessors which are designed to perform highly parallelactions on data packets [52]. Recently, Feghali et al. [53]described an implementation of DES [54] and AES [55]3 onthe Intel IXP28xx network processors that was able to keep

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 11

3. DES and AES are two well-known encryption algorithms that provideconfidentiality.

Fig. 4. (a) Abilene network topology. (b) Fatih in progress.

Page 12: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

pace with a 10Gigabit/sec forwarding rate. Since DES andAES encryption algorithms are significantly more expen-sive to compute than functions such as UHASH, we caninfer that it is possible to compute these packet finger-prints up to 10Gbps.

However, if there are insufficient computational re-sources, one can easily trade off accuracy for overhead bysubsampling the packets to be considered. As in Duffieldand Grossglauser’s Trajectory Sampling [56], if the samerandom hash function is used to subsample packets at eachend of a path segment, then each router should observe thesame subset of packets. Further, each pair of routers is freeto select such sampling functions independently and neednot rely on a global secret.

9.2 State Size

One of the most important factors in terms of a protocolbeing practical is the size of the state at each router thatneeds to be maintained. Assume that we wish to providefault-tolerant forwarding between each source and destina-tion pair in the network. Let N be the number of routers inthe network,R be the maximum number of links incident ona router, and k is the value used in the badðkÞ assumption.Perlman [11], Herzberg et al. [27], and Avramopoulos et al.[22] require OðN2Þ state, since each individual routermaintains state for each (source, destination) pair. WATCH-ERS [26] reduces the state requirement to OðRNÞ, whereeach individual router keeps seven counters for each itsneighbors for each destination. In Secure Traceroute [21], if arouter monitors paths to all destinations, then OðNÞ space isrequired (assuming that there is only one path in active usebetween a given source and destination).

Our protocols require a router r to record state for eachpath segment Pr that it monitors. By construction, this isOðk�Rkþ1Þ for

Qkþ2 and OðminfRkþ1; NgÞ for

Qkþ2 . In

practice, though, we expect jPrj to be much smaller. Forexample, we have examined two network topologies,Sprintlink and EBONE, that were measured by the Rock-etfuel project [57]. We counted the number of distinct pathsegments that a router monitors for different values of k inbadðkÞ assumption. For instance, the Sprintlink networkconsists of 315 routers and 972 links. On the average, arouter has 6.17 links, and the maximum number of linksthat a router has is 45. In Fig. 5, the maximum, average, and

median of jPrj that a router monitors is given for thisnetwork for

Q2 and

Qkþ2 . For this network (as well as for

EBONE), the empirical results are much smaller than thetheoretical upper bounds.

As a point of comparison, for the Sprintlink network, onaverage, a router under WATCHERS would maintainapproximately 13,600 counters, and the maximum numberof counters that a router maintains is 99,205. With ourconservation of flow traffic summary function4 for

Qkþ2 , a

router maintains 232 counters on average and 496 in theworst case if badð2Þ holds. Even with the weak assumptionof badð7Þ, each router maintains 616 counters on averageand 626 in the worst case.

9.3 Synchronization

Protocols that validate aggregate traffic requires synchro-nization to agree on a time interval to collect trafficinformation. WATCHERS synchronizes all routers using asnapshot algorithm [58]. In

Q2 , for each path segment � in

Pr, a router r synchronizes with all the routers in � to agreeon when and for how long the next measurement interval �will be. It would probably be more efficient, though, to haveall the routers in the network synchronize with each otherinstead of having many more, smaller synchronizationrounds. Perfect synchronization would not be necessary inpractice, since the traffic validation function TV could bewritten to accommodate a small skew. The synchronizationrequirements for

Qkþ2 are lower than for

Q2 . As for each

path segment � that a router r monitors, r needs to agreewith only the other end router r0 of �. This is the samerequirement for Secure Traceroute.

9.4 Information DisseminationQ

2 requires each router in path segment � to reachconsensus about the traffic information collected over �during a time interval � . To do so necessitates digitallysigning the traffic information, since, otherwise, thereplication is not high enough for consensus to be solvable.Thus, there is an issue of key distribution depending on thecryptographic tools that are used. Finally, there must beenough path connectivity among the routers to supportconsensus [59]. For

Qkþ2 , in order to exchange traffic

information, neither Consensus nor the good neighbor

12 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

4. This requires two counters per path segment, one for each direction.

Fig. 5. Based on badðkÞ; maximum, average, and median size of Pr that a router monitors in (a)Q

2 and (b)Q

kþ2 .

Page 13: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

condition of WATCHERS is required (good neighbor requiresthat each compromised router is adjacent to a noncompro-mised router). Furthermore, with

Qkþ2 , the end routers can

use the same path segment they are monitoring to exchangetraffic information. This is because if an intermediate routerwere to fail to forward the information, then one end woulddetect it, which would lead to the path segment beingsuspected. To avoid impersonation attacks,

Qkþ2 requires

authentication. Meanwhile, the final reliable broadcast ofboth

Q2 and

Qkþ2 can be done as part of the LSA

distribution of the link state protocol.

9.5 Multicast

Multicast forwards traffic along more than one path. Sinceconservation of flow as stated by WATCHERS inherentlyassumes single-path communication, WATCHERS cannotbe easily extended for multicast communications. All of theother protocols discussed above, including our own, can beextended easily for multicast. Doing so would require therouters to agree on the network topology and the multicasttree (something that, for example, with MOSPF [51], wouldbe straightforward to do).

9.6 Fragmentation

Fragmentation does not change the data that is carried innetwork traffic, but rather changes the way that data ispacketized. Thus, traffic validation based on packets issensitive to fragmentation, while traffic validation based ondata is not. Of the protocols mentioned in this paper, onlyWATCHERS bases its traffic validation just on the data (ituses byte count) and is therefore insensitive to fragmenta-tion. However, fragmentation occurs very rarely in theInternet [52], and, so, we do not consider sensitivity tofragmentation to be a practical concern.

10 CONCLUSION

In this paper, we motivated and specified the problem ofdetecting routers with incorrect packet forwarding behaviorand provided a general framework for formally analyzingprotocols addressing this problem. Our approach is toseparate the problem into three subproblems: 1) determin-ing the traffic information to record upon which to base thedetection, 2) synchronizing routers to collect traffic in-formation and distributing this information among them sodetection can occur, and 3) taking countermeasures whendetection occurs. We presented a concrete protocol,

Qkþ2 ,

that is likely inexpensive enough for a practical implemen-tation at scale. We believe our work is an important step inbeing able to tolerate attacks on key network infrastructurecomponents.

APPENDIX

PROPERTIES OF THE PROTOCOLS

A.1 Basic Theorems

Theorem 1. If a router r is t-faulty at some time t and badðkÞholds, then there exists a path segment �, such that:

. r 2 �,

. r is t-faulty in � during some � that contains t,

. only the first and last routers of � are correct, and

. 3 � j�j � kþ 2.

Proof. If r is t-faulty at time t, then there is a path �, such

that r is t-faulty in � during some � that contains t. From

the system assumption, the source and sink routers of �

are correct, and, so, � must contain at least three routers

to include the faulty router r.For each path segment � of � that contains r, r is

t-faulty in � during � . Given badðkÞ, r can be in a group ofno less than one and no more than k adjacent faultyrouters. This group, by definition, is bounded on bothsides by nonfaulty routers. tu

Theorem 2. If, for a path segment �,

TV ð�; infoðh; �; �Þ; infoðj; �; �ÞÞ

is false where 1 � h < j � j�j, then there exists a link hi; iþ1i such that TV ð�; infoði; �; �Þ; infoðiþ 1; �; �ÞÞ is false and

h � i < iþ 1 � j.Proof. By contradiction. Assume that there is no link hi; iþ

1i such that TV ð�; infoði; �; �Þ; infoðiþ 1; �; �ÞÞ is false

and h � i < iþ 1 � j. For each link hi; iþ 1i such that

h � i < iþ 1 � j, TV ð�; infoði; �; �Þ; infoðiþ 1; �; �ÞÞ i s

true. Since TV is transitive,

TV ð�; infoðh; �; �Þ; infoðj; �; �ÞÞ

is true, which leads us a contradiction. tu

A.2 Properties ofQ

2

Theorem 3. The protocolQ

2 is 2-Accurate.

Proof. By construction, all suspicions are path segments of

length 2. For a correct router s to suspect h�; �i, that

router found TV ð�; infoði; �; �Þinfoðiþ 1; �; �ÞÞ to be

false, for some � that contains i and iþ 1. Furthermore,

since the traffic information is digitally signed, the two

routers did report this traffic information. Hence, at least

one of the two routers must be t-faulty or p-faulty. tuTheorem 4. The protocol

Q2 is 2-FC Complete.

Formally, if a router r is t-faulty at some time t, then all

correct routers eventually suspect h�; �i for some path

segment � : j�j � 2 and some � containing t such that there

is a router r0 that was faulty in � at time t0 in � and is fault-

connected to r.

Proof. By Theorem 1, if a router r is t-faulty at time t, then

there exists a path segment �0, such that: r 2 �0; r is also

t-faulty in �0 during � containing t; only the first and last

routers of �0 (which we’ll call f and ‘) are correct and

3 � j�0j � kþ 2.By construction of Pf and P‘, both f and ‘ monitor at

least one path segment �00 such that ff; r; ‘g 2 �00 and �00

contains �0.Both f and ‘ compute

TV ð�00; infoðf; �00; �Þ; infoð‘; �00; �ÞÞ

to be false. By Theorem 2, there exists a 2-path segment

� ¼ hi; iþ 1i s u c h t h a t TV ð�00; infoði; �00; �Þ; infoðiþ1; �00; �ÞÞ is false where f � i < iþ 1 � ‘. Since all routers

between f and ‘ are faulty and fault-connected to r, at

least one of fi; iþ 1g is faulty and fault-connected to r.

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 13

Page 14: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

Both correct routers f and ‘ detects this failure andreliably broadcasts to all correct routers with theevidence of infoði; �00; �Þ; infoðiþ 1; �00; �Þ that are digi-tally signed by routers i and iþ 1, respectively.Eventually, all correct routers suspect � ¼ hi; iþ 1i. tu

A.3 Properties ofQ

kþ2

Theorem 5. The protocolQ

kþ2 is (k+2)-Accurate.

Proof. If a correct router suspects h�; �i, then j�j � a andsome router r 2 � was faulty in � during � .

For a correct router, to suspect a path segment �,router s that is either the first or last router of �announces that “� is unreliable.”

1. If this announcement is incorrect, then s isp-faulty.

2. If this announcement is correct, then s foundTV ð�; infoð1; �; �Þ; infoðj�j; �; �ÞÞ to be false. As-sume there exists no faulty router in � exhibitingfaulty manner with respect to � during � . Thus,each router in � forwards the traffic traversing �correctly. Since both router 1 and router s arecorrect, they collection and exchange trafficinformation correctly. Thus, both routers will findTV ð�; infoð1; �; �Þ; infoðj�j; �; �ÞÞ to be true, whichcontradicts our assumption.

A correct router applies the protocolQ

kþ2 tox-path segments where x � kþ 2. Hence,

Qkþ2 is

(k+2)-Accurate. tuTheorem 6. The protocol

Qkþ2 is (k+2)-Complete.

We show that if a router r is t-faulty at some time t, thenall correct routers eventually suspect h�; �i for some pathsegment � : j�j � kþ 2 such that r was t-faulty in � at t, andfor some interval � containing t.

Proof. Let r have introduced discrepancy into the trafficpassing through itself during � containing t. Then, fromTheorem 1, there exists a path segment � such that:

. r 2 �,

. r is t-faulty in � during � containing t,

. only f and ‘—the first and last routers of �—arecorrect, and

. 3 � j�j � kþ 2.

f and ‘ monitor � and apply the protocolQ

kþ2 for �.

After exchanging their traffic information, both f and ‘

compute TV ð�; infoðf; �; �Þ; infoð‘; �; �ÞÞ to be false and

suspect � and disseminate this information to the all

other correct routers by reliable broadcast. Since �

contains a t-faulty router r and the length of � might

be at most kþ 2, the protocolQ

kþ2 is (k+2)-Complete.tu

ACKNOWLEDGMENTS

This work was supported by US National Science Founda-tion Grant 0311690.

REFERENCES

[1] X. Ao, “Report on DIMACS Workshop on Large-Scale InternetAttacks,” Sept. 2003, http://dimacs.rutgers.edu/Workshops/Attacks/internet-attack-9-03.pdf.

[2] K.J. Houle, G.M. Weaver, N. Long, and R. Thomas, “Trends inDenial of Service Attack Technology,” CERT Coordination Center,technical report, Oct. 2001, http://www.cert.org/archive/pdf/DoS_trends.pdf.

[3] C. Labovitz, A. Ahuja, and M. Bailey, “Shining Light on DarkAddress Space,” Arbor Networks, technical report, Nov. 2001,http://research.arbornetworks.com/downloads/research38/dar-k_address_spa%ce.pdf.

[4] R. Thomas, “ISP Security BOF, NANOG 28,” June 2003, http://www.nanog.org/mtg-0306/pdf/thomas.pdf.

[5] ? Gauis, “Things to Do in Ciscoland When You’re Dead,” Jan.2000, http://www.phrack.org/phrack/56/p56-0x0a.org/phrack/56/p56-0x0a.

[6] D. Taylor, “Using a Compromised Router to Capture NetworkTraffic,” unpublished technical report, July 2002, http://www.netsys.com/library/papers/GRE_sniffing.PDF.

[7] A.P. Kosoresow and S.A. Hofmeyr, “Intrusion Detection viaSystem Call Traces,” IEEE Software, vol. 14, no. 5, pp. 35-42, 1997.

[8] K. Jain and R. Sekar, “User-Level Infrastructure for System CallInterposition: A Platform for Intrusion Detection and Confine-ment,” Proc. Network and Distributed Systems Security Symp., 2000.

[9] T. Garfinkel, “Traps and Pitfalls: Practical Problems in System CallInterposition Based Security Tools,” Proc. Network and DistributedSystems Security Symp., Feb. 2003.

[10] A.T. Mizrak, Y.-C. Cheng, K. Marzullo, and S. Savage, “Fatih:Detecting and Isolating Malicious Routers,” DSN ’05: Proc. 2005Int’l Conf. Dependable Systems and Networks (DSN ’05), pp. 538-547,2005.

[11] R. Perlman, “Network Layer Protocols with Byzantine Robust-ness,” PhD dissertation, MIT LCS TR-429, Oct. 1988.

[12] L. Subramanian, V. Roth, I. Stoica, S. Shenker, and R. Katz, “Listenand Whisper: Security Mechanisms for BGP,” Proc. Symp.Networked Systems Design and Implementation, Mar. 2004.

[13] S. Kent, C. Lynn, J. Mikkelson, and K. Seo, “Secure BorderGateway Protocol (Secure-BGP),” IEEE J. Selected Areas in Comm.,vol. 18, no. 4, pp. 582-592, Apr. 2000.

[14] Y.-C. Hu, A. Perrig, and D.B. Johnson, “Ariadne: A Secure On-Demand Routing Protocol for Ad Hoc Networks,” Proc. EighthACM Int’l Conf. MobiCom, Sept. 2002.

[15] B.R. Smith and J. Garcia-Luna-Aceves, “Securing the BorderGateway Routing Protocol,” Proc. Global Internet ’96, 1996.

[16] S. Cheung, “An Efficient Message Authentication Scheme for LinkState Routing,” Proc. Ann. Computer Security Applications Conf.,pp. 90-98, 1997.

[17] M.T. Goodrich, “Efficient and Secure Network Routing Algo-rithms,” provisional patent filing, Jan. 2001.

[18] Y. Cosendai, M. Dacier, and P. Scotton, “Intrusion DetectionMechanism to Detect Reachability Attacks in PNNI Networks,”Recent Advances in Intrusion Detection, 1999.

[19] Y. Jou, F. Gong, C. Sargor, X. Wu, S. Wu, H. Chang, and F. Wang,“Design and Implementation of a Scalable Intrusion DetectionSystem for the Protection of Network Infrastructure,” Proc.DARPA Information Survivability Conf. & Exposition, vol. 2,pp. 69-83, 2000.

[20] R. Perlman, Interconnections: Bridges and Routers. Addison WesleyLongman Publishing Co. Inc., 1992.

[21] V.N. Padmanabhan and D. Simon, “Secure Traceroute to DetectFaulty or Malicious Routing,” SIGCOMM Comp. Comm. Rev.,vol. 33, no. 1, pp. 77-82, 2003.

[22] I. Avramopoulos, H. Kobayashi, R. Wang, and A. Krishnamurthy,“Highly Secure and Efficient Routing,” Proc. INFOCOM Conf.,Mar. 2004.

[23] I. Avramopoulos, H. Kobayashi, R. Wang, and A. Krishnamurthy,“Amendment to: Highly Secure and Efficient Routing,” amend-ment, Feb. 2004.

[24] S. Cheung and K. Levitt, “Protecting Routing Infrastructures fromDenial of Service Using Cooperative Intrusion Detection,” Proc.New Security Paradigms Workshop, 1997.

[25] K.A. Bradley, S. Cheung, N. Puketza, B. Mukherjee, and R.A.Olsson, “Detecting Disruptive Routers: A Distributed NetworkMonitoring Approach,” Proc. IEEE Symp. Security and Privacy,pp. 115-124, May 1998.

[26] J.R. Hughes, T. Aura, and M. Bishop, “Using Conservation ofFlow as a Security Mechanism in Network Protocols,” Proc. IEEESymp. Security and Privacy, pp. 132-131, 2000.

14 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

Page 15: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE ...cseweb.ucsd.edu/~savage/papers/TDSC06.pdfpresents a wider set of opportunities to the attacker—not only denial-of-service, but also

[27] A. Herzberg and S. Kutten, “Early Detection of MessageForwarding Faults,” SIAM J. Computing, vol. 30, no. 4, pp. 1169-1196, 2000.

[28] A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb, “ACase Study of OSPF Behavior in a Large Enterprise Network,”IMW ’02: Proc. Second ACM SIGCOMM Workshop Internet Measur-ment, pp. 217-230, 2002.

[29] D. Watson, F. Jahanian, and C. Labovitz, “Experiences withMonitoring OSPF on a Regional Service Provider Network,”ICDCS ’03: Proc. 23rd Int’l Conf. Distributed Computing Systems,p. 204, 2003.

[30] Cisco Systems, “Load Balancing with Cisco Express Forwar-ding,”http://www.cisco.com/warp/public/cc/pd/ifaa/pa/much/prodlit/loadb_an.pdf, 2006.

[31] Juniper Networks, “JUNOS 6.4 Routing Protocols Configura-tion Guide,”http://www.juniper.net/techpubs/software/junos/junos64/swconfig64-routing/html/, 2006.

[32] R. Teixeira, K. Marzullo, S. Savage, and G.M. Voelker, “In Searchof Path Diversity in ISP Networks,” Proc. ACM/SIGCOMM IMC,pp. 313-318, 2003.

[33] J. Bellardo and S. Savage, “Measuring Packet Reordering,” Proc.ACM SIGCOMM Internet Measurement Workshop (IMW ’02), 2002.

[34] J.C. R. Bennett, C. Partridge, and N. Shectman, “Packet ReorderingIs Not Pathological Network Behavior,” IEEE/ACM Trans.Networking (TON), vol. 7, no. 6, pp. 789-798, 1999.

[35] G. Almes, S. Kalidindi, and M. Zekauskas, “A One-Way PacketLoss Metric for IPPM,” RFC 2680, IETF, Sept. 1999.

[36] D. Pullin, A. Corlett, B. Mandeville, and S. Critchley, “PacketReordering: The Minimal Longest Ascending Subsequence Me-tric,” Feb. 2002, http://www.globecom.net/ietf/draft/draft-critchley-mlas-reordering-00.t%xt.

[37] A. Morton, L. Ciavattone, G. Ramachandran, S. Shalunov, and J.Perser, “Packet Reordering Metric for IPPM,” Mar. 2003, http://www.globecom.net/ietf/draft/draft-ietf-ippm-reordering-00.txt.

[38] Cisco Systems, “Detecting and Analyzing Network Threats withNetFlow,”http://www.cisco.com/univercd/cc/td/doc/pro-duct/software/ios124/124tcg /tnf_c/ch15/nfhtdt.pdf, 2006.

[39] B.H. Bloom, “Space/Time Trade-Offs in Hash Coding withAllowable Errors,” Comm. ACM, vol. 13, no. 7, pp. 422-426, July1970.

[40] Y. Minsky, A. Trachtenberg, and R. Zippel, “Set Reconciliationwith Nearly Optimal Communication Complexity,” Proc. Int’lSymp. Information Theory, p. 232, June 2001.

[41] “Communications Assistance for Law Enforcement Act of 1994,”pub. L. No. 103-414, 108 Stat. 4279, 103rd Congress of the UnitedStates of Am.

[42] S. Hanks, T. Li, D. Farinacci, and P. Traina, “Generic RoutingEncapsulation GRE,” RFC 1701, IETF, Oct. 1994.

[43] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors forReliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225-267,1996.

[44] A.T. Mizrak, K. Marzullo, and S. Savage, “Detecting MaliciousRouters,” Univ. of California ay San Diego, technical reportCS2004-0789, 2004, http://www.cs.ucsd.edu/Dienst/UI/2.0/Describe/ncstrl.ucsd_cse/CS2004-0789.

[45] J. Black, S. Halevi, H. Krawczyk, T. Krovetz, and P. Rogaway,“UMAC: Fast and Secure Message Authentication,” Lecture Notesin Computer Science, vol. 1666, pp. 216-233, 1999.

[46] GNU Zebra, http://www.zebra.org, 2006.[47] J.T. Moy, “OSPF version 2,” RFC 2328, IETF, Apr. 1998.[48] D.L. Mills, “Network Time Protocol (version 3) Specification,

Implementation,” RFC 1305, IETF, Mar. 1992.[49] Abilene Network, http://abilene.internet2.edu/, 2006.[50] User Mode Linux, http://user-mode-linux.sourceforge.net/, 2006.[51] P. Rogaway, “UMAC Performance (More),” www.cs.ucdavis.e-

du/~rogaway/umac/2000/perf00bis.html, 2006.[52] N. Shah, “Understanding Network Processors,” Master’s thesis,

Univ. of California, Berkeley, Sept. 2001.[53] W. Feghali, B. Burres, G. Wolrich, and D. Carrigan, “Security:

Adding Protection to the Network via the Network Processor,”Intel Technology J., vol. 06, pp. 40-49, Aug. 2002.

[54] National Institute of Standards and Technology, “Data encryptionStandard,” FIPS PUBS 46-3, Oct. 1999.

[55] National Institute of Standards and Technology, “AdvancedEncryption Standard,” FIPS PUBS 197, Nov. 2001.

[56] N.G. Duffield and M. Grossglauser, “Trajectory Sampling forDirect Traffic Observation,” Proc. ACM SIGCOMM’00, pp. 271-282, 2000.

[57] N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISPTopologies with Rocketfuel,” Proc. ACM/SIGCOMM, pp. 133-145, 2002.

[58] K.M. Chandy and L. Lamport, “Distributed Snapshots: Determin-ing Global States of Distributed Systems,” ACM Trans. ComputerSystems, vol. 3, no. 1, pp. 63-75, 1985.

[59] L. Lamport, R. Shostak, and M. Pease, “The Byzantine GeneralsProblem,” ACM Trans. Programming Languages and Systems, vol. 4,no. 3, pp. 382-401, 1982.

Alper Tugay M�zrak received the BS degree incomputer engineering from Bilkent University,Ankara, Turkey, in 2000 and the MSc degree incomputer science from the University of Califor-nia, San Diego, in 2002. He is pursuing his Doctorof Philosophy degree in computer science in theUniversity of California, San Diego. His advisorsare Stefan Savage and Keith Marzullo. Hisresearch interests are computer networks andfault-tolerant distributed systems. He is a student

member of the IEEE.

Yu-Chung Cheng received the BS degree fromNational Taiwan University, Taiwan, in 1998.Since 2001, he has been working toward thePhD degree, advised by Geoff Voelker andStefan Savage, in computer science at theUniversity of California at San Diego. Hisresearch interests are the performance andreliability of wide-area and wireless networks.His thesis focuses on cross-layer 802.11 wire-less network benchmarking with multiple van-

tage points. He coauthored several papers in wide-area networksincluding one award paper in DSN 2005, and one paper on Webbenchmarking by replying TCP flows from Google Web sites.

Keith Marzullo received the PhD degree fromStanford University in 1984. He is a professor inthe Computer Science and Engineering Depart-ment at the University of California, San Diegowhere he has been since 1993. Dr. Marzulloworks on both theoretical and practical issues offault-tolerant distributed computing in variousdomains, including grid computing, mobile com-puting, and clusters. He is a member of theIEEE.

Stefan Savage received the BS degree inapplied history from Carnegie-Mellon Universityand the PhD degree in computer science andengineering from the University of Washington.He is an assistant professor of computer scienceand engineering at the University of California,San Diego, and he currently serves as director ofthe Cooperative Center for Internet Epidemiol-ogy and Defenses (CCIED), a joint projectbetween UCSD and the International Computer

Science Institute. His research interests lie at the intersection ofoperating systems, networking, and computer security. More informationabout him can be found at http://www.cs.ucsd.edu/ savage. He is amember of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

M�ZRAK ET AL.: DETECTING AND ISOLATING MALICIOUS ROUTERS 15