network resilience: a systematic approach · 90 ieee communications magazine • july 2011...

IEEE Communications Magazine • July 201188 0163-6804/11/$25.00 © 2011 IEEE

INTRODUCTION

Data communication networks are serving allkinds of human activities. Whether used for pro-fessional or leisure purposes, for safety-criticalapplications or e-commerce, the Internet in par-ticular has become an integral part of our every-day lives, affecting the way societies operate.However, the Internet was not intended to serveall these roles and, as such, is vulnerable to awide range of challenges. Malicious attacks, soft-ware and hardware faults, human mistakes (e.g.,software and hardware misconfigurations), and

large-scale natural disasters threaten its normaloperation.

Resilience, the ability of a network to defendagainst and maintain an acceptable level of ser-vice in the presence of such challenges, is viewedtoday, more than ever before, as a major require-ment and design objective. These concerns arereflected in, among other ways, in the CyberStorm III exercise carried out in the UnitedStates in September 2010, and the “cyber stresstests” conducted in Europe by the EuropeanNetwork and Information Security Agency(ENISA) in November 2010; both aimed pre-cisely at assessing the resilience of the Internet,this “critical infrastructure used by citizens, gov-ernments, and businesses.”

Resilience evidently cuts through several the-matic areas, such as information and networksecurity, fault tolerance, software dependability,and network survivability. A significant body ofresearch has been carried out around thesethemes, typically focusing on specific mecha-nisms for resilience and subsets of the challengespace. We refer the reader to Sterbenz et al. [1]for a discussion on the relation of variousresilience disciplines, and to a survey by Choldaet al. [2] on research work for network resilience.

A shortcoming of existing research anddeployed systems is the lack of a systematic viewof the resilience problem, that is, a view of howto engineer networks that are resilient to chal-lenges that transcend those considered by a sin-gle thematic area. A non-systematic approach tounderstanding resilience targets and challenges(e.g., one that does not cover thematic areas)leads to an impoverished view of resilienceobjectives, potentially resulting in ill suited solu-tions. Additionally, a patchwork of resiliencemechanisms that are incoherently devised anddeployed can result in undesirable behavior andincreased management complexity under chal-

ABSTRACT

The cost of failures within communicationnetworks is significant and will only increase astheir reach further extends into the way our soci-ety functions. Some aspects of networkresilience, such as the application of fault-toler-ant systems techniques to optical switching, havebeen studied and applied to great effect. Howev-er, networks — and the Internet in particular —are still vulnerable to malicious attacks, humanmistakes such as misconfigurations, and a rangeof environmental challenges. We argue that thisis, in part, due to a lack of a holistic view of theresilience problem, leading to inappropriate anddifficult-to-manage solutions. In this article, wepresent a systematic approach to buildingresilient networked systems. We first study fun-damental elements at the framework level suchas metrics, policies, and information sensingmechanisms. Their understanding drives thedesign of a distributed multilevel architecturethat lets the network defend itself against, detect,and dynamically respond to challenges. We thenuse a concrete case study to show how the frame-work and mechanisms we have developed can beapplied to enhance resilience.

TOPICS IN NETWORK AND SERVICE MANAGEMENT

Paul Smith, Lancaster University

David Hutchison, Lancaster University

James P. G. Sterbenz, University of Kansas and Lancaster University

Marcus Schöller, NEC Laboratories Europe

Ali Fessi, Technische Universität München

Merkouris Karaliopoulos, NKU Athens

Chidung Lac, France Telecom (Orange Labs)

Bernhard Plattner, ETH Zurich

Network Resilience:A Systematic Approach

SMITH LAYOUT 6/20/11 4:20 PM Page 88

IEEE Communications Magazine • July 2011 89

lenge conditions, encumbering the overall net-work management task [3].

The EU-funded ResumeNet project argues forresilience as a critical and integral property ofnetworks. It advances the state of the art byadopting a systematic approach to resilience,which takes into account the wide-variety ofchallenges that may occur. At the core of ourapproach is a coherent resilience framework,which includes implementation guidelines, pro-cesses, and toolsets that can be used to underpinthe design of resilience mechanisms at variouslevels in the network. In this article, we firstdescribe our framework, which forms the basisof a systematic approach to resilience. Central tothe framework is a control loop, which definesnecessary conceptual components to ensure net-work resilience. The other elements — a riskassessment process, metrics definitions, policy-based network management, and informationsensing mechanisms — emerge from the controlloop as necessary elements to realize our system-atic approach. We show how these elementsdrive the design of a novel architecture andmechanisms for resilience. Finally, we illustratethese mechanisms in a concrete case study beingexplored in ResumeNet: a future Internet smartenvironments application.

FRAMEWORK FOR RESILIENCEOur resilience framework builds on work by Ster-benz et al. [1], whereby a number of resilienceprinciples are defined, including a resiliencestrategy, called D2R2 + DR: Defend, Detect,Remediate, Recover, and Diagnose and Refine.The strategy describes a real-time control loop toallow dynamic adaptation of networks in responseto challenges, and a non-real time control loopthat aims to improve the design of the network,including the real-time loop operation, reflectingon past operational experience.

The framework represents our systematicapproach to the engineering of network

resilience. At its core is a control loop compris-ing a number of conceptual components thatrealize the real-time aspect of the D2R2 + DRstrategy, and consequently implement networkresilience. Based on the resilience control loop,other necessary elements of our framework arederived, namely resilience metrics, understand-ing challenges and risks, a distributed informa-tion store, and policy-based management. Theremainder of this section describes the resiliencecontrol loop, then motivates the need for theseframework elements.

RESILIENCE CONTROL LOOPBased on the real-time component of the D2R2

+ DR strategy, we have developed a resiliencecontrol loop, depicted in Fig. 1, in which a con-troller modulates the input to a system undercontrol in order to steer the system and its out-put towards a desired reference value. The con-trol loop forms the basis of our systematicapproach to network resilience — it defines nec-essary components for network resilience fromwhich the elements of our framework, discussedin this section, are derived. Its operation can bedescribed using the following list; items corre-spond to the numbers shown in Fig. 1:

1. The reference value we aim to achieve isexpressed in terms of a resilience target, which isdescribed using resilience metrics. The resiliencetarget reflects the requirements of end users,network operators, and service providers.

2. Defensive measures need to be put in placeproactively to alleviate the impact of challengeson the network, and maintain its ability to real-ize the resilience target. A process for identify-ing the challenges that should be considered inthis defense step of the strategy (e.g., those hap-pening more frequently and having high impact)is necessary.

3. Despite the defensive measures, some chal-lenges may cause the service delivered to usersto deviate from the resilience target. These chal-lenges could include unforeseen attacks or mis-

Figure 1. The resilience control loop: derived from the real-time component of the D2R2 + DR resilience strategy.

Defensive measures

Providedservice

Used to determinewhether targets are

being met

Management decisionsto control resilience

mechanisms

Detects andcharacteriseschallenges

Redundant anddiverse

infrastructureprovisioning andself-protecting

services

Resilienceobjectivesdescribed,

e.g.,in SLAs

Resiliencetarget

Protocolsand servicesembedded

in thenetwork

Challenges

Resilienceestimator

Resiliencemanager Resilience

mechanisms

Networksand services

Challengeanalysis

1

2

3

4 5

5


IEEE Communications Magazine • July 201190

configurations. Challenge analysis componentsdetect and characterize them using a variety ofinformation sources.

4. Based on output from challenge analysisand the state of the network, a resilience estima-tor determines whether the resilience target isbeing met. This measure is based on resiliencemetrics, and is influenced by the effectiveness ofdefense and remediation mechanisms to respondto challenges.

5. Output from the resilience estimator andchallenge analysis is fed to a resilience manager.It is then its responsibility to control resiliencemechanisms embedded in the network and ser-vice infrastructure, to preserve the target serviceprovision level or ensure its graceful degrada-tion. This adaptation is directed using resilienceknowledge, not shown in Fig. 1, such as policiesand challenge models. We anticipate a cost ofremediation in terms of a potentially unavoid-able degradation in quality of service (QoS),which should not be incurred if the challengeabates. Consequently, the network should aim torecover to normal operation after a challengehas ceased.

The purpose of the background loop in theD2R2 + DR strategy is to improve the operationof the resilience control loop such that it meetsan idealized system operation. This improvementcould be in response to market forces, leading tonew resilience targets, new challenges, or subopti-mal performance. The diagnose phase identifiesareas for improvement, including defense, thatare enacted through refinement. In reality, and forthe foreseeable future, we anticipate this outerloop to be realized with human intervention.

RESILIENCE METRICS

Defining a resilience target requires appropriatemetrics. Ideally, we would like to express theresilience of a network using a single value, R,in the interval [0,1], but this is not a simpleproblem because of the number of parametersthat contribute to and measure resilience, anddue to the multilayer aspects in which each levelof resilience (e.g., resilient topology) is the foun-dation for the next level up (e.g., resilient rout-ing). We model resilience as a two-dimensionalstate space in which the vertical axis P is a mea-sure of the service provided when the opera-tional state N is challenged, as shown in Fig. 2.Resilience is then modeled as the trajectorythrough the state as the network goes from deliv-ering acceptable service under normal opera-tions S0 to degraded service Sc. Remediationimproves service to Sr and recovery returns tothe normal state S0. We can measure resilienceat a particular service level as the area underthis trajectory, R.

We have developed a number of tools forevaluating network resilience. For example, weuse MATLAB or ns-3 simulation models tomeasure the service at each level and plot theresults under various challenges and attacks, asin Fig. 2, where each axis is an objective functionof the relevant parameters [4]. Furthermore, wehave developed the Graph Explorer tool [5] thattakes as input a network topology and associatedtraffic matrix, a description of challenges, and aset of metrics to be evaluated. The result of theanalysis is a series of plots that show the metricenvelope values (mi(min), mi(max)) for eachspecified metric mi, and topology maps indicat-ing the resilience across network regions.

Figure 3 shows an example of the resilienceof the European academic network GÉANT2 tolink failures. The set of plots in Fig. 3a showmetric envelopes at different protocol levels —the aim is to understand how jitter responds incomparison with metrics at other levels, such asqueue length and connectivity. Surprisingly, jitteris not clearly related to queue length, and amonotonic increase in path length does not yielda similar increase in queue length for all scenar-ios of link failures. In fact, the fourth link failuredisconnects a region of the network; whereas upto three failures, the heavy use of a certain pathresulted in increasing queue lengths and jitter.The partition increases path length, becauseroute lengths are set to infinity, and decreasesconnectivity, which is accompanied by a reduc-tion in jitter, shown with the blue arrows in Fig.3a. The topology map in Fig. 3b highlights thevulnerability of regions of GÉANT2 with a heatmap, which can be used by network planners.

Our framework for resilience metrics (i.e., themultilevel two-dimensional state space and theuse of metric envelopes) can be used to under-stand the resilience of networks to a broad rangeof challenges, such as misconfigurations, faults,and attacks. The ability to evaluate a given net-work’s resilience to a specific challenge is limitedby the capability of the tools to create complexchallenge scenarios — this is an area for furtherwork, in which our effort should be focused onmodeling pertinent high-impact challenges.

Figure 2. Resilience state space

Severelydegraded

Partiallydegraded

Operational state NNormal

operation

Remediate

Recover

Detect

Defend

Remediate

Recover

Detect

Defend

Acc

epta

ble

Impa

ired

Serv

ice

para

met

ers

P

Una

ccep

tabl

e

Sr

S0

Sc



Figure 3. Example output from the Graph Explorer, developed in the ResumeNet project: a) plots showing the relationship between met-rics at various layers in response to link failures on the GÉANT2 topology; b) a heat map showing vulnerable regions of the topologywith respect to a given set of metrics. Reprinted from [5].

Level 4transport

Level 3routing

Level 2link

Level 1physical

(b)

(a)

Level 7application

Network design

Network operationC

lust

erin

g

Failures10

0.13

Clu

tter

ing

coef

ficie

nt

0.260.240.220.2

0.180.16

0.28

2 3 4

Ass

orta

tivi

ty

Failures10

-0.27

-0.05-0.1

-0.15

-0.2

0

2 3 4

Failures10

50

Betw

eenn

ess 75

70656055

80

2 3 4

Betw

eenn

ess

Failures10

0.93

Con

nect

ivit

y 0.990.980.970.960.950.94

1

2 3 4

Con

nect

ivit

y

Failures10

0

Que

ue le

ngth

200400600800

100012001400

2 3 4

Que

ue le

ngth

Failures10

0.012

0.01

Jitt

er

0.011

0.013

0.014

2 3 4 End-

to-e

nd ji

tter

Failures10

3

Hop

cou

nt

579

11

14

2 3 4Pa

th le

ngth



UNDERSTANDING CHALLENGES AND RISKS

Engineering resilience has a monetary cost. Tomaximize the effectiveness of the resources com-mitted to resilience, a good understanding of thechallenges a network may face is mandatory. Wehave developed a structured risk assessmentapproach that identifies and ranks challenges inline with their probability of occurrence and theirimpact on network operation (i.e., how disruptivethey are to the provision of its services). Theapproach should be carried out at the stage of net-work design when proactive defensive measuresare deployed, and repeated regularly over time aspart of the process of network improvements.

Central to determining the impact of a chal-lenge is to identify the critical services the networkprovides and the cost of their disruption: a mea-sure of impact. Various approaches can be used toidentify the critical services, such as discussiongroups involving the network’s stakeholders. Net-worked systems are implemented via a set ofdependent subsystems and services (e.g., web andSession Initiation Protocol [SIP] services rely onDomain Name Service [DNS]). To identify whetherchallenges will cause a degradation of a service, itis necessary to explicate these dependencies.

The next phase is to identify the occurrenceprobabilities of challenges (challenge_prob).Some challenges will be unique to a network’scontext (e.g., because of the services it provides),while others will not. In relation to these chal-lenges, shortcomings of the system (e.g., in termsof faults) should be identified. The aim is todetermine the probability that a challenge willlead to a failure (fail_prob). We can use tools,such as our Graph Explorer, analytical modeling,and previous experience (e.g., in advisories) tohelp identify these probabilities. Given thisinformation, a measure of exposure can bederived using the following equation:

exposure = (challenge_prob × fail_prob) × impact

With the measures of exposure at hand,resilience resources can be targeted at the chal-lenges that are likely to have the highest impact.

INFORMATION SOURCES ANDSHARING FOR RESILIENCE

For the most part, network management deci-sions are made based on information obtainedfrom monitoring systems in the network (e.g., viaSimple Network Management Protocol[SNMP]). However, to be able to make auto-nomic decisions about the nature of a wide rangeof challenges and how to respond to them — anecessary property of resilient networks — abroader range of information needs to be used.In addition to traditional network monitoringinformation, context information, which is some-times “external” to the system can be used. Ear-lier work has demonstrated how the use ofweather information, an example of context,improves the resilience of millimeter-wave wire-less mesh networks, which perform poorly inheavy rain [4]. Also, in addition to node-centricmonitoring tools, such as NetFlow and SNMP,task-centric tools can be used to determine the

root cause of failures. For example, X-trace [6]is a promising task-centric monitoring approachthat can be used to associate network and ser-vice state (e.g., router queue lengths and DNSrecords) with service requests (e.g., retrieving aweb page). This multilevel information can thenbe used to determine the root causes of failures.

We are developing a Distributed Store forChallenges and their Outcome (DISco), whichuses a publish-subscribe messaging pattern todisseminate information between subsystemsthat realize the real-time loop. Such informationincludes actions performed to detect and reme-diate challenges. Information sources may reportmore data than we can afford or wish to relay onthe network, particularly during challenge occur-rences. DISco is able to aggregate informationfrom multiple sources to tackle this problem.Decoupling information sources from compo-nents that use them allows adaptation of chal-lenge analysis components without needing tomodify information sources. To assist the twophases of the outer loop, DISco employs a dis-tributed peer-to-peer storage system for longer-term persistence of data, which is aware ofavailable storage capacity and demand.

POLICIES FOR RESILIENCEWe advocate the use of a policy-based manage-ment framework to define the behavior of real-time loop instantiations. Consequently, theimplementation of resilience mechanisms can bedecoupled from the resilience management strate-gies, which are expressed in policies. This hastwo immediate benefits: the nature of challengeschanges over time — management strategies canbe adapted accordingly without the need for net-work down-time; and policies allow networkoperators to clearly express when they wouldlike to intervene in the network’s operation (e.g.,when a remediation action needs to be invoked).

Research outcomes from the policy-basedmanagement field can help address the complex-ities of resilience management [7]. A difficulttask is deriving implementable policies fromhigh-level resilience requirements, say, expressedin service level agreements (SLAs). With appro-priate modifications, techniques for policy refine-ment can be used to build tools to automateaspects of this process. Policy-based learning,which relies on the use of logical rules for knowl-edge representation and reasoning, is beingexploited to assist with the improvement stagesof our strategy. Techniques for policy ratificationare currently used to ensure that invocation ofdifferent resilience strategy sets does not yieldundesirable conflicting behavior. Conflicts canoccur horizontally between components thatrealize the resilience control loop, and verticallyacross protocol levels. For example, a mecha-nism that replicates a service using virtualizationtechniques at the service level could conflict witha mechanism that is rate-limiting traffic at thenetwork level. Example policies of this sort areshown in Fig. 4. So that these forms of conflictcan be detected, Agrawal et al. [8] provide a the-oretical foundation for conflict resolution thatneeds to be extended with domain-specificknowledge, for example, regarding the nature ofresilience mechanisms.

We advocate the use

of a policy-based

management frame-

work to define the

behavior of real-time

loop instantiations.

Consequently, the

implementation of

resilience mecha-

nisms can be decou-

pled from the

resilience manage-

ment strategies,

which are expressed

in policies.



DEFENSE AND DYNAMICADAPTATION ARCHITECTURE

In this section, we describe a set of defensivemechanisms and an architecture that realize oursystematic approach to resilience, described ear-lier. The architecture, shown in Fig. 5, consistsof several subsystems implementing the varioustasks of the communication system as well asthe challenge detection components and adapta-tion capabilities. The behavior of all these sub-systems is directed by the resilience managerusing policies, which are held in a resilienceknowledge base. Central to this architecture isDISco, which acts as a publish-subscribe andpersistent storage system, containing informa-tion regarding ongoing detection and remedia-tion activities. From an implementationperspective, based on the deployment context,we envisage components of the architecture tobe distributed (e.g., in an Internet serviceprovider [ISP] network) or functioning entirelyon a single device (e.g., nodes in a delay-toler-ant network).

DEFENSIVE MEASURESAs a first step, defensive measures need to beput in place to alleviate the impact of challengeson the network. Since challenges may varybroadly from topology-level link failures toapplication-level malware, defensive measuresagainst anticipated high-impact challenges needto be applied at different levels and locations: inthe network topology design phase, and withinprotocols; across a network domain, as well as atindividual nodes. Defensive measures can eitherprevent a challenge from affecting the system orcontain erroneous behavior within a subsystemin such a way that the delivered service still

meets its specification. A selection of defensivemeasures developed in the ResumeNet project isshown in Table 1.

DETECTION SUBSYSTEMSThe second step is to detect challenges affectingthe system leading to a deviation in deliveredservice. We propose an incremental approach tochallenge analysis. Thereby, the understandingabout the nature of a challenge evolves as moreinputs become available from a variety of infor-mation sources. There are two apparent advan-tages of this incremental approach. First, itreadily accommodates the varying computationaloverhead, timescales, and potentially limitedaccuracy of current detection approaches [9].Second, relatively lightweight detection mecha-nisms that are always on can be used to prompt-ly initiate remediation, thus providing thenetwork with a first level of protection, whilefurther mechanisms are invoked to better under-stand the challenge and improve the networkresponse. Lightweight detection mechanisms canbe driven by local measurements carried out inthe immediate neighborhood of affected nodes.

For example, consider high-traffic volumechallenges, such as a DDoS attack or a flashcrowd event. Initially, always-on simple queuemonitoring could generate an alarm if queuelengths exceed a threshold for a sustained peri-od. This could trigger the rate limiting of linksassociated with high traffic volumes. Moreexpensive traffic flow classification could then beused to identify and block malicious flows, con-sequently not subjecting benign flows to ratelimiting. Challenge models, shown in Fig. 5,describe symptoms of challenges and drive theanalysis process. They can be used to initiallyidentify broad classes of challenge, and later torefine identification to more specific instances.

Figure 4. Potentially conflicting policies at the service level (the replication of a service) and the networklevel (rate-limiting traffic) that could be triggered by the same challenge, such as a distributed denial of ser-vice (DDoS) attack. Rate limiting traffic could cause the replication to fail.

DDoSattackers

DDoS trafficReplication traffic

on highUtilisation (link) { do { RateLimiterMO limit (link , 50%) }}

on highServiceUtilisation (service) { do { VMReplicator replicateService (service, ServerB) }}

Virtualisedservices

Internet

Conflict

ServerB

ServerA

Since challenges may

vary broadly from

topology-level link

failures to applica-

tion-level malware,

defensive measures

against anticipated

high-impact

challenges need to

be applied at

different levels

and locations.



REMEDIATION AND RECOVERY SUBSYSTEMS

The challenge detection subsystem interfaceswith the remediation and recovery subsystem,the third and final step, by issuing alerts to DIScousing the publish(challenge) primitive.These alerts contain information about the chal-lenge and its impact on the network, in terms ofthe metrics that are falling short of the resiliencetarget. The network resilience manager takes thisinformation as context data, and, based on poli-cies, selects an adaptation strategy. In doing so,the network resilience manager realizes theresilience management functionality in Fig. 1. Iffurther information is required by the networkresilience manager that is not contained in thealert, the lookup(alarm) primitive can beused. Furthermore, the network resilience man-ager can make use of consultants, such as pathcomputation elements, which can compute newtopological configurations, such as new channelallocations or new forwarding structures.Resilience mechanisms are deployed by enforc-ing new configurations on the managed entities(e.g., routers and end hosts) in the network. Toimplement the resilience estimator, the networkresilience manager assesses the success of cho-sen remedies. The assessment is stored in DIScoto aid the diagnosis and refinement steps of thebackground loop. Carrying out this assessment isnot straightforward since it requires spatio-tem-poral correlation of changes in network state,which is an issue for further work.

RESILIENCE IN SMARTENVIRONMENTS: A CASE STUDY

We are currently evaluating our genericapproach to resilience through concrete studycases that cover a range of future networking

paradigms: wireless mesh and delay-tolerantnetworks, peer-to-peer voice conferencing,and service provision over heterogeneoussmart environments. Herein, we focus our dis-cussion on the last study case. The widespreaduse of smart mobile devices, together withidentifiers such as radio frequency identifica-tion (RFID), embedded in objects such asproducts, enables communication with, andabout, these objects. The French national pro-ject Infrastructure for the Future Trade(ICOM) has developed an intra- and inter-enterprise infrastructure, depicted in Fig. 6,that allows the connection of objects withenterprise information systems and fixed ormobile terminals. This ICOM platform can beused as a foundation for a number of enter-prise applications. The experimentation makesuse of three different entities:• The data acquisition site is the data source

— items identified by RFID, for example,are read and their information sent to aprocessing centre located remotely.

• The data processing site houses differentmodules of the platform (e.g., data collec-tion, aggregation, and tracking), which willforward the enriched data to the core appli-cation.

• The application provision site hosts theplatform’s central element — it is alsowhere the data subscriber applications (webservices, legal application, etc.) are linked.Based on outcomes of our risk assessment

approach, high-impact challenges to the plat-form include those that are intentional and acci-dental: malicious attacks that threaten theconfidentiality and integrity of commerciallysensitive data, DDoS attacks by extortionists,and, given the immature nature of the platform,software and hardware faults. This understand-ing ensures that we implement appropriate

Figure 5. A dynamic adaptation architecture that realizes the resilience control loop.

Update

ActivateRegister

Remediation and recovery

Apply

Respond

RequestNetworkresiliencemanager

Consultants

DefendDetection

DIScoChallengeanalysis

Policies

Deliver(alarm)

Publish(alarm)

Inform

Inform

Deliver(challenge)

Subscribe(alarm) /publish(challenge)

Subscribe(challenge) /publish(report) /

lookup(alarm)

Resilience knowledgebase

Managedentities

Informationsources

Challengemodels

Defensivemeasures

We are currently

evaluating our

generic approach to

resilience through

concrete study cases

that cover a range of

future networking

paradigms: wireless

mesh and delay tol-

erant networks,

peer-to-peer voice

conferencing and

service provision over

heterogeneous smart

environments.



defensive measures and dynamic adaptationstrategies.

Consequently, defensive measures primarilyinclude secure VPN connections between sites,enabling confidentiality and integrity of the datain transit. Security mechanisms, such as authenti-cation and firewalls, are also implemented.Redundancy of infrastructure and implementa-tion diversity of services are exploited to main-tain reliability and availability in the presence offailures caused by software faults.

Incremental challenge analysis is realizedusing the Chronicle Recognition System (CRS),a temporal reasoning system aimed at alarm-driven automated supervision of data networks[10]. Lightweight detection mechanisms gener-ate alarms based on metrics, such as anomalousapplication response times and data processingrequest rates. Finally, policy-based adaptation,implementing remediation and recovery, isachieved through the specification of the plat-form’s nominal and challenge context behavior(i.e., its configuration in response to anticipatedchallenges). In our case study, challenge con-text policies describe configurations in responseto alarms indicating a DDoS attack. For exam-ple, modified firewall configurations are definedto block traffic deemed to be malicious; servicevirtualization configurations that make use ofredundant infrastructure are also specified toload balance increased resource demands. Thetransition between behaviors is based on alertmessages, generated via challenge analysis, andoutcomes from continuous threat level assess-ment. The case study sketched above illustratesthe gain from applying our resilience strategiesin a systematic approach: starting from a riskassessment, challenges are derived, allowingdefense measures to be deployed. The follow-ing step is the specification of chronicles —

temporal descriptions of challenges — fordetection by the CRS, and policy-driven mecha-nisms to remediate and recover from unfore-seen failures.

CONCLUSIONGiven the dependence of our society on net-work infrastructures, and the Internet in partic-ular, we take the position that resilience shouldbe an integral property of future networks. Inthis article, we have described a systematicapproach to network resilience. Aspects of ourwork represent a longer-term vision of resilienceand necessitate more radical changes in the waynetwork operators currently think aboutresilience. Further experimentation and closerengagement with operators through initiativeslike ENISA, which focus on the resilience ofpublic communication networks and services,are required before some of this researchbecomes standard practice. On the other hand,application-level measures, such as service virtu-alization, necessitate fewer changes at the net-work core and lend to easier implementation.Further benefits for network practitioners areanticipated through the use of tools like theGraph Explorer, which can explore correlationsamong metrics at various levels of networkoperation.

ACKNOWLEDGEMENTSThe work presented in this article is supportedby the European Commission under Grant No.FP7-224619 (the ResumeNet project). Theauthors are grateful to the members of theResumeNet consortium, whose research hascontributed to this article, and in particular toChristian Doerr and his colleagues at TU Delftfor the work presented in Fig. 3.

Table 1. A selection of defensive measures developed in the ResumeNet Project.

Defensive measure Description Innovation

Survivable NetworkDesign (SND) [11]

During network planning, SND optimizes networkoperations, such as routing and transport, in thepresence of high-impact challenges.

Expansion of the methodology to derive a cooperation-friendly routing scheme for Wireless Mesh Networks(WMNs) to cope with node selfishness, explicitlyaccounting for radio interference constraints [12].

Game-theoreticalnode protection[13]

Node protection schemes are deployed againstpropagation of malware, which may compromisenetwork nodes and threaten the network resilience.

The game-theoretic formulation of the problem con-firms heavy dependence on the underlying topology andallows for optimal tuning of node protection level.

Rope-ladder routing[14]

Multi-path forwarding structure combining link andnode protection in a way that the loss gap and QoSpenalty, e.g., delay, during fail-over is minimized.

Better use of path diversity for support of real-time traf-fic, e.g., voice flows, for which burst packet loss duringthe path recovery time matters.

Cooperative SIP(CoSIP) [15]

An extension of the Session Initiation Protocol (SIP),whereby endpoints are organized into a peer-to-peer (P2P) network. The P2P network stores locationinformation and is used when the SIP server infra-structure is unavailable.

Optimal setting of the number of replica nodes in theP2P network for given service reliability levels, inlinewith an enhanced trace-driven reliability model.

Virtual servicemigration

Enables redundancy and spatial diversity by relocat-ing service instances on-the-fly, such that a continu-ous acceptable service can be provided to its users.

Existing approaches are tailored toward resilience tohardware failures within data centers. The derivation ofservice migration strategies from migration primitives,providing resilience against a variety of challenges.



REFERENCES[1] J. P. G. Sterbenz et al., “Resilience and Survivability in

Communication Networks: Strategies, Principles, andSurvey of Disciplines,” Elsevier Computer Networks,Special Issue on Resilient and Survivable Networks, vol.54, no. 8, June 2010, pp. 1243–42.

[2] P. Cholda et al., “A Survey of Resilience DifferentiationFrameworks in Communication Networks,” IEEE Com-mun. Surveys & Tutorials, vol. 9, no. 4, 2007, pp.32–55.

[3] ENISA Virtual Working Group on Network Providers’Resilience Measures, “Network Resilience and Securi-ty: Challenges and Measures,” tech. rep. v1.0, Dec.2009.

[4] J. P. G. Sterbenz et al., “Evaluation of NetworkResilience, Survivability, and Disruption Tolerance: Anal-ysis, Topology Generation, Simulation, and Experimen-tation (invited paper),” Springer Telecommun. Sys.,2011, accepted Mar. 2011.

[5] C. Doerr and J. Martin-Hernandez, “A ComputationalApproach to Multi-Level Analysis of NetworkResilience,” Proc. 3rd Int’l. Conf. Dependability, Venice,Italy, July 2010.

[6] R. Fonseca et al., “X-trace: A Pervasive Network TracingFramework,” 4th USENIX Symp. Networked Sys. Design& Implementation, Santa Clara, CA, June 2007, pp.271–84.

[7] P. Smith et al., “Strategies for Network Resilience: Capi-talizing on Policies,” AIMS 2010, Zürich, Switzerland,June 2010, pp. 118–22.

[8] D. Agrawal et al., “Policy Ratification,” 6th IEEE Int’l.Wksp. Policies for Distrib. Sys. and Networks, Stock-holm, Sweden, June 2005, pp. 223–32.

[9] V. Chandola, A. Banerjee, and V. Kumar, “AnomalyDetection: A Survey,” ACM Comp. Surveys, vol. 41, July2009, pp. 1–58.

[10] M.-O. Cordier and C. Dousson, “Alarm Driven Monitor-ing Based on Chronicles,” 4th Symp. Fault Detection,Supervision and Safety for Technical Processes ,Budapest, Hungary, June 2000, pp. 286–91.

[11] E. Gourdin, “A Mixed-Integer Model for the SparsestCut Problem,” Int’l. Symp. Combinatorial Optimization,Hammamet, Tunisia, Mar. 2010, pp. 111–18.

[12] G. Popa et al., “On Maximizing Collaboration in Wire-less Mesh Networks Without Monetary Incentives,” 8thInt’l. Symp. Modeling and Optimization in Mobile, AdHoc and Wireless Networks, May 2010, pp. 402–11.

[13] J. Omic, A. Orda, and P. Van Mieghem, “ProtectingAgainst Network Infections: A Game Theoretic Perspec-tive,” Proc. 28th IEEE INFOCOM, Rio de Janeiro, Brazil,Apr. 2009, pp. 1485–93.

[14] J. Lessman et al., “Rope Ladder Routing: Position-Based Multipath Routing for Wireless Mesh Networks,”Proc. 2nd IEEE WoWMoM Wksp. Hot Topics in MeshNetworking, Montreal, Canada, June 2010, pp. 1–6.

Figure 6. The ICOM platform connecting enterprise sites that perform data processing and application provisioning with objects in a smartenvironment. Selected resilience mechanisms are shown that can be used to mitigate identified challenges.

VPN

Defend

Def

end

VPN

VPN

Application provision site

Acquisition

RFID tag RFID reader

2D barcode Mobile photo

1D barcode Scanner

NFC mobile Paymentdevice

Data acquisition site

Legalapplications

Webapplications

Webservices

Softwarepackages

Collectionaggregation

Data processing site

Persistantstorage

Transport

Traceability

Def

endD

efen

d

Def

end

Nominalbehaviour

Challengecontexts

Alarm

Timeout

Remediate and recover

Internet

SMITH LAYOUT 6/20/11 4:20 PM 6


[15] A. Fessi et al., “A Cooperative SIP Infrastructure forHighly Reliable Telecommunication Services,” ACMConf. Principles, Sys. and Apps. of IP Telecommun.,New York, NY, July 2007, pp. 29–38.

BIOGRAPHIESPAUL SMITH is a senior research associate at Lancaster Uni-versity’s School of Computing and Communications. Hesubmitted his Ph.D. thesis in the area of programmablenetworking resource discovery in September 2003, andgraduated in 1999 with an honors degree in computer sci-ence from Lancaster. In general, he is interested in the vari-ous ways that networked (socio-technical) systems fail toprovide a desired service when under duress from variouschallenges, such as attacks and misconfigurations, anddeveloping approaches to improving their resilience. In par-ticular, his work has focused on the rich set of challengesthat face community-driven wireless mesh networks.

DAVID HUTCHISON is director of InfoLab21 and professor ofcomputing at Lancaster University, and has worked in theareas of computer communications and networking formore than 25 years, recently focusing his research effortson network resilience. He has served as member or chair ofnumerous TPCs (including the flagship ACM SIGCOMM andIEEE INFOCOM), and is an editor of the renowned SpringerLecture Notes in Computer Science and the Wiley CNDSbook series.

JAMES P. G. STERBENZ is director of the ResiliNets researchgroup at the Information & Telecommunication TechnologyCenter and associate professor of electrical engineeringand computer science at The University of Kansas, a visit-ing professor of computing in InfoLab21 at Lancaster Uni-versity, and has held senior staff and research managementpositions at BBN Technologies, GTE Laboratories, and IBMResearch. He received a doctorate in computer sciencefrom Washington University in St. Louis, Missouri. Hisresearch is centered on resilient, survivable, and disruption-tolerant networking for the future Internet for which he isinvolved in the NSF FIND and GENI programs as well as theEU FIRE program. He is principal author of the book High-Speed Networking: A Systematic Approach to High-Band-width Low-Latency Communication. He is a member of theACM, IET/IEE, and IEICE.

MARCUS SCHÖLLER is a research scientist at NEC LaboratoriesEurope, Germany. He received a diploma in computer sci-ence from the University of Karlsruhe, Germany, in 2001and his doctorate in engineering in 2006 on robustnessand stability of programmable networks. Afterward he

held a postdoc position at Lancaster University, UnitedKingdom, focusing his research on autonomic networksand network resilience. He is currently working on resiliencefor future networks, fault management in femtocelldeployments, and infrastructure service virtualization. Hisinterests also include network and system security, intru-sion detection, self-organization of networks, future net-work architectures, and mobile networks including meshand opportunistic networks.

ALI FESSI is a researcher at the Technische UniversitätMünchen (TUM). He holds a Ph.D. from TUM and a Diplom(Master’s) from the Technische Universität Kaiserslautern.His research currently focuses on the resilience of networkservices, such as web and SIP, using different techniques(e.g., P2P networking, virtualization, and cryptographicprotocols). He is a regular reviewer of several scientific con-ferences and journals, such as ACM IPTComm, IEEE GLOBE-COM, IFIP Networking, and IEEE/ACM Transactions onNetworking.

MERKOURIS KARALIOPOULOS is a Marie Curie Fellow in theDepartment of Informatics and Telecommunications,National and Kapodistrian University of Athens, Greece,since September 2010. He was a postdoctoral researcher inthe University of North Carolina, Chapel Hill, in 2006 and asenior researcher and lecturer at ETH Zurich, Switzerland,from 2007 until 2010. His research interests lie in the gen-eral area of wireless networking, currently focusing on net-work resilience problems related to node selfishness andmisbehavior.

CHIDUNG LAC is a senior researcher at France Telecom(Orange Labs). Besides activities linked with network archi-tecture evolution, for which he contributes to the designof scenarios and roadmaps, his research interests are cen-tered on network and services resilience, particularlythrough his involvement in European projects such as theReSIST Network of Excellence (2006–2009) and the presentSTREP ResumeNet (2008–2011). He holds a Doctoratd’Etat-és-Sciences Physiques (1987) from the University ofParis XI Orsay.

BERNHARD PLATTNER is a professor of computer engineeringat ETH Zurich, where he leads the communication systemsresearch group. He has a diploma in electrical engineeringand a doctoral degree in computer science from ETHZurich. His research currently focuses on self-organizingnetworks, systems-oriented aspects of information security,and future Internet research. He is the author or co-authorof several books and has published over 160 refereedpapers in international journals and conferences.


network resilience: a systematic approach · 90 ieee communications magazine • july 2011...

Documents