a lightweight approach to distributed network diagnosis under uncertainty

7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

1/7

1AbstractNetwork Management faces major challenges

nowadays. Management applications have not kept the changing

pace of networks and services and still rely on centralized

approaches. Moreover, uncertainty is part of the reality. This

paper presents a lightweight collaborative approach to network

troubleshooting, based in multi agents and probabilistic

techniques. The proposed architecture has been applied to three

different network environments.

Index Terms Bayesian Network, Multi Agent System,

Network Troubleshooting, Uncertainty, Collaborative diagnosis.

I. INTRODUCTIONelecommunication networks are growing in size andcomplexity, with a rich blend of services being

deployed on top of them. All kind of organizations, from smallcompanies, to NGOs, academic institutions orFortune 500corporations use Internet as an infrastructure that supportstheir operations. However, the power and flexibility of thiskind of solutions has some drawbacks. Since there aremultiple players and no central authority uncertainty becomesa burden.

Network Management is facing a change of paradigm after

a long period of stability. For decades, well establishedreference models like ITU-T TMN [1] or more recently TMFNGOSS [2], have been the guide for the design of commercialand in house systems, with a wide range of applications for thetelecom industry and corporate networks. IETF SNMP has

played a similar role for small and medium size businesssolutions.

Classical architectures have a common underlying designprinciple: the state of all existing entities can be fully known atany given moment in time. A hierarchy of layers allows a wellengineered distribution of functions. The five big managementareas (FCAPS: Fault, Configuration, Accounting,Performance, Security) guarantee that the network is always

under control.This deterministic approach has proved very useful when

the entire infrastructure belongs to the same domain, as it isthe case inside a telecom operator network, when well definedinterfaces allow the interconnection of different domains orwhen all players speak the same management language.

Another important feature of the classical model, related tothe previous one, is the centralized design. Huge network

1Manuscript received August 9, 2009. This work was supported in part bythe Spanish Ministerio de Industria, Turismo y Comercio, Avanza I+D 2008grant TSI-020400-2008-27 under the CELTIC Initiative project MAGNETO.

inventories, extremely complex end to end monitoringapplications or even trouble ticketing workflows behave as

part of a Big Brother that needs every piece of informationto react when something unexpected happens.

We can compare this situation to the state of developmentof classical mechanics by the end of XVIII Century, when itseemed that the Universe was like a perfect clock. Pierre-Simon Laplace wrote in 1814, in the Essai philosophique surles probabilits :

Une intelligence qui, un instant donn, connatrait toutesles forces dont la nature est anime et la situation respective

des tres qui la compose embrasserait dans la mme formule

les mouvements des plus grands corps de l'univers et ceux du

plus lger atome; rien ne serait incertain pour elle, et l'avenir,

comme le pass, serait prsent ses yeux.The problem with Laplaces demon is that when the number

of state variables reaches a critical threshold it is no longercomputable due to scalability problems. Telecom networksshare this property, the amount of managed entities grows soquickly that is impossible for centralized solutions to keep this

pace. But there are many sources of uncertainty that challengethe traditional view, as the multiplicity of managementdomains or the emergence of properties in complex networks

[3]-[6].Uncertainty cannot be underestimated or dismissed as anundesirable collateral effect in network management. It must

be considered as an intrinsic property of telecom networks,and it should be taken into account to avoid expensiveworkarounds.

We present in this paper a lightweight, collaborative anddistributed approach for network troubleshooting, developedas an internal initiative of Telefnica I+D, to foster innovativemanagement solutions. The same principles have been appliedto three different scenarios: KOWGAR, an onlinetroubleshooter to help the final users of a geographicallydistributed corporate network; KOWLAN, an automatic

diagnosis system for the Ethernet/VPN commercial service ofTelefnica Espaa (MACROLAN) and KOWGAR@HOME,the application of these concepts to Home Area Networks,where two different management domains (ISP and HomeArea Network) have to cooperate to reach valid conclusions.

II. DISTRIBUTEDDIAGNOSISUNDERUNCERTAINTY

Troubleshooting is one of the Network Management fieldsmore sensitive to uncertainty using traditional centralizedapproaches. Based on remote access to testing and

T

A Lightweight Approach to Distributed NetworkDiagnosis under Uncertainty

Javier Garca-Algarra, Pablo Arozarena-Llopis, Sergio Garca-Gmez, lvaro Carrera-BarrosoTelefnica I+D, Spain

{algarra,pabloa,sergg,[email protected]}


2/7

information capabilities Network Management tries to comeup with a conclusion but, if the information is incomplete orinaccurate, the process may get blocked. Usually, trying todiagnose affected services is very complicated, involvingdozens of elements of heterogeneous technologies, and finallyrequiring human intervention to solve the puzzle. As humanexpertise is a scarce and expensive resource, the researchcommunity is endeavoring to build systems that emulate

network engineers working under uncertainty.There are already several efforts on designing frameworks

for distributed network management. AutoMON [7] uses aP2P-based solution. Performance and reliability are tested bydistributed agents although the nodes do not co-operate or usehistorical information about failures. Distributed faultmanagement in Connected Home uses also an agent-basedapproach for this scenario [8]. DYSWIS [9] presents anArchitecture for Automated Diagnosis of Networks thatconsists of detection and diagnosis nodes. The first ones lookfor failures by passive traffic monitoring and active probing,while the second ones determine the root cause usinghistorical information and by performing active tests. Networkdependency relationships are encoded as rules. MADEIRA[10] introduced a distributed architecture for dynamic devicesmanagement. The use of a dynamic hierarchy allows thesystem to adapt to network failures and state changes,something that is considered essential for future large scalenetworks which must exhibit adaptive, decentralized control

behaviors.Several authors have explored the application of

probabilistic techniques to network management [11]-[13].For fault diagnosis, many of these advances use BayesianInference [14]-[18] to conclude the cause of the observednetwork problems. In particular, CAPRI [19] defines a

Common Architecture for Distributed Probabilistic InternetFault Diagnosis. The whole picture is similar to the one inDYSWIS, but it uses Bayesian networks instead of rules toinfer the root cause.

Bayesian networks (BN), a term coined by Judea Pearl [20],are based upon probability theory. The problem domain isrepresented as a directed acyclic graph where the nodesrepresent variables, and the arcs, conditional dependencies

between them. Graphs are easy to work with, so Bayesiannetworks can be used to produce models that are simple forhumans to understand, as well as effective algorithms forinference and learning [21]. Bayesian networks have beensuccessfully applied to numerous areas, including medicine,

decision support systems, and text analysis [22].

III. PRINCIPLESANDFEATURESOur strategy is based on three principles and three design

decisions previous to the study of each particular scenario.The first principle is that any new solution must be neutral

and fit on any OSS (Operations Support Systems) map. Onecommon mistake in the management field is to expect that thesurrounding IT systems have to adapt to our needs, and thisassumption is a source of delays and expenditure. The secondone is that deployment disruption must be minimal, so it is not

necessary to replace any previous system. The third one is thefact that uncertainty is unavoidable, so new systems have to

properly take it into account.The first design decision is that systems have to be

distributed, and grow organically as a part of the network notas an external watchman. The second one is that semantics is

part of the human knowledge and so systems must be based onsemantics just from the beginning. The third one is the

application of probabilistic techniques to deal withuncertainty. The combination of BNs with semanticstechnologies is an active research [23]-[25] that provides ahigh degree of flexibility.

Besides these principles and design decisions there is alsoan additional economic constraint, systems must be cheap todevelop, cheap to deploy and cheap to maintain. Economy is akey issue, since network operation costs have not dropped atthe same rate than equipment prices.

In order to cut down development and installation costs,only Open Source bricks have been used to build thisenvironment. A second strategic decision was to adopt anAgile methodology [28] to focus the effort on runningsoftware and reducing the managing overhead that is commonin complex organizations. In particular, Scrum was themethodology chosen [29].

Regarding hardware, one of the goals is taking advantage ofthe unused CPU resources that are distributed all over thenetwork. The JADE platform (jade.tilab.com) allows thedeployment of Java coded agents with minimal requirementsof CPU and memory [31]. When dedicated hardware is neededto run parts of the whole system, commoditized hardware isthe best option. For example, a 700 Quad Core Intel PC withUbuntu Linux is enough to run the MACROLAN diagnosis

prototype for the whole Telefnica Espaa network.

Maintenance costs are a main concern for IT systems ingeneral. One of the advantages of this approach is thatbusiness logic is encoded in a Bayesian Network that can bedistributed all over the network at any time. This allows theaddition of new diagnosis capabilities or the modification ofthe BNs without stopping the system. The Bayesian Inferenceengine chosen is SamIam, a light and fast Java implementationdeveloped at UCLA (http://reasoning.cs.ucla.edu/samiam/).

IV. BAYESIANKNOWLEDGECAPTUREWhen modelling a Bayesian Network two things have to be

defined: first the structure of the Bayesian Network and thenthe parameter values of the Conditional Probability Tables

(CPT).In order to initially build a BN, there are several

alternatives. When no domain expert knowledge is available,but there is a big amount of historical data, the structure of theBayesian network can be automatically built by usingstructural learning algorithms. There are several structurallearning algorithms, like K2 [32] and tree augmented NaveBayes [33]. K2 is a simple and very fast learning algorithm,

but its results depend on the initial ordering of input data, so itmakes sense to run the algorithm several times with differentrandom orderings. Another good learning algorithm for


3/7

Bayesian network classifiers is called tree augmented NaveBayes (TAN). This algorithm is linear in the number ofinstances and quadratic in the number of attributes. The main

problem with these structure learning algorithms is that, sincethey are just based on statistics, they may come up with wrongcasual dependencies.

For this reason, the structure of the Bayesian Networks ispreferably defined manually as a first step. Together with the

structure, initial CPTs parameter are set. This task may requirethe cooperation of a team of experts in the problem domainwith knowledge engineers to properly model the BN.

Another challenge deals with the distribution of theintelligence across the physical network. Instead of using acentralised BN, a smarter approach is to partition the wholedomain in smaller BNs [26] when the scenario is complex asis the case in network troubleshooting. Following this

principle, different elements in different parts of the networkmay have different views and knowledge. For instance, someagents may only diagnose network problems while othersservice problems, exchanging their conclusions tocooperatively reach a valid conclusion. In other words, thesingle BN that could exist within a centralised solution will befragmented and distributed to the overlay agent network.

The Virtual Evidence Method (VEM) [27] is an algorithmthat partitions the network in several pieces to perform partialinference and belief interchange. With VEM, each agent onlyneeds to know a subset of the whole BN. This makes thediagnosis process more scalable, since computing resourcesare distributed across the physical network. It also facilitatesreusing knowledge in other diagnosis processes that sharecommon parts of the BN.

Moreover, if the Bayesian Network is partitioned accordingto the physical network topology, VEM enables mapping the

diagnosis process to the different network domains.

V. SELF LEARNINGOne significant goal in our work is to allow the diagnostic

intelligence to be able to self adapt and improve over time. Toachieve this, the system must be able to learn from pastactions, something which requires a feedback loop to mark

past diagnoses as successful or not. This is very challenging,since there is not a straightforward way to assess diagnosisresults. A possible solution is to request human feedback fromthe network operators on the usefulness of each diagnosis.Once this is done, parametric learning algorithms may be usedto improve diagnosis quality. These methods update the

existing BN, but only modifying the link weights in the CPTsand not its structure.

Expectation Maximization algorithm (EM) is a goodparametric learning algorithm used in statistics to findmaximum likelihood estimates of parameters in probabilisticmodels [30]. EM uses an iterative algorithm that estimates themissing values in the input data representing previousdiagnoses. This is relevant since for some diagnoses there may

be only a subset of the possible evidences available. Once thisis done, statistical methods are used to recalculate the weightsin the BN based on the set of previous diagnoses. KOWGAR

uses this algorithm because it provides the necessary featuresfor updating the values of CPTs which, along with itsstructure, embody the knowledge embedded in a BN.

VI. ARCHITECTUREOVERVIEWThe core of KOWGAR is based on a Multiagent

architecture. Different types of agents have been envisaged tocarry out the diagnosis process.

The Interface Agents are in charge of communicationwith the systems outside, such as a user interface, atrouble ticketing system, a network inventory, etc. Insome cases, these agents provide an interface as anexternal service. In other cases, they use other systeminterfaces.

The Observation Agents mission is to get evidencesfrom the managed networks and services. Theseagents have interfaces with the network resources andtheir Management Information Bases (MIBs), or theycan exploit external testing tools or services.

The Diagnosis Agents orchestrate the diagnosisprocess and gather evidences from the Observation

Agents in order to carry out the Bayesian Inference.This process is driven by the Bayesian Networksspecifications. Usually, these agents are specializedfor a network/service configuration.

The Persistence Agents manage the storage andretrieval of KOWGAR information from the systemdatabase when required.

The Knowledge Agent task is to distribute theBayesian knowledge to the agents that need it.

All the communications between agents are based on FIPA-ACL messages and a specific ontology that the agentsunderstand. This ontology defines the Bayesian Networksstructure (hypothesis, evidences, conditional probabilities,

thresholds, etc.), the information about a diagnosis operation(observations, beliefs, additional information, etc.), and theactions that the agents can carry out. The communicationswith external systems are usually based on XML messagesover HTTP or on standard Web Services.As it is explained in the following sections, depending on theneeds of each scenario, this architecture is customized. Thetypes of agents must be specialized to adapt the architectureand different Bayesian Networks are defined for eachsituation.

VII. KOWGAR:BAYESIANDIAGNOSISINACORPORATENETWORK

KOWGAR is a proof of concept to test the power of adistributed Bayesian diagnosis system, deployed on the owncorporate Intranet of Telefnica I+D. The system targeted areduced set of problems, those related with web navigation ofinternal and external sites. The scenario is very common inany kind of institutional Intranet with locations geographicallydistant, linked via a VPN. In this case Telefnica Espaa actsas the ISP that provides also connectivity. Like any othercustomer, Telefnica I+D network managers perceive thisinfrastructure as a cloud or a black box. KOWGAR wasdesigned to help the end user, so to allow a quick and cheap


4/7

deployment; a tiny Firefox plug-in was

Graphical Interface. A diagnosis is autom

when the browser detects a problem and the

check the final result.

Two different types of possible HTTP co

distinguished:

- Connections to HTTP servers within the

- Connections to HTTP servers located

intranet, i.e. in the internet.The universe of root causes is select

construction of the BN. Some of th

misconfiguration, Link failure, Routing fail

unreachable, DNS information incorrect, De

unreachable, Destination port unreachabl

application unavailable.

There were three locations involved in

Madrid, Valladolid and Huesca. To simul

without disrupting the daily Company ope

mock DNS servers was installed. Fig. 1 sho

an example of what happens when a final

subnetwork is trying to access a web server i

refusing connections. When an error is

triggers the plugin, which in turn activates

Agent.

This User Agent performs a set of basic te

and if everything is OK, sends a request to

Agent, in charge of orchestrating KOWGA

agent knows the BN and the associated ontol

which kind of test must be executed to ga

possible symptoms. So, it invokes three

agents: routing, DNS and HTTP. The BN rea

that, in this case, is server congestion with a

The information can be displayed from the

clicking the plug-in icon.

Fig. 1: KOWGAR testbed scenario and agents deploy

This simple example shows some of the

distributed diagnosis. Problems are det

eveloped as the

tically launched

user only has to

nnections can be

intranet.

utside the TIDs

ed prior to the

em are: Local

ure, DNS server

stination host

or Destination

the experiment:

te DNS failures

ations, a pair of

s the testbed and

user in Madrid

n Boecillo that is

detected Firefox

the JADE User

sting procedures,

the Aggregation

s actions. This

ogy that explains

her the different

different testing

ches a conclusion

high probability.

browser just by

ent.

dvantages of the

cted locally if

possible. If the User Agent detects

instance, no further tests would be

stops. But let us take a look at A

when instead of a server congest

problem in the routing tables of a

case, the evidences gathered by the

would be enough to reach this conc

additional tests. So, instead of perf

available, the ontology sets the costis used to select the next test to

lowest cost). Each time a new tes

additional evidence, the Aggrega

output of the BN. When a suffici

reached (also called confidence, th

the process stops and returns th

probability. This allows avoiding un

Another important advantage

hardwired. The couple BN+Ontol

FIPA-SL string, governs the syste

of the BN or the weight of each arc

new string to all interested agents.

took less than one second for the

JADE communication mechanisms.

short term experience developed

learning algorithms were implemen

minor manual changes in the BN

actually tested.

The last important feature of

highlight is its multiplatform nature.

of JADE/Java, KOWGAR agents

flavors of Windows, Linux, Solari

agents were executed even on a 10

RAM and Puppy Linux.

We can conclude that this approorganization which wants to man

using their own programming ski

employed in the prototype are Ope

is necessary.

VIII. KOMACROLAN is Telefnica E

Virtual Private Networks (VP

enterprise sites over Ethernet base

supports service speeds from 2 M

standard L2 and L3 VPN technolo

geographically distant customer si

they belong to the same LAN, in

and transparency.

MACROLAN service is built o

networks. The local loop from the

Central Office (CO) can be either fi

when available, or copper, in rural

Hierarchy (SDH) circuits allow to

access segment when the closest

different CO. MACROLAN tra

province-wide Metropolitan Ethern

a network wire failure, for

necessary and the process

gregation Agent behavior

ion in Madrid there is a

Valladolid router. In this

User and Routing Agents

lusion without performing

orming every possible test

of each test action, whicherform (the one with the

t is performed, providing

ion Agent evaluates the

ent degree of certainty is

at is set in the ontology),

result with the highest

necessary tests.

is that the logic isnt

gy, coded together as an

. Modifying the structure

is as fast as distributing a

In our tests, this operation

whole network, using the

. As KOWGAR was just a

in three months, no self-

ted, but the distribution of

and/or the ontology were

KOWGAR we want to

As everything runs on top

ere deployed on different

and HP-UX. KOWGAR

micro PC with 256MB

ch can benefit any type ofage common IP services

lls. Besides, all the tools

Source and no extra HW

LAN

spaas solution to build

) connecting multiple

d accesses. MACROLAN

it/s to 1 Gbit/s. By using

ies, MACROLAN enables

tes to communicate as if

terms of speed, reliability

n top of a diverse set of

customer location to the

ber with media converters,

reas. Synchronous Digital

extend the distance of this

AN access point is in a

ffic is aggregated into

et networks (MANs) and


5/7

then into an IP backbone network that provides nationalcoverage.

MACROLAN poses an interesting challenge, since itinvolves a number of different technologies. Diagnosisrequires a high degree of skills, experience and ability since itinvolves accessing half a dozen different OSSssimultaneously. The goal was then to capture the expertisefrom human operators, model it in BNs and thus reduce

manual intervention for the most common types of failures.There were two additional constraints for the experience, thesystem should be deployed without any change in the existingOSSs and it should be running on a live network from the firstmonth with a six month deadline for the whole Scrum project.

The system architecture consists of three main blocks: themulti-agent platform, the web user interface and the database.

The behavior of KOWLAN agents is based on Bayesianinference, as we have explained. For this experience, there aresix different BNs, one for each type of access technology,depending on the combination of fiber, copper and theexistence of SDH path. Knowledge capture required only twoworking days of a team of expert technicians and twoTelefnica I+D engineers

In order to provide a common language so that the agentscan share knowledge about BNs and diagnosis operations, atop level ontology has been defined. This ontology includesinformation about hypothesis and evidences, informationabout probabilities of the current hypothesis beliefs,information about topology and inventory, and informationabout probabilistic dependencies among hypothesis andevidences.

These are the steps each diagnosis procedure follows:1. Requests can be launched by a human operator from

the user interface or more commonly as the result of

the arrival of a Trouble Ticket (TT). An interface agentpolls periodically the TT system so, when a new one isassigned to the MACROLAN technical center, aKOWLAN diagnosis procedure is triggered withouthuman intervention

2. The only pieces of information available about thefailure are the circuit identification and the reportedsymptom. From two corporate inventories, specializedKOWLAN agents get the full circuit description:topology, scenario, equipment data and configuration.

3. The appropriate Diagnosis Agent is instantiatedaccording to the scenario. This Agent orchestrates theontology coded behavior:

a. The Diagnosis Agent selects the best availabletest to perform, taking into account its costand if there is enough data to execute it. Then,it requests the test to the appropriateObservation Agent, gets the results, performsthe Bayesian inference and checks if anyhypothesis threshold has been reached. If not,it selects the next test to perform.

b. When enough certainty in the diagnosis isreached or there are not more tests to perform,it stores the results in the database.

The user interface is a decoupled PHP application, sinceKOWLAN has been designed to be fully automatic when

possible. The operator can request diagnosis on a circuit,display its details and evaluate the accuracy of the diagnosisthrough a simple Internet-like polling mechanism. Thisevaluation may be useful for further refinement of the BN

parameters by using self-learning mechanisms.After two months of usage, KOWLAN was able to diagnose

around 45% of failures with accuracy higher than 90%.Thanks to this, human intervention has been reduced in a 30%as the analysis of a sample of 3000 trouble tickets has shown.

The key for the successful application of Bayesiandiagnosis in KOWLAN has been the high involvement ofMACROLAN maintenance team in the project. Knowledgecapture has shown to be very accurate, the system is only asgood as the technicians are, and in this case they areexceptionally skilled. Besides, some shell scripts developed in

past years by this group were integrated as part of KOWLANdiagnosis toolbox. This kind of in house development should

be considered carefully as a repository of experience insteadas a threat to the IT formal structure of organizations.

IX. KOWGAR@HOME:THEOUTEREDGEPROBLEM

As we have seen, KOWGAR is a system for distributedBayesian diagnosis in a corporate network, while KOWLANapplies the same principles to a big telecom operator network.In the first case, the leased infrastructure is seen as a fuzzycloud while in the second case diagnostic tools can only testthe last operator equipment (customer router) but are blind

beyond this point. Each domain sees the other one as an outeredge, a region of extreme uncertainty. This situation is

becoming usual in the daily operations and the final customer

needs a whole picture despite the managed entities belongingto different players.Network Management in the outer edge scenario is a

complex task, not only for the technical constraints, but alsofor legal, regulatory, security and commercial issues. In thescope of the MAGNETO project [34], a fault diagnosis

prototype is being developed to address Home Area Network(HAN) troubleshooting.

Management of HANs represents a huge challenge fortelecom operators since it has to combine the management ofdifferent network domains. Some of these domains are underthe control of the telecom operator, while the rest belongs tothe end customer. Current management architectures address

HAN management from a centralized perspective, wheremanagement tasks are performed in a management system thatremotely accesses customer equipment using protocols such asTR-069 [35].

In order to allow some degree of autonomy to themanagement of HANs, MAGNETO is exploring a distributedarchitecture where management tasks are locally executedimproving efficiency and reducing the burden on centralizedservers. An important MAGNETO feature is its capability ofself diagnosing problems in the HAN. This self diagnosis mayrequire cooperation between management agents placed in


6/7

different HANs or even network domain

MAGNETO agents sitting on HAN equip

some conclusions based on the evidences loc

may also need to cooperate with agents in th

other HANs to exchange their views of the pr

reach a valid diagnosis.

KOWGAR@HOME is the testbed to vali

solution for distributed troubleshooting that

further in the wider scope of MAGNETO. Itapproach to the one already described for

KOWLAN. It uses Bayesian Network Infer

the cause of service and network failures eve

uncertainty where there is not full knowled

and network status. Regarding the modeling

knowledge, the initial definition of the Bayes

to diagnose each service will be based on

experts in the field, like network operators.

In KOWGAR@HOME, some parts of the

Network can be executed in HAN dev

gateways, set top boxes, etc.) while other pa

ISPs network or in other HANs. Each of t

domains will have a different perspective

being diagnosed and will exchange inform

conclusions to cooperatively reach a valid di

purpose, it has been decided to use the

Method for the partition of the BNs, as desc

paragraphs.

Figure 2 depicts a possible

KOWGAR@HOME agents. As can be seen

deployed at different domains, from the

environment. It is foreseen to have

KOWGAR@HOME agent inside the HAN,

in a well-equipped device like a residential ga

Fig.2: KOWGAR@HOME deployment in a HAN

s. For example,

ment may reach

ally available but

e ISP network or

oblem in order to

ate the technical

ill be integrated

follows a similarKOWGAR and

ence to diagnose

n in situations of

ge about service

of the diagnosis

ian Network used

knowledge from

overall Bayesian

ices (like home

rts run inside the

ese management

on the problems

ation about their

iagnosis. For that

irtual Evidence

ribed in previous

deployment of

, there are agents

AN to the ISP

at least one

ost likely sitting

teway.

nvironment.

Additional agents can be deploy

top boxes, although this is not co

the resource limitations of most mul

preferable to access them remot

gateway.

Agents deployed in the ISP n

existing computing resources lik

network equipment. These agents

that is not available inside the HANto complete the diagnosis process

results of the partial inference con

addition, ISP agents can communic

the ISP for two purposes: firstly to

information from them and secondl

results of the diagnosis proces

triggering trouble tickets or alarms.

Another important KOWGAR

capability to improve its own diagn

of self-learning processes. To ma

accurate, a central server in the IS

gathering historical data about

different domains involved. Perio

used to update the BNs appl

algorithms. The new knowledge

distributed to appropriate KOWGA

be used for further diagnoses.

CONCLUSI

Troubleshooting is a very co

Management. This paper has descri

for distributed diagnosis based on

built with Open Source component

framework. The proposed architect

different scenarios: a corporateinfrastructure and an experimental

Digital Home proof of concept.

The most relevant results are the

to-market of the proposed solution

start point for any organization th

house troubleshooting system to c

complexity of its own network.

One important contribution of

problems is its capacity to deal wit

there are problems getting informati

systems. The probabilistic approach

case, although the higher number

certainty the system gets.

Another important benefit is rela

been seen that the multi-agent para

to easily deploy decoupled pie

environment where the overall a

Agents can be proactively exploited

problems, avoiding the participa

agents in the architecture.

In the future we plan to furt

challenges identified, such as

d in other devices like set

pulsory. Moreover, given

ltimedia devices, it may be

ely from the residential

twork take advantage of

e OSS servers or even

ave access to information

and can therefore be usedmaking also use of the

ucted inside the HAN. In

te with OSSs belonging to

request relevant diagnosis

to automatically feed the

, when appropriate, by

@HOME feature is its

osis capabilities by means

e this more valuable and

P network is in charge of

ast diagnoses from the

ically these data will be

ing parametric learning

acquired will be then

@HOME agents so it can

NS

mmon task in Network

bed a lightweight solution

Bayesian Networks and

s on top of a Multi Agent

re has been tested on three

Intranet, a telco VPNtestbed for a multidomain

flexibility and short time-

. This approach is a good

t wants to develop an in

ope with the growth and

OWGAR to this kind of

h uncertainty. Many times

on and tests from different

provides an answer in any

of evidences, the higher

ted to scalability, as it has

igm offers a good solution

ces of software in an

chitecture is not known.

to locally detect and solve

ion of more centralized

er explore some of the

improving self-learning


7/7

capabilities, smart partitioning of the BNs, automatic feedbackon the success of diagnosis and dynamic generation of BNs.

ACKNOWLEDGMENT

The authors of this paper would like to thank theMACROLAN Technical Center staff in Barcelona, Spain, fortheir support during the KOWLAN experience.

REFERENCES[1] ITU-T, Principles for a Telecommunications Management Network,

Recommendation M.3010, 1996.[2] Creaner, M., Reilly, J.: NGOSS Distilled The Essential Guide to Next

Generation Telecoms Management, The Lean Corporation, August2005.

[3] Chih-Chun Chen, Sylvia B. Nagi and Christopher D. Clack,Complexity and Emergence in Engineering Systems. ComplexSystems in Knowledge based Environments: Theory, Models and

Applications. Tolk, Andreas; Jain, Lakhmi C. (editors). Springer: NewYork, NY, USA, 2009, ch. 5, pp 99-128.

[4] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships ofthe Internet topology, Proceedings of the conference on Applications,technologies, architectures, and protocols for computer communication,

pp 251-262,ACM, 1999.[5] J. Spencer, D. Johnson, A. Hastie, L. Sacks, Emergent properties of the

BT SDH network, BT Technology Journal, vol. 21, no. 2, April 2003,pp. 28-36.

[6] A. Santiago, J. P. Crdenas, M. L. Mouronte , V. Feliu and and R. M.Benito, Modeling the topology of SDH networks, International

Journal of Modern Physics C, vol. 19, no. 12, 2008, pp. 1809-1820.[7] A. Binzenhfer et al., A P2P-based framework for distributed network

management in New Trends in Network Architectures and ServicesLNCS vol. 3883, pp. 198-210, Loveno di Menaggio, Como, Italy, 2006.

[8] P. Utton and E. Scharf, A fault diagnosis system for the connectedhome, IEEE Communications Magazine, vol. 42, no. 11, pp. 128 134,

November 2004.[9] V. Singh, Dyswis: An architecture for automated diagnosis of

networks, in IEEE Network Operations and Management Symposium,NOMS 2008, pp. 851-854, Salvador de Bahia, Brazil, 2008.

[10] Fallon L., Parker D., Collins S., Zach M., Leitner M., "Self-FormingNetwork Management Topologies in the Madeira Management System",

Proceedings of the Autonomous Infrastructure, Management andSecurity International Conference, AIMS 2007, pp 61-72, Oslo, Norway,2007..

[11] R. Badonnel. R. State, O. Festor. Probabilistic Management of Ad-Hoc Networks, 10th IEEE/IFIP Network Operations and ManagementSymposium NOMS 2006, pp. 339-350, Vancouver, Canada, 2006.

[12] Jianguo Ding, Bernd Krmer, Shihao Xu, Hansheng Chen and YingcaiBai, Predictive Fault Management in the Dynamic Environment of IP

Networks, Proceedings IEEE Workshop on IP Operations andManagement, pp. 233-239, 2004.

[13] Marcus Brunner, Dominique Dudkowski, Chiara Mingardi and GiorgioNunzi, Probabilistic Decentralized Network Management,Proceedings IEEE INM 2009, Hofstra University, Long Island, NewYork, USA, 2009, pp. 25-32.

[14] Ferat Sahin, A Bayesian Network Approach to the Self-organizationand Learning in Intelligent Agents, Ph.D. dissertation, VirginiaPolytechnic, USA, 2000.

[15] Jianguo Ding, Ningkang Jiang, Xiaoyong Li, Bernd Krmer, FrancoDavoli and Yingcai Bai, Construction of Simulation or ProbabilisticInference in uncertain and Dynamic Networks Based on Bayesian

Networks, Intermational Coference on ITS TelecommunicationsProceedings, 2006, pp. 983-986.

[16] Jianguo Ding, Probablistic Fault Management in Distributed Systems,Ph. D. dissertation, FernUniversitt in Hagen, Germany, 2008.

[17] Raquel Barco-Moreno, Bayesian modeling of fault diagnosis in mobilecommunication networks, Ph. D. dissertation, Universidad de Mlaga,Spain, 2007.

[18] Lu Cheng, Xue-song Qiu, Luoming Meng, Yan Qiao, Zhi-qing Li,Probabilistic Fault Diagnosis for IT Services in Noisy and DynamicEnvironments,Proceedings IEEE INM 2009, Hofstra University, LongIsland, New York, USA, 2009, pp. 149-156.

[19] George J. Lee, CAPRI: A Common Architecture for DistributedProbabilistic Internet Fault Diagnosis, Ph. D. dissertation, CSAIL-MIT,Cambridge, MA, USA, 2007.

[20] Judea Pearl, Bayesian networks: A model of self-activated memory forevidential reasoning, UCLA Report CSD-850017, 1985.

[21] Richard E. Neapolitan, Learning Bayesian Networks, inPrentice-HallSeries in Artificial Intelligence, Prentice-Hall, 2003.

[22] Uffe B. Kjaerulff, Anders L. Madsen, Bayesian Networks andInfluence Diagrams: A Guide to Construction and Analysis, Springer,2008.

[23] P.C.G. da Costa, Bayesian Semantics for the Semantic Web, Ph. D.Dissertation, George Mason University, USA, 2005.

[24] Kathryn Blackmond Laskey and Paulo Cesar G. da Costa: UncertaintyRepresentation and Reasoning in Complex Systems. Complex Systemsin Knowledge based Environments: Theory, Models and Applications.Tolk, Andreas; Jain, Lakhmi C. (editors). Springer: New York, NY,USA, 2009, ch. 2, pp. 7-40.

[25] Zhongli Ding, BayesOWL: A Probabilistic Framework for Uncertaintyin Semantic Web, Ph.D. dissertation, University of Mariland, USA,2005.

[26] Yang Xiang, Probabilistic Reasoning in Multiagent Systems: AGraphical Models Approach, Cambridge University Press, 2002.

[27] Rong Pan, Yun Peng, Zhongli Ding, "Belief Update in BayesianNetworks Using Uncertain Evidence,", 18th IEEE InternationalConference on Tools with Artificial Intelligence, 2006, pp. 441-444.

[28] Manifesto for Agile Software Development, 2001. Available:http://agilemanifesto.org/, last visited July 2009.

[29] Linda Rising and Norman S. Janoff, The Scrum Software DevelopmentProcess for Small Teams,IEEE Software, July/August 2000, pp. 2-8.[30] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "The EM

algorithm" in The Elements of Statistical Learning. New York, USA:Springer, 2001pp. 236243

[31] Fabio Luigi Bellifemine, Giovanni Caire, Dominic Greenwood,Developing Multi-Agent Systems with JADE, John Wiley & Sons,Chichester, UK, 2007.

[32] Gregory F. Cooper Edward Herskovits. A bayesian method for theinduction of probabilistic networks from data. Technical Report KSL-91-02, Knowledge Systems Laboratory. Medical Computer Science.Stanford University School of Medicine, Stanford, CA 94305-5479,

Nov. 1993.[33]Nir Friedman, Dan Geiger, Moises Godlzsmit. "Bayesian Network

Classifiers". Machine Learning, vol.29, pp.131-163, 1997.[34] CELTIC Initiative MAGNETO project, CP5-012, 2008.

http://www.celtic-initiative.org/Projects/MAGNETO/abstract-

magneto.asp. Last visited July 2009.[35] TR-069 CPE WAN Management Protocol. http://www.broadband-

forum.org/technical/download/TR-069.pdf. Last visited July 2009.

a lightweight approach to distributed network diagnosis under uncertainty

Documents