a lightweight approach to distributed network diagnosis under uncertainty
TRANSCRIPT
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
1/7
1AbstractNetwork Management faces major challenges
nowadays. Management applications have not kept the changing
pace of networks and services and still rely on centralized
approaches. Moreover, uncertainty is part of the reality. This
paper presents a lightweight collaborative approach to network
troubleshooting, based in multi agents and probabilistic
techniques. The proposed architecture has been applied to three
different network environments.
Index Terms Bayesian Network, Multi Agent System,
Network Troubleshooting, Uncertainty, Collaborative diagnosis.
I. INTRODUCTIONelecommunication networks are growing in size andcomplexity, with a rich blend of services being
deployed on top of them. All kind of organizations, from smallcompanies, to NGOs, academic institutions orFortune 500corporations use Internet as an infrastructure that supportstheir operations. However, the power and flexibility of thiskind of solutions has some drawbacks. Since there aremultiple players and no central authority uncertainty becomesa burden.
Network Management is facing a change of paradigm after
a long period of stability. For decades, well establishedreference models like ITU-T TMN [1] or more recently TMFNGOSS [2], have been the guide for the design of commercialand in house systems, with a wide range of applications for thetelecom industry and corporate networks. IETF SNMP has
played a similar role for small and medium size businesssolutions.
Classical architectures have a common underlying designprinciple: the state of all existing entities can be fully known atany given moment in time. A hierarchy of layers allows a wellengineered distribution of functions. The five big managementareas (FCAPS: Fault, Configuration, Accounting,Performance, Security) guarantee that the network is always
under control.This deterministic approach has proved very useful when
the entire infrastructure belongs to the same domain, as it isthe case inside a telecom operator network, when well definedinterfaces allow the interconnection of different domains orwhen all players speak the same management language.
Another important feature of the classical model, related tothe previous one, is the centralized design. Huge network
1Manuscript received August 9, 2009. This work was supported in part bythe Spanish Ministerio de Industria, Turismo y Comercio, Avanza I+D 2008grant TSI-020400-2008-27 under the CELTIC Initiative project MAGNETO.
inventories, extremely complex end to end monitoringapplications or even trouble ticketing workflows behave as
part of a Big Brother that needs every piece of informationto react when something unexpected happens.
We can compare this situation to the state of developmentof classical mechanics by the end of XVIII Century, when itseemed that the Universe was like a perfect clock. Pierre-Simon Laplace wrote in 1814, in the Essai philosophique surles probabilits :
Une intelligence qui, un instant donn, connatrait toutesles forces dont la nature est anime et la situation respective
des tres qui la compose embrasserait dans la mme formule
les mouvements des plus grands corps de l'univers et ceux du
plus lger atome; rien ne serait incertain pour elle, et l'avenir,
comme le pass, serait prsent ses yeux.The problem with Laplaces demon is that when the number
of state variables reaches a critical threshold it is no longercomputable due to scalability problems. Telecom networksshare this property, the amount of managed entities grows soquickly that is impossible for centralized solutions to keep this
pace. But there are many sources of uncertainty that challengethe traditional view, as the multiplicity of managementdomains or the emergence of properties in complex networks
[3]-[6].Uncertainty cannot be underestimated or dismissed as anundesirable collateral effect in network management. It must
be considered as an intrinsic property of telecom networks,and it should be taken into account to avoid expensiveworkarounds.
We present in this paper a lightweight, collaborative anddistributed approach for network troubleshooting, developedas an internal initiative of Telefnica I+D, to foster innovativemanagement solutions. The same principles have been appliedto three different scenarios: KOWGAR, an onlinetroubleshooter to help the final users of a geographicallydistributed corporate network; KOWLAN, an automatic
diagnosis system for the Ethernet/VPN commercial service ofTelefnica Espaa (MACROLAN) and KOWGAR@HOME,the application of these concepts to Home Area Networks,where two different management domains (ISP and HomeArea Network) have to cooperate to reach valid conclusions.
II. DISTRIBUTEDDIAGNOSISUNDERUNCERTAINTY
Troubleshooting is one of the Network Management fieldsmore sensitive to uncertainty using traditional centralizedapproaches. Based on remote access to testing and
T
A Lightweight Approach to Distributed NetworkDiagnosis under Uncertainty
Javier Garca-Algarra, Pablo Arozarena-Llopis, Sergio Garca-Gmez, lvaro Carrera-BarrosoTelefnica I+D, Spain
{algarra,pabloa,sergg,[email protected]}
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
2/7
information capabilities Network Management tries to comeup with a conclusion but, if the information is incomplete orinaccurate, the process may get blocked. Usually, trying todiagnose affected services is very complicated, involvingdozens of elements of heterogeneous technologies, and finallyrequiring human intervention to solve the puzzle. As humanexpertise is a scarce and expensive resource, the researchcommunity is endeavoring to build systems that emulate
network engineers working under uncertainty.There are already several efforts on designing frameworks
for distributed network management. AutoMON [7] uses aP2P-based solution. Performance and reliability are tested bydistributed agents although the nodes do not co-operate or usehistorical information about failures. Distributed faultmanagement in Connected Home uses also an agent-basedapproach for this scenario [8]. DYSWIS [9] presents anArchitecture for Automated Diagnosis of Networks thatconsists of detection and diagnosis nodes. The first ones lookfor failures by passive traffic monitoring and active probing,while the second ones determine the root cause usinghistorical information and by performing active tests. Networkdependency relationships are encoded as rules. MADEIRA[10] introduced a distributed architecture for dynamic devicesmanagement. The use of a dynamic hierarchy allows thesystem to adapt to network failures and state changes,something that is considered essential for future large scalenetworks which must exhibit adaptive, decentralized control
behaviors.Several authors have explored the application of
probabilistic techniques to network management [11]-[13].For fault diagnosis, many of these advances use BayesianInference [14]-[18] to conclude the cause of the observednetwork problems. In particular, CAPRI [19] defines a
Common Architecture for Distributed Probabilistic InternetFault Diagnosis. The whole picture is similar to the one inDYSWIS, but it uses Bayesian networks instead of rules toinfer the root cause.
Bayesian networks (BN), a term coined by Judea Pearl [20],are based upon probability theory. The problem domain isrepresented as a directed acyclic graph where the nodesrepresent variables, and the arcs, conditional dependencies
between them. Graphs are easy to work with, so Bayesiannetworks can be used to produce models that are simple forhumans to understand, as well as effective algorithms forinference and learning [21]. Bayesian networks have beensuccessfully applied to numerous areas, including medicine,
decision support systems, and text analysis [22].
III. PRINCIPLESANDFEATURESOur strategy is based on three principles and three design
decisions previous to the study of each particular scenario.The first principle is that any new solution must be neutral
and fit on any OSS (Operations Support Systems) map. Onecommon mistake in the management field is to expect that thesurrounding IT systems have to adapt to our needs, and thisassumption is a source of delays and expenditure. The secondone is that deployment disruption must be minimal, so it is not
necessary to replace any previous system. The third one is thefact that uncertainty is unavoidable, so new systems have to
properly take it into account.The first design decision is that systems have to be
distributed, and grow organically as a part of the network notas an external watchman. The second one is that semantics is
part of the human knowledge and so systems must be based onsemantics just from the beginning. The third one is the
application of probabilistic techniques to deal withuncertainty. The combination of BNs with semanticstechnologies is an active research [23]-[25] that provides ahigh degree of flexibility.
Besides these principles and design decisions there is alsoan additional economic constraint, systems must be cheap todevelop, cheap to deploy and cheap to maintain. Economy is akey issue, since network operation costs have not dropped atthe same rate than equipment prices.
In order to cut down development and installation costs,only Open Source bricks have been used to build thisenvironment. A second strategic decision was to adopt anAgile methodology [28] to focus the effort on runningsoftware and reducing the managing overhead that is commonin complex organizations. In particular, Scrum was themethodology chosen [29].
Regarding hardware, one of the goals is taking advantage ofthe unused CPU resources that are distributed all over thenetwork. The JADE platform (jade.tilab.com) allows thedeployment of Java coded agents with minimal requirementsof CPU and memory [31]. When dedicated hardware is neededto run parts of the whole system, commoditized hardware isthe best option. For example, a 700 Quad Core Intel PC withUbuntu Linux is enough to run the MACROLAN diagnosis
prototype for the whole Telefnica Espaa network.
Maintenance costs are a main concern for IT systems ingeneral. One of the advantages of this approach is thatbusiness logic is encoded in a Bayesian Network that can bedistributed all over the network at any time. This allows theaddition of new diagnosis capabilities or the modification ofthe BNs without stopping the system. The Bayesian Inferenceengine chosen is SamIam, a light and fast Java implementationdeveloped at UCLA (http://reasoning.cs.ucla.edu/samiam/).
IV. BAYESIANKNOWLEDGECAPTUREWhen modelling a Bayesian Network two things have to be
defined: first the structure of the Bayesian Network and thenthe parameter values of the Conditional Probability Tables
(CPT).In order to initially build a BN, there are several
alternatives. When no domain expert knowledge is available,but there is a big amount of historical data, the structure of theBayesian network can be automatically built by usingstructural learning algorithms. There are several structurallearning algorithms, like K2 [32] and tree augmented NaveBayes [33]. K2 is a simple and very fast learning algorithm,
but its results depend on the initial ordering of input data, so itmakes sense to run the algorithm several times with differentrandom orderings. Another good learning algorithm for
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
3/7
Bayesian network classifiers is called tree augmented NaveBayes (TAN). This algorithm is linear in the number ofinstances and quadratic in the number of attributes. The main
problem with these structure learning algorithms is that, sincethey are just based on statistics, they may come up with wrongcasual dependencies.
For this reason, the structure of the Bayesian Networks ispreferably defined manually as a first step. Together with the
structure, initial CPTs parameter are set. This task may requirethe cooperation of a team of experts in the problem domainwith knowledge engineers to properly model the BN.
Another challenge deals with the distribution of theintelligence across the physical network. Instead of using acentralised BN, a smarter approach is to partition the wholedomain in smaller BNs [26] when the scenario is complex asis the case in network troubleshooting. Following this
principle, different elements in different parts of the networkmay have different views and knowledge. For instance, someagents may only diagnose network problems while othersservice problems, exchanging their conclusions tocooperatively reach a valid conclusion. In other words, thesingle BN that could exist within a centralised solution will befragmented and distributed to the overlay agent network.
The Virtual Evidence Method (VEM) [27] is an algorithmthat partitions the network in several pieces to perform partialinference and belief interchange. With VEM, each agent onlyneeds to know a subset of the whole BN. This makes thediagnosis process more scalable, since computing resourcesare distributed across the physical network. It also facilitatesreusing knowledge in other diagnosis processes that sharecommon parts of the BN.
Moreover, if the Bayesian Network is partitioned accordingto the physical network topology, VEM enables mapping the
diagnosis process to the different network domains.
V. SELF LEARNINGOne significant goal in our work is to allow the diagnostic
intelligence to be able to self adapt and improve over time. Toachieve this, the system must be able to learn from pastactions, something which requires a feedback loop to mark
past diagnoses as successful or not. This is very challenging,since there is not a straightforward way to assess diagnosisresults. A possible solution is to request human feedback fromthe network operators on the usefulness of each diagnosis.Once this is done, parametric learning algorithms may be usedto improve diagnosis quality. These methods update the
existing BN, but only modifying the link weights in the CPTsand not its structure.
Expectation Maximization algorithm (EM) is a goodparametric learning algorithm used in statistics to findmaximum likelihood estimates of parameters in probabilisticmodels [30]. EM uses an iterative algorithm that estimates themissing values in the input data representing previousdiagnoses. This is relevant since for some diagnoses there may
be only a subset of the possible evidences available. Once thisis done, statistical methods are used to recalculate the weightsin the BN based on the set of previous diagnoses. KOWGAR
uses this algorithm because it provides the necessary featuresfor updating the values of CPTs which, along with itsstructure, embody the knowledge embedded in a BN.
VI. ARCHITECTUREOVERVIEWThe core of KOWGAR is based on a Multiagent
architecture. Different types of agents have been envisaged tocarry out the diagnosis process.
The Interface Agents are in charge of communicationwith the systems outside, such as a user interface, atrouble ticketing system, a network inventory, etc. Insome cases, these agents provide an interface as anexternal service. In other cases, they use other systeminterfaces.
The Observation Agents mission is to get evidencesfrom the managed networks and services. Theseagents have interfaces with the network resources andtheir Management Information Bases (MIBs), or theycan exploit external testing tools or services.
The Diagnosis Agents orchestrate the diagnosisprocess and gather evidences from the Observation
Agents in order to carry out the Bayesian Inference.This process is driven by the Bayesian Networksspecifications. Usually, these agents are specializedfor a network/service configuration.
The Persistence Agents manage the storage andretrieval of KOWGAR information from the systemdatabase when required.
The Knowledge Agent task is to distribute theBayesian knowledge to the agents that need it.
All the communications between agents are based on FIPA-ACL messages and a specific ontology that the agentsunderstand. This ontology defines the Bayesian Networksstructure (hypothesis, evidences, conditional probabilities,
thresholds, etc.), the information about a diagnosis operation(observations, beliefs, additional information, etc.), and theactions that the agents can carry out. The communicationswith external systems are usually based on XML messagesover HTTP or on standard Web Services.As it is explained in the following sections, depending on theneeds of each scenario, this architecture is customized. Thetypes of agents must be specialized to adapt the architectureand different Bayesian Networks are defined for eachsituation.
VII. KOWGAR:BAYESIANDIAGNOSISINACORPORATENETWORK
KOWGAR is a proof of concept to test the power of adistributed Bayesian diagnosis system, deployed on the owncorporate Intranet of Telefnica I+D. The system targeted areduced set of problems, those related with web navigation ofinternal and external sites. The scenario is very common inany kind of institutional Intranet with locations geographicallydistant, linked via a VPN. In this case Telefnica Espaa actsas the ISP that provides also connectivity. Like any othercustomer, Telefnica I+D network managers perceive thisinfrastructure as a cloud or a black box. KOWGAR wasdesigned to help the end user, so to allow a quick and cheap
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
4/7
deployment; a tiny Firefox plug-in was
Graphical Interface. A diagnosis is autom
when the browser detects a problem and the
check the final result.
Two different types of possible HTTP co
distinguished:
- Connections to HTTP servers within the
- Connections to HTTP servers located
intranet, i.e. in the internet.The universe of root causes is select
construction of the BN. Some of th
misconfiguration, Link failure, Routing fail
unreachable, DNS information incorrect, De
unreachable, Destination port unreachabl
application unavailable.
There were three locations involved in
Madrid, Valladolid and Huesca. To simul
without disrupting the daily Company ope
mock DNS servers was installed. Fig. 1 sho
an example of what happens when a final
subnetwork is trying to access a web server i
refusing connections. When an error is
triggers the plugin, which in turn activates
Agent.
This User Agent performs a set of basic te
and if everything is OK, sends a request to
Agent, in charge of orchestrating KOWGA
agent knows the BN and the associated ontol
which kind of test must be executed to ga
possible symptoms. So, it invokes three
agents: routing, DNS and HTTP. The BN rea
that, in this case, is server congestion with a
The information can be displayed from the
clicking the plug-in icon.
Fig. 1: KOWGAR testbed scenario and agents deploy
This simple example shows some of the
distributed diagnosis. Problems are det
eveloped as the
tically launched
user only has to
nnections can be
intranet.
utside the TIDs
ed prior to the
em are: Local
ure, DNS server
stination host
or Destination
the experiment:
te DNS failures
ations, a pair of
s the testbed and
user in Madrid
n Boecillo that is
detected Firefox
the JADE User
sting procedures,
the Aggregation
s actions. This
ogy that explains
her the different
different testing
ches a conclusion
high probability.
browser just by
ent.
dvantages of the
cted locally if
possible. If the User Agent detects
instance, no further tests would be
stops. But let us take a look at A
when instead of a server congest
problem in the routing tables of a
case, the evidences gathered by the
would be enough to reach this conc
additional tests. So, instead of perf
available, the ontology sets the costis used to select the next test to
lowest cost). Each time a new tes
additional evidence, the Aggrega
output of the BN. When a suffici
reached (also called confidence, th
the process stops and returns th
probability. This allows avoiding un
Another important advantage
hardwired. The couple BN+Ontol
FIPA-SL string, governs the syste
of the BN or the weight of each arc
new string to all interested agents.
took less than one second for the
JADE communication mechanisms.
short term experience developed
learning algorithms were implemen
minor manual changes in the BN
actually tested.
The last important feature of
highlight is its multiplatform nature.
of JADE/Java, KOWGAR agents
flavors of Windows, Linux, Solari
agents were executed even on a 10
RAM and Puppy Linux.
We can conclude that this approorganization which wants to man
using their own programming ski
employed in the prototype are Ope
is necessary.
VIII. KOMACROLAN is Telefnica E
Virtual Private Networks (VP
enterprise sites over Ethernet base
supports service speeds from 2 M
standard L2 and L3 VPN technolo
geographically distant customer si
they belong to the same LAN, in
and transparency.
MACROLAN service is built o
networks. The local loop from the
Central Office (CO) can be either fi
when available, or copper, in rural
Hierarchy (SDH) circuits allow to
access segment when the closest
different CO. MACROLAN tra
province-wide Metropolitan Ethern
a network wire failure, for
necessary and the process
gregation Agent behavior
ion in Madrid there is a
Valladolid router. In this
User and Routing Agents
lusion without performing
orming every possible test
of each test action, whicherform (the one with the
t is performed, providing
ion Agent evaluates the
ent degree of certainty is
at is set in the ontology),
result with the highest
necessary tests.
is that the logic isnt
gy, coded together as an
. Modifying the structure
is as fast as distributing a
In our tests, this operation
whole network, using the
. As KOWGAR was just a
in three months, no self-
ted, but the distribution of
and/or the ontology were
KOWGAR we want to
As everything runs on top
ere deployed on different
and HP-UX. KOWGAR
micro PC with 256MB
ch can benefit any type ofage common IP services
lls. Besides, all the tools
Source and no extra HW
LAN
spaas solution to build
) connecting multiple
d accesses. MACROLAN
it/s to 1 Gbit/s. By using
ies, MACROLAN enables
tes to communicate as if
terms of speed, reliability
n top of a diverse set of
customer location to the
ber with media converters,
reas. Synchronous Digital
extend the distance of this
AN access point is in a
ffic is aggregated into
et networks (MANs) and
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
5/7
then into an IP backbone network that provides nationalcoverage.
MACROLAN poses an interesting challenge, since itinvolves a number of different technologies. Diagnosisrequires a high degree of skills, experience and ability since itinvolves accessing half a dozen different OSSssimultaneously. The goal was then to capture the expertisefrom human operators, model it in BNs and thus reduce
manual intervention for the most common types of failures.There were two additional constraints for the experience, thesystem should be deployed without any change in the existingOSSs and it should be running on a live network from the firstmonth with a six month deadline for the whole Scrum project.
The system architecture consists of three main blocks: themulti-agent platform, the web user interface and the database.
The behavior of KOWLAN agents is based on Bayesianinference, as we have explained. For this experience, there aresix different BNs, one for each type of access technology,depending on the combination of fiber, copper and theexistence of SDH path. Knowledge capture required only twoworking days of a team of expert technicians and twoTelefnica I+D engineers
In order to provide a common language so that the agentscan share knowledge about BNs and diagnosis operations, atop level ontology has been defined. This ontology includesinformation about hypothesis and evidences, informationabout probabilities of the current hypothesis beliefs,information about topology and inventory, and informationabout probabilistic dependencies among hypothesis andevidences.
These are the steps each diagnosis procedure follows:1. Requests can be launched by a human operator from
the user interface or more commonly as the result of
the arrival of a Trouble Ticket (TT). An interface agentpolls periodically the TT system so, when a new one isassigned to the MACROLAN technical center, aKOWLAN diagnosis procedure is triggered withouthuman intervention
2. The only pieces of information available about thefailure are the circuit identification and the reportedsymptom. From two corporate inventories, specializedKOWLAN agents get the full circuit description:topology, scenario, equipment data and configuration.
3. The appropriate Diagnosis Agent is instantiatedaccording to the scenario. This Agent orchestrates theontology coded behavior:
a. The Diagnosis Agent selects the best availabletest to perform, taking into account its costand if there is enough data to execute it. Then,it requests the test to the appropriateObservation Agent, gets the results, performsthe Bayesian inference and checks if anyhypothesis threshold has been reached. If not,it selects the next test to perform.
b. When enough certainty in the diagnosis isreached or there are not more tests to perform,it stores the results in the database.
The user interface is a decoupled PHP application, sinceKOWLAN has been designed to be fully automatic when
possible. The operator can request diagnosis on a circuit,display its details and evaluate the accuracy of the diagnosisthrough a simple Internet-like polling mechanism. Thisevaluation may be useful for further refinement of the BN
parameters by using self-learning mechanisms.After two months of usage, KOWLAN was able to diagnose
around 45% of failures with accuracy higher than 90%.Thanks to this, human intervention has been reduced in a 30%as the analysis of a sample of 3000 trouble tickets has shown.
The key for the successful application of Bayesiandiagnosis in KOWLAN has been the high involvement ofMACROLAN maintenance team in the project. Knowledgecapture has shown to be very accurate, the system is only asgood as the technicians are, and in this case they areexceptionally skilled. Besides, some shell scripts developed in
past years by this group were integrated as part of KOWLANdiagnosis toolbox. This kind of in house development should
be considered carefully as a repository of experience insteadas a threat to the IT formal structure of organizations.
IX. KOWGAR@HOME:THEOUTEREDGEPROBLEM
As we have seen, KOWGAR is a system for distributedBayesian diagnosis in a corporate network, while KOWLANapplies the same principles to a big telecom operator network.In the first case, the leased infrastructure is seen as a fuzzycloud while in the second case diagnostic tools can only testthe last operator equipment (customer router) but are blind
beyond this point. Each domain sees the other one as an outeredge, a region of extreme uncertainty. This situation is
becoming usual in the daily operations and the final customer
needs a whole picture despite the managed entities belongingto different players.Network Management in the outer edge scenario is a
complex task, not only for the technical constraints, but alsofor legal, regulatory, security and commercial issues. In thescope of the MAGNETO project [34], a fault diagnosis
prototype is being developed to address Home Area Network(HAN) troubleshooting.
Management of HANs represents a huge challenge fortelecom operators since it has to combine the management ofdifferent network domains. Some of these domains are underthe control of the telecom operator, while the rest belongs tothe end customer. Current management architectures address
HAN management from a centralized perspective, wheremanagement tasks are performed in a management system thatremotely accesses customer equipment using protocols such asTR-069 [35].
In order to allow some degree of autonomy to themanagement of HANs, MAGNETO is exploring a distributedarchitecture where management tasks are locally executedimproving efficiency and reducing the burden on centralizedservers. An important MAGNETO feature is its capability ofself diagnosing problems in the HAN. This self diagnosis mayrequire cooperation between management agents placed in
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
6/7
different HANs or even network domain
MAGNETO agents sitting on HAN equip
some conclusions based on the evidences loc
may also need to cooperate with agents in th
other HANs to exchange their views of the pr
reach a valid diagnosis.
KOWGAR@HOME is the testbed to vali
solution for distributed troubleshooting that
further in the wider scope of MAGNETO. Itapproach to the one already described for
KOWLAN. It uses Bayesian Network Infer
the cause of service and network failures eve
uncertainty where there is not full knowled
and network status. Regarding the modeling
knowledge, the initial definition of the Bayes
to diagnose each service will be based on
experts in the field, like network operators.
In KOWGAR@HOME, some parts of the
Network can be executed in HAN dev
gateways, set top boxes, etc.) while other pa
ISPs network or in other HANs. Each of t
domains will have a different perspective
being diagnosed and will exchange inform
conclusions to cooperatively reach a valid di
purpose, it has been decided to use the
Method for the partition of the BNs, as desc
paragraphs.
Figure 2 depicts a possible
KOWGAR@HOME agents. As can be seen
deployed at different domains, from the
environment. It is foreseen to have
KOWGAR@HOME agent inside the HAN,
in a well-equipped device like a residential ga
Fig.2: KOWGAR@HOME deployment in a HAN
s. For example,
ment may reach
ally available but
e ISP network or
oblem in order to
ate the technical
ill be integrated
follows a similarKOWGAR and
ence to diagnose
n in situations of
ge about service
of the diagnosis
ian Network used
knowledge from
overall Bayesian
ices (like home
rts run inside the
ese management
on the problems
ation about their
iagnosis. For that
irtual Evidence
ribed in previous
deployment of
, there are agents
AN to the ISP
at least one
ost likely sitting
teway.
nvironment.
Additional agents can be deploy
top boxes, although this is not co
the resource limitations of most mul
preferable to access them remot
gateway.
Agents deployed in the ISP n
existing computing resources lik
network equipment. These agents
that is not available inside the HANto complete the diagnosis process
results of the partial inference con
addition, ISP agents can communic
the ISP for two purposes: firstly to
information from them and secondl
results of the diagnosis proces
triggering trouble tickets or alarms.
Another important KOWGAR
capability to improve its own diagn
of self-learning processes. To ma
accurate, a central server in the IS
gathering historical data about
different domains involved. Perio
used to update the BNs appl
algorithms. The new knowledge
distributed to appropriate KOWGA
be used for further diagnoses.
CONCLUSI
Troubleshooting is a very co
Management. This paper has descri
for distributed diagnosis based on
built with Open Source component
framework. The proposed architect
different scenarios: a corporateinfrastructure and an experimental
Digital Home proof of concept.
The most relevant results are the
to-market of the proposed solution
start point for any organization th
house troubleshooting system to c
complexity of its own network.
One important contribution of
problems is its capacity to deal wit
there are problems getting informati
systems. The probabilistic approach
case, although the higher number
certainty the system gets.
Another important benefit is rela
been seen that the multi-agent para
to easily deploy decoupled pie
environment where the overall a
Agents can be proactively exploited
problems, avoiding the participa
agents in the architecture.
In the future we plan to furt
challenges identified, such as
d in other devices like set
pulsory. Moreover, given
ltimedia devices, it may be
ely from the residential
twork take advantage of
e OSS servers or even
ave access to information
and can therefore be usedmaking also use of the
ucted inside the HAN. In
te with OSSs belonging to
request relevant diagnosis
to automatically feed the
, when appropriate, by
@HOME feature is its
osis capabilities by means
e this more valuable and
P network is in charge of
ast diagnoses from the
ically these data will be
ing parametric learning
acquired will be then
@HOME agents so it can
NS
mmon task in Network
bed a lightweight solution
Bayesian Networks and
s on top of a Multi Agent
re has been tested on three
Intranet, a telco VPNtestbed for a multidomain
flexibility and short time-
. This approach is a good
t wants to develop an in
ope with the growth and
OWGAR to this kind of
h uncertainty. Many times
on and tests from different
provides an answer in any
of evidences, the higher
ted to scalability, as it has
igm offers a good solution
ces of software in an
chitecture is not known.
to locally detect and solve
ion of more centralized
er explore some of the
improving self-learning
-
7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty
7/7
capabilities, smart partitioning of the BNs, automatic feedbackon the success of diagnosis and dynamic generation of BNs.
ACKNOWLEDGMENT
The authors of this paper would like to thank theMACROLAN Technical Center staff in Barcelona, Spain, fortheir support during the KOWLAN experience.
REFERENCES[1] ITU-T, Principles for a Telecommunications Management Network,
Recommendation M.3010, 1996.[2] Creaner, M., Reilly, J.: NGOSS Distilled The Essential Guide to Next
Generation Telecoms Management, The Lean Corporation, August2005.
[3] Chih-Chun Chen, Sylvia B. Nagi and Christopher D. Clack,Complexity and Emergence in Engineering Systems. ComplexSystems in Knowledge based Environments: Theory, Models and
Applications. Tolk, Andreas; Jain, Lakhmi C. (editors). Springer: NewYork, NY, USA, 2009, ch. 5, pp 99-128.
[4] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships ofthe Internet topology, Proceedings of the conference on Applications,technologies, architectures, and protocols for computer communication,
pp 251-262,ACM, 1999.[5] J. Spencer, D. Johnson, A. Hastie, L. Sacks, Emergent properties of the
BT SDH network, BT Technology Journal, vol. 21, no. 2, April 2003,pp. 28-36.
[6] A. Santiago, J. P. Crdenas, M. L. Mouronte , V. Feliu and and R. M.Benito, Modeling the topology of SDH networks, International
Journal of Modern Physics C, vol. 19, no. 12, 2008, pp. 1809-1820.[7] A. Binzenhfer et al., A P2P-based framework for distributed network
management in New Trends in Network Architectures and ServicesLNCS vol. 3883, pp. 198-210, Loveno di Menaggio, Como, Italy, 2006.
[8] P. Utton and E. Scharf, A fault diagnosis system for the connectedhome, IEEE Communications Magazine, vol. 42, no. 11, pp. 128 134,
November 2004.[9] V. Singh, Dyswis: An architecture for automated diagnosis of
networks, in IEEE Network Operations and Management Symposium,NOMS 2008, pp. 851-854, Salvador de Bahia, Brazil, 2008.
[10] Fallon L., Parker D., Collins S., Zach M., Leitner M., "Self-FormingNetwork Management Topologies in the Madeira Management System",
Proceedings of the Autonomous Infrastructure, Management andSecurity International Conference, AIMS 2007, pp 61-72, Oslo, Norway,2007..
[11] R. Badonnel. R. State, O. Festor. Probabilistic Management of Ad-Hoc Networks, 10th IEEE/IFIP Network Operations and ManagementSymposium NOMS 2006, pp. 339-350, Vancouver, Canada, 2006.
[12] Jianguo Ding, Bernd Krmer, Shihao Xu, Hansheng Chen and YingcaiBai, Predictive Fault Management in the Dynamic Environment of IP
Networks, Proceedings IEEE Workshop on IP Operations andManagement, pp. 233-239, 2004.
[13] Marcus Brunner, Dominique Dudkowski, Chiara Mingardi and GiorgioNunzi, Probabilistic Decentralized Network Management,Proceedings IEEE INM 2009, Hofstra University, Long Island, NewYork, USA, 2009, pp. 25-32.
[14] Ferat Sahin, A Bayesian Network Approach to the Self-organizationand Learning in Intelligent Agents, Ph.D. dissertation, VirginiaPolytechnic, USA, 2000.
[15] Jianguo Ding, Ningkang Jiang, Xiaoyong Li, Bernd Krmer, FrancoDavoli and Yingcai Bai, Construction of Simulation or ProbabilisticInference in uncertain and Dynamic Networks Based on Bayesian
Networks, Intermational Coference on ITS TelecommunicationsProceedings, 2006, pp. 983-986.
[16] Jianguo Ding, Probablistic Fault Management in Distributed Systems,Ph. D. dissertation, FernUniversitt in Hagen, Germany, 2008.
[17] Raquel Barco-Moreno, Bayesian modeling of fault diagnosis in mobilecommunication networks, Ph. D. dissertation, Universidad de Mlaga,Spain, 2007.
[18] Lu Cheng, Xue-song Qiu, Luoming Meng, Yan Qiao, Zhi-qing Li,Probabilistic Fault Diagnosis for IT Services in Noisy and DynamicEnvironments,Proceedings IEEE INM 2009, Hofstra University, LongIsland, New York, USA, 2009, pp. 149-156.
[19] George J. Lee, CAPRI: A Common Architecture for DistributedProbabilistic Internet Fault Diagnosis, Ph. D. dissertation, CSAIL-MIT,Cambridge, MA, USA, 2007.
[20] Judea Pearl, Bayesian networks: A model of self-activated memory forevidential reasoning, UCLA Report CSD-850017, 1985.
[21] Richard E. Neapolitan, Learning Bayesian Networks, inPrentice-HallSeries in Artificial Intelligence, Prentice-Hall, 2003.
[22] Uffe B. Kjaerulff, Anders L. Madsen, Bayesian Networks andInfluence Diagrams: A Guide to Construction and Analysis, Springer,2008.
[23] P.C.G. da Costa, Bayesian Semantics for the Semantic Web, Ph. D.Dissertation, George Mason University, USA, 2005.
[24] Kathryn Blackmond Laskey and Paulo Cesar G. da Costa: UncertaintyRepresentation and Reasoning in Complex Systems. Complex Systemsin Knowledge based Environments: Theory, Models and Applications.Tolk, Andreas; Jain, Lakhmi C. (editors). Springer: New York, NY,USA, 2009, ch. 2, pp. 7-40.
[25] Zhongli Ding, BayesOWL: A Probabilistic Framework for Uncertaintyin Semantic Web, Ph.D. dissertation, University of Mariland, USA,2005.
[26] Yang Xiang, Probabilistic Reasoning in Multiagent Systems: AGraphical Models Approach, Cambridge University Press, 2002.
[27] Rong Pan, Yun Peng, Zhongli Ding, "Belief Update in BayesianNetworks Using Uncertain Evidence,", 18th IEEE InternationalConference on Tools with Artificial Intelligence, 2006, pp. 441-444.
[28] Manifesto for Agile Software Development, 2001. Available:http://agilemanifesto.org/, last visited July 2009.
[29] Linda Rising and Norman S. Janoff, The Scrum Software DevelopmentProcess for Small Teams,IEEE Software, July/August 2000, pp. 2-8.[30] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "The EM
algorithm" in The Elements of Statistical Learning. New York, USA:Springer, 2001pp. 236243
[31] Fabio Luigi Bellifemine, Giovanni Caire, Dominic Greenwood,Developing Multi-Agent Systems with JADE, John Wiley & Sons,Chichester, UK, 2007.
[32] Gregory F. Cooper Edward Herskovits. A bayesian method for theinduction of probabilistic networks from data. Technical Report KSL-91-02, Knowledge Systems Laboratory. Medical Computer Science.Stanford University School of Medicine, Stanford, CA 94305-5479,
Nov. 1993.[33]Nir Friedman, Dan Geiger, Moises Godlzsmit. "Bayesian Network
Classifiers". Machine Learning, vol.29, pp.131-163, 1997.[34] CELTIC Initiative MAGNETO project, CP5-012, 2008.
http://www.celtic-initiative.org/Projects/MAGNETO/abstract-
magneto.asp. Last visited July 2009.[35] TR-069 CPE WAN Management Protocol. http://www.broadband-
forum.org/technical/download/TR-069.pdf. Last visited July 2009.