a lightweight approach to distributed network diagnosis under uncertainty

Upload: javigarciaalgarra

Post on 04-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    1/7

    1AbstractNetwork Management faces major challenges

    nowadays. Management applications have not kept the changing

    pace of networks and services and still rely on centralized

    approaches. Moreover, uncertainty is part of the reality. This

    paper presents a lightweight collaborative approach to network

    troubleshooting, based in multi agents and probabilistic

    techniques. The proposed architecture has been applied to three

    different network environments.

    Index Terms Bayesian Network, Multi Agent System,

    Network Troubleshooting, Uncertainty, Collaborative diagnosis.

    I. INTRODUCTIONelecommunication networks are growing in size andcomplexity, with a rich blend of services being

    deployed on top of them. All kind of organizations, from smallcompanies, to NGOs, academic institutions orFortune 500corporations use Internet as an infrastructure that supportstheir operations. However, the power and flexibility of thiskind of solutions has some drawbacks. Since there aremultiple players and no central authority uncertainty becomesa burden.

    Network Management is facing a change of paradigm after

    a long period of stability. For decades, well establishedreference models like ITU-T TMN [1] or more recently TMFNGOSS [2], have been the guide for the design of commercialand in house systems, with a wide range of applications for thetelecom industry and corporate networks. IETF SNMP has

    played a similar role for small and medium size businesssolutions.

    Classical architectures have a common underlying designprinciple: the state of all existing entities can be fully known atany given moment in time. A hierarchy of layers allows a wellengineered distribution of functions. The five big managementareas (FCAPS: Fault, Configuration, Accounting,Performance, Security) guarantee that the network is always

    under control.This deterministic approach has proved very useful when

    the entire infrastructure belongs to the same domain, as it isthe case inside a telecom operator network, when well definedinterfaces allow the interconnection of different domains orwhen all players speak the same management language.

    Another important feature of the classical model, related tothe previous one, is the centralized design. Huge network

    1Manuscript received August 9, 2009. This work was supported in part bythe Spanish Ministerio de Industria, Turismo y Comercio, Avanza I+D 2008grant TSI-020400-2008-27 under the CELTIC Initiative project MAGNETO.

    inventories, extremely complex end to end monitoringapplications or even trouble ticketing workflows behave as

    part of a Big Brother that needs every piece of informationto react when something unexpected happens.

    We can compare this situation to the state of developmentof classical mechanics by the end of XVIII Century, when itseemed that the Universe was like a perfect clock. Pierre-Simon Laplace wrote in 1814, in the Essai philosophique surles probabilits :

    Une intelligence qui, un instant donn, connatrait toutesles forces dont la nature est anime et la situation respective

    des tres qui la compose embrasserait dans la mme formule

    les mouvements des plus grands corps de l'univers et ceux du

    plus lger atome; rien ne serait incertain pour elle, et l'avenir,

    comme le pass, serait prsent ses yeux.The problem with Laplaces demon is that when the number

    of state variables reaches a critical threshold it is no longercomputable due to scalability problems. Telecom networksshare this property, the amount of managed entities grows soquickly that is impossible for centralized solutions to keep this

    pace. But there are many sources of uncertainty that challengethe traditional view, as the multiplicity of managementdomains or the emergence of properties in complex networks

    [3]-[6].Uncertainty cannot be underestimated or dismissed as anundesirable collateral effect in network management. It must

    be considered as an intrinsic property of telecom networks,and it should be taken into account to avoid expensiveworkarounds.

    We present in this paper a lightweight, collaborative anddistributed approach for network troubleshooting, developedas an internal initiative of Telefnica I+D, to foster innovativemanagement solutions. The same principles have been appliedto three different scenarios: KOWGAR, an onlinetroubleshooter to help the final users of a geographicallydistributed corporate network; KOWLAN, an automatic

    diagnosis system for the Ethernet/VPN commercial service ofTelefnica Espaa (MACROLAN) and KOWGAR@HOME,the application of these concepts to Home Area Networks,where two different management domains (ISP and HomeArea Network) have to cooperate to reach valid conclusions.

    II. DISTRIBUTEDDIAGNOSISUNDERUNCERTAINTY

    Troubleshooting is one of the Network Management fieldsmore sensitive to uncertainty using traditional centralizedapproaches. Based on remote access to testing and

    T

    A Lightweight Approach to Distributed NetworkDiagnosis under Uncertainty

    Javier Garca-Algarra, Pablo Arozarena-Llopis, Sergio Garca-Gmez, lvaro Carrera-BarrosoTelefnica I+D, Spain

    {algarra,pabloa,sergg,[email protected]}

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    2/7

    information capabilities Network Management tries to comeup with a conclusion but, if the information is incomplete orinaccurate, the process may get blocked. Usually, trying todiagnose affected services is very complicated, involvingdozens of elements of heterogeneous technologies, and finallyrequiring human intervention to solve the puzzle. As humanexpertise is a scarce and expensive resource, the researchcommunity is endeavoring to build systems that emulate

    network engineers working under uncertainty.There are already several efforts on designing frameworks

    for distributed network management. AutoMON [7] uses aP2P-based solution. Performance and reliability are tested bydistributed agents although the nodes do not co-operate or usehistorical information about failures. Distributed faultmanagement in Connected Home uses also an agent-basedapproach for this scenario [8]. DYSWIS [9] presents anArchitecture for Automated Diagnosis of Networks thatconsists of detection and diagnosis nodes. The first ones lookfor failures by passive traffic monitoring and active probing,while the second ones determine the root cause usinghistorical information and by performing active tests. Networkdependency relationships are encoded as rules. MADEIRA[10] introduced a distributed architecture for dynamic devicesmanagement. The use of a dynamic hierarchy allows thesystem to adapt to network failures and state changes,something that is considered essential for future large scalenetworks which must exhibit adaptive, decentralized control

    behaviors.Several authors have explored the application of

    probabilistic techniques to network management [11]-[13].For fault diagnosis, many of these advances use BayesianInference [14]-[18] to conclude the cause of the observednetwork problems. In particular, CAPRI [19] defines a

    Common Architecture for Distributed Probabilistic InternetFault Diagnosis. The whole picture is similar to the one inDYSWIS, but it uses Bayesian networks instead of rules toinfer the root cause.

    Bayesian networks (BN), a term coined by Judea Pearl [20],are based upon probability theory. The problem domain isrepresented as a directed acyclic graph where the nodesrepresent variables, and the arcs, conditional dependencies

    between them. Graphs are easy to work with, so Bayesiannetworks can be used to produce models that are simple forhumans to understand, as well as effective algorithms forinference and learning [21]. Bayesian networks have beensuccessfully applied to numerous areas, including medicine,

    decision support systems, and text analysis [22].

    III. PRINCIPLESANDFEATURESOur strategy is based on three principles and three design

    decisions previous to the study of each particular scenario.The first principle is that any new solution must be neutral

    and fit on any OSS (Operations Support Systems) map. Onecommon mistake in the management field is to expect that thesurrounding IT systems have to adapt to our needs, and thisassumption is a source of delays and expenditure. The secondone is that deployment disruption must be minimal, so it is not

    necessary to replace any previous system. The third one is thefact that uncertainty is unavoidable, so new systems have to

    properly take it into account.The first design decision is that systems have to be

    distributed, and grow organically as a part of the network notas an external watchman. The second one is that semantics is

    part of the human knowledge and so systems must be based onsemantics just from the beginning. The third one is the

    application of probabilistic techniques to deal withuncertainty. The combination of BNs with semanticstechnologies is an active research [23]-[25] that provides ahigh degree of flexibility.

    Besides these principles and design decisions there is alsoan additional economic constraint, systems must be cheap todevelop, cheap to deploy and cheap to maintain. Economy is akey issue, since network operation costs have not dropped atthe same rate than equipment prices.

    In order to cut down development and installation costs,only Open Source bricks have been used to build thisenvironment. A second strategic decision was to adopt anAgile methodology [28] to focus the effort on runningsoftware and reducing the managing overhead that is commonin complex organizations. In particular, Scrum was themethodology chosen [29].

    Regarding hardware, one of the goals is taking advantage ofthe unused CPU resources that are distributed all over thenetwork. The JADE platform (jade.tilab.com) allows thedeployment of Java coded agents with minimal requirementsof CPU and memory [31]. When dedicated hardware is neededto run parts of the whole system, commoditized hardware isthe best option. For example, a 700 Quad Core Intel PC withUbuntu Linux is enough to run the MACROLAN diagnosis

    prototype for the whole Telefnica Espaa network.

    Maintenance costs are a main concern for IT systems ingeneral. One of the advantages of this approach is thatbusiness logic is encoded in a Bayesian Network that can bedistributed all over the network at any time. This allows theaddition of new diagnosis capabilities or the modification ofthe BNs without stopping the system. The Bayesian Inferenceengine chosen is SamIam, a light and fast Java implementationdeveloped at UCLA (http://reasoning.cs.ucla.edu/samiam/).

    IV. BAYESIANKNOWLEDGECAPTUREWhen modelling a Bayesian Network two things have to be

    defined: first the structure of the Bayesian Network and thenthe parameter values of the Conditional Probability Tables

    (CPT).In order to initially build a BN, there are several

    alternatives. When no domain expert knowledge is available,but there is a big amount of historical data, the structure of theBayesian network can be automatically built by usingstructural learning algorithms. There are several structurallearning algorithms, like K2 [32] and tree augmented NaveBayes [33]. K2 is a simple and very fast learning algorithm,

    but its results depend on the initial ordering of input data, so itmakes sense to run the algorithm several times with differentrandom orderings. Another good learning algorithm for

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    3/7

    Bayesian network classifiers is called tree augmented NaveBayes (TAN). This algorithm is linear in the number ofinstances and quadratic in the number of attributes. The main

    problem with these structure learning algorithms is that, sincethey are just based on statistics, they may come up with wrongcasual dependencies.

    For this reason, the structure of the Bayesian Networks ispreferably defined manually as a first step. Together with the

    structure, initial CPTs parameter are set. This task may requirethe cooperation of a team of experts in the problem domainwith knowledge engineers to properly model the BN.

    Another challenge deals with the distribution of theintelligence across the physical network. Instead of using acentralised BN, a smarter approach is to partition the wholedomain in smaller BNs [26] when the scenario is complex asis the case in network troubleshooting. Following this

    principle, different elements in different parts of the networkmay have different views and knowledge. For instance, someagents may only diagnose network problems while othersservice problems, exchanging their conclusions tocooperatively reach a valid conclusion. In other words, thesingle BN that could exist within a centralised solution will befragmented and distributed to the overlay agent network.

    The Virtual Evidence Method (VEM) [27] is an algorithmthat partitions the network in several pieces to perform partialinference and belief interchange. With VEM, each agent onlyneeds to know a subset of the whole BN. This makes thediagnosis process more scalable, since computing resourcesare distributed across the physical network. It also facilitatesreusing knowledge in other diagnosis processes that sharecommon parts of the BN.

    Moreover, if the Bayesian Network is partitioned accordingto the physical network topology, VEM enables mapping the

    diagnosis process to the different network domains.

    V. SELF LEARNINGOne significant goal in our work is to allow the diagnostic

    intelligence to be able to self adapt and improve over time. Toachieve this, the system must be able to learn from pastactions, something which requires a feedback loop to mark

    past diagnoses as successful or not. This is very challenging,since there is not a straightforward way to assess diagnosisresults. A possible solution is to request human feedback fromthe network operators on the usefulness of each diagnosis.Once this is done, parametric learning algorithms may be usedto improve diagnosis quality. These methods update the

    existing BN, but only modifying the link weights in the CPTsand not its structure.

    Expectation Maximization algorithm (EM) is a goodparametric learning algorithm used in statistics to findmaximum likelihood estimates of parameters in probabilisticmodels [30]. EM uses an iterative algorithm that estimates themissing values in the input data representing previousdiagnoses. This is relevant since for some diagnoses there may

    be only a subset of the possible evidences available. Once thisis done, statistical methods are used to recalculate the weightsin the BN based on the set of previous diagnoses. KOWGAR

    uses this algorithm because it provides the necessary featuresfor updating the values of CPTs which, along with itsstructure, embody the knowledge embedded in a BN.

    VI. ARCHITECTUREOVERVIEWThe core of KOWGAR is based on a Multiagent

    architecture. Different types of agents have been envisaged tocarry out the diagnosis process.

    The Interface Agents are in charge of communicationwith the systems outside, such as a user interface, atrouble ticketing system, a network inventory, etc. Insome cases, these agents provide an interface as anexternal service. In other cases, they use other systeminterfaces.

    The Observation Agents mission is to get evidencesfrom the managed networks and services. Theseagents have interfaces with the network resources andtheir Management Information Bases (MIBs), or theycan exploit external testing tools or services.

    The Diagnosis Agents orchestrate the diagnosisprocess and gather evidences from the Observation

    Agents in order to carry out the Bayesian Inference.This process is driven by the Bayesian Networksspecifications. Usually, these agents are specializedfor a network/service configuration.

    The Persistence Agents manage the storage andretrieval of KOWGAR information from the systemdatabase when required.

    The Knowledge Agent task is to distribute theBayesian knowledge to the agents that need it.

    All the communications between agents are based on FIPA-ACL messages and a specific ontology that the agentsunderstand. This ontology defines the Bayesian Networksstructure (hypothesis, evidences, conditional probabilities,

    thresholds, etc.), the information about a diagnosis operation(observations, beliefs, additional information, etc.), and theactions that the agents can carry out. The communicationswith external systems are usually based on XML messagesover HTTP or on standard Web Services.As it is explained in the following sections, depending on theneeds of each scenario, this architecture is customized. Thetypes of agents must be specialized to adapt the architectureand different Bayesian Networks are defined for eachsituation.

    VII. KOWGAR:BAYESIANDIAGNOSISINACORPORATENETWORK

    KOWGAR is a proof of concept to test the power of adistributed Bayesian diagnosis system, deployed on the owncorporate Intranet of Telefnica I+D. The system targeted areduced set of problems, those related with web navigation ofinternal and external sites. The scenario is very common inany kind of institutional Intranet with locations geographicallydistant, linked via a VPN. In this case Telefnica Espaa actsas the ISP that provides also connectivity. Like any othercustomer, Telefnica I+D network managers perceive thisinfrastructure as a cloud or a black box. KOWGAR wasdesigned to help the end user, so to allow a quick and cheap

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    4/7

    deployment; a tiny Firefox plug-in was

    Graphical Interface. A diagnosis is autom

    when the browser detects a problem and the

    check the final result.

    Two different types of possible HTTP co

    distinguished:

    - Connections to HTTP servers within the

    - Connections to HTTP servers located

    intranet, i.e. in the internet.The universe of root causes is select

    construction of the BN. Some of th

    misconfiguration, Link failure, Routing fail

    unreachable, DNS information incorrect, De

    unreachable, Destination port unreachabl

    application unavailable.

    There were three locations involved in

    Madrid, Valladolid and Huesca. To simul

    without disrupting the daily Company ope

    mock DNS servers was installed. Fig. 1 sho

    an example of what happens when a final

    subnetwork is trying to access a web server i

    refusing connections. When an error is

    triggers the plugin, which in turn activates

    Agent.

    This User Agent performs a set of basic te

    and if everything is OK, sends a request to

    Agent, in charge of orchestrating KOWGA

    agent knows the BN and the associated ontol

    which kind of test must be executed to ga

    possible symptoms. So, it invokes three

    agents: routing, DNS and HTTP. The BN rea

    that, in this case, is server congestion with a

    The information can be displayed from the

    clicking the plug-in icon.

    Fig. 1: KOWGAR testbed scenario and agents deploy

    This simple example shows some of the

    distributed diagnosis. Problems are det

    eveloped as the

    tically launched

    user only has to

    nnections can be

    intranet.

    utside the TIDs

    ed prior to the

    em are: Local

    ure, DNS server

    stination host

    or Destination

    the experiment:

    te DNS failures

    ations, a pair of

    s the testbed and

    user in Madrid

    n Boecillo that is

    detected Firefox

    the JADE User

    sting procedures,

    the Aggregation

    s actions. This

    ogy that explains

    her the different

    different testing

    ches a conclusion

    high probability.

    browser just by

    ent.

    dvantages of the

    cted locally if

    possible. If the User Agent detects

    instance, no further tests would be

    stops. But let us take a look at A

    when instead of a server congest

    problem in the routing tables of a

    case, the evidences gathered by the

    would be enough to reach this conc

    additional tests. So, instead of perf

    available, the ontology sets the costis used to select the next test to

    lowest cost). Each time a new tes

    additional evidence, the Aggrega

    output of the BN. When a suffici

    reached (also called confidence, th

    the process stops and returns th

    probability. This allows avoiding un

    Another important advantage

    hardwired. The couple BN+Ontol

    FIPA-SL string, governs the syste

    of the BN or the weight of each arc

    new string to all interested agents.

    took less than one second for the

    JADE communication mechanisms.

    short term experience developed

    learning algorithms were implemen

    minor manual changes in the BN

    actually tested.

    The last important feature of

    highlight is its multiplatform nature.

    of JADE/Java, KOWGAR agents

    flavors of Windows, Linux, Solari

    agents were executed even on a 10

    RAM and Puppy Linux.

    We can conclude that this approorganization which wants to man

    using their own programming ski

    employed in the prototype are Ope

    is necessary.

    VIII. KOMACROLAN is Telefnica E

    Virtual Private Networks (VP

    enterprise sites over Ethernet base

    supports service speeds from 2 M

    standard L2 and L3 VPN technolo

    geographically distant customer si

    they belong to the same LAN, in

    and transparency.

    MACROLAN service is built o

    networks. The local loop from the

    Central Office (CO) can be either fi

    when available, or copper, in rural

    Hierarchy (SDH) circuits allow to

    access segment when the closest

    different CO. MACROLAN tra

    province-wide Metropolitan Ethern

    a network wire failure, for

    necessary and the process

    gregation Agent behavior

    ion in Madrid there is a

    Valladolid router. In this

    User and Routing Agents

    lusion without performing

    orming every possible test

    of each test action, whicherform (the one with the

    t is performed, providing

    ion Agent evaluates the

    ent degree of certainty is

    at is set in the ontology),

    result with the highest

    necessary tests.

    is that the logic isnt

    gy, coded together as an

    . Modifying the structure

    is as fast as distributing a

    In our tests, this operation

    whole network, using the

    . As KOWGAR was just a

    in three months, no self-

    ted, but the distribution of

    and/or the ontology were

    KOWGAR we want to

    As everything runs on top

    ere deployed on different

    and HP-UX. KOWGAR

    micro PC with 256MB

    ch can benefit any type ofage common IP services

    lls. Besides, all the tools

    Source and no extra HW

    LAN

    spaas solution to build

    ) connecting multiple

    d accesses. MACROLAN

    it/s to 1 Gbit/s. By using

    ies, MACROLAN enables

    tes to communicate as if

    terms of speed, reliability

    n top of a diverse set of

    customer location to the

    ber with media converters,

    reas. Synchronous Digital

    extend the distance of this

    AN access point is in a

    ffic is aggregated into

    et networks (MANs) and

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    5/7

    then into an IP backbone network that provides nationalcoverage.

    MACROLAN poses an interesting challenge, since itinvolves a number of different technologies. Diagnosisrequires a high degree of skills, experience and ability since itinvolves accessing half a dozen different OSSssimultaneously. The goal was then to capture the expertisefrom human operators, model it in BNs and thus reduce

    manual intervention for the most common types of failures.There were two additional constraints for the experience, thesystem should be deployed without any change in the existingOSSs and it should be running on a live network from the firstmonth with a six month deadline for the whole Scrum project.

    The system architecture consists of three main blocks: themulti-agent platform, the web user interface and the database.

    The behavior of KOWLAN agents is based on Bayesianinference, as we have explained. For this experience, there aresix different BNs, one for each type of access technology,depending on the combination of fiber, copper and theexistence of SDH path. Knowledge capture required only twoworking days of a team of expert technicians and twoTelefnica I+D engineers

    In order to provide a common language so that the agentscan share knowledge about BNs and diagnosis operations, atop level ontology has been defined. This ontology includesinformation about hypothesis and evidences, informationabout probabilities of the current hypothesis beliefs,information about topology and inventory, and informationabout probabilistic dependencies among hypothesis andevidences.

    These are the steps each diagnosis procedure follows:1. Requests can be launched by a human operator from

    the user interface or more commonly as the result of

    the arrival of a Trouble Ticket (TT). An interface agentpolls periodically the TT system so, when a new one isassigned to the MACROLAN technical center, aKOWLAN diagnosis procedure is triggered withouthuman intervention

    2. The only pieces of information available about thefailure are the circuit identification and the reportedsymptom. From two corporate inventories, specializedKOWLAN agents get the full circuit description:topology, scenario, equipment data and configuration.

    3. The appropriate Diagnosis Agent is instantiatedaccording to the scenario. This Agent orchestrates theontology coded behavior:

    a. The Diagnosis Agent selects the best availabletest to perform, taking into account its costand if there is enough data to execute it. Then,it requests the test to the appropriateObservation Agent, gets the results, performsthe Bayesian inference and checks if anyhypothesis threshold has been reached. If not,it selects the next test to perform.

    b. When enough certainty in the diagnosis isreached or there are not more tests to perform,it stores the results in the database.

    The user interface is a decoupled PHP application, sinceKOWLAN has been designed to be fully automatic when

    possible. The operator can request diagnosis on a circuit,display its details and evaluate the accuracy of the diagnosisthrough a simple Internet-like polling mechanism. Thisevaluation may be useful for further refinement of the BN

    parameters by using self-learning mechanisms.After two months of usage, KOWLAN was able to diagnose

    around 45% of failures with accuracy higher than 90%.Thanks to this, human intervention has been reduced in a 30%as the analysis of a sample of 3000 trouble tickets has shown.

    The key for the successful application of Bayesiandiagnosis in KOWLAN has been the high involvement ofMACROLAN maintenance team in the project. Knowledgecapture has shown to be very accurate, the system is only asgood as the technicians are, and in this case they areexceptionally skilled. Besides, some shell scripts developed in

    past years by this group were integrated as part of KOWLANdiagnosis toolbox. This kind of in house development should

    be considered carefully as a repository of experience insteadas a threat to the IT formal structure of organizations.

    IX. KOWGAR@HOME:THEOUTEREDGEPROBLEM

    As we have seen, KOWGAR is a system for distributedBayesian diagnosis in a corporate network, while KOWLANapplies the same principles to a big telecom operator network.In the first case, the leased infrastructure is seen as a fuzzycloud while in the second case diagnostic tools can only testthe last operator equipment (customer router) but are blind

    beyond this point. Each domain sees the other one as an outeredge, a region of extreme uncertainty. This situation is

    becoming usual in the daily operations and the final customer

    needs a whole picture despite the managed entities belongingto different players.Network Management in the outer edge scenario is a

    complex task, not only for the technical constraints, but alsofor legal, regulatory, security and commercial issues. In thescope of the MAGNETO project [34], a fault diagnosis

    prototype is being developed to address Home Area Network(HAN) troubleshooting.

    Management of HANs represents a huge challenge fortelecom operators since it has to combine the management ofdifferent network domains. Some of these domains are underthe control of the telecom operator, while the rest belongs tothe end customer. Current management architectures address

    HAN management from a centralized perspective, wheremanagement tasks are performed in a management system thatremotely accesses customer equipment using protocols such asTR-069 [35].

    In order to allow some degree of autonomy to themanagement of HANs, MAGNETO is exploring a distributedarchitecture where management tasks are locally executedimproving efficiency and reducing the burden on centralizedservers. An important MAGNETO feature is its capability ofself diagnosing problems in the HAN. This self diagnosis mayrequire cooperation between management agents placed in

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    6/7

    different HANs or even network domain

    MAGNETO agents sitting on HAN equip

    some conclusions based on the evidences loc

    may also need to cooperate with agents in th

    other HANs to exchange their views of the pr

    reach a valid diagnosis.

    KOWGAR@HOME is the testbed to vali

    solution for distributed troubleshooting that

    further in the wider scope of MAGNETO. Itapproach to the one already described for

    KOWLAN. It uses Bayesian Network Infer

    the cause of service and network failures eve

    uncertainty where there is not full knowled

    and network status. Regarding the modeling

    knowledge, the initial definition of the Bayes

    to diagnose each service will be based on

    experts in the field, like network operators.

    In KOWGAR@HOME, some parts of the

    Network can be executed in HAN dev

    gateways, set top boxes, etc.) while other pa

    ISPs network or in other HANs. Each of t

    domains will have a different perspective

    being diagnosed and will exchange inform

    conclusions to cooperatively reach a valid di

    purpose, it has been decided to use the

    Method for the partition of the BNs, as desc

    paragraphs.

    Figure 2 depicts a possible

    KOWGAR@HOME agents. As can be seen

    deployed at different domains, from the

    environment. It is foreseen to have

    KOWGAR@HOME agent inside the HAN,

    in a well-equipped device like a residential ga

    Fig.2: KOWGAR@HOME deployment in a HAN

    s. For example,

    ment may reach

    ally available but

    e ISP network or

    oblem in order to

    ate the technical

    ill be integrated

    follows a similarKOWGAR and

    ence to diagnose

    n in situations of

    ge about service

    of the diagnosis

    ian Network used

    knowledge from

    overall Bayesian

    ices (like home

    rts run inside the

    ese management

    on the problems

    ation about their

    iagnosis. For that

    irtual Evidence

    ribed in previous

    deployment of

    , there are agents

    AN to the ISP

    at least one

    ost likely sitting

    teway.

    nvironment.

    Additional agents can be deploy

    top boxes, although this is not co

    the resource limitations of most mul

    preferable to access them remot

    gateway.

    Agents deployed in the ISP n

    existing computing resources lik

    network equipment. These agents

    that is not available inside the HANto complete the diagnosis process

    results of the partial inference con

    addition, ISP agents can communic

    the ISP for two purposes: firstly to

    information from them and secondl

    results of the diagnosis proces

    triggering trouble tickets or alarms.

    Another important KOWGAR

    capability to improve its own diagn

    of self-learning processes. To ma

    accurate, a central server in the IS

    gathering historical data about

    different domains involved. Perio

    used to update the BNs appl

    algorithms. The new knowledge

    distributed to appropriate KOWGA

    be used for further diagnoses.

    CONCLUSI

    Troubleshooting is a very co

    Management. This paper has descri

    for distributed diagnosis based on

    built with Open Source component

    framework. The proposed architect

    different scenarios: a corporateinfrastructure and an experimental

    Digital Home proof of concept.

    The most relevant results are the

    to-market of the proposed solution

    start point for any organization th

    house troubleshooting system to c

    complexity of its own network.

    One important contribution of

    problems is its capacity to deal wit

    there are problems getting informati

    systems. The probabilistic approach

    case, although the higher number

    certainty the system gets.

    Another important benefit is rela

    been seen that the multi-agent para

    to easily deploy decoupled pie

    environment where the overall a

    Agents can be proactively exploited

    problems, avoiding the participa

    agents in the architecture.

    In the future we plan to furt

    challenges identified, such as

    d in other devices like set

    pulsory. Moreover, given

    ltimedia devices, it may be

    ely from the residential

    twork take advantage of

    e OSS servers or even

    ave access to information

    and can therefore be usedmaking also use of the

    ucted inside the HAN. In

    te with OSSs belonging to

    request relevant diagnosis

    to automatically feed the

    , when appropriate, by

    @HOME feature is its

    osis capabilities by means

    e this more valuable and

    P network is in charge of

    ast diagnoses from the

    ically these data will be

    ing parametric learning

    acquired will be then

    @HOME agents so it can

    NS

    mmon task in Network

    bed a lightweight solution

    Bayesian Networks and

    s on top of a Multi Agent

    re has been tested on three

    Intranet, a telco VPNtestbed for a multidomain

    flexibility and short time-

    . This approach is a good

    t wants to develop an in

    ope with the growth and

    OWGAR to this kind of

    h uncertainty. Many times

    on and tests from different

    provides an answer in any

    of evidences, the higher

    ted to scalability, as it has

    igm offers a good solution

    ces of software in an

    chitecture is not known.

    to locally detect and solve

    ion of more centralized

    er explore some of the

    improving self-learning

  • 7/31/2019 A Lightweight Approach to Distributed Network Diagnosis under Uncertainty

    7/7

    capabilities, smart partitioning of the BNs, automatic feedbackon the success of diagnosis and dynamic generation of BNs.

    ACKNOWLEDGMENT

    The authors of this paper would like to thank theMACROLAN Technical Center staff in Barcelona, Spain, fortheir support during the KOWLAN experience.

    REFERENCES[1] ITU-T, Principles for a Telecommunications Management Network,

    Recommendation M.3010, 1996.[2] Creaner, M., Reilly, J.: NGOSS Distilled The Essential Guide to Next

    Generation Telecoms Management, The Lean Corporation, August2005.

    [3] Chih-Chun Chen, Sylvia B. Nagi and Christopher D. Clack,Complexity and Emergence in Engineering Systems. ComplexSystems in Knowledge based Environments: Theory, Models and

    Applications. Tolk, Andreas; Jain, Lakhmi C. (editors). Springer: NewYork, NY, USA, 2009, ch. 5, pp 99-128.

    [4] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships ofthe Internet topology, Proceedings of the conference on Applications,technologies, architectures, and protocols for computer communication,

    pp 251-262,ACM, 1999.[5] J. Spencer, D. Johnson, A. Hastie, L. Sacks, Emergent properties of the

    BT SDH network, BT Technology Journal, vol. 21, no. 2, April 2003,pp. 28-36.

    [6] A. Santiago, J. P. Crdenas, M. L. Mouronte , V. Feliu and and R. M.Benito, Modeling the topology of SDH networks, International

    Journal of Modern Physics C, vol. 19, no. 12, 2008, pp. 1809-1820.[7] A. Binzenhfer et al., A P2P-based framework for distributed network

    management in New Trends in Network Architectures and ServicesLNCS vol. 3883, pp. 198-210, Loveno di Menaggio, Como, Italy, 2006.

    [8] P. Utton and E. Scharf, A fault diagnosis system for the connectedhome, IEEE Communications Magazine, vol. 42, no. 11, pp. 128 134,

    November 2004.[9] V. Singh, Dyswis: An architecture for automated diagnosis of

    networks, in IEEE Network Operations and Management Symposium,NOMS 2008, pp. 851-854, Salvador de Bahia, Brazil, 2008.

    [10] Fallon L., Parker D., Collins S., Zach M., Leitner M., "Self-FormingNetwork Management Topologies in the Madeira Management System",

    Proceedings of the Autonomous Infrastructure, Management andSecurity International Conference, AIMS 2007, pp 61-72, Oslo, Norway,2007..

    [11] R. Badonnel. R. State, O. Festor. Probabilistic Management of Ad-Hoc Networks, 10th IEEE/IFIP Network Operations and ManagementSymposium NOMS 2006, pp. 339-350, Vancouver, Canada, 2006.

    [12] Jianguo Ding, Bernd Krmer, Shihao Xu, Hansheng Chen and YingcaiBai, Predictive Fault Management in the Dynamic Environment of IP

    Networks, Proceedings IEEE Workshop on IP Operations andManagement, pp. 233-239, 2004.

    [13] Marcus Brunner, Dominique Dudkowski, Chiara Mingardi and GiorgioNunzi, Probabilistic Decentralized Network Management,Proceedings IEEE INM 2009, Hofstra University, Long Island, NewYork, USA, 2009, pp. 25-32.

    [14] Ferat Sahin, A Bayesian Network Approach to the Self-organizationand Learning in Intelligent Agents, Ph.D. dissertation, VirginiaPolytechnic, USA, 2000.

    [15] Jianguo Ding, Ningkang Jiang, Xiaoyong Li, Bernd Krmer, FrancoDavoli and Yingcai Bai, Construction of Simulation or ProbabilisticInference in uncertain and Dynamic Networks Based on Bayesian

    Networks, Intermational Coference on ITS TelecommunicationsProceedings, 2006, pp. 983-986.

    [16] Jianguo Ding, Probablistic Fault Management in Distributed Systems,Ph. D. dissertation, FernUniversitt in Hagen, Germany, 2008.

    [17] Raquel Barco-Moreno, Bayesian modeling of fault diagnosis in mobilecommunication networks, Ph. D. dissertation, Universidad de Mlaga,Spain, 2007.

    [18] Lu Cheng, Xue-song Qiu, Luoming Meng, Yan Qiao, Zhi-qing Li,Probabilistic Fault Diagnosis for IT Services in Noisy and DynamicEnvironments,Proceedings IEEE INM 2009, Hofstra University, LongIsland, New York, USA, 2009, pp. 149-156.

    [19] George J. Lee, CAPRI: A Common Architecture for DistributedProbabilistic Internet Fault Diagnosis, Ph. D. dissertation, CSAIL-MIT,Cambridge, MA, USA, 2007.

    [20] Judea Pearl, Bayesian networks: A model of self-activated memory forevidential reasoning, UCLA Report CSD-850017, 1985.

    [21] Richard E. Neapolitan, Learning Bayesian Networks, inPrentice-HallSeries in Artificial Intelligence, Prentice-Hall, 2003.

    [22] Uffe B. Kjaerulff, Anders L. Madsen, Bayesian Networks andInfluence Diagrams: A Guide to Construction and Analysis, Springer,2008.

    [23] P.C.G. da Costa, Bayesian Semantics for the Semantic Web, Ph. D.Dissertation, George Mason University, USA, 2005.

    [24] Kathryn Blackmond Laskey and Paulo Cesar G. da Costa: UncertaintyRepresentation and Reasoning in Complex Systems. Complex Systemsin Knowledge based Environments: Theory, Models and Applications.Tolk, Andreas; Jain, Lakhmi C. (editors). Springer: New York, NY,USA, 2009, ch. 2, pp. 7-40.

    [25] Zhongli Ding, BayesOWL: A Probabilistic Framework for Uncertaintyin Semantic Web, Ph.D. dissertation, University of Mariland, USA,2005.

    [26] Yang Xiang, Probabilistic Reasoning in Multiagent Systems: AGraphical Models Approach, Cambridge University Press, 2002.

    [27] Rong Pan, Yun Peng, Zhongli Ding, "Belief Update in BayesianNetworks Using Uncertain Evidence,", 18th IEEE InternationalConference on Tools with Artificial Intelligence, 2006, pp. 441-444.

    [28] Manifesto for Agile Software Development, 2001. Available:http://agilemanifesto.org/, last visited July 2009.

    [29] Linda Rising and Norman S. Janoff, The Scrum Software DevelopmentProcess for Small Teams,IEEE Software, July/August 2000, pp. 2-8.[30] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "The EM

    algorithm" in The Elements of Statistical Learning. New York, USA:Springer, 2001pp. 236243

    [31] Fabio Luigi Bellifemine, Giovanni Caire, Dominic Greenwood,Developing Multi-Agent Systems with JADE, John Wiley & Sons,Chichester, UK, 2007.

    [32] Gregory F. Cooper Edward Herskovits. A bayesian method for theinduction of probabilistic networks from data. Technical Report KSL-91-02, Knowledge Systems Laboratory. Medical Computer Science.Stanford University School of Medicine, Stanford, CA 94305-5479,

    Nov. 1993.[33]Nir Friedman, Dan Geiger, Moises Godlzsmit. "Bayesian Network

    Classifiers". Machine Learning, vol.29, pp.131-163, 1997.[34] CELTIC Initiative MAGNETO project, CP5-012, 2008.

    http://www.celtic-initiative.org/Projects/MAGNETO/abstract-

    magneto.asp. Last visited July 2009.[35] TR-069 CPE WAN Management Protocol. http://www.broadband-

    forum.org/technical/download/TR-069.pdf. Last visited July 2009.