ca spectruim

Upload: charles-patterson

Post on 04-Apr-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 CA Spectruim

    1/24

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CAU SE A NA LYSIS

    Interpret ing Events W ith Intelligence

    to Find Root Cause

  • 7/30/2019 CA Spectruim

    2/24

    Copyright 2008 CA . All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. This document is for your inform ational purposes only. To the extent perm ittby applicable law, CA provides this document As Is w ithout w arranty of any kind, including, without limitation, any implied warranties of merchantability or fit ness for a particular purpose, or noninfringement. In no event will CA liable for any loss or damage, direct or indirect, from the use of this document , including, without limit ation, lost profits, business interruption, goodwill or lost data, even if CA is expressly advised of such damages.

    Table of Cont ents

    Executive Summary

    SECTION 1: CHALLENGE 2

    A Complex Problem in Need of a Solution

    The Infrastruct ure is the Business

    The Need t o Be Proactive

    The Importance of Understanding Business

    Impact

    SECTION 2: OPPORTUNITY 3

    Event Correlation and Root Cause A nalysis

    A Three-Pronged Approach for CA SPECTRUM

    Network Fault Manager

    Induct ive M odeling Technology

    Event M anagement System

    Condition Correlation

    Use Case Scenarios

    Inference and Inductive M odeling Technology

    Creatively using the Event M anagement System

    SECTION 3: BENEFITS 20

    CA SPECTRUM for High Performi ng Infrast ruc-

    tures

    Patented Software Elevates CA SPECTRUM

    Capabilities

    Benefits for Experienced Users and New U sers

    SECTION 4: CONCLUSIONS 21

  • 7/30/2019 CA Spectruim

    3/24

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS

    Executive SummaryChallenge

    Todays complex IT infrastruct ures are dynamic, multi vendor engines made of frequently

    changing components and technologies. The complexity o f the infrastructure and the continual

    changes caused by business demands often lead to faults within the infrastructure. A fault in

    a single device can have a ripple effect t hat causes performance and availability problems

    for m any users. The ripple effect also m akes it difficult to pinpoint t he root cause of the fault

    To remain relevant and competit ive in the marketplace, comp anies must m inim ize outages

    and perform ance degradations and this requires effective performance and availability

    management. Unfort unately, many management tools are not adaptive and have not kept

    pace w ith the dynamics of real-tim e, on-demand IT. Other tools often used are niche tools

    that manage only a portion of t he infrastructure, making them largely ineffective in large,

    interconnected environments.

    Opportunity

    CA SPECTRUM is an infrastruc ture m anagement solut ion that provides integrated service

    fault and configuration functionality for m odeling, monitoring and reporting across multiple

    network device types and technologies. Using a t rust but verify m ethodology, CA SPECTRUM

    provides an automated and intelligent m anagement approach for your particular

    environment w hether you are as a service provider or an enterprise.

    CA SPECTRUM leverages three types of problem solving to comprehensively m anage yourinfrastructure:

    M odel-based Inductive Modeling Technology ( IM T)

    Rules-based Event M anagement System ( EM S)

    Policy-based Condit ion Correlation Technology ( CCT)

    Benefits

    CAs model-, rules- and poli cy-based analytics understand relationships betw een IT infra-

    struct ure assets and t he users and services they are designed to suppor t. This insight hasenabled CA SPECTRUM to deliver real benefit to custom ers. A large service provider

    realized a 70 % reduction in downt ime w hile resolving 90% of availability or performance

    problems from a central location. Patented root cause analysis has been able to reduce t he

    number of alarms by several orders of magnit ude while significantly reducing M ean-Time-

    to-Repair ( M TTR).

    The CA int egrated approach to fault and performance management has enabled enterprise

    government and service provider organizations around t he world t o achieve reliability,

    efficiency and effectiveness in managing IT i nfrastructures as a business service.

  • 7/30/2019 CA Spectruim

    4/24

    A Complex Problem in N eed of a SolutionIT infrastructure management is an intensive undertaking w ith significant resource require-

    ment s. M ost employees in an organization expect the infrastructu re to work, not thinking of ias a dynamic, mu lti-vendor engine made up of frequently changing components and technologie

    In fact, the complexit y and dynam ics of todays real-tim e, on-dem and IT architectures presen

    many opport unities for inconsistencies and failures. Invariably, the infrastructu re will slow

    dow n or fail and when it does, tools are required to qui ckly pinpoin t t he root cause, suppress

    sympt omatic faults, prioritize based on business impact and accelerate service restoration.

    The Infrastr ucture is t he Business

    The IT infrastructure is a collection of interdependent components including computing

    systems, netw orking devices, databases and applications. W ithin t he set of infrastructu re

    components are mult iple versions of m any vendors products connected over a variety of

    netw orking technologies. In addition, each business environm ent is different from the next

    there is no configuration or standard set of components that makes up an infrastructu re.

    There also is constant change in devices, firmw are versions, operating system s, netw orking

    technologies, development t echnologies and tools. But this dynam ic and complex infrastructur

    serves an impo rtant purpose; the infrastructure IS the business. Either t he infrastructure work

    and evolves or the organization is out of business. Com panies must also evolve their people,

    processes and management t ools for greater efficiency or fall com petit ively behind.

    To ensure the performance and availability of t he infrastructu re, most com panies employ a

    dual-approach method consisting of:

    1. Highly available, fault- tolerant, load balancing designs for infrastruct ure devices and

    comm unication paths.2. A network management solution to ensure reliable operation.

    High-availability environments further complicate the job of management solutions. The

    management solut ion must understand the load balancing capacity, be able to track primary

    and fault-t olerant backup paths, and understand w hen redundant systems are active.

    The investment in the management solution is as important as the investment in the infra-

    struct ure itself. The solution m ust be broad, deep, integrated and intelligent t o perform its

    intended function. The infrastructure is not static and the solution will need to embrace

    change while delivering an end-t o-end int egrated view of perform ance and availability across

    infrastructure silos. Unfort unately, many m anagement tools are not adaptive and do not keep

    pace with t he dynamics of IT reality.

    The Need to Be Proactive

    To remain relevant and competit ive in the marketplace, comp anies must m inim ize outages

    and perform ance degradations. In order to do t his, the indiv idual or groups responsible for the

    care of the inf rastructure ( e.g., IT, Engineering, Operations) must be proact ively noti fied of

    problems. There are many tools that m onito r the availability and perform ance of infrastruct ur

    components and the business applications that rely on them.

    SECTION 1: CHALLENGE

    2 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS

  • 7/30/2019 CA Spectruim

    5/24

    M any of these tools simply identify that a problem exists and notify technicians of a problem

    after it has happened. They often give visibility int o only a small slice of the technologies und

    management and have no ability to understand how the various component s relate to each

    other. This is not enough. It is import ant that t he management solution act as an early warninsystem to help avoid downt ime and service level agreement (SLA) violations. Aft er a problem

    has occurred is t oo late users are dissatisfied and SLA penalties have been levied.

    Before the true t ask of troubleshooting can begin, the troubleshooter has to isolate the

    problem. Simply know ing there is a problem and collect ing all the problems on one screen is

    not enough. Troubleshooters need to know w here the problem is (and w here the problem is

    not) to effect ively triage the issue. If mul tiple problem s are happening simult aneously, issues

    should be automatically prioritized based on the criticality of the impacted service.

    It is far too costly t o rely on human intervention to d etermine the root cause of problems and

    is also far too costly t o sift t hrough an unending storm of symptom atic problems. Knowing th

    root cause allows an organization t o efficiently get problems fixed without wasting tim epursuing sympt omatic problems.

    The Impor tance of Underst anding Business Impact

    The best management solutions will not only be able to identify problems, but also identify

    impacted services, assets and users. For t he business, understanding im pact is as import ant

    as understanding the root cause. W hen outages or perform ance degradations occur, people

    cannot do t heir jobs effectively, resulting in lower produc tivit y and reduced efficiency.

    Sometim es the products or services provided by the company to t heir custom ers are affected

    resulting in lost revenue, SLA penalt ies, lost customers and even dam aged brand reput ation

    that can take years to repair. Know ing impact allow s an organization t o priorit ize response

    efforts to fix what m atters most, first.

    Event Correlation and Root Cause Analysis A Three-ProngedA pproach for CA SPECTRUM Network Fault M anagerRoot Cause Analysis (RCA) can be defined as the act of interpret ing a set of sym ptom s and

    events and pinpoint ing the source of the problem . A single problem of ten results in mult iple

    events across the inf rastructure. Events are typically local to a source, and wit hout proper

    context do not always help w ith RCA because they are only symptom s of the problem. M any

    components provide events and events come in m any form s: SNM P traps, syslog messages,

    application log file ent ries, TL1 events, ASCII streams, etc.

    M any sophisticated m anagement systems, including CA SPECTRUM Netw ork Fault M anager

    even generate events based on proactive polling of com ponent st atus to ind icate parameter

    based threshold violations, response time m easurement threshold violations, etc. Often,

    correlation of events is required to determ ine if an actionable condition or problem exists but

    correlation is almost alw ays required to isolate problem s, identif y impacted assets and servic

    and suppress symptomatic events.

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS

    SECTION 2: OPPORTUNITY

  • 7/30/2019 CA Spectruim

    6/24

    4 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS

    M anagement soft ware applications efficiently perform ing RCA should raise an alarm for the

    root condition and should minim ize other events resulting from t he same root condition to

    generate an alarm .

    One service provider experienced a situation where they were receiving 500,000 daily

    problem no tificat ions from t heir management t ool. Clearly, no person or team of people

    could keep up w ith t hat m any events. CA SPECTRUM root cause analysis technology

    helped this service provider reduce the num ber of daily problem not ifications to 200 actual

    alarm condit ions, wh ile also autom atically priorit izing issues based on impact. In this

    environment , 500,000 symptom s had only 200 causes. Average tim e to find and fix a

    problem w as reduced from over four hours to less than five minu tes.

    Effective RCA must:

    Understand the relationship between information within the infrastructure and the services

    assets and users that depend on that info rmation

    Be proactive in its monit oring and not just rely on event streams

    Distinguish between a plethora of events and meaningful alarms

    Scale and adapt to t he requirements of growing and dynam ic infrastructures

    W ork across mult iple-vendor and mult iple-technology environments

    Allow for extensions and customization

    CA SPECTRUM employs m ultip le techniques working cooperatively t o deliver its event

    correlation and root cause analysis capabilities. These include Inductive M odeling Technology

    (IM T) , Event M anagement System ( EM S) and Condit ion Correlation Technology (CCT). Each

    of these techniques is employed to d iagnose a diverse and often unpredict able set of p roblem

    Inducti ve Modeli ng Technology

    The core of the CA SPECTRUM RCA solution is it s patented Induct ive M odeling Technology

    (IM T). IM T uses an object-oriented modeling paradigm w ith m odel-based reasoning analytic

    CA SPECTRUM most oft en uses IM T for physical and logical topology analysis, as the softw a

    can automat ically map t opological relationships through it s efficient Auto D iscovery engine. T

    models created are software representat ions of a real-wo rld physical or logical device. These

    software models are in direct com munication w ith their real-world counterparts, enablingCA SPECTRUM root cause analysis to no t only listen, but proactively query for health st atus

    or additional diagnostic inform ation. M odels are described by their att ributes, behaviors,

    relationship to other models and algorithmic intelligence.

  • 7/30/2019 CA Spectruim

    7/24

    Intelligent analysis is enabled through t he collaboration of m odels in a system . Collaboration

    includes the ability to exchange information and initiate processing between models within th

    modeling system. A model m aking a request to another m odel may in turn t rigger that m ode

    to m ake requests on other models, and so on.

    Relationships between m odels provide a context for collaboration. Collaboration betw een

    models enables:

    Correlation of t he symptom s

    Suppression of unnecessary/ sympt omatic alarms

    Impact analysis

    W ith CA SPECTRUM , a model is the softw are representation of a real-world m anaged device

    or a component of t hat m anaged element. This representation allow s CA SPECTRUM to not

    only investigate and query an individual element w ithin t he network, but also provides the

    means to establish relationships betw een elements in order to recognize them as part o f alarger system.

    By understanding t he relationship between elements and the conditions of related m anaged

    elements, root cause analysis is simplified and p roblems are identif ied m ore quickly. Root

    cause and impact are determined through IM Ts ability to both listen and talk to the

    infrastructure.

    A simple example of IM T in action can be demonstrated by a network router port t ransition

    from U P to DOW N. If a port m odel receives a LINK DOW N t rap, it has intelligence to reactby perform ing a status query to determine if the port is actually dow n. If it is in fact

    DOW N, then the system of m odels will be consulted to determine if the port has lower

    layer sub-int erfaces. If any of the low er layer sub-int erfaces are also DOW N, only the

    condition of the original port D OW N w ill be raised as an alarm. A n application of t his

    example can be described by several Frame Relay Dat a Link Control Identifiers ( DLCIs)

    transitioning to INA CTIVE. If the Frame Relay port is DOW N, IMT wi ll suppress the

    sympt omatic D LCI INACTIVE conditions and raise an alarm on the Frame Relay port

    model. Additionally, when the port transitions to DOW N, IMT w ill query the status of the

    connected N etwork Elements ( NEs) and if those are also DOW N, those conditions will be

    considered sympt omatic of the port DOW N, will be suppressed, and w ill be identified as

    impacted by the port DOW N alarm.

    IM T is a very powerful analytical system and can be applied to m any problem domains. A mo

    in-dept h discussion of IM T in action w ill be covered later in the paper.

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS

  • 7/30/2019 CA Spectruim

    8/24

    Event M anagement System

    There are tim es w hen the only source of management inform ation is through event streams

    local to a specific source. There may be no way to t alk to the managed element, but only a w

    to listen to it. Any one event may or may not be a significant occurrence but, in the context ofother events or information, may be an actionable condition .

    Event Rules in SPECTRUM s Event M anagement System (EM S) provide a com prehensive

    decision-m aking system t o indicate how events should be processed. Event Rules can be

    applied to look for a series of events to occur on a m odel in a certain pattern or w ithin a

    specific tim e frame or w ithin certain data value ranges. Event Rules can be used t o generate

    other events or even alarms. If events occur such t hat t he preconditions of a rule are met,

    another event m ay be generated allow ing cascading events, or the event can be logged for

    later reporting/ troubleshooting purposes, or it can be prom oted into an actionable alarm.

    CA SPECTRUM provides a series of custom izable meta Event Rule types that form the basis o

    the EMS. These rule types are building blocks that can be used individually or cooperatively teffect an alarm on t he most sim ple or sophisticated event-oriented scenarios. The rules engin

    allows for t he correlation of event frequency/ duration, event sequence and event persistence

    coincidence. More than 80% of rule conditions fall into one of the following three event types

    f requency/ durat ion, sequence or coincidence. Keep in mind t hat Event Rules can be r un

    against t he aforement ioned mod els to avoid the need to constantly re-w rite rules to reflect

    infrastructure move/ add/ change activit y. The Event Rule types are highlighted below , followe

    by usage examp les later in t he paper:

    Event Pair (Event Coincidence) This rule is used w hen you expect cert ain events t o happe

    in pairs. If the second event in a series does not occur, this may indicate a problem . Event

    rules created using the Event Pair rule t ype generate a new event w hen an event occurs

    wit hout it s paired event. It is possible for other events to happen betw een the specified eve

    pair wit hout affecting this event rule.

    Event Rate Counter (Event Frequency) This rule typ e is used to generate a new event base

    on events at a specified rate in a specified tim e span. A few events of a certain type m ay be

    tolerated, but once the num ber of these events reaches a certain threshold w ithin a specifi

    tim e period, notif ication is required. No addit ional events w ill be generated as long as the

    rate stays at or above the th reshold. If the rate drops below t he threshold and then

    subsequently rises above the threshold, another event w ill be generated.

    Event Rate W indow (Event Frequency) This rule type is used to generate a new event

    when a number of the same events are generated in a specified tim e period. This rule type

    similar t o t he Event Rate Counter. The Event Rate Counter type is best suited fo r detect ing

    long, sustained burst of events. The Event Rate W indow type is best suit ed for accurately

    detecting short er bursts of events. It monito rs an event t hat is not significant if it happens

    occasionally, but is significant if it happens frequently. If the event occurs above a certain

    rate, then another event is generated. No additi onal events will be generated as long as the

    rate stays at or above the th reshold. If the rate drops below t he threshold and then

    subsequently rises above the threshold, another event w ill be generated.

    6 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS

  • 7/30/2019 CA Spectruim

    9/24

    Event Sequence ( Event Sequence) This rule type is used to identify a part icular order of

    sequence in events that might be significant in your IT infrastructure. This sequence can

    include any number and any type of event. W hen the sequence is detected in t he given

    period of t ime, a new event is generated.

    Event Combo (Event Coincidence) This rule type is used to identify a certain comb ination

    events happening in any order. The combination can include any number and typ e of event

    W hen the comb ination is detected w ithin a given time period, a new event is generated.

    Event Condition (Event Coincidence) This rule typ e is used to generate an event based on

    condit ional expression. In keeping w ith t he CA SPECTRUM t rust but verify m ethodology,

    series of condit ional expressions can be listed w ithin the event rule and t he first expression

    that is found to be true will generate the event. Rules can be constructed to provide correlatio

    through a com bination of evaluating event data w ith IM T model data. For example, if a trap

    is received notifying the m anagement system of m emory bu ffer overload, to validate that a

    alarm condit ion has occurred, an Event Condit ion rule can init iate a request t o the device to

    check actual m emory ut ilization.

    CA SPECTRUM imp lements a number of Event Rules out-of-t he-box by applying one or m ore

    rule typ es to event streams. Users can create or custom ize Event Rules using any of t he rule

    types. Furt her implement ation of Event Rules using the Event M anagement System wil l be

    discussed later in t his paper.

    Condition Correlation

    In order to perform more complex user-defined or user-controlled correlations, a broader set

    capabilit ies is required. CA SPECTRUM policy-based Condit ion Correlation Technology enable

    Creation o f correlation policies

    Creation of correlation domains

    Correlation of seemingly disparate event streams or conditions

    Correlation across sets of m anaged elements

    Correlation w ithin m anaged dom ains

    Correlation across sets of m anaged dom ains

    Correlation of component cond itions as they m ap to higher order concepts such as busine

    services or custom er access

    In order to understand t hese capabilities, the terminology is described in this context:

    Cond it ions A condit ion is similar to st ate. Condition can be set by an event and cleared by

    an event. It i s also possible to have an event set a condit ion but require a user-based action

    to clear the condition. The condition exists from the tim e it is set until t he time it is cleared

    A very simple example of a condition is port down condition. The port down condition w

    exist for a particular interface from the tim e the LINK DOW N t rap or set event (such as a

    failed status poll) is received until the tim e the LINK UP trap or clear event ( such as a

    successful status poll) is received. A num ber of condit ions that m ay be of use for establish

    ing domain level correlations are defined out- of-t he-box and more can be added by the use

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS

  • 7/30/2019 CA Spectruim

    10/24

    Seemingly Disparate Conditions M any devices in an IT infrastructure provide a specific

    funct ion. The device level function is often w ithout context as it relates to the functions of

    other devices. M ost m anaged devices can emit event streams but t hose event streams are

    local to each component. A sim ple example is when a response tim e test identif ies a resultexceeding a threshold. At the same tim e, an event m ay identify a condit ion of a router port

    exceeding a transmit bandw idth threshold. These condit ions are seemingly disparate, as th

    are created independently and wit hout cont ext or know ledge of each other. In reality t he tw

    are quite related.

    Rule Patterns Rule Patt erns are used to associate condit ions w hen specific crit eria are met

    A simple example is a port dow n condition caused by a board pulled condition but

    only if the port s slot num ber is equal to the slot num ber of the board thats been pulled.

    Figure A il lustrates this rule patt ern. The result of applying a rule pat tern can be t he creatio

    of an actionable alarm or t he suppression of sympt omatic alarms.

    RULE PATTERN

    Correlat ion Pol icy M ultip le rule patt erns can be bundled or grouped into a Correlation

    Policy. Correlation Policies can t hen be app lied t o a Correlation Dom ain. For example, a set

    of rule patt erns applicable to OSPF correlation can be created and labeled the O SPF

    Correlation Policy. This policy can be applied t o each Correlation D omain as defined by eac

    autonom ous OSPF region and the support ing routers in t hat region.

    Correlat ion Domain A Correlation Dom ain is used to bot h define and lim it t he scope of on

    or m ore Correlation Policies. A Correlation Domain can be applied to a specific service. For

    example, in the cable broadband environm ent, a return path m onitor ing system m ay detecreturn path failure in a certain geographic area. This return path failure condition is causin

    subscribers high speed cable modems to become unreachable and causing Video on Deman

    (VoD) pay-per-view streams to fail. The knowledge that the return pat h failure, the m odem

    problems and the failed video streams are all in the same correlation dom ain is essential to

    correlating the events and ultim ately identif ying the root cause. How ever, it is also impo rta

    to have the ability to distinguish that a return path failure condition occurring in one

    correlation dom ain (Philadelphia, PA) is not correlated wit h VoD stream failure conditions

    occurring in a different correlation dom ain (Portsmout h, NH) .

    8 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS

    FIGURE A

    Rule patt erns determine the sequenceof investigation t hat w ill result in eitherthe creation of an alarm or t hesuppression of symptom atic alarms.

  • 7/30/2019 CA Spectruim

    11/24

    Condition- based correlations are very pow erful and provide a mechanism to develop

    Correlation Policies and apply t hem t o Correlation Dom ains. W hen applied to Business

    Service M anagement , Correlation Policies can be likened to met rics of an SLA and

    Correlation Dom ains can be likened t o service, user or geographical groupings.

    There are tim es w hen the only way to infer a causal relationship between tw o or more

    seem ingly disparate conditions is when those conditions occur in a common Correlation

    Dom ain. These mechanisms are necessary w hen causal relationships cannot be discovered

    through interrogations or receipt of events t o/ from the infrastructure components.

    Use Case ScenariosOut -of-t he-box, CA SPECTRUM addresses a wide range of diff erent scenarios where it can

    perform root cause analysis. This section provides specific scenarios where t he techniques

    described in the previous section are employed t o determ ine root cause and impact analysis.

    The detail will be lim ited t o the basic processing for the sake of simp licity and brevit y. A lso fo

    the purpose of the discussion and figures, the follow ing table is provided showing the color of

    alarms that are associated wit h the icon status of models at any given time.

    M OD EL STATU S COLORS

    Inference and Induct ive M odeling Technology

    Comm unication outages are often described as black-outs or hard faults. W ith t hese typeof faults, one or more comm unication paths are degraded to the point t hat comm unication is

    no longer possible. The cause could be broken copper/ fiber cables/connections, improperly

    configured routers or swit ches, hardware failures, severe performance problems and securit y

    attacks. Often t he difficulty w ith t hese hard comm unication failures is that there is limited

    inform ation available to t he management system, as it is unable to exchange information wit

    one or more managed elements.

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS

    FIGURE B

    A larms are color coded reflecting themodel status.

  • 7/30/2019 CA Spectruim

    12/24

    The CA SPECTRUM system of sophist icated m odels, relationships and behaviors available

    through IM T allows it to infer t he fault and impact . IM T inference algorithm s are also called

    inference handlers and a set of inference handlers designed for a purpose is labeled as an

    intelligence circuit or simply intelligence. This section will out line how int elligence is appliedto isolate com munication out ages.

    BUILDING THE MODEL The accurate representation of t he infrastruct ure through the modeling

    system is the key to being able to determ ine the fault and t he impact of t he fault. CA SPECTRUM

    has specific solutions for discovering mult i-path netwo rks over a variety of technologies

    support ing different architectu res. It offers support for m eshed and redundant, physical and

    logical topologies based on AT M , Ethernet , Frame Relay, HSRP, ISDN , ISL, M PLS, M ult icast,

    PPP, VoIP, VPN, VLAN and 80 2.11 w ireless environments even legacy t echnologies such a

    FDD I and Token Ring. Its modeling is ext remely extensible and can be used to m odel OSI

    Layers 1-7 in a communication infrastructure.

    CA SPECTRUM provides four different m ethods for build ing the physical and logical topologymodel and inter-dependent connectivity for any given infrastructure:

    The AutoDiscovery functionality can be used to automatically interrogate the managed

    infrastructure about its physical and logical relationships. Aut oDiscovery works in tw o

    distinct phases (alt hough there are many different stages with in each phase that are not

    covered here) and dynamically.

    W hen initiated, Aut oDiscovery automatically discovers the devices that exist in t he

    infrastructure. This provides an inventor y of devices that cou ld be m anaged.

    The second phase is M odeling. A utoD iscovery uses management and discovery protocol s

    query the devices it has found to gain information that wi ll be used to determ ine the Layer

    and Layer 3 connectivity between m anaged devices. For example, Aut oDiscovery uses SNM

    to examine route tables, bridge/ swit ch tables and interface tables, but also uses trafficanalysis and vendor p roprietary d iscovery protocols such as Cisco Discovery Protocol (CD P

    AutoDiscovery is a very thorough, accurate and automated mechanism to build the

    infrastructure model.

    Alternately, the SPECTRUM M odeling Gateway can be used to import a description of t he

    entire infrastructures components, as well as physical and logical connectivity in form ation

    from external sources, such as provisioning systems, netw ork topology databases or

    configuration management dat abases (CM DBs)

    The Comm and line interface or programmatic A PIs can also be used to build a custom

    integration or application to import information from external sources.

    SPECTRUM s OneClick Netw ork Console can be used to quickly point and cl ick to m anuallbuild the model.

    CA SPECTRUM allows a single managed element to be logically broken up into any num ber o

    sub-m odels. This collection of models and the relationships between them is often referred to

    as the semantic data m odel for that device. Thus, a typical semantic data mod el for a device

    may includ e a chassis model wit h board models related t o the chassis. Associated to the boa

    models would be physical interface models. Each physical interface model m ay have a set of

    sub-interface models associated below them.

    10 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS

  • 7/30/2019 CA Spectruim

    13/24

    CA SPECTRUM has a set of well- defined associations that define how d ifferent semant ic data

    model sets act wit h one another. W hen the software determines the connectivity betw een tw

    devices, a relationship is established betw een the two por ts that fo rm t he link between them

    as well as the relationships that form betw een device models and to the correspondinginterface and port models of other devices. This is depicted in Figure C.

    M OD ELING D EVICE AN D INT ERFACE LEVEL CONNECTIVIT Y

    W HEN DO ES TH E A NA LYSIS BEGIN? CA SPECTRUM can begin to solve a problem p roactively

    upon receipt of a single event. M any problems share the same set of sym ptoms and only

    through fu rther analysis can the root cause be determ ined. For communication out ages, the

    analysis is triggered w hen a m odel recognizes a comm unication failure. Failed polling, traps,

    events, perform ance threshold violations or lack of response can all lead to t his recognition.

    CA SPECTRUM validates the comm unication failures through ret ries, alternative protocols

    and alternative path checking as part of its trust but verify methodology.

    CA SPECTRUM wil l refer to the model that triggered the intelligence as the init iator m odel,

    although m ore than one model can trigger the intelligent validation procedures. The initiator

    model intelligence requests a list of other m odels that are directly connected t o it. These

    connected m odels are referred to as the init iator m odels neighbors.

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

    FIGURE C

    M odeled relationships betweendevices are reflected in a relationshipbetween the port s and interface thatconnect them.

  • 7/30/2019 CA Spectruim

    14/24

    TH E INITIATO R M OD EL AND N EIGHBORS

    W ith a list of neighbors determ ined, CA SPECTRUM directs each neighbor model t o check its

    current st atus. This check is referred to as the Are you OK? check. A re you OK is a relative

    term , and a unique set of att ributes related to perform ance and availability w ill vary from m od

    to m odel based on the real-world capabilit ies of the device that the model is representing.

    W hen a model is asked Are you OK?, the model can initiate a variety of tests to verify its

    current operational status. For example, wit h most SNM P managed devices the check is

    typi cally a combination of SNM P requests but could be more involved by interrogating an

    Element M anagement System o r as simple as an ICM P ping. A com prehensive check could

    include threshold perform ance calculations or execution of response tim e tests.

    Each neighbor model returns an answer to Are you O K? and CA SPECTRUM then begins its

    analysis of the answ ers.

    DETERM ININ G TH E HEA LTH OF NEIGHBORS

    FIGURE D

    The initiator m odel triggers theintelligent validation p rocedure. It

    requests a list of models that aredirectly connected and these are calledthe init iator m odels neighbors.

    FIGURE E

    Once the list of neighbors isestablished, the status of eachneighbor is checked with t heAre you O K? m essage.

    12 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS

  • 7/30/2019 CA Spectruim

    15/24

    FAU LT ISOLATION If the init iator m odel has a neighbor that responds that it i s OK , (Figure F

    M odel A) , then it can be inferred the problem lies between the unaffected neighbor and the

    affected init iator ( Figure F, Model B). In th is case, the init iator m odel that triggered the int el-

    ligence is a likely culprit f or this part icular infrastructure failure. The result? A critical alarm wbe asserted on the initiat or model and it is considered the root cause alarm.

    FA ULT ISOLATIO N IN PROGRESS

    A LA RM SUPPRESSIO N As the analysis continues beyond isolating the device at fault (Figure G

    M odel B), the next step is to analyze the effects of the fault, the goal of which is intelligent

    alarm suppression. If a neighbor ( Figure G, M odels C, D or E) of the initiator model responds

    N o, I am not OK , then this particular neighbor is considered to be affected by a failure that

    occurring som ewhere else. As a result, CA SPECTRUM w ill place these models into a

    suppressed condition (Grey Color) because any alarms from this device are symp tom atic of a

    problem elsewhere.

    FA ULT ISOLAT ION CO M PLETE

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

    FIGURE F

    The root cause alarm is establishedthrough a sequence of sharing statusbetween models.

    FIGURE G

    M odels that respond wit h a No, I amnot O K stat us are put in a suppressedcondition t o suppress the alarms thatare symp tom atic of a problemelsewhere in t he infrastructure.

  • 7/30/2019 CA Spectruim

    16/24

    IM PACT A NA LYSIS CA SPECTRUM continues to analyze the to tal im pact of t he fault. It w ill

    analyze each Fault D omain, a Fault D omain being the collection of m odels wit h suppressed

    alarms that are affected by t he same failure. These impacted m odels are linked to t he root fau

    for presentation and analysis. The intelligence provides the im pact m easurement t his fault iscreating, by examining the models that are included with in this Fault Dom ain and calculating

    an Impact Severity value. The ranking allows operators to quickly assess the relative im pact o

    each fault and prioritize corrective actions.

    Creat ively using t he Event M anagement System

    There are many applicat ions of Event Rules that w ill allow h igher order correlation of event

    streams. Event Rule processing is required for situat ions w hen the event st ream is t he only

    source of management information. For example, this situation can occur when CA SPECTRUM

    accepts event streams from devices and applications that it does not d irectly m onitor, so that

    CA SPECTRUM can only listen-it cannot t alk. CA SPECTRUM provides many out -of-t he-box

    event rules, but also provides easy-to-use m ethods for creating new rules using one or more the event rule types. This section highlights a couple of out- of-t he-box event rules and also a

    few custom er examp les of event rule applications.

    AN OU T-OF-TH E-BOX EVENT PAIR RULE CA SPECTRUM has the ability to interpret Cisco syslo

    messages as event streams. Each syslog m essage is generated on behalf of a m anaged sw itch

    or router and is directed to the model representing that m anaged element. One of the many

    Cisco syslog m essages indicates a new configuration has been loaded into t he router. The

    Reload m essage should alw ays be followed by a Restart message, indicating the device ha

    been restart ed to adopt the newly loaded configuration. If not, a failure during reload is

    probable. An event r ule based on the Event Pair rule type is used to raise an alarm wit h cause

    ERROR DURING ROUTER RELOA D if t he restart message is not received w ithin 15 m inutes o

    the reload m essage. Figure H diagrams t he events and tim ing.

    EV ENT PA IR RULE

    14 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS

    FIGURE H

    This figure depicts an example of anevent pair rule in operation. Reloadmessages indicating new routerconfigurations should always befollowed by restart messages toindicate the router has adopted thenew configuration. An alarm is raisedto indicate a probable failure if therestart m essage is not received w ithinthe expected time period.

  • 7/30/2019 CA Spectruim

    17/24

    M AN AGING SECURITY EVENTS USING A N EVENT RATE COUNT ER RULE CA SPECTRUM is oft en

    used to collect event feeds from m any sources. Some custom ers send events from security

    devices such as intr usion detect ion systems and firewalls. These types of devices can genera

    mi llions of log file entr ies. One custom er utilizes an Event Rate Counter rule to dist inguishbetw een sporadic client connection rejections and real security att acks. The rule was

    construct ed to generate a critical alarm if 20 or m ore connection failures occurred in less tha

    one m inute. Figure I depicts t his alarm scenario.

    EV ENT RATE COU NT ER RULE

    M AN AGING SERVER M EM ORY GROW TH USING AN EVENT SEQUENCE RULE A common problem

    w ith som e applications is the inability t o manage mem ory usage. There are applications that

    w ill take system m emory and never give it back for other applicati ons to reuse. W hen the

    application does not ret urn the m emory, and also no longer requires the m emory, it is called

    m emory leak. The result is that performance on the host m achine will degrade and eventua

    cause the application to fail.

    At one customer environment t his problem regularly occurs on a W eb Server application.

    The custom er has a standard operating procedure t o reboot t he server once a week to

    compensate for the memory consumption. However, if the memory leak occurs too quickly,

    there is a deviation from norm al behavior and the server needs to be rebooted before the

    scheduled maintenance window. The custom er employs a combination of p rogressive

    thresholds w ith an Event Sequence rule to moni tor for abnor mal behavior. M onitor ing was se

    to create events as the memory usage passed threshold point s of 50% , 75% and 90% . If thos

    threshold poin ts are reached in a period of less than one w eek, an alarm is generated t o provi

    notif ication t o reboot t he server prior to the scheduled maint enance w indow. Figure J depicts

    the fault scenario.

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

    FIGURE I

    This figure depicts an example of t heEvent Rate Count er ru le. Events areoften sent by other devices, such asintrusion detection system s, that cangenerate m illions of events. Toseparate event noise f rom realevents, this rule limit s alarms tosituations where more than 20connection failures are loggedwithin a m inute.

  • 7/30/2019 CA Spectruim

    18/24

    EVENT SEQUENCE RULE

    COND ITION CORRELATION T ECHN OLOGY There are many uses for policy- based Condition

    Correlation Technology (CCT) . For example, consider the com plexities of m anaging an IP

    netw ork that p rovides VPN connectivity across an M PLS backbone wit h intra-area routing

    maint ained by IS-IS and int er-area routing m aintained by BGP. Any physical link or protocol

    failure could cause dozens of events from m ultip le devices. W ithou t sophisticated correlation

    capability applied carefully, the network troubleshooters will spend most of their time chasing

    after symp tom s, rather than fixing the root cause.

    A N IS-IS ROUT ING FA ILURE EXA M PLE A specific example experienced by one of our custom ers

    can be used to describe the pow er of Condit ion Correlation. The failure scenario and link

    outages are illustrated in Figure K. The situation occurs w here a core router, labeled in t he

    figure as R1, loses IS-IS adjacencies to all neighbor s (labeled in t he figure as R2, R3, R4) . This

    also results in the BGP session w ith the rout e reflector ( labeled in the f igure as RR) being lost

    This condition, if it persists, will result in rout es aging out o f R1 and adjacent edge routers R3

    and R4. Eventually, the custom er V PN sites serviced by t hese customer premise edge (CPE)

    routers w ill be unable to reach their peer sites (labeled in the f igure as CPE1, CPE2, CPE3) .

    16 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS

    FIGURE J

    Reaching threshold points issometim es acceptable if the thresholds

    are reached over a period of tim e thatwill ensure they are reset by regularmaintenance schedules beforereaching critical levels. If they arereached sooner, however, they m aycause outages. The Event Sequencerule will m easure threshold att ainmentand generate an alarm only if required.

  • 7/30/2019 CA Spectruim

    19/24

    IM PACT OF AN IS-IS ROU TING FAILURE

    This failure causes a series of syslog error m essages and t raps to be generated by the rou ters

    The messages and traps t hat w ould be received are outlined in Figure L.

    SYSLOG ERROR MESSAGE AN D T RAP SEQU ENCE

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

    FIGURE K

    This diagram shows how a failure ina core router can ripple through the

    network, causing numerous events andalarms, if not int elligently m anaged.

    FIGURE L

    Error m essages and traps cascade froma single core router failure. SOURCE

    R1

    R1

    R1

    RR

    RR

    RR

    R2

    R3

    Rn

    TYPE

    Syslog message

    Syslog message

    Syslog message

    Syslog message

    Syslog message

    Trap

    Syslog message

    Syslog message

    Syslog message

    M ESSAGE

    %CLNS050ADJCHA NGE: ISIS: Adjancency to R2 (P0S5/ 0/ 0) Dow nhold tim e expired

    %CLNS050ADJCHA NGE: ISIS: Adjancency to R3 (P0S5/ 0/ 0) Dow nhold tim e expired

    %CLNS050ADJCHA NGE: ISIS: Adjancency to Rn (P0S5/ 0/ 0) Dow nhold tim e expired

    %BGP-5-ADJCHANGE: neighbor R1 Down BGP Notification sent

    %BGP-3-N OTIFICATION: sent to neighbor R1 4/ 0 ( hold tim e expire0 bytes

    BGP Backwards Transition trap, neighbor = R1

    %CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Dow nhold tim e expired

    %CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Downhold tim e expired

    %CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Dow nhold tim e expired

  • 7/30/2019 CA Spectruim

    20/24

    The root cause of all t hese messages is an IS-IS outage problem related to R1. For m any

    management systems the operator w ould see each of these traps as seem ingly disparate

    events on the alarm console. A trained operator or experienced troubleshooter may be able t

    deduce, after som e careful t hought, that an R1 rout ing problem is indicated. However, in a larenvironment these alarms will likely be interspersed wit h other alarms clutt ering the console

    Even if the operator were capable of making t he correlation m anually, there would be

    significant effort and tim e spent doing so. That tim e is directly related t o costs, lower user

    satisfaction and lost revenue.

    Using a combination of an Event Rule and Condit ion Correlation, a set of rule patterns can be

    applied to a Correlation D om ain consisting of all core, label switch routers, enabling CA

    SPECTRUM to p roduce a single actionable alarm. This alarm w ill indicate that R1 has an IS-IS

    routing prob lem, and a netw ork outage may result if t his is not corrected. The seemingly

    disparate conditions that w ere correlated by the software, resulting in this alarm, will be

    displayed in the symptom s panel of t he alarm console as follows:

    1. A local Event Rate Counter rule w as used t o define m ultip le IS-IS adjacency change

    syslog messages report ed by t he same source as a routing problem for t hat source.

    2. A rule patt ern w as used t o m ake an IS-IS adjacency lost event caused by an IS-IS routi

    problem w hen the neighbor of t he adjacency lost event is equal to the source of the

    routing problem event.

    3. A rule patt ern was used to m ake a BGP adjacency dow n event caused by an IS-IS

    routing problem when the neighbor of the adjacency dow n event is equal to the source o

    the routing problem event.

    4. A rule pattern w as used to m ake a BGP backward transition t rap event caused by an

    IS-IS routing problem w hen the neighbor of t he backward transition event is equal to the

    source of the rout ing problem event.

    A PPLYIN G CON DIT ION CORRELATIO N T O SERVI CE CORRELATIO N It is comm on to have several

    services running over the same network . As an example, in the cable industry, telephone

    service (VoIP), Internet access (high speed data), video on dem and ( VoD) and d igital cable a

    delivered over the same physical data network. M anagement of this network is quit e a challeng

    Inside the network ( cable plant), the video transport equipm ent, video subscripti on services

    and the Cable Model Termination System (CM TS) all work together to put data on the cable

    network at the correct frequencies. Uncounted m iles of cable along wit h thousands of amp lifie

    and power supplies must carry t he signals to the hom es of literally m illions of subscribers.

    W ith t he flood of events and error messages that w ill be provided by the m anaged elements,

    the fact t hat there is a problem w ith t he service will be obvious. The challenge is to translate

    all this data into root cause and service impact actionable inform ation.

    18 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS

  • 7/30/2019 CA Spectruim

    21/24

    Service impact relevance goes beyond understanding what is impacted; it is also impor tant

    to understand w hat is not impacted. Its possible for the video subscript ion service to fail to

    deliver VoD cont ent t o a single service area, and yet all ot her services to t hat area could be

    fine. Or, a return pat h problem in one area could cause Internet, VoIP and VoD services to failand digital cable to degrade, yet analog cable could still func tion norm ally. In the case of a

    media cut in one area, the return path mon itoring system and t he head end controller wou ld

    report return path and power problem s in that area. The CMTS would provide the number of

    cable modem s off-line for the node. The video transport system wou ld generate errors for

    video subscriptions in t hat area. Lastly, any custom er modems that are being managed will

    become lost to t he management system.

    CA SPECTRUM can make sense of t he resulting deluge of events by using t he service area of

    the seemingly disparate events as a factor in the Condition Correlation. If the service areas an

    services are appropriat ely modeled, Condition Correlation can be used to determ ine which

    services in which areas are affected and the root cause or causes.

    SERVICE CORRELATI ON

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

    FIGURE M

    Condition Correlation enablesCA SPECTRUM to analyze a delugeof events to determine w hich serviceis impacted.

  • 7/30/2019 CA Spectruim

    22/24

    CA SPECTRUM for H igh Perform ing Infrastruct uresA high perform ing IT infrastructu re is at t he core of todays successful businesses. W hether

    your business is online retail, financial services or m anufacturing, your infrastruct ure isessential. Keeping t he infrastruct ure running, avoiding outages, quickly finding t he causes of

    degradation and outages, and simplif ying m anagement and event data for your IT staff are al

    factors in maintaining high performance.

    Patented Software Elevates CA SPECTRUM Capabilities

    CA SPECTRUM provides intelligence, multip le methods and patented solutions to apply t he

    best in event cor relation and root cause analysis to you r infrastruct ure. Event correlation is at

    the heart o f root cause analysis. W ith large and complex infrastructures, events flood event

    logs and your IT staff can be overw helmed by at tem pting to correlate events manually. The

    tim e wasted in t his effort has direct effect on your business and its bot tom line. CA SPECTRUM

    uses intelligence and event rules to separate true root causes from associated, symptomatic

    causes, thereby m inimizing the amount of inform ation and m aximizing the quality of

    information your IT staff must address.

    Benefit s for Experienced Users and New Users

    CA SPECTRUM provides out-of- the-box utilization, perform ance and response time th reshold

    that act as an early warning system w hen a problem is about t o happen or when a service lev

    guarantee is about t o be violated. W hile these thresholds can be tuned for a specific custom e

    environment , there is tremendous value in having these out-of-t he-box th resholds. They enab

    CA SPECTRUM to deliver value on day one.

    CA SPECTRUM makes it easy for experienced users to add t heir ow n thresholds and w atches

    such that aft er a unique problem happens in the environm ent once, new w atches can helppredict o r prevent it from happening again. Through Event Rules and com binations of Event

    Rules, even complex behaviors can be capt ured and m anaged.

    SECTION 3: BENEFITS

    20 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS

  • 7/30/2019 CA Spectruim

    23/24

    Change is a constant, requiring any management system to be automated, adaptable, and

    extensible. The number of multi-vendor, multi-technology hardware and software elements

    in a typical IT environment exponentially increases the complexity of m anaging a real-tim e,on-dem and IT infrastruct ure. CA SPECTRUM currently support s several thousand distinct

    means to aut omate root cause analysis across over hundreds of product families and device

    types from todays leading infrastruct ure vendors.

    Know ing about a problem is no longer enough. Predicting and preventing problem s, pinpoint i

    their root cause, and priorit izing issues based on im pact are requirements for t odays manage

    ment solutions. The num ber and variety of possible fault, perform ance and threshold problem

    means that no single approach to root cause analysis is suited for all scenarios. For t his reaso

    CA SPECTRUM incorporates m odel-based IM T, rules-based EMS, and policy-based CCT t o

    provide an integrated, intelligent approach to d rive efficiency and effectiveness in managing

    IT infrastruct ure as a business service.

    To learn m ore about CA SPECTRUM and it s technical approach, visit

    http:// ww w.ca.com/ us/ products/ product.aspx?id=783 2

    TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

    SECTION 4: CONCLUSIONS

    http://www.ca.com/us/products/product.aspx?id=7832http://www.ca.com/us/products/product.aspx?id=7832http://www.ca.com/us/products/product.aspx?id=7832
  • 7/30/2019 CA Spectruim

    24/24

    CA (NYSE: CA), one of t he wor ld's leading independent,enterprise management software companies, unifies and

    simplifies complex information technology (IT) managementacross the enterprise for greater business results. W ith ourEnterprise IT M anagement vision, solutions and expert ise,we help customers effect ively govern, manage and secure IT.

    TB05IM SPECRCA1E M P327420