ca spectruim

7/30/2019 CA Spectruim

1/24

TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CAU SE A NA LYSIS

Interpret ing Events W ith Intelligence

to Find Root Cause


2/24

Copyright 2008 CA . All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. This document is for your inform ational purposes only. To the extent perm ittby applicable law, CA provides this document As Is w ithout w arranty of any kind, including, without limitation, any implied warranties of merchantability or fit ness for a particular purpose, or noninfringement. In no event will CA liable for any loss or damage, direct or indirect, from the use of this document , including, without limit ation, lost profits, business interruption, goodwill or lost data, even if CA is expressly advised of such damages.

Table of Cont ents

Executive Summary

SECTION 1: CHALLENGE 2

A Complex Problem in Need of a Solution

The Infrastruct ure is the Business

The Need t o Be Proactive

The Importance of Understanding Business

Impact

SECTION 2: OPPORTUNITY 3

Event Correlation and Root Cause A nalysis

A Three-Pronged Approach for CA SPECTRUM

Network Fault Manager

Induct ive M odeling Technology

Event M anagement System

Condition Correlation

Use Case Scenarios

Inference and Inductive M odeling Technology

Creatively using the Event M anagement System

SECTION 3: BENEFITS 20

CA SPECTRUM for High Performi ng Infrast ruc-

tures

Patented Software Elevates CA SPECTRUM

Capabilities

Benefits for Experienced Users and New U sers

SECTION 4: CONCLUSIONS 21


3/24

TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS

Executive SummaryChallenge

Todays complex IT infrastruct ures are dynamic, multi vendor engines made of frequently

changing components and technologies. The complexity o f the infrastructure and the continual

changes caused by business demands often lead to faults within the infrastructure. A fault in

a single device can have a ripple effect t hat causes performance and availability problems

for m any users. The ripple effect also m akes it difficult to pinpoint t he root cause of the fault

To remain relevant and competit ive in the marketplace, comp anies must m inim ize outages

and perform ance degradations and this requires effective performance and availability

management. Unfort unately, many management tools are not adaptive and have not kept

pace w ith the dynamics of real-tim e, on-demand IT. Other tools often used are niche tools

that manage only a portion of t he infrastructure, making them largely ineffective in large,

interconnected environments.

Opportunity

CA SPECTRUM is an infrastruc ture m anagement solut ion that provides integrated service

fault and configuration functionality for m odeling, monitoring and reporting across multiple

network device types and technologies. Using a t rust but verify m ethodology, CA SPECTRUM

provides an automated and intelligent m anagement approach for your particular

environment w hether you are as a service provider or an enterprise.

CA SPECTRUM leverages three types of problem solving to comprehensively m anage yourinfrastructure:

M odel-based Inductive Modeling Technology ( IM T)

Rules-based Event M anagement System ( EM S)

Policy-based Condit ion Correlation Technology ( CCT)

Benefits

CAs model-, rules- and poli cy-based analytics understand relationships betw een IT infra-

struct ure assets and t he users and services they are designed to suppor t. This insight hasenabled CA SPECTRUM to deliver real benefit to custom ers. A large service provider

realized a 70 % reduction in downt ime w hile resolving 90% of availability or performance

problems from a central location. Patented root cause analysis has been able to reduce t he

number of alarms by several orders of magnit ude while significantly reducing M ean-Time-

to-Repair ( M TTR).

The CA int egrated approach to fault and performance management has enabled enterprise

government and service provider organizations around t he world t o achieve reliability,

efficiency and effectiveness in managing IT i nfrastructures as a business service.


4/24

A Complex Problem in N eed of a SolutionIT infrastructure management is an intensive undertaking w ith significant resource require-

ment s. M ost employees in an organization expect the infrastructu re to work, not thinking of ias a dynamic, mu lti-vendor engine made up of frequently changing components and technologie

In fact, the complexit y and dynam ics of todays real-tim e, on-dem and IT architectures presen

many opport unities for inconsistencies and failures. Invariably, the infrastructu re will slow

dow n or fail and when it does, tools are required to qui ckly pinpoin t t he root cause, suppress

sympt omatic faults, prioritize based on business impact and accelerate service restoration.

The Infrastr ucture is t he Business

The IT infrastructure is a collection of interdependent components including computing

systems, netw orking devices, databases and applications. W ithin t he set of infrastructu re

components are mult iple versions of m any vendors products connected over a variety of

netw orking technologies. In addition, each business environm ent is different from the next

there is no configuration or standard set of components that makes up an infrastructu re.

There also is constant change in devices, firmw are versions, operating system s, netw orking

technologies, development t echnologies and tools. But this dynam ic and complex infrastructur

serves an impo rtant purpose; the infrastructure IS the business. Either t he infrastructure work

and evolves or the organization is out of business. Com panies must also evolve their people,

processes and management t ools for greater efficiency or fall com petit ively behind.

To ensure the performance and availability of t he infrastructu re, most com panies employ a

dual-approach method consisting of:

1. Highly available, fault- tolerant, load balancing designs for infrastruct ure devices and

comm unication paths.2. A network management solution to ensure reliable operation.

High-availability environments further complicate the job of management solutions. The

management solut ion must understand the load balancing capacity, be able to track primary

and fault-t olerant backup paths, and understand w hen redundant systems are active.

The investment in the management solution is as important as the investment in the infra-

struct ure itself. The solution m ust be broad, deep, integrated and intelligent t o perform its

intended function. The infrastructure is not static and the solution will need to embrace

change while delivering an end-t o-end int egrated view of perform ance and availability across

infrastructure silos. Unfort unately, many m anagement tools are not adaptive and do not keep

pace with t he dynamics of IT reality.

The Need to Be Proactive

To remain relevant and competit ive in the marketplace, comp anies must m inim ize outages

and perform ance degradations. In order to do t his, the indiv idual or groups responsible for the

care of the inf rastructure ( e.g., IT, Engineering, Operations) must be proact ively noti fied of

problems. There are many tools that m onito r the availability and perform ance of infrastruct ur

components and the business applications that rely on them.

SECTION 1: CHALLENGE

2 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS


5/24

M any of these tools simply identify that a problem exists and notify technicians of a problem

after it has happened. They often give visibility int o only a small slice of the technologies und

management and have no ability to understand how the various component s relate to each

other. This is not enough. It is import ant that t he management solution act as an early warninsystem to help avoid downt ime and service level agreement (SLA) violations. Aft er a problem

has occurred is t oo late users are dissatisfied and SLA penalties have been levied.

Before the true t ask of troubleshooting can begin, the troubleshooter has to isolate the

problem. Simply know ing there is a problem and collect ing all the problems on one screen is

not enough. Troubleshooters need to know w here the problem is (and w here the problem is

not) to effect ively triage the issue. If mul tiple problem s are happening simult aneously, issues

should be automatically prioritized based on the criticality of the impacted service.

It is far too costly t o rely on human intervention to d etermine the root cause of problems and

is also far too costly t o sift t hrough an unending storm of symptom atic problems. Knowing th

root cause allows an organization t o efficiently get problems fixed without wasting tim epursuing sympt omatic problems.

The Impor tance of Underst anding Business Impact

The best management solutions will not only be able to identify problems, but also identify

impacted services, assets and users. For t he business, understanding im pact is as import ant

as understanding the root cause. W hen outages or perform ance degradations occur, people

cannot do t heir jobs effectively, resulting in lower produc tivit y and reduced efficiency.

Sometim es the products or services provided by the company to t heir custom ers are affected

resulting in lost revenue, SLA penalt ies, lost customers and even dam aged brand reput ation

that can take years to repair. Know ing impact allow s an organization t o priorit ize response

efforts to fix what m atters most, first.

Event Correlation and Root Cause Analysis A Three-ProngedA pproach for CA SPECTRUM Network Fault M anagerRoot Cause Analysis (RCA) can be defined as the act of interpret ing a set of sym ptom s and

events and pinpoint ing the source of the problem . A single problem of ten results in mult iple

events across the inf rastructure. Events are typically local to a source, and wit hout proper

context do not always help w ith RCA because they are only symptom s of the problem. M any

components provide events and events come in m any form s: SNM P traps, syslog messages,

application log file ent ries, TL1 events, ASCII streams, etc.

M any sophisticated m anagement systems, including CA SPECTRUM Netw ork Fault M anager

even generate events based on proactive polling of com ponent st atus to ind icate parameter

based threshold violations, response time m easurement threshold violations, etc. Often,

correlation of events is required to determ ine if an actionable condition or problem exists but

correlation is almost alw ays required to isolate problem s, identif y impacted assets and servic

and suppress symptomatic events.


SECTION 2: OPPORTUNITY


6/24


M anagement soft ware applications efficiently perform ing RCA should raise an alarm for the

root condition and should minim ize other events resulting from t he same root condition to

generate an alarm .

One service provider experienced a situation where they were receiving 500,000 daily

problem no tificat ions from t heir management t ool. Clearly, no person or team of people

could keep up w ith t hat m any events. CA SPECTRUM root cause analysis technology

helped this service provider reduce the num ber of daily problem not ifications to 200 actual

alarm condit ions, wh ile also autom atically priorit izing issues based on impact. In this

environment , 500,000 symptom s had only 200 causes. Average tim e to find and fix a

problem w as reduced from over four hours to less than five minu tes.

Effective RCA must:

Understand the relationship between information within the infrastructure and the services

assets and users that depend on that info rmation

Be proactive in its monit oring and not just rely on event streams

Distinguish between a plethora of events and meaningful alarms

Scale and adapt to t he requirements of growing and dynam ic infrastructures

W ork across mult iple-vendor and mult iple-technology environments

Allow for extensions and customization

CA SPECTRUM employs m ultip le techniques working cooperatively t o deliver its event

correlation and root cause analysis capabilities. These include Inductive M odeling Technology

(IM T) , Event M anagement System ( EM S) and Condit ion Correlation Technology (CCT). Each

of these techniques is employed to d iagnose a diverse and often unpredict able set of p roblem

Inducti ve Modeli ng Technology

The core of the CA SPECTRUM RCA solution is it s patented Induct ive M odeling Technology

(IM T). IM T uses an object-oriented modeling paradigm w ith m odel-based reasoning analytic

CA SPECTRUM most oft en uses IM T for physical and logical topology analysis, as the softw a

can automat ically map t opological relationships through it s efficient Auto D iscovery engine. T

models created are software representat ions of a real-wo rld physical or logical device. These

software models are in direct com munication w ith their real-world counterparts, enablingCA SPECTRUM root cause analysis to no t only listen, but proactively query for health st atus

or additional diagnostic inform ation. M odels are described by their att ributes, behaviors,

relationship to other models and algorithmic intelligence.


7/24

Intelligent analysis is enabled through t he collaboration of m odels in a system . Collaboration

includes the ability to exchange information and initiate processing between models within th

modeling system. A model m aking a request to another m odel may in turn t rigger that m ode

to m ake requests on other models, and so on.

Relationships between m odels provide a context for collaboration. Collaboration betw een

models enables:

Correlation of t he symptom s

Suppression of unnecessary/ sympt omatic alarms

Impact analysis

W ith CA SPECTRUM , a model is the softw are representation of a real-world m anaged device

or a component of t hat m anaged element. This representation allow s CA SPECTRUM to not

only investigate and query an individual element w ithin t he network, but also provides the

means to establish relationships betw een elements in order to recognize them as part o f alarger system.

By understanding t he relationship between elements and the conditions of related m anaged

elements, root cause analysis is simplified and p roblems are identif ied m ore quickly. Root

cause and impact are determined through IM Ts ability to both listen and talk to the

infrastructure.

A simple example of IM T in action can be demonstrated by a network router port t ransition

from U P to DOW N. If a port m odel receives a LINK DOW N t rap, it has intelligence to reactby perform ing a status query to determine if the port is actually dow n. If it is in fact

DOW N, then the system of m odels will be consulted to determine if the port has lower

layer sub-int erfaces. If any of the low er layer sub-int erfaces are also DOW N, only the

condition of the original port D OW N w ill be raised as an alarm. A n application of t his

example can be described by several Frame Relay Dat a Link Control Identifiers ( DLCIs)

transitioning to INA CTIVE. If the Frame Relay port is DOW N, IMT wi ll suppress the

sympt omatic D LCI INACTIVE conditions and raise an alarm on the Frame Relay port

model. Additionally, when the port transitions to DOW N, IMT w ill query the status of the

connected N etwork Elements ( NEs) and if those are also DOW N, those conditions will be

considered sympt omatic of the port DOW N, will be suppressed, and w ill be identified as

impacted by the port DOW N alarm.

IM T is a very powerful analytical system and can be applied to m any problem domains. A mo

in-dept h discussion of IM T in action w ill be covered later in the paper.



8/24

Event M anagement System

There are tim es w hen the only source of management inform ation is through event streams

local to a specific source. There may be no way to t alk to the managed element, but only a w

to listen to it. Any one event may or may not be a significant occurrence but, in the context ofother events or information, may be an actionable condition .

Event Rules in SPECTRUM s Event M anagement System (EM S) provide a com prehensive

decision-m aking system t o indicate how events should be processed. Event Rules can be

applied to look for a series of events to occur on a m odel in a certain pattern or w ithin a

specific tim e frame or w ithin certain data value ranges. Event Rules can be used t o generate

other events or even alarms. If events occur such t hat t he preconditions of a rule are met,

another event m ay be generated allow ing cascading events, or the event can be logged for

later reporting/ troubleshooting purposes, or it can be prom oted into an actionable alarm.

CA SPECTRUM provides a series of custom izable meta Event Rule types that form the basis o

the EMS. These rule types are building blocks that can be used individually or cooperatively teffect an alarm on t he most sim ple or sophisticated event-oriented scenarios. The rules engin

allows for t he correlation of event frequency/ duration, event sequence and event persistence

coincidence. More than 80% of rule conditions fall into one of the following three event types

f requency/ durat ion, sequence or coincidence. Keep in mind t hat Event Rules can be r un

against t he aforement ioned mod els to avoid the need to constantly re-w rite rules to reflect

infrastructure move/ add/ change activit y. The Event Rule types are highlighted below , followe

by usage examp les later in t he paper:

Event Pair (Event Coincidence) This rule is used w hen you expect cert ain events t o happe

in pairs. If the second event in a series does not occur, this may indicate a problem . Event

rules created using the Event Pair rule t ype generate a new event w hen an event occurs

wit hout it s paired event. It is possible for other events to happen betw een the specified eve

pair wit hout affecting this event rule.

Event Rate Counter (Event Frequency) This rule typ e is used to generate a new event base

on events at a specified rate in a specified tim e span. A few events of a certain type m ay be

tolerated, but once the num ber of these events reaches a certain threshold w ithin a specifi

tim e period, notif ication is required. No addit ional events w ill be generated as long as the

rate stays at or above the th reshold. If the rate drops below t he threshold and then

subsequently rises above the threshold, another event w ill be generated.

Event Rate W indow (Event Frequency) This rule type is used to generate a new event

when a number of the same events are generated in a specified tim e period. This rule type

similar t o t he Event Rate Counter. The Event Rate Counter type is best suited fo r detect ing

long, sustained burst of events. The Event Rate W indow type is best suit ed for accurately

detecting short er bursts of events. It monito rs an event t hat is not significant if it happens

occasionally, but is significant if it happens frequently. If the event occurs above a certain

rate, then another event is generated. No additi onal events will be generated as long as the

rate stays at or above the th reshold. If the rate drops below t he threshold and then

subsequently rises above the threshold, another event w ill be generated.



9/24

Event Sequence ( Event Sequence) This rule type is used to identify a part icular order of

sequence in events that might be significant in your IT infrastructure. This sequence can

include any number and any type of event. W hen the sequence is detected in t he given

period of t ime, a new event is generated.

Event Combo (Event Coincidence) This rule type is used to identify a certain comb ination

events happening in any order. The combination can include any number and typ e of event

W hen the comb ination is detected w ithin a given time period, a new event is generated.

Event Condition (Event Coincidence) This rule typ e is used to generate an event based on

condit ional expression. In keeping w ith t he CA SPECTRUM t rust but verify m ethodology,

series of condit ional expressions can be listed w ithin the event rule and t he first expression

that is found to be true will generate the event. Rules can be constructed to provide correlatio

through a com bination of evaluating event data w ith IM T model data. For example, if a trap

is received notifying the m anagement system of m emory bu ffer overload, to validate that a

alarm condit ion has occurred, an Event Condit ion rule can init iate a request t o the device to

check actual m emory ut ilization.

CA SPECTRUM imp lements a number of Event Rules out-of-t he-box by applying one or m ore

rule typ es to event streams. Users can create or custom ize Event Rules using any of t he rule

types. Furt her implement ation of Event Rules using the Event M anagement System wil l be

discussed later in t his paper.

Condition Correlation

In order to perform more complex user-defined or user-controlled correlations, a broader set

capabilit ies is required. CA SPECTRUM policy-based Condit ion Correlation Technology enable

Creation o f correlation policies

Creation of correlation domains

Correlation of seemingly disparate event streams or conditions

Correlation across sets of m anaged elements

Correlation w ithin m anaged dom ains

Correlation across sets of m anaged dom ains

Correlation of component cond itions as they m ap to higher order concepts such as busine

services or custom er access

In order to understand t hese capabilities, the terminology is described in this context:

Cond it ions A condit ion is similar to st ate. Condition can be set by an event and cleared by

an event. It i s also possible to have an event set a condit ion but require a user-based action

to clear the condition. The condition exists from the tim e it is set until t he time it is cleared

A very simple example of a condition is port down condition. The port down condition w

exist for a particular interface from the tim e the LINK DOW N t rap or set event (such as a

failed status poll) is received until the tim e the LINK UP trap or clear event ( such as a

successful status poll) is received. A num ber of condit ions that m ay be of use for establish

ing domain level correlations are defined out- of-t he-box and more can be added by the use



10/24

Seemingly Disparate Conditions M any devices in an IT infrastructure provide a specific

funct ion. The device level function is often w ithout context as it relates to the functions of

other devices. M ost m anaged devices can emit event streams but t hose event streams are

local to each component. A sim ple example is when a response tim e test identif ies a resultexceeding a threshold. At the same tim e, an event m ay identify a condit ion of a router port

exceeding a transmit bandw idth threshold. These condit ions are seemingly disparate, as th

are created independently and wit hout cont ext or know ledge of each other. In reality t he tw

are quite related.

Rule Patterns Rule Patt erns are used to associate condit ions w hen specific crit eria are met

A simple example is a port dow n condition caused by a board pulled condition but

only if the port s slot num ber is equal to the slot num ber of the board thats been pulled.

Figure A il lustrates this rule patt ern. The result of applying a rule pat tern can be t he creatio

of an actionable alarm or t he suppression of sympt omatic alarms.

RULE PATTERN

Correlat ion Pol icy M ultip le rule patt erns can be bundled or grouped into a Correlation

Policy. Correlation Policies can t hen be app lied t o a Correlation Dom ain. For example, a set

of rule patt erns applicable to OSPF correlation can be created and labeled the O SPF

Correlation Policy. This policy can be applied t o each Correlation D omain as defined by eac

autonom ous OSPF region and the support ing routers in t hat region.

Correlat ion Domain A Correlation Dom ain is used to bot h define and lim it t he scope of on

or m ore Correlation Policies. A Correlation Domain can be applied to a specific service. For

example, in the cable broadband environm ent, a return path m onitor ing system m ay detecreturn path failure in a certain geographic area. This return path failure condition is causin

subscribers high speed cable modems to become unreachable and causing Video on Deman

(VoD) pay-per-view streams to fail. The knowledge that the return pat h failure, the m odem

problems and the failed video streams are all in the same correlation dom ain is essential to

correlating the events and ultim ately identif ying the root cause. How ever, it is also impo rta

to have the ability to distinguish that a return path failure condition occurring in one

correlation dom ain (Philadelphia, PA) is not correlated wit h VoD stream failure conditions

occurring in a different correlation dom ain (Portsmout h, NH) .


FIGURE A

Rule patt erns determine the sequenceof investigation t hat w ill result in eitherthe creation of an alarm or t hesuppression of symptom atic alarms.


11/24

Condition- based correlations are very pow erful and provide a mechanism to develop

Correlation Policies and apply t hem t o Correlation Dom ains. W hen applied to Business

Service M anagement , Correlation Policies can be likened to met rics of an SLA and

Correlation Dom ains can be likened t o service, user or geographical groupings.

There are tim es w hen the only way to infer a causal relationship between tw o or more

seem ingly disparate conditions is when those conditions occur in a common Correlation

Dom ain. These mechanisms are necessary w hen causal relationships cannot be discovered

through interrogations or receipt of events t o/ from the infrastructure components.

Use Case ScenariosOut -of-t he-box, CA SPECTRUM addresses a wide range of diff erent scenarios where it can

perform root cause analysis. This section provides specific scenarios where t he techniques

described in the previous section are employed t o determ ine root cause and impact analysis.

The detail will be lim ited t o the basic processing for the sake of simp licity and brevit y. A lso fo

the purpose of the discussion and figures, the follow ing table is provided showing the color of

alarms that are associated wit h the icon status of models at any given time.

M OD EL STATU S COLORS

Inference and Induct ive M odeling Technology

Comm unication outages are often described as black-outs or hard faults. W ith t hese typeof faults, one or more comm unication paths are degraded to the point t hat comm unication is

no longer possible. The cause could be broken copper/ fiber cables/connections, improperly

configured routers or swit ches, hardware failures, severe performance problems and securit y

attacks. Often t he difficulty w ith t hese hard comm unication failures is that there is limited

inform ation available to t he management system, as it is unable to exchange information wit

one or more managed elements.


FIGURE B

A larms are color coded reflecting themodel status.


12/24

The CA SPECTRUM system of sophist icated m odels, relationships and behaviors available

through IM T allows it to infer t he fault and impact . IM T inference algorithm s are also called

inference handlers and a set of inference handlers designed for a purpose is labeled as an

intelligence circuit or simply intelligence. This section will out line how int elligence is appliedto isolate com munication out ages.

BUILDING THE MODEL The accurate representation of t he infrastruct ure through the modeling

system is the key to being able to determ ine the fault and t he impact of t he fault. CA SPECTRUM

has specific solutions for discovering mult i-path netwo rks over a variety of technologies

support ing different architectu res. It offers support for m eshed and redundant, physical and

logical topologies based on AT M , Ethernet , Frame Relay, HSRP, ISDN , ISL, M PLS, M ult icast,

PPP, VoIP, VPN, VLAN and 80 2.11 w ireless environments even legacy t echnologies such a

FDD I and Token Ring. Its modeling is ext remely extensible and can be used to m odel OSI

Layers 1-7 in a communication infrastructure.

CA SPECTRUM provides four different m ethods for build ing the physical and logical topologymodel and inter-dependent connectivity for any given infrastructure:

The AutoDiscovery functionality can be used to automatically interrogate the managed

infrastructure about its physical and logical relationships. Aut oDiscovery works in tw o

distinct phases (alt hough there are many different stages with in each phase that are not

covered here) and dynamically.

W hen initiated, Aut oDiscovery automatically discovers the devices that exist in t he

infrastructure. This provides an inventor y of devices that cou ld be m anaged.

The second phase is M odeling. A utoD iscovery uses management and discovery protocol s

query the devices it has found to gain information that wi ll be used to determ ine the Layer

and Layer 3 connectivity between m anaged devices. For example, Aut oDiscovery uses SNM

to examine route tables, bridge/ swit ch tables and interface tables, but also uses trafficanalysis and vendor p roprietary d iscovery protocols such as Cisco Discovery Protocol (CD P

AutoDiscovery is a very thorough, accurate and automated mechanism to build the

infrastructure model.

Alternately, the SPECTRUM M odeling Gateway can be used to import a description of t he

entire infrastructures components, as well as physical and logical connectivity in form ation

from external sources, such as provisioning systems, netw ork topology databases or

configuration management dat abases (CM DBs)

The Comm and line interface or programmatic A PIs can also be used to build a custom

integration or application to import information from external sources.

SPECTRUM s OneClick Netw ork Console can be used to quickly point and cl ick to m anuallbuild the model.

CA SPECTRUM allows a single managed element to be logically broken up into any num ber o

sub-m odels. This collection of models and the relationships between them is often referred to

as the semantic data m odel for that device. Thus, a typical semantic data mod el for a device

may includ e a chassis model wit h board models related t o the chassis. Associated to the boa

models would be physical interface models. Each physical interface model m ay have a set of

sub-interface models associated below them.

10 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS


13/24

CA SPECTRUM has a set of well- defined associations that define how d ifferent semant ic data

model sets act wit h one another. W hen the software determines the connectivity betw een tw

devices, a relationship is established betw een the two por ts that fo rm t he link between them

as well as the relationships that form betw een device models and to the correspondinginterface and port models of other devices. This is depicted in Figure C.

M OD ELING D EVICE AN D INT ERFACE LEVEL CONNECTIVIT Y

W HEN DO ES TH E A NA LYSIS BEGIN? CA SPECTRUM can begin to solve a problem p roactively

upon receipt of a single event. M any problems share the same set of sym ptoms and only

through fu rther analysis can the root cause be determ ined. For communication out ages, the

analysis is triggered w hen a m odel recognizes a comm unication failure. Failed polling, traps,

events, perform ance threshold violations or lack of response can all lead to t his recognition.

CA SPECTRUM validates the comm unication failures through ret ries, alternative protocols

and alternative path checking as part of its trust but verify methodology.

CA SPECTRUM wil l refer to the model that triggered the intelligence as the init iator m odel,

although m ore than one model can trigger the intelligent validation procedures. The initiator

model intelligence requests a list of other m odels that are directly connected t o it. These

connected m odels are referred to as the init iator m odels neighbors.

TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS

FIGURE C

M odeled relationships betweendevices are reflected in a relationshipbetween the port s and interface thatconnect them.


14/24

TH E INITIATO R M OD EL AND N EIGHBORS

W ith a list of neighbors determ ined, CA SPECTRUM directs each neighbor model t o check its

current st atus. This check is referred to as the Are you OK? check. A re you OK is a relative

term , and a unique set of att ributes related to perform ance and availability w ill vary from m od

to m odel based on the real-world capabilit ies of the device that the model is representing.

W hen a model is asked Are you OK?, the model can initiate a variety of tests to verify its

current operational status. For example, wit h most SNM P managed devices the check is

typi cally a combination of SNM P requests but could be more involved by interrogating an

Element M anagement System o r as simple as an ICM P ping. A com prehensive check could

include threshold perform ance calculations or execution of response tim e tests.

Each neighbor model returns an answer to Are you O K? and CA SPECTRUM then begins its

analysis of the answ ers.

DETERM ININ G TH E HEA LTH OF NEIGHBORS

FIGURE D

The initiator m odel triggers theintelligent validation p rocedure. It

requests a list of models that aredirectly connected and these are calledthe init iator m odels neighbors.

FIGURE E

Once the list of neighbors isestablished, the status of eachneighbor is checked with t heAre you O K? m essage.



15/24

FAU LT ISOLATION If the init iator m odel has a neighbor that responds that it i s OK , (Figure F

M odel A) , then it can be inferred the problem lies between the unaffected neighbor and the

affected init iator ( Figure F, Model B). In th is case, the init iator m odel that triggered the int el-

ligence is a likely culprit f or this part icular infrastructure failure. The result? A critical alarm wbe asserted on the initiat or model and it is considered the root cause alarm.

FA ULT ISOLATIO N IN PROGRESS

A LA RM SUPPRESSIO N As the analysis continues beyond isolating the device at fault (Figure G

M odel B), the next step is to analyze the effects of the fault, the goal of which is intelligent

alarm suppression. If a neighbor ( Figure G, M odels C, D or E) of the initiator model responds

N o, I am not OK , then this particular neighbor is considered to be affected by a failure that

occurring som ewhere else. As a result, CA SPECTRUM w ill place these models into a

suppressed condition (Grey Color) because any alarms from this device are symp tom atic of a

problem elsewhere.

FA ULT ISOLAT ION CO M PLETE


FIGURE F

The root cause alarm is establishedthrough a sequence of sharing statusbetween models.

FIGURE G

M odels that respond wit h a No, I amnot O K stat us are put in a suppressedcondition t o suppress the alarms thatare symp tom atic of a problemelsewhere in t he infrastructure.


16/24

IM PACT A NA LYSIS CA SPECTRUM continues to analyze the to tal im pact of t he fault. It w ill

analyze each Fault D omain, a Fault D omain being the collection of m odels wit h suppressed

alarms that are affected by t he same failure. These impacted m odels are linked to t he root fau

for presentation and analysis. The intelligence provides the im pact m easurement t his fault iscreating, by examining the models that are included with in this Fault Dom ain and calculating

an Impact Severity value. The ranking allows operators to quickly assess the relative im pact o

each fault and prioritize corrective actions.

Creat ively using t he Event M anagement System

There are many applicat ions of Event Rules that w ill allow h igher order correlation of event

streams. Event Rule processing is required for situat ions w hen the event st ream is t he only

source of management information. For example, this situation can occur when CA SPECTRUM

accepts event streams from devices and applications that it does not d irectly m onitor, so that

CA SPECTRUM can only listen-it cannot t alk. CA SPECTRUM provides many out -of-t he-box

event rules, but also provides easy-to-use m ethods for creating new rules using one or more the event rule types. This section highlights a couple of out- of-t he-box event rules and also a

few custom er examp les of event rule applications.

AN OU T-OF-TH E-BOX EVENT PAIR RULE CA SPECTRUM has the ability to interpret Cisco syslo

messages as event streams. Each syslog m essage is generated on behalf of a m anaged sw itch

or router and is directed to the model representing that m anaged element. One of the many

Cisco syslog m essages indicates a new configuration has been loaded into t he router. The

Reload m essage should alw ays be followed by a Restart message, indicating the device ha

been restart ed to adopt the newly loaded configuration. If not, a failure during reload is

probable. An event r ule based on the Event Pair rule type is used to raise an alarm wit h cause

ERROR DURING ROUTER RELOA D if t he restart message is not received w ithin 15 m inutes o

the reload m essage. Figure H diagrams t he events and tim ing.

EV ENT PA IR RULE


FIGURE H

This figure depicts an example of anevent pair rule in operation. Reloadmessages indicating new routerconfigurations should always befollowed by restart messages toindicate the router has adopted thenew configuration. An alarm is raisedto indicate a probable failure if therestart m essage is not received w ithinthe expected time period.


17/24

M AN AGING SECURITY EVENTS USING A N EVENT RATE COUNT ER RULE CA SPECTRUM is oft en

used to collect event feeds from m any sources. Some custom ers send events from security

devices such as intr usion detect ion systems and firewalls. These types of devices can genera

mi llions of log file entr ies. One custom er utilizes an Event Rate Counter rule to dist inguishbetw een sporadic client connection rejections and real security att acks. The rule was

construct ed to generate a critical alarm if 20 or m ore connection failures occurred in less tha

one m inute. Figure I depicts t his alarm scenario.

EV ENT RATE COU NT ER RULE

M AN AGING SERVER M EM ORY GROW TH USING AN EVENT SEQUENCE RULE A common problem

w ith som e applications is the inability t o manage mem ory usage. There are applications that

w ill take system m emory and never give it back for other applicati ons to reuse. W hen the

application does not ret urn the m emory, and also no longer requires the m emory, it is called

m emory leak. The result is that performance on the host m achine will degrade and eventua

cause the application to fail.

At one customer environment t his problem regularly occurs on a W eb Server application.

The custom er has a standard operating procedure t o reboot t he server once a week to

compensate for the memory consumption. However, if the memory leak occurs too quickly,

there is a deviation from norm al behavior and the server needs to be rebooted before the

scheduled maintenance window. The custom er employs a combination of p rogressive

thresholds w ith an Event Sequence rule to moni tor for abnor mal behavior. M onitor ing was se

to create events as the memory usage passed threshold point s of 50% , 75% and 90% . If thos

threshold poin ts are reached in a period of less than one w eek, an alarm is generated t o provi

notif ication t o reboot t he server prior to the scheduled maint enance w indow. Figure J depicts

the fault scenario.


FIGURE I

This figure depicts an example of t heEvent Rate Count er ru le. Events areoften sent by other devices, such asintrusion detection system s, that cangenerate m illions of events. Toseparate event noise f rom realevents, this rule limit s alarms tosituations where more than 20connection failures are loggedwithin a m inute.


18/24

EVENT SEQUENCE RULE

COND ITION CORRELATION T ECHN OLOGY There are many uses for policy- based Condition

Correlation Technology (CCT) . For example, consider the com plexities of m anaging an IP

netw ork that p rovides VPN connectivity across an M PLS backbone wit h intra-area routing

maint ained by IS-IS and int er-area routing m aintained by BGP. Any physical link or protocol

failure could cause dozens of events from m ultip le devices. W ithou t sophisticated correlation

capability applied carefully, the network troubleshooters will spend most of their time chasing

after symp tom s, rather than fixing the root cause.

A N IS-IS ROUT ING FA ILURE EXA M PLE A specific example experienced by one of our custom ers

can be used to describe the pow er of Condit ion Correlation. The failure scenario and link

outages are illustrated in Figure K. The situation occurs w here a core router, labeled in t he

figure as R1, loses IS-IS adjacencies to all neighbor s (labeled in t he figure as R2, R3, R4) . This

also results in the BGP session w ith the rout e reflector ( labeled in the f igure as RR) being lost

This condition, if it persists, will result in rout es aging out o f R1 and adjacent edge routers R3

and R4. Eventually, the custom er V PN sites serviced by t hese customer premise edge (CPE)

routers w ill be unable to reach their peer sites (labeled in the f igure as CPE1, CPE2, CPE3) .


FIGURE J

Reaching threshold points issometim es acceptable if the thresholds

are reached over a period of tim e thatwill ensure they are reset by regularmaintenance schedules beforereaching critical levels. If they arereached sooner, however, they m aycause outages. The Event Sequencerule will m easure threshold att ainmentand generate an alarm only if required.


19/24

IM PACT OF AN IS-IS ROU TING FAILURE

This failure causes a series of syslog error m essages and t raps to be generated by the rou ters

The messages and traps t hat w ould be received are outlined in Figure L.

SYSLOG ERROR MESSAGE AN D T RAP SEQU ENCE


FIGURE K

This diagram shows how a failure ina core router can ripple through the

network, causing numerous events andalarms, if not int elligently m anaged.

FIGURE L

Error m essages and traps cascade froma single core router failure. SOURCE

R1

R1

R1

RR

RR

RR

R2

R3

Rn

TYPE

Syslog message

Syslog message

Syslog message

Syslog message

Syslog message

Trap

Syslog message

Syslog message

Syslog message

M ESSAGE

%CLNS050ADJCHA NGE: ISIS: Adjancency to R2 (P0S5/ 0/ 0) Dow nhold tim e expired


%CLNS050ADJCHA NGE: ISIS: Adjancency to Rn (P0S5/ 0/ 0) Dow nhold tim e expired

%BGP-5-ADJCHANGE: neighbor R1 Down BGP Notification sent

%BGP-3-N OTIFICATION: sent to neighbor R1 4/ 0 ( hold tim e expire0 bytes

BGP Backwards Transition trap, neighbor = R1


%CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Downhold tim e expired



20/24

The root cause of all t hese messages is an IS-IS outage problem related to R1. For m any

management systems the operator w ould see each of these traps as seem ingly disparate

events on the alarm console. A trained operator or experienced troubleshooter may be able t

deduce, after som e careful t hought, that an R1 rout ing problem is indicated. However, in a larenvironment these alarms will likely be interspersed wit h other alarms clutt ering the console

Even if the operator were capable of making t he correlation m anually, there would be

significant effort and tim e spent doing so. That tim e is directly related t o costs, lower user

satisfaction and lost revenue.

Using a combination of an Event Rule and Condit ion Correlation, a set of rule patterns can be

applied to a Correlation D om ain consisting of all core, label switch routers, enabling CA

SPECTRUM to p roduce a single actionable alarm. This alarm w ill indicate that R1 has an IS-IS

routing prob lem, and a netw ork outage may result if t his is not corrected. The seemingly

disparate conditions that w ere correlated by the software, resulting in this alarm, will be

displayed in the symptom s panel of t he alarm console as follows:

1. A local Event Rate Counter rule w as used t o define m ultip le IS-IS adjacency change

syslog messages report ed by t he same source as a routing problem for t hat source.

2. A rule patt ern w as used t o m ake an IS-IS adjacency lost event caused by an IS-IS routi

problem w hen the neighbor of t he adjacency lost event is equal to the source of the

routing problem event.

3. A rule patt ern was used to m ake a BGP adjacency dow n event caused by an IS-IS

routing problem when the neighbor of the adjacency dow n event is equal to the source o

the routing problem event.

4. A rule pattern w as used to m ake a BGP backward transition t rap event caused by an

IS-IS routing problem w hen the neighbor of t he backward transition event is equal to the

source of the rout ing problem event.

A PPLYIN G CON DIT ION CORRELATIO N T O SERVI CE CORRELATIO N It is comm on to have several

services running over the same network . As an example, in the cable industry, telephone

service (VoIP), Internet access (high speed data), video on dem and ( VoD) and d igital cable a

delivered over the same physical data network. M anagement of this network is quit e a challeng

Inside the network ( cable plant), the video transport equipm ent, video subscripti on services

and the Cable Model Termination System (CM TS) all work together to put data on the cable

network at the correct frequencies. Uncounted m iles of cable along wit h thousands of amp lifie

and power supplies must carry t he signals to the hom es of literally m illions of subscribers.

W ith t he flood of events and error messages that w ill be provided by the m anaged elements,

the fact t hat there is a problem w ith t he service will be obvious. The challenge is to translate

all this data into root cause and service impact actionable inform ation.



21/24

Service impact relevance goes beyond understanding what is impacted; it is also impor tant

to understand w hat is not impacted. Its possible for the video subscript ion service to fail to

deliver VoD cont ent t o a single service area, and yet all ot her services to t hat area could be

fine. Or, a return pat h problem in one area could cause Internet, VoIP and VoD services to failand digital cable to degrade, yet analog cable could still func tion norm ally. In the case of a

media cut in one area, the return path mon itoring system and t he head end controller wou ld

report return path and power problem s in that area. The CMTS would provide the number of

cable modem s off-line for the node. The video transport system wou ld generate errors for

video subscriptions in t hat area. Lastly, any custom er modems that are being managed will

become lost to t he management system.

CA SPECTRUM can make sense of t he resulting deluge of events by using t he service area of

the seemingly disparate events as a factor in the Condition Correlation. If the service areas an

services are appropriat ely modeled, Condition Correlation can be used to determ ine which

services in which areas are affected and the root cause or causes.

SERVICE CORRELATI ON


FIGURE M

Condition Correlation enablesCA SPECTRUM to analyze a delugeof events to determine w hich serviceis impacted.


22/24

CA SPECTRUM for H igh Perform ing Infrastruct uresA high perform ing IT infrastructu re is at t he core of todays successful businesses. W hether

your business is online retail, financial services or m anufacturing, your infrastruct ure isessential. Keeping t he infrastruct ure running, avoiding outages, quickly finding t he causes of

degradation and outages, and simplif ying m anagement and event data for your IT staff are al

factors in maintaining high performance.

Patented Software Elevates CA SPECTRUM Capabilities

CA SPECTRUM provides intelligence, multip le methods and patented solutions to apply t he

best in event cor relation and root cause analysis to you r infrastruct ure. Event correlation is at

the heart o f root cause analysis. W ith large and complex infrastructures, events flood event

logs and your IT staff can be overw helmed by at tem pting to correlate events manually. The

tim e wasted in t his effort has direct effect on your business and its bot tom line. CA SPECTRUM

uses intelligence and event rules to separate true root causes from associated, symptomatic

causes, thereby m inimizing the amount of inform ation and m aximizing the quality of

information your IT staff must address.

Benefit s for Experienced Users and New Users

CA SPECTRUM provides out-of- the-box utilization, perform ance and response time th reshold

that act as an early warning system w hen a problem is about t o happen or when a service lev

guarantee is about t o be violated. W hile these thresholds can be tuned for a specific custom e

environment , there is tremendous value in having these out-of-t he-box th resholds. They enab

CA SPECTRUM to deliver value on day one.

CA SPECTRUM makes it easy for experienced users to add t heir ow n thresholds and w atches

such that aft er a unique problem happens in the environm ent once, new w atches can helppredict o r prevent it from happening again. Through Event Rules and com binations of Event

Rules, even complex behaviors can be capt ured and m anaged.

SECTION 3: BENEFITS



23/24

Change is a constant, requiring any management system to be automated, adaptable, and

extensible. The number of multi-vendor, multi-technology hardware and software elements

in a typical IT environment exponentially increases the complexity of m anaging a real-tim e,on-dem and IT infrastruct ure. CA SPECTRUM currently support s several thousand distinct

means to aut omate root cause analysis across over hundreds of product families and device

types from todays leading infrastruct ure vendors.

Know ing about a problem is no longer enough. Predicting and preventing problem s, pinpoint i

their root cause, and priorit izing issues based on im pact are requirements for t odays manage

ment solutions. The num ber and variety of possible fault, perform ance and threshold problem

means that no single approach to root cause analysis is suited for all scenarios. For t his reaso

CA SPECTRUM incorporates m odel-based IM T, rules-based EMS, and policy-based CCT t o

provide an integrated, intelligent approach to d rive efficiency and effectiveness in managing

IT infrastruct ure as a business service.

To learn m ore about CA SPECTRUM and it s technical approach, visit

http:// ww w.ca.com/ us/ products/ product.aspx?id=783 2


SECTION 4: CONCLUSIONS
http://www.ca.com/us/products/product.aspx?id=7832http://www.ca.com/us/products/product.aspx?id=7832http://www.ca.com/us/products/product.aspx?id=7832


24/24

CA (NYSE: CA), one of t he wor ld's leading independent,enterprise management software companies, unifies and

simplifies complex information technology (IT) managementacross the enterprise for greater business results. W ith ourEnterprise IT M anagement vision, solutions and expert ise,we help customers effect ively govern, manage and secure IT.

TB05IM SPECRCA1E M P327420

ca spectruim

Documents