ca spectruim
TRANSCRIPT
-
7/30/2019 CA Spectruim
1/24
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CAU SE A NA LYSIS
Interpret ing Events W ith Intelligence
to Find Root Cause
-
7/30/2019 CA Spectruim
2/24
Copyright 2008 CA . All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. This document is for your inform ational purposes only. To the extent perm ittby applicable law, CA provides this document As Is w ithout w arranty of any kind, including, without limitation, any implied warranties of merchantability or fit ness for a particular purpose, or noninfringement. In no event will CA liable for any loss or damage, direct or indirect, from the use of this document , including, without limit ation, lost profits, business interruption, goodwill or lost data, even if CA is expressly advised of such damages.
Table of Cont ents
Executive Summary
SECTION 1: CHALLENGE 2
A Complex Problem in Need of a Solution
The Infrastruct ure is the Business
The Need t o Be Proactive
The Importance of Understanding Business
Impact
SECTION 2: OPPORTUNITY 3
Event Correlation and Root Cause A nalysis
A Three-Pronged Approach for CA SPECTRUM
Network Fault Manager
Induct ive M odeling Technology
Event M anagement System
Condition Correlation
Use Case Scenarios
Inference and Inductive M odeling Technology
Creatively using the Event M anagement System
SECTION 3: BENEFITS 20
CA SPECTRUM for High Performi ng Infrast ruc-
tures
Patented Software Elevates CA SPECTRUM
Capabilities
Benefits for Experienced Users and New U sers
SECTION 4: CONCLUSIONS 21
-
7/30/2019 CA Spectruim
3/24
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS
Executive SummaryChallenge
Todays complex IT infrastruct ures are dynamic, multi vendor engines made of frequently
changing components and technologies. The complexity o f the infrastructure and the continual
changes caused by business demands often lead to faults within the infrastructure. A fault in
a single device can have a ripple effect t hat causes performance and availability problems
for m any users. The ripple effect also m akes it difficult to pinpoint t he root cause of the fault
To remain relevant and competit ive in the marketplace, comp anies must m inim ize outages
and perform ance degradations and this requires effective performance and availability
management. Unfort unately, many management tools are not adaptive and have not kept
pace w ith the dynamics of real-tim e, on-demand IT. Other tools often used are niche tools
that manage only a portion of t he infrastructure, making them largely ineffective in large,
interconnected environments.
Opportunity
CA SPECTRUM is an infrastruc ture m anagement solut ion that provides integrated service
fault and configuration functionality for m odeling, monitoring and reporting across multiple
network device types and technologies. Using a t rust but verify m ethodology, CA SPECTRUM
provides an automated and intelligent m anagement approach for your particular
environment w hether you are as a service provider or an enterprise.
CA SPECTRUM leverages three types of problem solving to comprehensively m anage yourinfrastructure:
M odel-based Inductive Modeling Technology ( IM T)
Rules-based Event M anagement System ( EM S)
Policy-based Condit ion Correlation Technology ( CCT)
Benefits
CAs model-, rules- and poli cy-based analytics understand relationships betw een IT infra-
struct ure assets and t he users and services they are designed to suppor t. This insight hasenabled CA SPECTRUM to deliver real benefit to custom ers. A large service provider
realized a 70 % reduction in downt ime w hile resolving 90% of availability or performance
problems from a central location. Patented root cause analysis has been able to reduce t he
number of alarms by several orders of magnit ude while significantly reducing M ean-Time-
to-Repair ( M TTR).
The CA int egrated approach to fault and performance management has enabled enterprise
government and service provider organizations around t he world t o achieve reliability,
efficiency and effectiveness in managing IT i nfrastructures as a business service.
-
7/30/2019 CA Spectruim
4/24
A Complex Problem in N eed of a SolutionIT infrastructure management is an intensive undertaking w ith significant resource require-
ment s. M ost employees in an organization expect the infrastructu re to work, not thinking of ias a dynamic, mu lti-vendor engine made up of frequently changing components and technologie
In fact, the complexit y and dynam ics of todays real-tim e, on-dem and IT architectures presen
many opport unities for inconsistencies and failures. Invariably, the infrastructu re will slow
dow n or fail and when it does, tools are required to qui ckly pinpoin t t he root cause, suppress
sympt omatic faults, prioritize based on business impact and accelerate service restoration.
The Infrastr ucture is t he Business
The IT infrastructure is a collection of interdependent components including computing
systems, netw orking devices, databases and applications. W ithin t he set of infrastructu re
components are mult iple versions of m any vendors products connected over a variety of
netw orking technologies. In addition, each business environm ent is different from the next
there is no configuration or standard set of components that makes up an infrastructu re.
There also is constant change in devices, firmw are versions, operating system s, netw orking
technologies, development t echnologies and tools. But this dynam ic and complex infrastructur
serves an impo rtant purpose; the infrastructure IS the business. Either t he infrastructure work
and evolves or the organization is out of business. Com panies must also evolve their people,
processes and management t ools for greater efficiency or fall com petit ively behind.
To ensure the performance and availability of t he infrastructu re, most com panies employ a
dual-approach method consisting of:
1. Highly available, fault- tolerant, load balancing designs for infrastruct ure devices and
comm unication paths.2. A network management solution to ensure reliable operation.
High-availability environments further complicate the job of management solutions. The
management solut ion must understand the load balancing capacity, be able to track primary
and fault-t olerant backup paths, and understand w hen redundant systems are active.
The investment in the management solution is as important as the investment in the infra-
struct ure itself. The solution m ust be broad, deep, integrated and intelligent t o perform its
intended function. The infrastructure is not static and the solution will need to embrace
change while delivering an end-t o-end int egrated view of perform ance and availability across
infrastructure silos. Unfort unately, many m anagement tools are not adaptive and do not keep
pace with t he dynamics of IT reality.
The Need to Be Proactive
To remain relevant and competit ive in the marketplace, comp anies must m inim ize outages
and perform ance degradations. In order to do t his, the indiv idual or groups responsible for the
care of the inf rastructure ( e.g., IT, Engineering, Operations) must be proact ively noti fied of
problems. There are many tools that m onito r the availability and perform ance of infrastruct ur
components and the business applications that rely on them.
SECTION 1: CHALLENGE
2 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS
-
7/30/2019 CA Spectruim
5/24
M any of these tools simply identify that a problem exists and notify technicians of a problem
after it has happened. They often give visibility int o only a small slice of the technologies und
management and have no ability to understand how the various component s relate to each
other. This is not enough. It is import ant that t he management solution act as an early warninsystem to help avoid downt ime and service level agreement (SLA) violations. Aft er a problem
has occurred is t oo late users are dissatisfied and SLA penalties have been levied.
Before the true t ask of troubleshooting can begin, the troubleshooter has to isolate the
problem. Simply know ing there is a problem and collect ing all the problems on one screen is
not enough. Troubleshooters need to know w here the problem is (and w here the problem is
not) to effect ively triage the issue. If mul tiple problem s are happening simult aneously, issues
should be automatically prioritized based on the criticality of the impacted service.
It is far too costly t o rely on human intervention to d etermine the root cause of problems and
is also far too costly t o sift t hrough an unending storm of symptom atic problems. Knowing th
root cause allows an organization t o efficiently get problems fixed without wasting tim epursuing sympt omatic problems.
The Impor tance of Underst anding Business Impact
The best management solutions will not only be able to identify problems, but also identify
impacted services, assets and users. For t he business, understanding im pact is as import ant
as understanding the root cause. W hen outages or perform ance degradations occur, people
cannot do t heir jobs effectively, resulting in lower produc tivit y and reduced efficiency.
Sometim es the products or services provided by the company to t heir custom ers are affected
resulting in lost revenue, SLA penalt ies, lost customers and even dam aged brand reput ation
that can take years to repair. Know ing impact allow s an organization t o priorit ize response
efforts to fix what m atters most, first.
Event Correlation and Root Cause Analysis A Three-ProngedA pproach for CA SPECTRUM Network Fault M anagerRoot Cause Analysis (RCA) can be defined as the act of interpret ing a set of sym ptom s and
events and pinpoint ing the source of the problem . A single problem of ten results in mult iple
events across the inf rastructure. Events are typically local to a source, and wit hout proper
context do not always help w ith RCA because they are only symptom s of the problem. M any
components provide events and events come in m any form s: SNM P traps, syslog messages,
application log file ent ries, TL1 events, ASCII streams, etc.
M any sophisticated m anagement systems, including CA SPECTRUM Netw ork Fault M anager
even generate events based on proactive polling of com ponent st atus to ind icate parameter
based threshold violations, response time m easurement threshold violations, etc. Often,
correlation of events is required to determ ine if an actionable condition or problem exists but
correlation is almost alw ays required to isolate problem s, identif y impacted assets and servic
and suppress symptomatic events.
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS
SECTION 2: OPPORTUNITY
-
7/30/2019 CA Spectruim
6/24
4 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS
M anagement soft ware applications efficiently perform ing RCA should raise an alarm for the
root condition and should minim ize other events resulting from t he same root condition to
generate an alarm .
One service provider experienced a situation where they were receiving 500,000 daily
problem no tificat ions from t heir management t ool. Clearly, no person or team of people
could keep up w ith t hat m any events. CA SPECTRUM root cause analysis technology
helped this service provider reduce the num ber of daily problem not ifications to 200 actual
alarm condit ions, wh ile also autom atically priorit izing issues based on impact. In this
environment , 500,000 symptom s had only 200 causes. Average tim e to find and fix a
problem w as reduced from over four hours to less than five minu tes.
Effective RCA must:
Understand the relationship between information within the infrastructure and the services
assets and users that depend on that info rmation
Be proactive in its monit oring and not just rely on event streams
Distinguish between a plethora of events and meaningful alarms
Scale and adapt to t he requirements of growing and dynam ic infrastructures
W ork across mult iple-vendor and mult iple-technology environments
Allow for extensions and customization
CA SPECTRUM employs m ultip le techniques working cooperatively t o deliver its event
correlation and root cause analysis capabilities. These include Inductive M odeling Technology
(IM T) , Event M anagement System ( EM S) and Condit ion Correlation Technology (CCT). Each
of these techniques is employed to d iagnose a diverse and often unpredict able set of p roblem
Inducti ve Modeli ng Technology
The core of the CA SPECTRUM RCA solution is it s patented Induct ive M odeling Technology
(IM T). IM T uses an object-oriented modeling paradigm w ith m odel-based reasoning analytic
CA SPECTRUM most oft en uses IM T for physical and logical topology analysis, as the softw a
can automat ically map t opological relationships through it s efficient Auto D iscovery engine. T
models created are software representat ions of a real-wo rld physical or logical device. These
software models are in direct com munication w ith their real-world counterparts, enablingCA SPECTRUM root cause analysis to no t only listen, but proactively query for health st atus
or additional diagnostic inform ation. M odels are described by their att ributes, behaviors,
relationship to other models and algorithmic intelligence.
-
7/30/2019 CA Spectruim
7/24
Intelligent analysis is enabled through t he collaboration of m odels in a system . Collaboration
includes the ability to exchange information and initiate processing between models within th
modeling system. A model m aking a request to another m odel may in turn t rigger that m ode
to m ake requests on other models, and so on.
Relationships between m odels provide a context for collaboration. Collaboration betw een
models enables:
Correlation of t he symptom s
Suppression of unnecessary/ sympt omatic alarms
Impact analysis
W ith CA SPECTRUM , a model is the softw are representation of a real-world m anaged device
or a component of t hat m anaged element. This representation allow s CA SPECTRUM to not
only investigate and query an individual element w ithin t he network, but also provides the
means to establish relationships betw een elements in order to recognize them as part o f alarger system.
By understanding t he relationship between elements and the conditions of related m anaged
elements, root cause analysis is simplified and p roblems are identif ied m ore quickly. Root
cause and impact are determined through IM Ts ability to both listen and talk to the
infrastructure.
A simple example of IM T in action can be demonstrated by a network router port t ransition
from U P to DOW N. If a port m odel receives a LINK DOW N t rap, it has intelligence to reactby perform ing a status query to determine if the port is actually dow n. If it is in fact
DOW N, then the system of m odels will be consulted to determine if the port has lower
layer sub-int erfaces. If any of the low er layer sub-int erfaces are also DOW N, only the
condition of the original port D OW N w ill be raised as an alarm. A n application of t his
example can be described by several Frame Relay Dat a Link Control Identifiers ( DLCIs)
transitioning to INA CTIVE. If the Frame Relay port is DOW N, IMT wi ll suppress the
sympt omatic D LCI INACTIVE conditions and raise an alarm on the Frame Relay port
model. Additionally, when the port transitions to DOW N, IMT w ill query the status of the
connected N etwork Elements ( NEs) and if those are also DOW N, those conditions will be
considered sympt omatic of the port DOW N, will be suppressed, and w ill be identified as
impacted by the port DOW N alarm.
IM T is a very powerful analytical system and can be applied to m any problem domains. A mo
in-dept h discussion of IM T in action w ill be covered later in the paper.
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS
-
7/30/2019 CA Spectruim
8/24
Event M anagement System
There are tim es w hen the only source of management inform ation is through event streams
local to a specific source. There may be no way to t alk to the managed element, but only a w
to listen to it. Any one event may or may not be a significant occurrence but, in the context ofother events or information, may be an actionable condition .
Event Rules in SPECTRUM s Event M anagement System (EM S) provide a com prehensive
decision-m aking system t o indicate how events should be processed. Event Rules can be
applied to look for a series of events to occur on a m odel in a certain pattern or w ithin a
specific tim e frame or w ithin certain data value ranges. Event Rules can be used t o generate
other events or even alarms. If events occur such t hat t he preconditions of a rule are met,
another event m ay be generated allow ing cascading events, or the event can be logged for
later reporting/ troubleshooting purposes, or it can be prom oted into an actionable alarm.
CA SPECTRUM provides a series of custom izable meta Event Rule types that form the basis o
the EMS. These rule types are building blocks that can be used individually or cooperatively teffect an alarm on t he most sim ple or sophisticated event-oriented scenarios. The rules engin
allows for t he correlation of event frequency/ duration, event sequence and event persistence
coincidence. More than 80% of rule conditions fall into one of the following three event types
f requency/ durat ion, sequence or coincidence. Keep in mind t hat Event Rules can be r un
against t he aforement ioned mod els to avoid the need to constantly re-w rite rules to reflect
infrastructure move/ add/ change activit y. The Event Rule types are highlighted below , followe
by usage examp les later in t he paper:
Event Pair (Event Coincidence) This rule is used w hen you expect cert ain events t o happe
in pairs. If the second event in a series does not occur, this may indicate a problem . Event
rules created using the Event Pair rule t ype generate a new event w hen an event occurs
wit hout it s paired event. It is possible for other events to happen betw een the specified eve
pair wit hout affecting this event rule.
Event Rate Counter (Event Frequency) This rule typ e is used to generate a new event base
on events at a specified rate in a specified tim e span. A few events of a certain type m ay be
tolerated, but once the num ber of these events reaches a certain threshold w ithin a specifi
tim e period, notif ication is required. No addit ional events w ill be generated as long as the
rate stays at or above the th reshold. If the rate drops below t he threshold and then
subsequently rises above the threshold, another event w ill be generated.
Event Rate W indow (Event Frequency) This rule type is used to generate a new event
when a number of the same events are generated in a specified tim e period. This rule type
similar t o t he Event Rate Counter. The Event Rate Counter type is best suited fo r detect ing
long, sustained burst of events. The Event Rate W indow type is best suit ed for accurately
detecting short er bursts of events. It monito rs an event t hat is not significant if it happens
occasionally, but is significant if it happens frequently. If the event occurs above a certain
rate, then another event is generated. No additi onal events will be generated as long as the
rate stays at or above the th reshold. If the rate drops below t he threshold and then
subsequently rises above the threshold, another event w ill be generated.
6 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS
-
7/30/2019 CA Spectruim
9/24
Event Sequence ( Event Sequence) This rule type is used to identify a part icular order of
sequence in events that might be significant in your IT infrastructure. This sequence can
include any number and any type of event. W hen the sequence is detected in t he given
period of t ime, a new event is generated.
Event Combo (Event Coincidence) This rule type is used to identify a certain comb ination
events happening in any order. The combination can include any number and typ e of event
W hen the comb ination is detected w ithin a given time period, a new event is generated.
Event Condition (Event Coincidence) This rule typ e is used to generate an event based on
condit ional expression. In keeping w ith t he CA SPECTRUM t rust but verify m ethodology,
series of condit ional expressions can be listed w ithin the event rule and t he first expression
that is found to be true will generate the event. Rules can be constructed to provide correlatio
through a com bination of evaluating event data w ith IM T model data. For example, if a trap
is received notifying the m anagement system of m emory bu ffer overload, to validate that a
alarm condit ion has occurred, an Event Condit ion rule can init iate a request t o the device to
check actual m emory ut ilization.
CA SPECTRUM imp lements a number of Event Rules out-of-t he-box by applying one or m ore
rule typ es to event streams. Users can create or custom ize Event Rules using any of t he rule
types. Furt her implement ation of Event Rules using the Event M anagement System wil l be
discussed later in t his paper.
Condition Correlation
In order to perform more complex user-defined or user-controlled correlations, a broader set
capabilit ies is required. CA SPECTRUM policy-based Condit ion Correlation Technology enable
Creation o f correlation policies
Creation of correlation domains
Correlation of seemingly disparate event streams or conditions
Correlation across sets of m anaged elements
Correlation w ithin m anaged dom ains
Correlation across sets of m anaged dom ains
Correlation of component cond itions as they m ap to higher order concepts such as busine
services or custom er access
In order to understand t hese capabilities, the terminology is described in this context:
Cond it ions A condit ion is similar to st ate. Condition can be set by an event and cleared by
an event. It i s also possible to have an event set a condit ion but require a user-based action
to clear the condition. The condition exists from the tim e it is set until t he time it is cleared
A very simple example of a condition is port down condition. The port down condition w
exist for a particular interface from the tim e the LINK DOW N t rap or set event (such as a
failed status poll) is received until the tim e the LINK UP trap or clear event ( such as a
successful status poll) is received. A num ber of condit ions that m ay be of use for establish
ing domain level correlations are defined out- of-t he-box and more can be added by the use
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS
-
7/30/2019 CA Spectruim
10/24
Seemingly Disparate Conditions M any devices in an IT infrastructure provide a specific
funct ion. The device level function is often w ithout context as it relates to the functions of
other devices. M ost m anaged devices can emit event streams but t hose event streams are
local to each component. A sim ple example is when a response tim e test identif ies a resultexceeding a threshold. At the same tim e, an event m ay identify a condit ion of a router port
exceeding a transmit bandw idth threshold. These condit ions are seemingly disparate, as th
are created independently and wit hout cont ext or know ledge of each other. In reality t he tw
are quite related.
Rule Patterns Rule Patt erns are used to associate condit ions w hen specific crit eria are met
A simple example is a port dow n condition caused by a board pulled condition but
only if the port s slot num ber is equal to the slot num ber of the board thats been pulled.
Figure A il lustrates this rule patt ern. The result of applying a rule pat tern can be t he creatio
of an actionable alarm or t he suppression of sympt omatic alarms.
RULE PATTERN
Correlat ion Pol icy M ultip le rule patt erns can be bundled or grouped into a Correlation
Policy. Correlation Policies can t hen be app lied t o a Correlation Dom ain. For example, a set
of rule patt erns applicable to OSPF correlation can be created and labeled the O SPF
Correlation Policy. This policy can be applied t o each Correlation D omain as defined by eac
autonom ous OSPF region and the support ing routers in t hat region.
Correlat ion Domain A Correlation Dom ain is used to bot h define and lim it t he scope of on
or m ore Correlation Policies. A Correlation Domain can be applied to a specific service. For
example, in the cable broadband environm ent, a return path m onitor ing system m ay detecreturn path failure in a certain geographic area. This return path failure condition is causin
subscribers high speed cable modems to become unreachable and causing Video on Deman
(VoD) pay-per-view streams to fail. The knowledge that the return pat h failure, the m odem
problems and the failed video streams are all in the same correlation dom ain is essential to
correlating the events and ultim ately identif ying the root cause. How ever, it is also impo rta
to have the ability to distinguish that a return path failure condition occurring in one
correlation dom ain (Philadelphia, PA) is not correlated wit h VoD stream failure conditions
occurring in a different correlation dom ain (Portsmout h, NH) .
8 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CAU SE AN ALYSIS
FIGURE A
Rule patt erns determine the sequenceof investigation t hat w ill result in eitherthe creation of an alarm or t hesuppression of symptom atic alarms.
-
7/30/2019 CA Spectruim
11/24
Condition- based correlations are very pow erful and provide a mechanism to develop
Correlation Policies and apply t hem t o Correlation Dom ains. W hen applied to Business
Service M anagement , Correlation Policies can be likened to met rics of an SLA and
Correlation Dom ains can be likened t o service, user or geographical groupings.
There are tim es w hen the only way to infer a causal relationship between tw o or more
seem ingly disparate conditions is when those conditions occur in a common Correlation
Dom ain. These mechanisms are necessary w hen causal relationships cannot be discovered
through interrogations or receipt of events t o/ from the infrastructure components.
Use Case ScenariosOut -of-t he-box, CA SPECTRUM addresses a wide range of diff erent scenarios where it can
perform root cause analysis. This section provides specific scenarios where t he techniques
described in the previous section are employed t o determ ine root cause and impact analysis.
The detail will be lim ited t o the basic processing for the sake of simp licity and brevit y. A lso fo
the purpose of the discussion and figures, the follow ing table is provided showing the color of
alarms that are associated wit h the icon status of models at any given time.
M OD EL STATU S COLORS
Inference and Induct ive M odeling Technology
Comm unication outages are often described as black-outs or hard faults. W ith t hese typeof faults, one or more comm unication paths are degraded to the point t hat comm unication is
no longer possible. The cause could be broken copper/ fiber cables/connections, improperly
configured routers or swit ches, hardware failures, severe performance problems and securit y
attacks. Often t he difficulty w ith t hese hard comm unication failures is that there is limited
inform ation available to t he management system, as it is unable to exchange information wit
one or more managed elements.
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATIO N A ND ROOT CA USE AN A LYS
FIGURE B
A larms are color coded reflecting themodel status.
-
7/30/2019 CA Spectruim
12/24
The CA SPECTRUM system of sophist icated m odels, relationships and behaviors available
through IM T allows it to infer t he fault and impact . IM T inference algorithm s are also called
inference handlers and a set of inference handlers designed for a purpose is labeled as an
intelligence circuit or simply intelligence. This section will out line how int elligence is appliedto isolate com munication out ages.
BUILDING THE MODEL The accurate representation of t he infrastruct ure through the modeling
system is the key to being able to determ ine the fault and t he impact of t he fault. CA SPECTRUM
has specific solutions for discovering mult i-path netwo rks over a variety of technologies
support ing different architectu res. It offers support for m eshed and redundant, physical and
logical topologies based on AT M , Ethernet , Frame Relay, HSRP, ISDN , ISL, M PLS, M ult icast,
PPP, VoIP, VPN, VLAN and 80 2.11 w ireless environments even legacy t echnologies such a
FDD I and Token Ring. Its modeling is ext remely extensible and can be used to m odel OSI
Layers 1-7 in a communication infrastructure.
CA SPECTRUM provides four different m ethods for build ing the physical and logical topologymodel and inter-dependent connectivity for any given infrastructure:
The AutoDiscovery functionality can be used to automatically interrogate the managed
infrastructure about its physical and logical relationships. Aut oDiscovery works in tw o
distinct phases (alt hough there are many different stages with in each phase that are not
covered here) and dynamically.
W hen initiated, Aut oDiscovery automatically discovers the devices that exist in t he
infrastructure. This provides an inventor y of devices that cou ld be m anaged.
The second phase is M odeling. A utoD iscovery uses management and discovery protocol s
query the devices it has found to gain information that wi ll be used to determ ine the Layer
and Layer 3 connectivity between m anaged devices. For example, Aut oDiscovery uses SNM
to examine route tables, bridge/ swit ch tables and interface tables, but also uses trafficanalysis and vendor p roprietary d iscovery protocols such as Cisco Discovery Protocol (CD P
AutoDiscovery is a very thorough, accurate and automated mechanism to build the
infrastructure model.
Alternately, the SPECTRUM M odeling Gateway can be used to import a description of t he
entire infrastructures components, as well as physical and logical connectivity in form ation
from external sources, such as provisioning systems, netw ork topology databases or
configuration management dat abases (CM DBs)
The Comm and line interface or programmatic A PIs can also be used to build a custom
integration or application to import information from external sources.
SPECTRUM s OneClick Netw ork Console can be used to quickly point and cl ick to m anuallbuild the model.
CA SPECTRUM allows a single managed element to be logically broken up into any num ber o
sub-m odels. This collection of models and the relationships between them is often referred to
as the semantic data m odel for that device. Thus, a typical semantic data mod el for a device
may includ e a chassis model wit h board models related t o the chassis. Associated to the boa
models would be physical interface models. Each physical interface model m ay have a set of
sub-interface models associated below them.
10 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS
-
7/30/2019 CA Spectruim
13/24
CA SPECTRUM has a set of well- defined associations that define how d ifferent semant ic data
model sets act wit h one another. W hen the software determines the connectivity betw een tw
devices, a relationship is established betw een the two por ts that fo rm t he link between them
as well as the relationships that form betw een device models and to the correspondinginterface and port models of other devices. This is depicted in Figure C.
M OD ELING D EVICE AN D INT ERFACE LEVEL CONNECTIVIT Y
W HEN DO ES TH E A NA LYSIS BEGIN? CA SPECTRUM can begin to solve a problem p roactively
upon receipt of a single event. M any problems share the same set of sym ptoms and only
through fu rther analysis can the root cause be determ ined. For communication out ages, the
analysis is triggered w hen a m odel recognizes a comm unication failure. Failed polling, traps,
events, perform ance threshold violations or lack of response can all lead to t his recognition.
CA SPECTRUM validates the comm unication failures through ret ries, alternative protocols
and alternative path checking as part of its trust but verify methodology.
CA SPECTRUM wil l refer to the model that triggered the intelligence as the init iator m odel,
although m ore than one model can trigger the intelligent validation procedures. The initiator
model intelligence requests a list of other m odels that are directly connected t o it. These
connected m odels are referred to as the init iator m odels neighbors.
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS
FIGURE C
M odeled relationships betweendevices are reflected in a relationshipbetween the port s and interface thatconnect them.
-
7/30/2019 CA Spectruim
14/24
TH E INITIATO R M OD EL AND N EIGHBORS
W ith a list of neighbors determ ined, CA SPECTRUM directs each neighbor model t o check its
current st atus. This check is referred to as the Are you OK? check. A re you OK is a relative
term , and a unique set of att ributes related to perform ance and availability w ill vary from m od
to m odel based on the real-world capabilit ies of the device that the model is representing.
W hen a model is asked Are you OK?, the model can initiate a variety of tests to verify its
current operational status. For example, wit h most SNM P managed devices the check is
typi cally a combination of SNM P requests but could be more involved by interrogating an
Element M anagement System o r as simple as an ICM P ping. A com prehensive check could
include threshold perform ance calculations or execution of response tim e tests.
Each neighbor model returns an answer to Are you O K? and CA SPECTRUM then begins its
analysis of the answ ers.
DETERM ININ G TH E HEA LTH OF NEIGHBORS
FIGURE D
The initiator m odel triggers theintelligent validation p rocedure. It
requests a list of models that aredirectly connected and these are calledthe init iator m odels neighbors.
FIGURE E
Once the list of neighbors isestablished, the status of eachneighbor is checked with t heAre you O K? m essage.
12 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS
-
7/30/2019 CA Spectruim
15/24
FAU LT ISOLATION If the init iator m odel has a neighbor that responds that it i s OK , (Figure F
M odel A) , then it can be inferred the problem lies between the unaffected neighbor and the
affected init iator ( Figure F, Model B). In th is case, the init iator m odel that triggered the int el-
ligence is a likely culprit f or this part icular infrastructure failure. The result? A critical alarm wbe asserted on the initiat or model and it is considered the root cause alarm.
FA ULT ISOLATIO N IN PROGRESS
A LA RM SUPPRESSIO N As the analysis continues beyond isolating the device at fault (Figure G
M odel B), the next step is to analyze the effects of the fault, the goal of which is intelligent
alarm suppression. If a neighbor ( Figure G, M odels C, D or E) of the initiator model responds
N o, I am not OK , then this particular neighbor is considered to be affected by a failure that
occurring som ewhere else. As a result, CA SPECTRUM w ill place these models into a
suppressed condition (Grey Color) because any alarms from this device are symp tom atic of a
problem elsewhere.
FA ULT ISOLAT ION CO M PLETE
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS
FIGURE F
The root cause alarm is establishedthrough a sequence of sharing statusbetween models.
FIGURE G
M odels that respond wit h a No, I amnot O K stat us are put in a suppressedcondition t o suppress the alarms thatare symp tom atic of a problemelsewhere in t he infrastructure.
-
7/30/2019 CA Spectruim
16/24
IM PACT A NA LYSIS CA SPECTRUM continues to analyze the to tal im pact of t he fault. It w ill
analyze each Fault D omain, a Fault D omain being the collection of m odels wit h suppressed
alarms that are affected by t he same failure. These impacted m odels are linked to t he root fau
for presentation and analysis. The intelligence provides the im pact m easurement t his fault iscreating, by examining the models that are included with in this Fault Dom ain and calculating
an Impact Severity value. The ranking allows operators to quickly assess the relative im pact o
each fault and prioritize corrective actions.
Creat ively using t he Event M anagement System
There are many applicat ions of Event Rules that w ill allow h igher order correlation of event
streams. Event Rule processing is required for situat ions w hen the event st ream is t he only
source of management information. For example, this situation can occur when CA SPECTRUM
accepts event streams from devices and applications that it does not d irectly m onitor, so that
CA SPECTRUM can only listen-it cannot t alk. CA SPECTRUM provides many out -of-t he-box
event rules, but also provides easy-to-use m ethods for creating new rules using one or more the event rule types. This section highlights a couple of out- of-t he-box event rules and also a
few custom er examp les of event rule applications.
AN OU T-OF-TH E-BOX EVENT PAIR RULE CA SPECTRUM has the ability to interpret Cisco syslo
messages as event streams. Each syslog m essage is generated on behalf of a m anaged sw itch
or router and is directed to the model representing that m anaged element. One of the many
Cisco syslog m essages indicates a new configuration has been loaded into t he router. The
Reload m essage should alw ays be followed by a Restart message, indicating the device ha
been restart ed to adopt the newly loaded configuration. If not, a failure during reload is
probable. An event r ule based on the Event Pair rule type is used to raise an alarm wit h cause
ERROR DURING ROUTER RELOA D if t he restart message is not received w ithin 15 m inutes o
the reload m essage. Figure H diagrams t he events and tim ing.
EV ENT PA IR RULE
14 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS
FIGURE H
This figure depicts an example of anevent pair rule in operation. Reloadmessages indicating new routerconfigurations should always befollowed by restart messages toindicate the router has adopted thenew configuration. An alarm is raisedto indicate a probable failure if therestart m essage is not received w ithinthe expected time period.
-
7/30/2019 CA Spectruim
17/24
M AN AGING SECURITY EVENTS USING A N EVENT RATE COUNT ER RULE CA SPECTRUM is oft en
used to collect event feeds from m any sources. Some custom ers send events from security
devices such as intr usion detect ion systems and firewalls. These types of devices can genera
mi llions of log file entr ies. One custom er utilizes an Event Rate Counter rule to dist inguishbetw een sporadic client connection rejections and real security att acks. The rule was
construct ed to generate a critical alarm if 20 or m ore connection failures occurred in less tha
one m inute. Figure I depicts t his alarm scenario.
EV ENT RATE COU NT ER RULE
M AN AGING SERVER M EM ORY GROW TH USING AN EVENT SEQUENCE RULE A common problem
w ith som e applications is the inability t o manage mem ory usage. There are applications that
w ill take system m emory and never give it back for other applicati ons to reuse. W hen the
application does not ret urn the m emory, and also no longer requires the m emory, it is called
m emory leak. The result is that performance on the host m achine will degrade and eventua
cause the application to fail.
At one customer environment t his problem regularly occurs on a W eb Server application.
The custom er has a standard operating procedure t o reboot t he server once a week to
compensate for the memory consumption. However, if the memory leak occurs too quickly,
there is a deviation from norm al behavior and the server needs to be rebooted before the
scheduled maintenance window. The custom er employs a combination of p rogressive
thresholds w ith an Event Sequence rule to moni tor for abnor mal behavior. M onitor ing was se
to create events as the memory usage passed threshold point s of 50% , 75% and 90% . If thos
threshold poin ts are reached in a period of less than one w eek, an alarm is generated t o provi
notif ication t o reboot t he server prior to the scheduled maint enance w indow. Figure J depicts
the fault scenario.
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS
FIGURE I
This figure depicts an example of t heEvent Rate Count er ru le. Events areoften sent by other devices, such asintrusion detection system s, that cangenerate m illions of events. Toseparate event noise f rom realevents, this rule limit s alarms tosituations where more than 20connection failures are loggedwithin a m inute.
-
7/30/2019 CA Spectruim
18/24
EVENT SEQUENCE RULE
COND ITION CORRELATION T ECHN OLOGY There are many uses for policy- based Condition
Correlation Technology (CCT) . For example, consider the com plexities of m anaging an IP
netw ork that p rovides VPN connectivity across an M PLS backbone wit h intra-area routing
maint ained by IS-IS and int er-area routing m aintained by BGP. Any physical link or protocol
failure could cause dozens of events from m ultip le devices. W ithou t sophisticated correlation
capability applied carefully, the network troubleshooters will spend most of their time chasing
after symp tom s, rather than fixing the root cause.
A N IS-IS ROUT ING FA ILURE EXA M PLE A specific example experienced by one of our custom ers
can be used to describe the pow er of Condit ion Correlation. The failure scenario and link
outages are illustrated in Figure K. The situation occurs w here a core router, labeled in t he
figure as R1, loses IS-IS adjacencies to all neighbor s (labeled in t he figure as R2, R3, R4) . This
also results in the BGP session w ith the rout e reflector ( labeled in the f igure as RR) being lost
This condition, if it persists, will result in rout es aging out o f R1 and adjacent edge routers R3
and R4. Eventually, the custom er V PN sites serviced by t hese customer premise edge (CPE)
routers w ill be unable to reach their peer sites (labeled in the f igure as CPE1, CPE2, CPE3) .
16 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS
FIGURE J
Reaching threshold points issometim es acceptable if the thresholds
are reached over a period of tim e thatwill ensure they are reset by regularmaintenance schedules beforereaching critical levels. If they arereached sooner, however, they m aycause outages. The Event Sequencerule will m easure threshold att ainmentand generate an alarm only if required.
-
7/30/2019 CA Spectruim
19/24
IM PACT OF AN IS-IS ROU TING FAILURE
This failure causes a series of syslog error m essages and t raps to be generated by the rou ters
The messages and traps t hat w ould be received are outlined in Figure L.
SYSLOG ERROR MESSAGE AN D T RAP SEQU ENCE
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS
FIGURE K
This diagram shows how a failure ina core router can ripple through the
network, causing numerous events andalarms, if not int elligently m anaged.
FIGURE L
Error m essages and traps cascade froma single core router failure. SOURCE
R1
R1
R1
RR
RR
RR
R2
R3
Rn
TYPE
Syslog message
Syslog message
Syslog message
Syslog message
Syslog message
Trap
Syslog message
Syslog message
Syslog message
M ESSAGE
%CLNS050ADJCHA NGE: ISIS: Adjancency to R2 (P0S5/ 0/ 0) Dow nhold tim e expired
%CLNS050ADJCHA NGE: ISIS: Adjancency to R3 (P0S5/ 0/ 0) Dow nhold tim e expired
%CLNS050ADJCHA NGE: ISIS: Adjancency to Rn (P0S5/ 0/ 0) Dow nhold tim e expired
%BGP-5-ADJCHANGE: neighbor R1 Down BGP Notification sent
%BGP-3-N OTIFICATION: sent to neighbor R1 4/ 0 ( hold tim e expire0 bytes
BGP Backwards Transition trap, neighbor = R1
%CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Dow nhold tim e expired
%CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Downhold tim e expired
%CLNS050ADJCHA NGE: ISIS: Adjancency to R1 (P0S5/ 0/ 0) Dow nhold tim e expired
-
7/30/2019 CA Spectruim
20/24
The root cause of all t hese messages is an IS-IS outage problem related to R1. For m any
management systems the operator w ould see each of these traps as seem ingly disparate
events on the alarm console. A trained operator or experienced troubleshooter may be able t
deduce, after som e careful t hought, that an R1 rout ing problem is indicated. However, in a larenvironment these alarms will likely be interspersed wit h other alarms clutt ering the console
Even if the operator were capable of making t he correlation m anually, there would be
significant effort and tim e spent doing so. That tim e is directly related t o costs, lower user
satisfaction and lost revenue.
Using a combination of an Event Rule and Condit ion Correlation, a set of rule patterns can be
applied to a Correlation D om ain consisting of all core, label switch routers, enabling CA
SPECTRUM to p roduce a single actionable alarm. This alarm w ill indicate that R1 has an IS-IS
routing prob lem, and a netw ork outage may result if t his is not corrected. The seemingly
disparate conditions that w ere correlated by the software, resulting in this alarm, will be
displayed in the symptom s panel of t he alarm console as follows:
1. A local Event Rate Counter rule w as used t o define m ultip le IS-IS adjacency change
syslog messages report ed by t he same source as a routing problem for t hat source.
2. A rule patt ern w as used t o m ake an IS-IS adjacency lost event caused by an IS-IS routi
problem w hen the neighbor of t he adjacency lost event is equal to the source of the
routing problem event.
3. A rule patt ern was used to m ake a BGP adjacency dow n event caused by an IS-IS
routing problem when the neighbor of the adjacency dow n event is equal to the source o
the routing problem event.
4. A rule pattern w as used to m ake a BGP backward transition t rap event caused by an
IS-IS routing problem w hen the neighbor of t he backward transition event is equal to the
source of the rout ing problem event.
A PPLYIN G CON DIT ION CORRELATIO N T O SERVI CE CORRELATIO N It is comm on to have several
services running over the same network . As an example, in the cable industry, telephone
service (VoIP), Internet access (high speed data), video on dem and ( VoD) and d igital cable a
delivered over the same physical data network. M anagement of this network is quit e a challeng
Inside the network ( cable plant), the video transport equipm ent, video subscripti on services
and the Cable Model Termination System (CM TS) all work together to put data on the cable
network at the correct frequencies. Uncounted m iles of cable along wit h thousands of amp lifie
and power supplies must carry t he signals to the hom es of literally m illions of subscribers.
W ith t he flood of events and error messages that w ill be provided by the m anaged elements,
the fact t hat there is a problem w ith t he service will be obvious. The challenge is to translate
all this data into root cause and service impact actionable inform ation.
18 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS
-
7/30/2019 CA Spectruim
21/24
Service impact relevance goes beyond understanding what is impacted; it is also impor tant
to understand w hat is not impacted. Its possible for the video subscript ion service to fail to
deliver VoD cont ent t o a single service area, and yet all ot her services to t hat area could be
fine. Or, a return pat h problem in one area could cause Internet, VoIP and VoD services to failand digital cable to degrade, yet analog cable could still func tion norm ally. In the case of a
media cut in one area, the return path mon itoring system and t he head end controller wou ld
report return path and power problem s in that area. The CMTS would provide the number of
cable modem s off-line for the node. The video transport system wou ld generate errors for
video subscriptions in t hat area. Lastly, any custom er modems that are being managed will
become lost to t he management system.
CA SPECTRUM can make sense of t he resulting deluge of events by using t he service area of
the seemingly disparate events as a factor in the Condition Correlation. If the service areas an
services are appropriat ely modeled, Condition Correlation can be used to determ ine which
services in which areas are affected and the root cause or causes.
SERVICE CORRELATI ON
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS
FIGURE M
Condition Correlation enablesCA SPECTRUM to analyze a delugeof events to determine w hich serviceis impacted.
-
7/30/2019 CA Spectruim
22/24
CA SPECTRUM for H igh Perform ing Infrastruct uresA high perform ing IT infrastructu re is at t he core of todays successful businesses. W hether
your business is online retail, financial services or m anufacturing, your infrastruct ure isessential. Keeping t he infrastruct ure running, avoiding outages, quickly finding t he causes of
degradation and outages, and simplif ying m anagement and event data for your IT staff are al
factors in maintaining high performance.
Patented Software Elevates CA SPECTRUM Capabilities
CA SPECTRUM provides intelligence, multip le methods and patented solutions to apply t he
best in event cor relation and root cause analysis to you r infrastruct ure. Event correlation is at
the heart o f root cause analysis. W ith large and complex infrastructures, events flood event
logs and your IT staff can be overw helmed by at tem pting to correlate events manually. The
tim e wasted in t his effort has direct effect on your business and its bot tom line. CA SPECTRUM
uses intelligence and event rules to separate true root causes from associated, symptomatic
causes, thereby m inimizing the amount of inform ation and m aximizing the quality of
information your IT staff must address.
Benefit s for Experienced Users and New Users
CA SPECTRUM provides out-of- the-box utilization, perform ance and response time th reshold
that act as an early warning system w hen a problem is about t o happen or when a service lev
guarantee is about t o be violated. W hile these thresholds can be tuned for a specific custom e
environment , there is tremendous value in having these out-of-t he-box th resholds. They enab
CA SPECTRUM to deliver value on day one.
CA SPECTRUM makes it easy for experienced users to add t heir ow n thresholds and w atches
such that aft er a unique problem happens in the environm ent once, new w atches can helppredict o r prevent it from happening again. Through Event Rules and com binations of Event
Rules, even complex behaviors can be capt ured and m anaged.
SECTION 3: BENEFITS
20 TECHNO LOGY BRIEF: CA SPECTRUM EVENT CORRELATION A ND ROOT CA USE A NA LYSIS
-
7/30/2019 CA Spectruim
23/24
Change is a constant, requiring any management system to be automated, adaptable, and
extensible. The number of multi-vendor, multi-technology hardware and software elements
in a typical IT environment exponentially increases the complexity of m anaging a real-tim e,on-dem and IT infrastruct ure. CA SPECTRUM currently support s several thousand distinct
means to aut omate root cause analysis across over hundreds of product families and device
types from todays leading infrastruct ure vendors.
Know ing about a problem is no longer enough. Predicting and preventing problem s, pinpoint i
their root cause, and priorit izing issues based on im pact are requirements for t odays manage
ment solutions. The num ber and variety of possible fault, perform ance and threshold problem
means that no single approach to root cause analysis is suited for all scenarios. For t his reaso
CA SPECTRUM incorporates m odel-based IM T, rules-based EMS, and policy-based CCT t o
provide an integrated, intelligent approach to d rive efficiency and effectiveness in managing
IT infrastruct ure as a business service.
To learn m ore about CA SPECTRUM and it s technical approach, visit
http:// ww w.ca.com/ us/ products/ product.aspx?id=783 2
TECHNO LOGY BRIEF: CA SPECTRUM EVENT CO RRELATION A ND ROOT CA USE AN ALYSIS
SECTION 4: CONCLUSIONS
http://www.ca.com/us/products/product.aspx?id=7832http://www.ca.com/us/products/product.aspx?id=7832http://www.ca.com/us/products/product.aspx?id=7832 -
7/30/2019 CA Spectruim
24/24
CA (NYSE: CA), one of t he wor ld's leading independent,enterprise management software companies, unifies and
simplifies complex information technology (IT) managementacross the enterprise for greater business results. W ith ourEnterprise IT M anagement vision, solutions and expert ise,we help customers effect ively govern, manage and secure IT.
TB05IM SPECRCA1E M P327420