network and service management for wide-area electronic commerce networks

INTERNATIONAL JOURNAL OF NETWORK MANAGEMENTInt. J. Network Mgmt 2001; 11:75–90

Network and service management for wide-areaelectronic commerce networks

By Symeon PapavassiliouŁ

This paper focuses on the effective management of wide-area electroniccommerce networks supporting services and applications that require highavailability and reliability as well as fast reconstitution time, in the eventof failures. Copyright 2001 John Wiley & Sons, Ltd.

Introduction

W ide-area electronic commerce net-works continue to expand bothdomestically and internationally. In

the network management control center, thisgrowth adds to the volume of status and alarmdata that an operator must monitor and analyze.Network management instrumentation must allowoperators to easily and comprehensively monitorthe various network segments, determine the trou-bles, and rapidly focus on a magnified portionof the network. Moreover most electronic com-merce networks will operate continuously aroundthe clock. The challenge for network managementsystems vendors is to maintain the operator’s atten-tion, and focus their activities on the most relevantactions in emergencies. Network management sys-tems must begin to automate routine functionsby improving their capabilities for automateddecision making. Thus network administratorscan focus on traffic analysis, trend analysis andplanning.1

Typical examples of services provided by wide-area electronic commerce environments include:dedicated or switched (on-demand) connectivityto a public Internet (i.e. dial to TCP/IP and NovellIPX LANs), dial-up access to directory services,transactions of short duration where valued fea-tures are required (i.e. credit card authorization),access to Service Provider application services (i.e.

mail), etc. All those different type of applicationspresent different traffic characteristics to the net-work, have different performance objective criteria,may require different quality of service, and cre-ate an immediate and pressing need for enhancedautomated network management operations.2,3

The success of network management willdepend on three critical success factors: a well-organized set of network management functionsallocated and assigned to instruments and tohuman skill levels; proper instrumentation withthe ability to extract integrated information, toexport and import it, to maintain databases, and toprovide analysis and performance prediction; per-sonnel who understand their job responsibilitiesand possess the necessary qualifying skills.4 In thispaper we mainly focus on several aspects of thefirst two factors. The major network managementprocesses that comprise the operational supportof the service and the network are identifiedas: configuration management, fault management,performance management, security management,accounting management, network maintenance,capacity management. Within the scope of eachone of those processes there is a set of func-tions that must be performed by that process.5,6

In this paper we limit our study and effortsto the network maintenance and fault manage-ment functions. In the following we may use theterms network management and network moni-toring interchangeably and we mainly refer to the

The author is a faculty member at the New Jersey Institute of Technology, New Jersey Center for Multimedia Research, Newark, NJ, USA.

ŁCorrespondence to: Symeon Papavassiliou, New Jersey Institute of Technology, Electrical and Computer Engineering Dept., New Jersey Center

for Multimedia Research, University Heights, Newark, NJ 07102, USA.Email: [email protected]

Copyright 2001 John Wiley & Sons, Ltd.

76 S. PAPAVASSILIOU

functions associated with network maintenanceand fault management (i.e. proactive maintenance,fault recovery, troubleshooting, alarm correlationetc.). The main purpose of an efficient networkmanagement architecture and methodology is tomaintain the network in a proactive way by mon-itoring and troubleshooting alarmed conditions aswell as other type of system occurrences that maycause or indicate a degradation of the service.

The main focus of this paper is to describemethodologies to design and implement effectivelyproactive service/network management for elec-tronic commerce networks. Conventional networkmanagement systems report network faults interms of link and/or node failures. Such high-levelinformation would provide very limited informa-tion for an operator monitoring the network of aservice provider. A service provider would liketo know what is the impact on customers due toa network fault. For instance, a provider wouldlike to know what are the local dial-numbersthat went out of service, what city, and what isthe remaining capacity etc. This mission-criticaldata has to be provided to the service providerin almost real-time, so that the service providerscan inform the customers about a specific problembefore the customer is impacted by the problem.Moreover networks used in applications with highavailability requirements need automatic networkmanagement schemes to handle fault and conges-tion. These schemes are used to detect and confinefaults/congestion, reconfigure the network withadaptive routing techniques, and restore the origi-nal configuration upon restart of the faulty compo-nent or abatement of congestion. Different thresh-old parameters need to be specified while definingthe network management procedures. Often net-works with high availability requirements areused in applications that have high performanceobjectives as well. Hence a design of network man-agement schemes should simultaneously considerboth availability and performance metrics.7,8

The paper is organized as follows. In the nextsection we provide a high-level description ofa generic network model for services typicallysupported by wide-area electronic commerce net-works, while in the third section we describesenhanced data architecture guidelines that couldplay a critical role on the efficient implementationof the Operations Support Systems and the cor-responding Network Management functions. In

the fourth section we present methods, modelsand architectures for efficient network manage-ment. The fifth section describes how proactiveservice/network fault-detection methods based onreal-time performance measurements and dynam-ically built performance profiles (‘signatures’) canbe applied and therefore facilitate the networkmanagement and operations processes, by pro-viding enhanced and intelligent on-line networkanalysis and control, for value-added on-line typeof services (i.e. transaction access services). Thefinal section presents conclusions.

I n wide-area electronic commercecommunication services and applications

two types of provider are usually involvedin order to complete the end-to-end serviceoffering: the Service Provider and the NetworkProvider.

Generic Network Model

—Service and Network Providers—

In wide-area electronic commerce communi-cation services and applications two types ofprovider are usually involved in order to com-plete the end-to-end service offering: the ServiceProvider and the Network Provider. The first isresponsible for the definition of the service char-acteristics and the maintenance of the customerpremises equipment, while the latter provides thenetwork infrastructure (i.e. high-speed network)used by the end users and/or the Service Provider.The Network Provider relieves the other partiesinvolved in that arrangement of the cost and effortof network management by reducing labor costand capital investment. In such an arrangementthe Service Provider is essentially a Customer ofthe Network Provider, while the Service Providerprovides the service to its own customers or end-users (usually multiple customers with small tomedium size). Note that it is possible for the func-tions of Service Providers and Network Providersto be offered by the same provider or organi-zation. It should be noted here that in generalthe providers could be either national or regional

Copyright 2001 John Wiley & Sons, Ltd. Int. J. Network Mgmt 2001; 11:75–90

WIDE-AREA ELECTRONIC COMMERCE NETWORKS 77

providers depending on the geographical cover-age that they provide. Providers that have Pointof Presence (POP) throughout a country are callednational providers while providers that cover spe-cific regions are called regional providers andconnect themselves to other providers at one ormore points. All service provider networks mayexchange traffic only at the Network Access Points.The primary intention of the work presented here isto provide an enhanced network/service manage-ment model that deals with methods of providinga view of network events with higher granularityand analyses the impact due to those events, andnot to address issues related with network inter-operability. However, we present a method thatis incorporated in our enhanced network modelto address the impact of faults originated withinnetwork elements that are either not monitored atall (i.e. special cases of customer premises equip-ment) or are outside the jurisdiction of the networkprovider. Therefore, for the sake of simplicityand without loss of any of the paper’s objec-tives, throughout we do not distinguish betweennational and regional providers and we use theterms network and service providers with theirloose definition as provided in the beginning ofthis section.

—Access Networks and BackboneNetworks—

These are analogous to the highway systemwhere access networks are similar to primary andsecondary access roads, and backbone networksare similar to major highways.9 Figure 1 showsthe key access technologies that are either alreadyin use or under development. These includedial-up network architectures, xDSL networks,cable modem networks, ISDNs, and wirelessnetworks. Although as we see different useraccess techniques and different backbone networktechnologies may be used by different providers tosupport their services, all those techniques mustprovide for the users ubiquitous access to thecorresponding remote sites (i.e. host processorsand servers) over a large geographic area (WideArea Network).

Figure 2 shows a sample high level end-to-end architecture diagram used by the Serviceand Network Providers to offer various electronic

Dial-upaccessnetwork xDSL

access network

Cableaccessnetwork Wireless

accessnetwork

Access network termination and interworking

Internet Wide Area Network(e.g ATM, Frame Relay, IP)

Figure 1. Illustration of various access networks

commerce services, with multiple access tech-nologies (i.e. dial-up architectures, cable modemtechnologies, xDSL network) and backbone net-works. In any of those cases conceptually themodel to be followed is similar. The end-user fromits terminal device (i.e. cable modem) communi-cates with a Terminating System (TS) (i.e. CableModem Termination System—CMTS) located atvarious Point of Presence (POP) sites where aggre-gation and/or trunking processes may take placefor connection to the high-speed backbone datanetwork. The various POPs may serve either spe-cific kinds of services (e.g. only dial-up architec-tures, only cable access architectures) or as thePoint of Presence for multiple services and/oraccess architectures (Integrated POPs). Despitethe fact that user access devices today are ser-vice and access network specific, the networkmodel and the issues associated with implementa-tion of enhanced network and service manage-ment are similar in most of those cases. Theend-user is connected to a user access devicethat interfaces to the access network. Subscribersuse the access network to reach remote servicenodes (e.g. servers, host processors, toll gates,service gateways) via high-speed backbone net-works.

Although multiple technologies are being devel-oped to provide access to the end users, dial-upand ISDN architectures are still the workhorse ofremote access and represent a very large portionof today’s electronic commerce business environ-ment with general availability and widespreadgeographic coverage area.10,11


78 S. PAPAVASSILIOU

PCClient

Cable modem

Hybrid FiberCoax

Hub

PCClient

Dial-inmodem

Subscriber

Access Network

Signal ConversionSystem

ISDNSubscriber

POP IntegratedPOP

POP POP

BRI

PCClient

High_ Speed Network(ATM, Frame Re lay, IP)

CustomerGateway

Router

FDDIRing

HostProcessor

CustomerGateway

Router

HostProcessor Ethernet

xDSLaccess

Figure 2. End-to-end Network/Service architecture sample

—Dial Network ServiceArchitecture/Infrastructure—

In this section we present in more detail thecharacteristics and various components of thedial network/service architecture that motivatedthe development of the methodologies and algo-rithms presented here. In general the end-to-endnetwork architecture/infrastructure that supportsthe various services can be broken into multi-ple components: dial access services, access nodes,backbone infrastructure, egress, protocol convert-ers, shared gateways. This generic architecturemodel is depicted in Figure 3.

Such a generic type of architecture representstoday a large number of providers that offerelectronic commerce services to geographicallydispersed users. Although several providers, inaddition to this infrastructure, may implementand support other enhanced access techniquesas well in the future, we believe that the designof enhanced network management models andmethodologies and the experience gained by thedevelopment and deployment of those modelson the existing ‘production’ networks is a majorstep towards the direction of understanding thecomplex issues associated with the impact of thedesign and network management process in theelectronic commerce business.



FDDI Ring

Ethernet

High_Speed Network

LEC

LEC

LEC

P C

P C

P C

Customer Gateway

Customer Gateway

Router

Router

Host

HostP O P

P O P

P O P

Figure 3. Dial network service architecture

Therefore although most of the methodologiesand algorithms presented in this paper can beextended and used in different network access andnetwork backbone environments supporting elec-tronic commerce services, when we describe thosetechniques in more detail and present examples wewill be referring to the infrastructure of Figure 3. InFigure 3 the high-speed network provides instantaccess to corporate sites/networks, as well as to theInternet. The dial-up network Service Providerstypically would like to operate over a large geo-graphic area and as such expect ubiquitous accessto their Server Farms. The underlying support-ing consists of large numbers of Points-of-Presence(POPs) as well as the ingress and egress links.Those type of services provide the customer witha method of dialing into their hosts, along withvarious protocol conversions, options for support-ing host protocols, various shared gateways and

Internet connectivity option. The Service Provideruses dial services for the end users to connect to theaccess nodes (Point of Presence) through the LocalExchange Carrier (LEC) End Office (EO). Multiplecalls from various End Offices are multiplexed andforwarded to the appropriate Point of Presencewhich will provide the necessary functions to setup the calls across the backbone (high-speed net-work) to either the Customer Premises or a sharedGateway etc.

The AT&T Transaction Access Service net-work— In this section we present a brief descrip-tion of the network and service architecture of theAT&T Transactions Access Service offering whichactually motivated the development of the method-ologies and algorithms presented here and pro-vided the test-bed and implementation environ-ment. The AT&T Transaction Access Service (TAS)


80 S. PAPAVASSILIOU

network is a hybrid POTS-and-data wide-area net-work that provides ubiquitous dial-to-packet ser-vices for carrying short-duration transaction trafficin the United States, Canada, and the Caribbeancountries.12 Average usage of the TAS networkamounts to millions of transactions on a non-busy and typical day, and is growing rapidly.Typical transactions support point-of-sale applica-tions/services (e.g. credit/debit card authorizationand settlement), health care applications, bankingand vending applications, and other data-drivensales applications. The TAS Network currently ser-vices tens of service classes.

The physical topology of the TAS Network con-sists of three major components—the AT&T 800Network, the TAS nodes (for POTS-to-packet pro-tocol conversion), and the AT&T Packet Service, asillustrated in Figure 4.

The central function of the TAS networkis to enable transaction-oriented communicationbetween terminal devices (e.g. credit card scan-ners) scattered across the United States and theirdesignated processing hosts (e.g. credit processingservers). Device access to TAS is accomplishedthrough the AT&T MEGACOM 800 Network,which is terminated at a set of TAS nodes that actas protocol converters. These nodes use the DNIS(Dialed Number Identification Service) digits pro-vided by the 4ESS switches in the 800 networkto establish SVCs in the AT&T packet network.Finally, the packet network is used to complete theconnection between the customer devices and their

destined host processors. In a typical transaction,a call originated in a terminal device is processedby the 4ESS switches in the AT&T network, and isrouted to a geographically proximate TAS modempool. A virtual connection is further set up betweenthe modem and the host processor through a set ofpacket switches. The result is an end-to-end circuitthat connects the E-commerce terminal device andits processor for the duration of the transaction.This circuit is dropped as soon as the transactionis completed.

A lmost all of the applications that havebeen developed to support operation

processes are information-based systems.

Data Architecture Vision andGuidelines

Almost all the applications that have beendeveloped to support operation processes areinformation-based systems. The collection, orga-nization, processing and reporting of data isfundamental to operations automation.1 Conse-quently the Operations Support Systems (OSSs)by nature are database applications. An efficientOSS data architecture plan must tie all the OSSstogether on a common infrastructure. Management

AT&T800

Network

TASNodes

modempools

AT&TPacket

Network

AT&T TAS Network

Terminal equipment(e.g., credit card scanner)

Host processorReal-time transaction records forservice-class and network fault detection

Figure 4. The AT&T TAS Network physical architecture



or operations systems must be flexible and have adistributed, modular architecture that allows Ser-vice and Network Providers to adapt to futurecustomer needs.

—Data Architecture Definition—

Data architecture, as defined within this paper, isdescribed as the underlying infrastructure includ-ing data stores, platforms, systems, and accessmethods used to support daily operations of aservice. Data entities supported by the data archi-tecture include, but are not limited to, the follow-ing: Inventory assignments of network elementsand circuits, network performance statistics, net-work and customer provisioning, critical businesssuccess factors such as Direct Measure of Qual-ity (DMOQ) etc., customer profile and customerreference information.

—Objectives—

The data architecture is intended to achieve notonly an improvement on the operations supportfunctions but a significant return to the businessas well, by driving future information systemsdevelopment to a common logical systems frame-work that maximizes: (1) easy access to accurateand up-to-date information by users, (2) powerfulprocessing of accurate and up-to-date informa-tion by users, (3) communication resources; andsimultaneously minimizes: (1) overall costs ofthe information systems infrastructure, (2) barri-ers to the introduction of new service offerings,(3) costs associated with enhancing the infras-tructure, (4) barriers to the introduction of newapplication services.

—Guidelines/Principles—

Data management represents a major cost itemfor Service and Network Providers due to thevolume, redundancy, and difficulty in ensuringaccuracy throughout a provider’s operation. Tomeet the objectives motivating the OSS dataarchitecture planning and to implement the goalsof the data architecture vision the followingprinciples should be used:

ž Client/Server Application architectures:client/server architecture is the current trendin technology of support system architectures.

ž Data Elements Entered Once: any particulardata element should be entered manuallyonly once. The data architecture should bebased upon a network of distributed databasesthat maximizes the use of its data acrossdifferent applications. Preferably, data shouldbe stored in only one database location. If,for design implementations or performancereasons, the data should be copied to anotherdatabase, it must be done so electronicallyand transparently to the user. This reduces theamount of data entry performed by operationsstaff and increases the quality of the datasince errors made by redundant data entryare eliminated.

ž Application and Data Independence: the dataarchitecture should be based upon a net-work of distributed databases. The relationaldatabases must be designed such that differ-ent applications can access and update thedata without any strong relationship to them.The supporting network and infrastructureshould allow various applications to accessdata in different distributed databases in amanner that is transparent to the user andvirtually transparent to the developer of theapplication. Such an approach tears down thedependencies between the database and theapplications that use the stored data. Conse-quently, this provides for the most efficient useof commercially available software packagesthat can interface, store, query, retrieve, andreport this data with minimum developmentcost and intervals.

ž Modularity: the data architecture should bebased on modularity implemented at the com-ponent level. Modularity allows individualcomponents to be substituted with minimaldisruption to the surrounding systems infras-tructure.

ž Scalability: the data architecture shouldensure that each OSS and the infrastruc-ture supporting the OSSs are designed andimplemented in such a way to accommodateservice and network growth. The principle ofa network of distributed databases, systems,and commercially available applications is anecessity to ensure scalability. Modularity and


82 S. PAPAVASSILIOU

scalability minimize upgrade costs associatedwith maintaining the architecture to meet theneed for efficient operations support.

Network ManagementTo ensure that the network is available to its

users and customers and the elements withinthe network components are functioning properlyand according to specified requirements variousoperations-support tools, algorithms and method-ologies are implemented to perform real-time(or close to real-time) monitoring and manage-ment functions. Network management includestrouble management, which initiates correctiveactions for service and fault recovery, and proac-tive maintenance which may provide capabilitiesfor self-healing. Trouble management correlatesalarms to services and resources, initiates tests,performs diagnostics to isolate faults, triggers ser-vice restoral, and performs activities necessary torepair the diagnosed fault. Proactive maintenanceresponds to near-fault conditions that degrade sys-tem reliability and may eventually result in animpact on services. It tries to detect and/or cor-rect network problems before service troubles arereported and well before the service or the networkperformance is considerably degraded.

—Common Implementation—

Typically, a network provider manages thenetwork through Network Operations Centers(NOCs). These centers deploy three level man-agement model, that is: managed object, elementmanagement system and the manager to executethe network management tasks. The managedobjects are those objects that are critical to thefunctioning of the network. Element managementsystems typically collect data on a class or sub-group of network elements. The manager is acentral management station which collects all rele-vant data on the status of the managed objects. Themanaged objects in on-line service networks areWAN links, Routers, CSU/DSU, Network Nodesetc. If a fault is detected on any of these objects, themanager flashes those traps (SNMP event) on thescreen. These events typically are displayed withan ASCII string like link XYZ is down or Node ABC

is unreachable. Vendors of network elements expectthe users to know what is the meaning of thosemessages. Technicians at the network operationscenters are trained or have access to the informa-tion to translate the events to the impacted objectsand take adequate corrective steps. However, suchreactive approach takes time, and could take hoursto identify all the customers impacted, since typ-ically a WAN link or a node could potentially beserving thousands of local numbers. Thus the cur-rent paradigm of displaying link and/or networknode failures does not adequately support on-lineservices network management. The issue gets morecomplicated when many customers/services rideon the same network infrastructure.13

T he management of service-independentmulti-service on-line networks can

be greatly enhanced if the granularity ofnetwork events is magnified through a propertranslation algorithm.

—Enhanced Network ManagementModel—

The management of service-independent multi-service on-line networks can be greatly enhancedif the granularity of network events is magnifiedthrough a proper translation algorithm. Further,the availability of capacity for a location can becomputed so that the users have advance informa-tion on congestion or an impeding service outagefor particular location. The enhanced networkmanagement model presented here deals with themethod of providing a view of network eventswith higher granularity and analyses the capacityimpact due to those events. The model consists oftwo major logical components: (1) Event Transla-tion Algorithm, and (2) Capacity and CongestionAnalysis. In Figure 5 we provide a high-level eventflow diagram that presents the required steps andprocesses in order to implement an enhanced andrealistic network management model and ana-lyze the capacity impact due to different failures(events).

In the following we present a scenario of theevent translation and capacity analysis algorithm



Critical Event ?

Oscillating Net. LOG/DISPLAY

Call Link/Node ->Dial# Mapping

Function

Call Link/NodeAvailabilityFunction

Call Capacity/Congestion Control

Function

Display Messages

EventArrives

No

No

Yes

Yes

No

Figure 5. High-level event flow diagram

for the event of ‘communication link goes down’.A similar approach could be taken for nodes goingdown or any other type of failures indicated byan event. It should be noted here that an event ofthe type ‘communication link goes down’ couldrepresent either an ingress link, an intermediatelink, or an egress link. For instance, of interest tothe service provider could be a link that provides

connection from the access network to the Point ofPresence, a link that connects the POP to the high-speed network (ingress link), a link that connectstwo intermediate switches, or a link that providesconnection from the backbone network to the cus-tomer gateways, routers and servers (egress link).The corresponding nodes affected by such fail-ures could be either the POPs on the ingress site


84 S. PAPAVASSILIOU

or the customer gateways on the egress site. Ofcourse, the actions taken in each of those caseswould be different since different network ele-ments may be affected, but the flow and reasoningof the algorithm should be similar in all cases. Intoday’s environment where communication infras-tructures are evolving into multiple service-classnetworks, it is quite common to have multiplenodes in cities (especially in metropolitan areas),and therefore it is critical for the network providerto let the users know about the availability andstatus of their network. The information regard-ing the various network elements as well as theirimpacts on other elements and specific customers isobtained from the database systems containing theinventory assignments of network elements andcircuits, network and customer-provisioning data,customer profile and reference information. When-ever a fault has triggered the calculation of theevent translation and capacity analysis algorithmand caused the generation of different availabilitymessages, then as the fault conditions get clearedthe availability messages get cleared in reverseorder.

Event Translation and Capacity Analysis Algo-rithm:Event: link xyz is down

If xyz is not in an oscillating state;find all nodes N served by link xyz;

For all nodes N check city served;For each node in the city compute total # of links

compute total # of Up Links;if up-link/total D > 0.9, ‘‘link in City ABC’’

is Down, No impact oncapacity

D Between 0.9 and 0.5, ‘‘linkin City ABC is Down’’capacity impacted in cityABC

D Between 0.5 and 0.0, ‘‘linkin City ABC is Down’’Service Outage

elseexit.

The threshold levels for service availabilitycould be user definable. Typically in productionnetworks supporting electronic commerce on-lineservices12 these levels are defined as follows:

(1) Service Not Impacted—For (90%–99%)availability

(2) Service Impacted—For (80%–89%) avail-ability

(3) Reduced Capacity—For (70%–79%) avail-ability

(4) Severe Congestion—For (50%–69%) avail-ability

(5) Service Outage—For (<50%) availability

—Network ManagementArchitecture—

The network management architecture for sup-porting the model described in the previous sub-section is depicted in Figure 6. The architecture isbased on a client/server model. The central man-agement system and the clients reside in the NOC.The element management systems which gatherdata on the health of different sets of networkelements are connected to the network manage-ment server via a communication server box. Thecommunication server converts the incoming datastream to the appropriate format to be exportedto the network management server (i.e. incom-ing ASCII information into TCP/IP). The databaseserver provides the necessary information on thenetwork assets, inventory assignments of networkelements and circuits, network and customer pro-visioning data, customer profile and referenceinformation etc.

I t should be noted that the access to thenetwork management servers could pose a

serious security risk to the entire network.

In the architecture the network managementserver controls the event-processing and capacityanalysis algorithm. The result of this processing isdisplayed on the respective client stations in theNOC. The views of these clients, which basicallydisplay the network of a service provider, can beprojected to their respective locations. It shouldbe noted that access to the network managementservers could pose a serious security risk to theentire network. Therefore static routes should beused between the service provider and the NOClocations. Each client is dedicated to a specificservice provider and therefore the sectionalization



Etherne t

N e t w o r k M a n a g e m e n tS y s t e m

C O M M . S E R V E R

E M S E M S E M S

High Speed Net .

router

Cl i en t For Cus tomerA

Cl ien t For Cus tomerB

Rou te r

Cent ra l Database

Rou te r Works t a t i on

Customer A's Si te

Network OperationsCenter (NOC)

Figure 6. Network management architecture

of the fault becomes less time consuming. Thisarchitecture can accommodate a sizable number ofcustomers depending on the power and availablememory of the workstations. The architecturecan be further scaled by deploying a distributednetwork management server.

—Typical Outputs—

The output message formats can be user defin-able and the corresponding messages could typi-cally be displayed as follows:

EVENT-ID TIME TEXT

9999 09 : 00 Reduced Capacity in CityABC, State abc, local dialnumbers xya

9998 08 : 10 Severe Congestion in CityABC, State abc, local dialnumbers xyb

9997 08 : 09 Service Outage in CityABC, State abc, local dialnumbers xyc

9996 08 : 08 Link Up in City DEF, State KJ,local dial numbers cbaavailable

9995 07 : 00 Link Down in City ABC,State abc, Service notimpacted

9994 06 : 00 Service Outage, Node Downin City ABC, State abc, localdial numbers xya

Intelligent On-line NetworkAnalysis and Control

The network management procedures describedin the previous sections usually identify networkproblems based on events, statistics, alarms andconditions generated by the network equipment,as well as on pre-specified thresholds applied tothe various network resources. However, for sometypes of applications (i.e. transactions where val-ued features are required), due to the large varia-tions on the traffic characteristics and requirementsassociated with those applications, it is critical toprovide some type of intelligent network anal-ysis and monitoring based on specific customerrequirements.

Most of the conventional network maintenancetools generate alarmed conditions when they


86 S. PAPAVASSILIOU

detect ‘hard network faults’ (i.e. failures andoutages). Those alarms are usually engineeredinto the network elements, e.g. operational statusalarm of a switch (‘up’, ‘down’, or ‘maintenance’status), and they are by nature reactive. Thatis, by the time that they are identified a majorfault (i.e. link cut or router interface down)has already occurred, and services have alreadybeen interrupted or compromised. The key useof these alarms is to enable network hardwareto be fixed soon after alarms are captured andanalyzed.14,15 However, in order to move theNetwork Maintenance process from a reactive to aproactive mode, in addition to the hard faults, theNetwork Management Center (NMC) personnelmust be capable of identifying potential networkfaults at a stage that is not serious enough tolead to service level failures and compromises(‘soft faults’). One example of alarms that couldidentify potential ‘soft faults’ is the ‘threshold-based’ alarms.16 If the threshold is appropriatelyset it is possible that by the time the threshold iscrossed the network fault has not degraded theoverall network performance, and therefore hasnot seriously affected the corresponding services.The methodologies and issues discussed in thissection mainly deal with such kind of problems orfaults. In references 17 and 18 we have describeda proactive service and network fault detectionmethod that we have developed based on real-time measurements and expected behavior of theperformance of different applications.17,19 In thissection we mainly discusses how those methodscan be applied to facilitate the enhanced networkmanagement process described earlier.

Methods that detect network and service anoma-lies as violations of base-lined performance char-acteristics, in addition to being applicable to faultsoriginated within the network, could also apply tofaults that may occur outside the jurisdiction of thenetwork provider monitoring system (i.e. exter-nal devices such as host processors or servers). Inthis sense external failures that may impact theperformance of the whole network could be proac-tively identified and corrected. In the following wesummarize the key characteristics of an anomalydetection system that we have developed17,18 andimplemented in a high volume with multipleservice classes transaction-oriented wide-area net-work. Specifically this system is capable of:

ž Adaptively sampling the service transactionrecords in real-time on a per service class basisfor highlighting transactions that have a highprobability of being anomalous,

ž Automatically building dynamic thresholdsfor network service classes to baseline theirindividual performances and the overallnetwork performance. These thresholds areupdated periodically (and automatically) toaccount for the performance evolution of thedifferent service classes,

ž Detecting network and service anomaliesas violations of the baselined performancecharacteristics and profiles

ž Detecting network and service faults reliablyand proactively. Some of these faults mayoriginate from the non-managed part of thenetwork, and may lead to serious servicedegradations and even major failures of thenetwork. Being able to detect these faults in theearly stage (hence proactively) before problemescalation enables early recovery.

In references 17 and 18 the emphasis of our workwas placed on the development of the correspond-ing rules and adaptive thresholding techniquesused in order to implement the anomaly detec-tion system. In this paper we direct our effortsand discussion on how the results and observa-tions of our study could facilitate the enhancednetwork management process, as described ear-lier. We also present some interesting resultsand observations that we obtained by implement-ing this anomaly-detection system as part of ourenhanced network maintenance model in a pro-duction wide-area network supporting electroniccommerce traffic.

For the sake of completeness in the followingwe provide a brief and high-level description ofthe generic architecture of the anomaly detectionsystem we are investigating. The generic architec-ture of an anomaly detection system is shownin Figure 7, which highlights the three crucialfunctional components: the sampler, the rule (orthreshold) generator, and the anomaly detector.In this system, network performance data areaccumulated on-line by the sampler for analy-sis. The sampler outputs performance measures(e.g. traffic intensities, or circuit utilization, for ser-vice classes in a transaction network) in whichpotential anomalous data are highlighted. The



Figure 7. Architecture of a network anomaly detection system

Figure 8. Typical service traffic visualization instance

historical network performance data output bythe sampler are analyzed by the rule generatorto build adaptive and dynamic (i.e. temporallybased) performance thresholds. The detector com-pares real-time network performance data outputby the sampler with performance thresholds andpredefined fault criteria for anomaly detection.The outputs of the detector are typically sent to a

graphic user interface (GUI) to alert network oper-ators of network anomalies and faults, or directlyto network control modules for automatic feedbackand control (e.g. circuit breaker).

In Figure 8 we present a visualization instance ofservice-class based performance (as presented onGUI), which provides a representative exampleof a typical detector output. The performance


88 S. PAPAVASSILIOU

measurements used in this instance are the trafficintensities. The traffic intensity of a service classprovides a measure of the total number of circuitsdedicated to that service class in real-time. The setof dynamic thresholds (upper and lower threshold)is an adaptive function of the predicted baselineperformance and tolerance.

The output of the detector could be directlyused as an event or as another input element inthe enhanced network management model andthe event translation algorithms described above.Specifically the detector outputs can be translatedto specific events/alarms which subsequently canbe categorized based on their severity (i.e. critical,major, minor etc.). Those events may trigger theexecution of the event translation and capacityanalysis algorithm, as shown in our flow diagramin order to determine the objects that are impactedby the specific event occurrence, as well as theoverall impact of the performance degradation ofthe ‘guilty’ service in the network availability.An instant of the Event Translation and theCapacity Analysis Algorithm (similar to the onepresented earlier) can be generated and triggeredby the detector output, if the real-time service ornetwork performance persistently (for a specifictime interval) exceeded the dynamically builtperformance thresholds and fault criteria, evenif the conventional network management systemshave not detected any network failures.

N etwork failures and faults that cannot bedetected directly by conventional alarm-

based monitoring systems could instead beinferred from their negative impact on theperformance of the network.

Network failures and faults that cannot bedetected directly by conventional alarm-basedmonitoring systems could instead be inferred fromtheir negative impact on the performance of thenetwork. Based on an anomaly-detection systemfor intelligent proactive network control we candetect those types of faults in the network and bybinding that system with the enhanced networkmanagement model we presented here we cancorrelate them to affected services and resources

in order to initiate the necessary activities to repairthe diagnosed fault. The ‘self-similar’ samplingmethod can generate stable performance profilesfor the different applications dynamically. Thisimplies that the algorithm can be used on-lineas the initial filtering and sampling module fornetwork performance data. The algorithm canalso reduce network management data flow byfiltering out ‘uninteresting’ events. In addition itcan be programmed to trigger on anomalous callflows and hence is designed to search for networkanomalies and transaction frauds.

The traffic performances of those applicationsexhibit marked temporal regularities on varioustime scale bases (i.e. weekday, weekend, weekly,holiday bases). For instance, usually the tempo-ral signatures of weekdays resemble each other,while weekends and/or holidays must be classi-fied apart. This implies that performance profilesshould be built on a per-application basis. More-over because of the temporal regularity of theperformance, the profiles can be used to calculatenetwork thresholds for alarming and anomaliesdetection with a minimal use of historical perfor-mance data. Those thresholds are based on thehistorical performance of the network, but they areadaptive to account for the temporally varying per-formance behavior of the transactions. One of themost important requirements of developing reli-able and effective network/service anomaly detec-tion is that the statistical properties of the randomobservables (e.g. transaction duration, traffic inten-sity, or rate of byte counts) should be relativelyinvariant in time (and hence predictable into thefuture on average). These observables should berelatively predictable with respect to the periodic-ity of statistical rule updates in anomaly detection,since departures from statistically ‘normal’ patternare algorithmically recognized and analyzed as thepresence of network/service anomalies. As can beseen from Figure 9, the PDFs of transaction dura-tion of one service class (service class 1) are repeat-able on a daily basis for seven consecutive days. Ofall the service classes in transaction-oriented net-works, this statistical regularity applies in general.

Conclusions and FutureResearch

In this paper we presented a methodology fordesigning and implementing an enhanced network



0 5 10 15 20 25 30

0.0

0.15

0.30

Service Class 1; Monday

Transaction Duration (sec)

0 5 10 15 20 25 30

0.0

0.10

0.25

Service Class 1; Tuesday


0 5 10 15 20 25 30

0.0

0.15

0.30

Service Class 1; Wednesday


0 5 10 15 20 25 30

0.0

0.10

0.25

Service Class 1; Thursday


0 5 10 15 20 25 30

0.0

0.10

0.25

Service Class 1; Friday


0 5 10 15 20 25 30

0.0

0.10

0.25

Service Class 1; Saturday


0 5 10 15 20 25 30

0.0

0.10

0.25

Service Class 1; Sunday


Figure 9. PDFs of transaction duration of service class 1 on seven consecutive days, showing that they arehighly repeatable

management model for proactive network/servicefault detection and management for electroniccommerce network environments. In particular,the methodology considers strategies and algo-rithms for translating detected network faults to allthe impacted objects and calculating the availabil-ity thresholds. By employing different translationand congestion analysis algorithms we showedhow one can provide better insight on the health ofthe network elements. Higher granularity displaysmakes trouble reporting and subsequent resolu-tions less time consuming which in turn translatesinto better customer support. In our ongoing workon this topic, we intend to tie this strategy with apredictive and fault diagnosis system. We believethat this coupling will further enhance our model,and act as an early warning system in a networkmanagement platform. Moreover we presentedprinciples and guidelines that can be used indeveloping an evolutionary network managementarchitecture for on-line services that will overcomethe inefficiency and complexity of the existing envi-ronments and would address key service and tech-nical aspects such as: enabling rapid new servicedeployment within both the network and networkmanagement systems environments, promotingfaster service activation, efficiently managing anddistributing the data throughout the network, etc.

We also described how proactive service/net-work fault detection methods based on real-timeperformance measurements and dynamically builtperformance profiles (‘signatures’) can be appliedand therefore facilitate the network managementand operations processes by providing enhancedand intelligent on-line network analysis and con-trol for value-added on-line type of services (i.e.financial transaction access services). Those meth-ods and profiles can be used to calculate adap-tive temporal thresholds for performance analysis,network/application fault diagnosis etc. The con-struction of adaptive thresholds for productionnetworks as well as efficient implementation ofthose methods and algorithms for possible incor-poration into the current network managementsystems used in production networks support-ing transactions are major topics of our currentresearch and development.

AcknowledgmentsThe authors would like to thank Lawrence

Ho and David Cavuto for their contributions onthe design and development of algorithms andtools that realize the intelligent On-Line NetworkAnalysis and Control methodology.


90 S. PAPAVASSILIOU

References1. Swanson RH. Emerging technologies for net-

work management. Business Communication ReviewAugust 1991; 53–58.

2. Aidarous S, Prooudfoot DA, Dam XN. Servicemanagement in intelligent networks. IEEE NetworkMagazine January 1990; 4:(1).

3. Aidarous S, Pleyak T. Telecommunications NetworkManagement into the 21st Century. IEEE Press: NewYork, 1994.

4. Terplan K. Global area network management.IEEE/ACM Transactions on Networking February 1997;5:(1).

5. Terplan K. Communication Networks Management, 2ndedn. Prentice Hall: Englewood Cliffs, NJ, 1992.

6. Terplan K. Effective Management of Local Area Net-works, McGraw-Hill; New York, 1992.

7. Veeraraghavan M. Coverage modeling for thedesign of newwork management procedures. Net-work Management and Control, Plenum September1993; 2: 53–65.

8. Meyer F. On evalutin the performability of degrad-able computer systems. IEEE Transactions on Com-puter 1980; 29:(8), 720–731.

9. Dixit S. Data rides high on high-speed remote access.IEEE Communications Magazine January 1999; 37:130–141.

10. Cuffie D, Biesecker K, Kain C, Charleston G,Ma J. Emerging high-speed access technologies. ITProfessional March/April 1999; 1: 20–27.

11. Schoen U, Hamann J, Jugel A, Kurzawa H,Schmidtm C. Convergence between public switchingand Internet. IEEE Communications Magazine January1998; 36: 50–65.

12. Jerabek EE. Transaction Access Service III. AT&TTechnical Services Description (AT&T Proprietary).September 1996.

13. Feldkhun L, Marini, M, Bigaroni V. Integratedcustomer focused network management architec-tural perspective. IFIP/IEEE International Sympo-sium On Integrated Network Management, May1997.

14. Katker S, Paterok M. Fault isolation and event corre-lation for integrated fault management. Proceedingsof the Fifth IFIP/IEEE International Symposium on Inte-grated Network Management 1997; 583.

15. Jakobson G, Weissman MD. Alarm correlation. IEEENetwork November 1993; 52.

16. Huntington-Lee J. Terplan K, Gibson JA. HP Open-View: A Manager’s Guide. McGraw-Hill: 1997; NewYork, 1997.

17. Ho LL, Cavuto DJ, Papavasiliou S, Zawadzki AG.Adaptive and automated detection of nework/ser-vice anomalies in wide area networks. Submitted1999.

18. Ho LL, Cavuto DJ, Papavasiliou S, Hasan MZ,Feather F, Zawadzki AG. Adaptive network/servicefault detection in transaction-oriented wide areanetworks. Sixth IFIP/IEEE International Symposiumon Integrated Network Management IM’99, May1999.

19. Willinger W, Taqqu MS, Sherman R, Wilson DV.Self-similarity through high-variability: statisticalanalysis of ethernet LAN traffic at the source level.IEE/ACM Transactions on Networking February 1997;5:(1). �

If you wish to order reprints for this or anyother articles in the International Journal ofNetwork Management, please see the SpecialReprint instructions inside the front cover.


network and service management for wide-area electronic commerce networks

Documents