[ieee comput. soc 7th international conference on computer communications and networks - lafayette,...

ieal Approach to Performance Analysis of Self-Healing SONET Networks

Hakki C. Cankaya and V. S. S. Nair Department of Computer Science and Engineering

Southern Methodist University, Dallas, TX 75275-0122 USA { candan , na i r a seas . smu . edu}

Abstract

Performance evaluation of large self-healing SONET mesh networks is of great interest to the telecom industry. In this paper, we first modify a model that has been proposed originally for the reliability analysis of SONET networks, for the performance evaluation of such networks. Then, we propose a hierarchical approach that will not only reduce the complexity of the analysis, but also improve the accuracy. The sensi- tivity of the performance metrics to the traffic distribution and the performance of individual subnetworks are studied experimentally.

1. Introduction

Performance has always been a very important is- sue in telecommunication industry. Therefore, service providers try to develop fault tolerance and fault avoidance techniques to be able to keep the performance level of telecommunication networks over a certain value [7]. One of the fault tolerance techniques is restoration using self-healing techniques and the analysis of such networks is very complex [9]. Networks continue to grow in time and eventually they reach a point where they are no longer easy to evaluate. More- over, the connectivity and the traffic over such a large network is less likely to be homogeneous and uniform. Thus, the analysis of large networks tend to be less accurate.

In a previous study, we have introduced a Markov model, called parametric State Reward Markov Model (SRMM/p), for the reliability and availability analysis of self-healing SONET networks [l] [a] [3] [4]. In this paper, we modify the model to evaluate the expected performance of the communication system and further propose a hierarchical approach for performance anal-

ysis of large non-homegenous networks. Section 2 briefly presents the SRMM/p model and

discusses the necessary modifications towards the performance evaluation. In section 3, we introduce the hierarchical approach. We give some experimental results in section 4 and conclude the paper in section 5.

2. The Analysis of Self-Healing SONET Mesh Networks

Self-healing SONET mesh networks are provided with restoration capability and the performance of such networks are expected to be better than those without the restoration capability [6] [lo]. Therefore, the model to analyze the performance should accommodate parameters pertinent to restoration. In this section, we use a modified version of the parametric State Reward Markov Model, which has originally been introduced in our previous work [l] [a] [3], to evaluate the expected performance and the expected restoration capabilities of a self-healing mesh network. In the following subsec- tions, we give a brief overview of the model and explain the modifications with the calculation of expected performance and restoration values, named performability and restorability, respectively.

2.1. The Model (SRM[M/p)

The SRMM/p is a parametric State Reward Markov Model which is used to analyze the probabilistic behavior of any self-healing mesh network. The parametric feature gives flexibility to the designer to choose the level of detail a t which she/he wants to analyze the network. Figure 1 depicts the white-box representation of the entire performance evaluation process. The process has a set of input data fed into the model to initialize the numerical analysis of the communica-

362 0-8186-9014-3/98 $10.00 0 1998 IEEE

M(1.m) Demand Related Data 1 I I 4 @- Performability

Topology Related Data 3 pc&?-c), e b Restorability Event Related Data

Restoration Related Data

Figure 1. White-box representation of the performance evaluation process

tion system behavior. This input data set includes in- formation about the communication demand, approx- imate network topology, failure and repair events, and restoration data [5].

Functioning states (denoted by Wj , Restoration states (denoted by Q), and Failure states (denoted by F ) . The functioning states represent the states in which the system is considered providing, if not all, a satisfactory amount of service to the users. In order to accommodate the varying levels of performance, the set of functioning states is divided into two subsets: the set of fully functioning states (denoted by S ) and the set of partially functioning states (denoted by I<), where SUI‘ = W and S n Ii‘ = 0. Failure events cause the system to go to restoration states where the recovery procedure takes place. The rates of link failure events are denoted by “A” in Figure l . From the restoration states, according to the amount of recovery at the restoration completion events, the system goes either to a failure state or to a functioning state. The rate of completion of the restoration events and the coverage for success- ful recovery are denoted by “6” and “c” , respectively in Figure 1. Repair completion events not only bring the system back to the working states, but also change the performance level within the working states. The rate of repair completion events is denoted by “p” in the model, as shown in Figure 1. The output of the model consists of expected performability and restorability.

A 2-dimensional symbolic representation, U(/, m), is used for describing the model parameters. This representation is simply the equivalent of M’(l, h = 0, m) which is originally introduced in [l]. This modification is done for the simplicity of the performance evaluation process. We eliminate the threshold value h from the original model because we are interested in the whole performance spectrum. The parameter “1” denotes the number of consecutive link failures that the user wants the model to consider. It also constitutes the number of stages in the model where each stage accommo- dates a consecutive failure. Parameter “m” represents the number of performance levels considered between

The model has three sets of states:

fully-functioning level and failure at each stage. These performance levels are represented by partially functioning states (denoted by K) in the model. The com- prehensive model showing all the states and transitions is given in Figure 2.

We define the overall communication demand D on the system as the summation of all pairwise bidirectional communication demands per unit time through the communication network:

i=l j=i+l

where dij is a pairwise bidirectional communication demand between nodes i and j in the network. The model also considers the performance of the communication system. The performance is defined in [0, l .0 j where performance value 1 .O means that communication demand D is fully satisfied. In SRMM/p, every state has its own reward value, denoted by Rw(.), representing the average performance of the system at that state [8]. The reward value function from state set is denoted as follows:

Rw(q) = T , where q E ( W U Q U F j & 0 5 r 5 1

The detailed calculation of the reward values for all states is given in [I].

(2)

2.2. Performability and Restorability

In this section, we define two metrics, namely performability and restorability. The performability is the expected performance of a communication system. From the definition of expected value, we incorporate the probabilistic behavior of the system into the performance spectrum that it covers. The restorability is the expected restoration which is failure recovery per unit time. Since we capture the probabilistic transient and steady-state behavior of the system through the SRMM/p model, we are able to evaluate the system performability and restorability in both time dependent and independent manner. Transient performability, denoted by T P ( t ) , gives the expected time-dependent performance of the system in satisfying the initial demand D. This normalized value is calculated as follows:

T P ( t ) = Pz(t j * Rw(zj (3) V d W U Q U Fl

where Pz(t) gives the probability of the system being in state 2 at time t . Steady-State Performability, denoted

363

e .

e .

* e

e o

e - e

o e e e

Stage 0 Stage1 Stage2

Figure 2. The SRMM/p

by S P , gives the expected time dependent performance of the system in satisfying the initial demand.

S P = II. * Rw(z) , where TI, = lim Px(t) t-oo

v 4 W U Q U Fl (4)

Restorability is defined in terms of average amount of recovery and average restoration duration. Both transient and steady state restorability, which are denoted by T R ( t ) and SR, respectively, are obtained by the formulae below:

E( Recovery( Q x ) ) S R = * n x (6)

QQ. .c{S U K l

where E(Recovery(Qz)) stands for the expected value of failure recovery level in percentage at restoration state z [l].

a Stage 1

3. The Hierarchical Approach

In this section, we propose a hierarchical approach to performance analysis of large self-healing SONET communication networks. Large networks lack homo- geneity in many ways. Topology varies a t different regions of the network according to the geographical rea- sons, demand, and optimality considerations related to cost/performance issues. Capacity of links and network elements change according to the requested demand in a region. Moreover, It might be necessary to use a restoration mechanism for fault-tolerance in some regions of the network by using redundant capacity. The allocation of this redundancy might differ. For example, redundant capacity might be dedicated for certain types of failures as opposed to being shared by a set of failure events. The aim and control of the restoration mechanisms might be different according to the vari- ous needs within the network. For example, in some regions, link restoration might be more favorable than the path restoration and centralized control might be a better fit than the distributed control. Furthermore, the failure and repair behavior of the entire system vary. On the other hand, large networks are mostly structured hierarchically which is a good fit for hierarchical approach. Thus, a hierarchical approach in this case is not only less complex from the computational

364

point of view, but also more accurate. In our multi-layer hierarchical approach shown in

Figure 3, we might have a varying number of levels for a large network. At layer 0, we have a big logical ver-

Figure 3. Multi-layer hierarchy

tex without any edge representing the entire network of interest. A layer below, we have the first graph of subnetworks, denoted by G I . This graph has subnetworks as its vertices and inter-subnetwork links as its edges. In the layers below, the same hierarchy continues until it reaches a graph in which vertices are real network nodes and there is no more subnetworks involved. Such a subnetwork is plotted on the l a y e r n of Figure 3 . The hierarchical-tree corresponding to Figure 3 is given in Figure 4. If it is a simple subnetwork, which means a graph of network nodes, the subnetwork constitutes a leaf-vertex in the tree structure. A large network can be defined in a nested hierarchy as follows:

where and Ea are vertex and edge set respectively

Figure 4. Tree structure for multi-layer hierar- ChY

and DG1 is the demand matrix.

' Z= iGi.1, .., Gi.p}, if Ga is a graph of subnetworks. wi.1, .., wi.,.}, if Gi is a simple subnetwork;

(8)

Ea = { e i . l , e i . ~ , . . . ,e i.,} (9) The communication demand for the entire network

is also given in a hierarchical structure. Each level has its own demand matrix which defines the demand between each vertex-pair. Column X gives the demand between a vertex and its parent which is called the external demand of the vertex.

Table 1. Demand matrix for network G [ i ~

In the previous section, we have defined the performability and the restorability for a subnetwork. With this hierarchical approach, we have a set of subnetworks with an interconnection network at every in- stance of an intermediate level; therefore, the performability analysis of a subnetwork at an intermediate level depends on the performance values of the subnetworks corresponding to vertices and the interconnection mesh network. From the previous calculations, we get the expected performance value in normalized form towards the satisfaction of the traffic demand on a subnetwork. However, for an intermediate layer in the hierarchy, the overall expected performance is defined recursively in terms of the performability of the subnetworks on the level below and the interconnection

365

mesh on that layer. The calculation for steady-state performability is given by a set of formula. The first and the second formula, SP?' and SP.f ' , calculates the expected internal and external satisfied-demand of the individual subnetworks that correspond to vertices of G;. These are calculated by the weighted sum of the subnetworks' steady-state performability and the corresponding traffic demand.

Y€V,

SP.f' = Min(SPY,SPE') * d,",;Z (11) Y€V,

The third formula, SP:', analyzes the pairwise expected demand within G;.

SP:' = Min(SPY , SP", SPE' ) * dy",;

The expected performance for the external communication of a subnetwork or the communication between a pair is chosen as the minimum performance of the subnetwork(s) and the interconnection network that is located at that layer. Therefore, expected satisfied demand is the multiplication of the minimum expected performance and the traffic demand between the pair. The overall steady-state performability is as follows:

To choose the minimum performance, one has to calculate the expected performance of all the subnetworks at lower layers, recursively. However, in the calculation of some pairs, the performance of a set of subnetworks are needed which results in solving the set of subproblems more than once. In other words, the total number of distinct problems are less than recursive calls. This problem is known as the overlapping subproblems in the literature. We approach this problem with the t a b - ular dynamic-programming technique. This technique typically solves each subproblem once and then store the solution in a table from which it can be looked up when again needed. This table look-up method is known as memorization in which a table with the so- lutions of subproblems is maintained, but the central structure invokes the computation procedure if the result does not appear in the table and stores the result in the table. The procedure for calculating the steady- state performability, SP( .), is given below:

procedure SP(G : {V, E , D G } ) 1 begin

2 if TABLE(G) = 00

4 return (TABLE(G) ) ; 5 end procedure

We use tree structure for TABLE(G : {V, E , D } ) , however any other data structure can be used for the table.

For the Transient Performability, denoted by TR( .), we incorporate the time factor into the same set of formulae.

TP:'(t) = Min(TPY( t ) ,TPE' ( t ) ) * d z i (15) Y€V,

TPF' = C ~ i n ( T P Y ( t ) , T P Z ( t ) , T P E ~ ( t ) ) * d ~ ; Y€V, "€{V,-Y}

The average steady-state restorability of an intermediate subnetwork G;, denoted by SRGt, is calculated using the restorability of its children and the interconnection network weighted by the traffic that each one is exposed to. The calculation is given by a set of formulae below:

The average transient restoration is calculated in a similar way by incorporating the time factor as follows:

366

4. Experimental Study

G1.2

G1.3 G1.4 Gi.5

In this section, we show numerically how the performance of subnetworks and interconnection networks affect the performability and restorability of a large network. We also observe the effect of traffic-demand distribution on the large network. For this purpose, we run a set of experiments on an example network whose hierarchical structure is given in Figure 5 with the following demand matrix in Table 2.

0 500 200 200 200 0 0 500 200 200 0 0 0 200 200 0 0 0 0 200

Figure 5. An example hierarchy

Table 2. Demand matrix for subnetwork GI

DG' I G1.i G1.2 G1.3 G1.4 Gi.5 G1.1 I 500 200 200 200 200

The first experiment that we conduct is to illustrate the effect of a subnetwork and an interconnection network on the performability of the large network. In this experiment, we first fix the performability of every simple subnetwork and interconnection network to a certain value which is 0.9995 and calculate the performability value for the large network which could not

be different than 0.9995. Then we vary the performability of one of the subnetworks from .998 to 1.0 and plot the result in Figure 6 with dotted line. We observe that the performability of the large network increases until .9995 with a slope and it changes the slope for the higher values of the varying subnetwork performability. The slope is slowed down after .9995 because of the minimum function. In other words, the performance for satisfying any pair-wise traffic demand can be at most the smallest performability within the subnetworks involved in this transmission. In the same manner, we vary the performability of the interconnection network and obtain the result given by the solid line in Figure 6. Both curves meet at .9995 because at this point, all the performances are the same. Then the performability of the large network does not vary because the interconnection network involves only the pair-wise transmission. Therefore, after this point the minimum is always .9995 which is the performability value of the subnetwork pair. Consequently, it does not matter if the interconnection network performs better.

Figure 6. The effect of a subnetwork and an interconnection network on steady-state performability

We conduct another experiment to observe the effect of the traffic distribution on the performance of the large network. For the experiment, we take the subnetwork G1.1 and define two traffic distribution parameters a and b. The a represents the percentage of the traffic that the subnetwork G1.1 is involved in over the entire traffic of the system and is defined as:

367

The second parameter b gives the percentage of the inner traffic that G1.1 has over the entire traffic that it is involved. This parameter is defined for the example network in Figure 5 as:

In the experiment, we fix the expected performance of all the subnetworks except G1.1 at the value 0.999 and the interconnection network at 0.992. The expected performance is 1.0 for G1.1. Figure 7 gives the performability of the large network by varying a and b values in(O,l.O]. In the analytical expression of the performability given previously, the expected performance used for a transmission of any pair is the minimum expected performance of the subnetwork pair and the interconnection network in between (see the formula (16)). This minimum effect is driven by b where 6 = 0 represents that all the transmissions that the subnetwork G1.1 is involved takes place between itself and the other sibling subnetwork on the same layer through the interconnection network. For a = 0 , there is no traffic activity which involves the subnetwork G1 1; therefore, G1.1, does not have any effect in the performability which can be seen by horizontal line a t the back of the 3D plot in Figure 7. As the traffic involvement increases with the value of a and when there is no inner G1.1 traffic represented by b = 0, the performability decreases because the performance of this transaction is limited by the interconnection network which has the minimum performance value. On the other hand, as b increases for any non-zero value of a , performability of the system increases because of the lesser involvement of the interconnection network to the transmissions and the higher performability value of G1.1. When both parameters reach to 1.0, G1.1 holds all the traffic activity within itself. This means that G1.1 constitutes the entire network in terms of service; therefore, its performance, which was originally given as 1.0, becomes the eventual performance of the entire system as seen in Figure 7.

We observe the transient performability of the large network in an other experiment. For example, in Fig- ure 8, the curve with solid line represents the transient performability value when all the subnetworks and the interconnection network are working. The one with dotted line depicts the performability when one of the identical subnetworks is not working. Therefore, we observe a decrease in the transient performability. The last one with dash-dot line represents the case when the interconnection network is not working. We observe even less transient performability because of the frequent involvement of the interconnection network to

b

Figure 7. The effect of traffic distribution on the steady-state performability

the tra.nsmission in the network.

'4

0 50 1M) 150 2W 250 3w 350 4W 45.3 5W 011 " " ' " " -I

TiClV3(HCUr)

Figure 8. The effect of a non-working subnetwork and a non-working interconnection network on the transient performability

Restoration calculation for the large network is only the weighted average of the restorability of individual simple subnetworks and the interconnection network (see formula (20) and (21)). Therefore, the parameter a would have different effect on the restorability figures of the large network because a is not involved in minimum function. As a result, surface in Figure 9 has another horizontal line for the varying values of a . The final steady-state restorability increases always but differently by the increasing a and b factors.

An observation on transient restorability is given in

368

5. Conclusion

05-

n

- - _ _ - _ - _ - _

b

Figure 9. The effect of traffic distribution on steady-state restorability

Figure 10. The solid line shows the case where all the subnetworks are restoring. The results without one of the subnetworks and the interconnection network working are depicted by the dotted and the dash-dot line, respectively. The experiment is similar to the transient performability experiment. The only differ- ence is that the result for the one-subnetwork case (dotted line) is closer to the result for the all-subnetwork case case (solid line) because the malfunction of the subnetwork is not propagated by the minimum function which would cause larger decrease in transient restorability.

Figure 10. The effect of a non-restoring subnetwork and a non-restoring interconnection network on the transient restorability

In this paper, we adapted a previously introduced reliability model, for the performance evaluation of self- healing SONET networks and defined the performability and restorability metrics. Secondly, we introduced a hierarchical approach that will not only reduce the complexity of the analysis, but also improve the accuracy. We, then, studied the effect of subnetwork performance and traffic distribution on the performability and restorability of these large networks. The numerical results showed that the hierarchical approach alleviates the complexity, gives more accurate results and is sensitive enough to subnetwork performability/restorability and the traffic distribution to be able to evaluate the large network.

References

H. C. Cankaya and V. S. S. Nair. Reliability and availability evaluation of self-healing SONET mesh networks. In IEEE Proceedings of GLOBECOM’97, volume 1, pages 252-256. IEEE Com. Soc., 1997. H. C. Cankaya and V. S. S. Nair. Survivability metrics for self-healing SONET mesh networks. In Proceedings of ISCIS’XII, volume 1, pages 269-276. Bogazici Uni- versity, Bogazici University Press, 1997. H. C. Cankaya and V. S. S. Nair. Accelerated reliability analysis for self-healing SONET networks. To appear in the Proceedings of ACM SIGCOMM’98, September 1998. H. C. Cankaya and V. S. S. Nair. Survivability analysis of self-healing SONET rings. IEEE Com. Soc., 1998. To appear in the IEEE Proceedings of GLOBE- COM’98. C. E. Chow, J. Bicknell, and S. Syed. Performance analysis of fast link restoration algorithms. JournaZ of Communication Systems, 8:325-345, 1995. W. D. Grover. The selfhealingTM network : a fast distributed restoration technique for networks using digital cross-connect machines. In IEEE Proceedings of GLOBECOM’87, pages 1090-1095, 1987. W. G. on Network Survivability Performance. A Tech- nical Report on Network Survivability Performance. Technical Report TlA1.2/93-001R3, Technical Com- mittee T1, 1993. R. A. Sahner, K. S. Trivedi, and A. Puliafito. Perfor- mance and Reliability Analysis of Computer Systems. Kluwer Academic Publishers, first edition, 1995. J. Sosnosky. Service applications for SONET DCS distributed restoration. IEEE Journal on Selected Areas in Communications, 12(1):59-68, January 1994. C. H. Yang and S. Hasegawa. FITNESS: failure immu- nization technology for network service survivability. In IEEE Proceedings of GLOBECOM’88, pages 1549- 1554, November 1988.

369

[ieee comput. soc 7th international conference on computer communications and networks - lafayette,...

Documents