cost-effective error detection codes in multicomputer networks

13
North-Holland 331 Microprocessing and Microprogramming 20 (1987) 331-343 Cost-Effective Error Detection Codes in M ulticomputer Networks Miroslaw Malek and Kitty Hiu Yau Department of Electrical and Computer Engineering, The Uni- versity of Texas at Austin, Austin, Texas 78712, U.S.A. Several error detection codes (EDCS) applicable to multicom- puter networks are analyzed and compared in detail with re- spect to cost and performance. A general guideline is pro- vided for making the best choice for a variety of multicomputer networks. EDCSmay be applied to any type of computer networks and are especially useful in fault-tolerant networks where recovery actions are triggered by error detec- tion. Making a particular code choice requires consideration of the actual network environment and specific performance requirements. The banyan network is used as an example to demonstrate the multi-dimensional tradeoff involved in se- lecting the most cost-effective EDC for a multicomputer net- work. Keywords: Error detection codes, Error coverage, Redundan- cy, Random errors, Unidirectional errors, Fault-tolerant net- works, Recovery, Retry, Reconfiguration. 1. Introduction The ever increasing demand for large computing power has made multicomputers a necessity. Multi- computer systems can provide parallel or distribut- ed processing, and therefore offer high speed com- putation and resource sharing. As the complexity of problems tackled continues to increase, so does the size of computer systems. This means that higher degree of fault tolerance is required to prevent seve- re performance degradation caused by any system fault. Since multicomputer systems rely heavily on interconnection networks for communication, the performance and reliability of the network constitu- tes a very crucial part of today's supersystems. Therefore, extensive research has been conducted to This research was supported in part by the Department of En- ergy Grant #DE-AS05-81 ER10987 and the Office of Naval Re- search develop fault-tolerant multicomputer networks more effectively and economically. A multicomputer system can be classified either as loosely coupled or tightly coupled. The former aims mainly at resource sharing, while the latter's main goat is to gain parallelism and achieve high speed computation. Due to the difference in average communication requirements, these two types of systems require different networks. Loosely coupled multicomputers are usually connected by simpler networks such as buses or rings, for which fault tol- erance problem is easier to handle. However, for large tightly coupled multicomputers, the class of multistage interconnection networks (MINS) seems to be one of the very few feasible solutions with cur- rent technology [1, 2]. A problem inherent to a large MIN is that the probability of having a component failure is significant. This makes fault-tolerant de- sign indispensable. A fault-tolerant MIN can be classified as a multi- path or a multipass network. The first category in- cludes duplicated, multiple modular redundant, and extra-stage networks, which provide more than one path between any pair of computer modules (cMs). The second category contains those networks in which communication between a pair of cMs in- volves some intermediary cM(s). Compared to other fault-tolerant MINS, the class of extra-stage net- works provides a relatively high degree of fault tol- erance at a small hardware cost [3]. We therefore refer to the extra-stage banyan network as a specific example to demonstrate the process Of EDC selection for multicomputer networks. In general, fault tolerance may be achieved by fault masking or fault detection followed by recov- ery. The former provides faster correction while the latter requires less redundancy. Nevertheless, the detection-recovery approach offers more flexibility because with careful planning, it can handle perma-

Upload: miroslaw-malek

Post on 21-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cost-effective error detection codes in multicomputer networks

North-Holland 331 Microprocessing and Microprogramming 20 (1987) 331-343

Cost-Effective Error Detection Codes in M ulticomputer Networks

Miroslaw Malek and Kitty Hiu Yau Department of Electrical and Computer Engineering, The Uni- versity of Texas at Austin, Austin, Texas 78712, U.S.A.

Several error detection codes (EDCS) applicable to multicom- puter networks are analyzed and compared in detail with re- spect to cost and performance. A general guideline is pro- vided for making the best choice for a variety of multicomputer networks. EDCS may be applied to any type of computer networks and are especially useful in fault-tolerant networks where recovery actions are triggered by error detec- tion. Making a particular code choice requires consideration of the actual network environment and specific performance requirements. The banyan network is used as an example to demonstrate the multi-dimensional tradeoff involved in se- lecting the most cost-effective EDC for a multicomputer net- work.

Keywords: Error detection codes, Error coverage, Redundan- cy, Random errors, Unidirectional errors, Fault-tolerant net- works, Recovery, Retry, Reconfiguration.

1. Introduction

The ever increasing demand for large computing power has made multicomputers a necessity. Multi- computer systems can provide parallel or distribut- ed processing, and therefore offer high speed com- putation and resource sharing. As the complexity of problems tackled continues to increase, so does the size of computer systems. This means that higher degree of fault tolerance is required to prevent seve- re performance degradation caused by any system fault. Since multicomputer systems rely heavily on interconnection networks for communication, the performance and reliability of the network constitu- tes a very crucial part of today's supersystems. Therefore, extensive research has been conducted to

This research was supported in part by the Department of En- ergy Grant #DE-AS05-81 ER10987 and the Office of Naval Re- search

develop fault-tolerant multicomputer networks more effectively and economically.

A multicomputer system can be classified either as loosely coupled or tightly coupled. The former aims mainly at resource sharing, while the latter's main goat is to gain parallelism and achieve high speed computation. Due to the difference in average communication requirements, these two types of systems require different networks. Loosely coupled multicomputers are usually connected by simpler networks such as buses or rings, for which fault tol- erance problem is easier to handle. However, for large tightly coupled multicomputers, the class of multistage interconnection networks (MINS) seems

to be one of the very few feasible solutions with cur- rent technology [1, 2]. A problem inherent to a large MIN is that the probability of having a component failure is significant. This makes fault-tolerant de- sign indispensable.

A fault-tolerant MIN can be classified as a multi- path or a multipass network. The first category in- cludes duplicated, multiple modular redundant, and extra-stage networks, which provide more than one path between any pair of computer modules (cMs). The second category contains those networks in which communication between a pair of cMs in- volves some intermediary cM(s). Compared to other fault-tolerant MINS, the class of extra-stage net- works provides a relatively high degree of fault tol- erance at a small hardware cost [3]. We therefore refer to the extra-stage banyan network as a specific example to demonstrate the process Of EDC selection for multicomputer networks.

In general, fault tolerance may be achieved by fault masking or fault detection followed by recov- ery. The former provides faster correction while the latter requires less redundancy. Nevertheless, the detection-recovery approach offers more flexibility because with careful planning, it can handle perma-

Page 2: Cost-effective error detection codes in multicomputer networks

332 M. Ma/ek, K. Hiu Yau / Cost-Effective Error Detection Codes

nent and transient faults in different ways so that less hardware redundancy or recovery time is in- curred. This advantage becomes more apparent if we observe that in today's vLsI systems a very high percentage of on-line faults are transient (usually ranging from 90% to 99%). By distributing retry ca- pabilities to lower level components, fast recovery is accomplished for these faults. The much smaller number of permanent faults can be recovered by network reconfiguration. To reduce fault distribu- tion and system contamination, prompt error detec- tion and recovery are very important. Although er- ror correcting codes serve this purpose, their cost is usually prohibitive. On the other hand, EDCS prov- ide real-time error detection, which is essential in minimizing fault latency in multicomputer net- works. Combined with an efficient error recovery scheme, error propagation can be prevented at a much lower cost. In data transmission, fault mask- ing is usually implemented with error correct ing codes (ECCS). Since ECCS treat permanent and tran- sient faults equally, they inherently require much higher redundancy to obtain a certain fault cover- age. On the other hand, even a well planned recov- ery scheme equipped with retry and reconfiguration facilities will be worthless without sufficient error detection capability, because recovery action can- not take place before an error is detected. There- fore, finding a high coverage EDC is essential for any recovery scheme in which the efficiency of remedial actions depends heavily on error detection. This paper focuses on the comparison and discussion of tradeoffs among several well known EOCS in multi- computer network environment. The data furnish- ed may prove useful in the code selection process for multiprocessor interconnection networks.

In the next section, the banyan network is briefly described. The tradeoff between the horizontal and longitudinal error detection schemes is discussed in Section 3, and a vLsI fault model and the associated error classes are presented in Section 4. Section 5 contains a description of several EDC candidates and the detailed comparisons among them with respect to fault coverage, bit and time redundancy, checker complexity, and delays. Some quidelines for mak- ing code choices are also provided, and a specific example is given using an extra-stage banyan net- work. Finally, conclusions are given in Section 6.

Fig. 1. A (2, 3) - sw-banyan network.

2. Banyan Network and Fault-Tolerant Schemes

2.1 Extra-Stage Banyan Network

The network referred to in this paper is a rectangu- lar sw-banyan which provides a single path between every pair of cMs. The network is built from f × s crossbar elements which are connected in a shuffle fashion to ensure exactly a single path between every pair of cMs. For N cMs to be connected (N is a power of f o r s for f = s ) logfN=L levels are re- quired. The notation used in this paper, (f,L)- banyan, indicates that a particular banyan is built o f f × f crossbar switches, has L levels and connects f~ resources. An example of a (2,3)-banyan com- posed of 12 2 × 2 crossbar switches and three levels is shown in Fig. 1. A banyan with 2 × 2 switching el- ements is also known in the literature as omega, baseline, indirect binary cube or butterfly network. Generally, each computer module contains a pro- cessor, a sharable memory, an l/o device, and a net- work interface unit (Fig. 2). An extra-stage banyan

PROCESSOR I [~ LOCAL BUS

INTERFACE MEMORY UNIT

I Fig. 2. A computer module (CM).

Page 3: Cost-effective error detection codes in multicomputer networks

M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes 333

o ° °o ~ o o 0 0 0

X

o o o o

0 ~ 0 ~ 0 ~ 0 ~ 0

o o _ - o o o o o

0 0 0 0 ~

'

xl }K X

X x

/ .

X "- . X x / x

X X )<

0

o ~ O - o O o

0 0 0 0

/

/

/

o o o

Fig. 3. Two disjoint paths from CM (1100) to CM (0111 ) in an ESB (2, 5).

(ESB) network with fanout or spread equal t o f a n d ( L + 1) stages or levels is denoted by ESB ( f ,L+ 1). It has f disjoint paths between any pair of CMS and therefore can tolerate up to (f-l) permanent faults. Fig. 3 shows, in graph representation, the two dis- joint paths between two CMS in an ESB (2,5) network. An ESB (f,L + 1) network interconnects N CMS where N--fL and has (L + 1) stages. Each stage of the net- work also contains N/f switching elements that are f×f crossbars internally. Communicat ion is unidi- rectional and can be carried out in packet or circuit switching mode. Determining a suitable value f o r f involves cost-performance tradeoff. It is suggested in [4] that f = 4 is the most cost-effective choice. This is supported by the analysis and simulation pre- sented in [5]. Fig. 4 shows an ESB (4,3) network that is composed of 12 4 × 4 switching elements.

An example of the fault recovery scheme de- scribed below involves graceful degradation of net- work performance. The scheme does not cause sys- tem operation to terminate when a fault occurs. Instead, it continues to operate at a reduced level during the recovery period.

Statistics show that in many system operations, transient or intermittent faults predominate. In VLSI systems, they usually contribute to 90% to 99% of total detected faults. The recovery scheme therefore puts top priority on combatt ing soft errors. This can be efficiently handled by a retry mechanism in- corporated in the switch design, provided that ade-

o o o o o o o o o o o o o

• -y/)~ ~ ~ ~

. _ / / / " / ~ , , 7 v ~ 7-.X " "~

- - - - o Z o o o -

o2 -- g ° ° - ° ° o o O o

\"--/ X/ X

Fig. 4. An ESB (4, 3) network made of 4 x 4 switches.

quate fault detection mechanisms are incorporated in the switch.

We assume that the switching elements design used in the banyan network is based on the design described in [4], except that error detection and re-

covery capabilities are added to them. A message is sent as s consecutive w-bit wide packets, where s and w determine the message size, and are system dependent. In this paper, we have assumed an 8-bit wide data bus and a variable message size of 32, 64, or 128 bits. Therefore, w = 8 and s = 4 , 8, or 16.

Every message is checked at each switch through which it passes. I f an error is detected, the receiving

channel will not release the sending channel in the preceding switch. This triggers the sender to make a few retry attempts. A retry is automatically made by the sending channel if it is not released after a certain timeout period. If an error cannot be over- come after a chosen number of retries, a permanent fault is assumed. Reconfiguration is then required

to bypass this fault if ESB is employed, or multipass is required if regular banyan is used. Because recon- figuration is usually implemented at CM level, it takes longer delay and incurs more severe perfor- mance degradation. However, since only a small percentage (usually less than l0 %) of faults are per- manent, reconfiguration is not expected to occur of- ten enough to cause serious impact on performance. Obviously, error recovery ability depends on error detection coverage because remedial actions are ini- tiated only by error indications. This motivated the quest for a high coverage EDC. Finding the most cost-effective EDC for a given network is the main goal of this paper.

Page 4: Cost-effective error detection codes in multicomputer networks

334 M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes

3. Horizontal versus Longitudinal Scheme

3.1 Horizontal Scheme

In a separable EDC, if the check bits are formed over a horizontal word and are concatenated to the data bits, it is called the horizontal scheme. In this scheme the check bits induce the so called space or device redundancy because they require additional data links and pins for transmission.

3.2 Longitudinal Scheme

I f the check bits in a separable EDC are formed over a serially transmitted bit stream and are appended to the end of the data stream, it is implemented by the longitudinal scheme. In this case, the check bits do not require extra data links or pins, but they in- crease the message transmission time and thus incur the so called time redundancy.

3.3 Space-Time Tradeoff

The choice between the above two schemes demon- strates a space-time tradeoff. Generally speaking, a system requiring a very high network bandwidth would prefer the horizontal scheme, whereas a sys- tem under severe packaging constraint and pin limi- tation would favor the longitudinal scheme. Never- theless, pin limitation is often a major constraint in VLSI design, making the longitudinal scheme a more favorable choice. This is also the case for the ESB (4, L + 1) network.

4. Fault Model and Error Classes in VLSI Networks

4.1 Fault Model

The ESa (4, L + 1) network and other large MINS will be more efficiently implemented using VLSI technol- ogy. The fault model considered must thus include those faults that are typical in VLSl and wafer scale integration. There are generally three groups of faults in an ESB network: (1) control logic faults, which include faults in the switch control and some buffers; (2) error detection and recovery logic faults; (3) data logic faults, which may occur in data

buffers or data links. There are both internal links (data paths on the switch chip) and external links (cables or wires interconnecting the switches).

The first two groups of faults can be tackled by incorporating fault-tolerant design to the switch control and error handling logic. This can be ac- complished by employing totally self-checking tech- niques (a'sc). Besides duplication or triplication, more desirable arsc logic can be designed to reduce the required hardware redundancy. Some examples

of totally self-checking checkers are given in [6, 7]. In this paper we are mainly concerned about the third group of faults. Since the EDC selection for a network is based on assumed the fault model, the model must be realistic and should cover all faults that can be anticipated. Faults on the data links can be either permanent or transient. Permanent faults are usually caused by severe environmental stress or component wearout, and will affect all data bits transmitted through the faulty line(s). Transient faults may be caused by a variety of sources such as alpha particles and some unknown elements. They usually produce unpredictable error effects. How- ever, impulse noise is a major source of transient faults in many systems. The cause may be power fluctuation, overheating, dirty electrical contact, or loose connections, and many others. Field study shows that the duration of these faults usually falls into the range of micro to milli-seconds. Since the transmission time of a typical message (32 to 128 bits) in the VLSI ESB network is in the range of tens to hundreds of nanoseconds, a single transient fault is very likely to affect a number of consecutive bits and cause adjacent-bit unidirectional errors in the longitudinal scheme.

4.2 Error Classes

Eliminating errors from system operation involves both preventive and remedial actions. Error preven- tion requires knowing the causes while error recov- ery requires knowing the effect or behavior of the errors. Errors with the same pattern may result f rom different causes, but are covered by the same EDC. On the other hand, the same cause may result in different error patterns in horizontally and longi- tudinally covered data packets. For example, a transient fault on a single line may affect several

Page 5: Cost-effective error detection codes in multicomputer networks

M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes 335

) Fig. 5. Relationship among different error classes.

consecutive bits and cause one adjacent-bit unidi- rectional error in the longitudinal scheme but seve- ral single-bit errors in the horizontal scheme. On the other hand, a bridge fault on some adjacent lines will cause several single-bit errors in the longitudi- nal scheme but one adjacent-bit unidirectional error in the horizontal scheme. Since we are dealing with error detection and recovery in this paper, errors are classified according to their effect. The various classes of errors can be related as follows.

{single-bit error} ~ {adjacent-bit unidirectional er- ror} ~ {unidirectional error} ~ {random multiple- bit error}

This relationship holds for both horizontal and lon- gitudinal schemes and can be depicted by the set di- agram shown in Fig. 5.

I f there exists an EOC that can detect all random multiple-bit errors, it will have perfect coverage for all error classes. However, perfect detection is never possible with limited resources. Our goal is there- fore to ensure that the majority of errors can be de- tected by the selected EDC.

5. Comparison of EDC Candidates

Several EDCS are selected as candidates after a preli- minary evaluation of their coverage and cost. They constitute a spectrum of code choices, ranging from the least expensive single parity to the more costly Berger and checksum codes, with a general increase in the error detection coverage. These EDCS then be- come competitors in our code selection process. For a given multicomputer network, the most cost-ef- fective EDC in that particular environment becomes

DATA BITS I CHECK BITS

J ~GCHECK BITS

ENERATOR / I COMPARATOR

ERROR INDICATION

Horizontal word or longitudinal bit stream

Fig. 6. Block diagram for EDC checkers.

the winner. All candidates considered here are sepa- rable EDCS because they require minimized and sim- plified hardware for encoding/decoding and check- ing the data bits. The error detection logic for each code can be generally depicted by the block di- agram shown in Fig. 6.

The figure of merit or cost-effectiveness of each EDC is measured in terms of cost and performance. In turn, cost is measured by redundant links and pins, checker logic, transmission delay and checker delay, while performance is measured by error de- tection capability or coverage. It is worth noticing that the time redundancy incurred by checker delay or check-bit transmission delay (in longitudinal scheme) can be counted either as increased cost or decreased performance, or may even be considered as a separate dimension. Assimilating time redun- dancy to the cost is only a choice we made to carry out easier comparison of the EDCS. Performance can be evaluated by two measures of coverage: (1) gen- eral coverage, which is more qualitative and usually specifies the classes of errors that are detectable. It may also include error detection probability (or percentage) for various classes of errors; (2) explicit coverage, which is the probability of an arbitrary error being detectable. This is the average of the coverages for all possible classes of errors, weighted by their respective probabilities of occurrence. It is very difficult to obtain because the relative probabi- lities of different error classes are implementation dependant and are usually hard to predict. In prac- tical situations, assumptions often have to be made regarding the possible failure modes and the proba-

Page 6: Cost-effective error detection codes in multicomputer networks

336 M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes

bilities of various error classes. General coverage is used in this paper.

5.1 EDC Selection

I f we define the cost of an EDC to include the hard- ware redundancy incurred by the checkers and the extra storage, links and pins for check bits, and the time redundancy incurred by the checker delay and check-bit transmission delay, the most cost-effec- tive EDC for a particular network can then be de- fined as the code that provides the highest coverage for the predominant faults/errors among all the candidates that have similar cost. Since the preva- lent faults vary from system to system, the code selection is system dependant. Codes can only be chosen on the equal opportunity basis if details of a system environment are unknown. This choice is usually not optimal because in reality some error classes are often more probable than others. Dif- ferent weights should be given to the coverage of various errors to reflect some errors ' higher fre- quency of occurrence. For example, the coverages of local Berger code and 3-group interlaced parity code (see the next section for definitions) for differ- ent error classes are depicted in Table 1.

I f all errors are assumed to be equally probable, the explicit coverages are computed as follows.

C(BG) = 1/3(100 + 72.73 + 95.91) = 89.55 C(IP) = 1/3(93.5 + 74.55 + 100) = 89.35

C( BG) "~ C( IP). However, in VLSI systems, especially those using longitudinal EDCS, adjacent-bit unidirectional errors have much higher probability than random errors. I f we take a more realistic approach and assume the probabilities of the above three types of errors to be

Table 1 Coverage comparison between Berger and 3-group interlaced parity codes.

Adjacent-bit Random Random unidirectional 2-bit 3-bit

Berger code (BG) 1 O0

3-group interlaced 93.5 parity (IP, c = 3)

72.73 95.91

74.55 1 O0

0.7, 0.2 and 0.1 respectively, then the coverages are computed again as follows.

C(BG)=O.7x 100+0 .2x72 .73+0 .1 x95.91 = 94.14 C(IP)=O.7x93.5+O.2x74.55+O.1 x 100 = 90.36 C(BG) > C(IP).

In this case, Berger code has much better coverage

than 3-group interlaced parity. In making a choice among several EDCS, all fac-

tors affecting cost or performance should be consid~ ered. These include coverage, check-bit redundancy (extra links and pins for horizontal codes, transmis- sion delay for longitudinal codes), checker delay, and checker hardware. Before making detailed comparisons among the EDC candidates, the general properties and special feature of each code are de- scribed.

5.2 Code Description

A separable EDC is denoted by (n, k), where n is the number of bits in the entire codeword and k is the number of information bits. The number of check bits is therefore c, where c = n - k . The five EDCS se- lected as code candidates for discussion are de- scribed as follows.

5.2.1 Single Parity Code A single parity codeword is constructed by adding an extra bit to the original data word such that the resulting word contains an even number of l 's (even parity) or odd number of l 's (odd parity). The par- ity bit can either be concatenated horizontally to a data packet or appended longitudinally to the end of a bit stream. The former requires a checker of n- bit XOR tree that takes d log2n gate delays (d is the number of gate delays for an XOR gate), while the latter uses a serial parity checker formed by a single XOR gate followed by a single memory cell (flip flop). This checker checks the data bits as they are shifted in, and therefore incurs only d gate delays for checking. The choice between even and odd par- ity is affected by the codeword length, n, and the transmission link failure mode. I fn is even, odd par- ity is the clear choice because it detects both all-0s and all-ls failure modes. I f n is odd, the choice de- pends on whether the all-0s or the all-ls error is

Page 7: Cost-effective error detection codes in multicomputer networks

M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes 337

I

Fig. 7.2-Group interlaced parity code.

more probable. Odd parity detects the all-0s error whereas even parity detects the all-ls error. Single parity is one of the least expensive forms of error detection. It detects all single-bit errors as well as all multiple-bit errors affecting an odd number of bits.

5.2.2 Interlaced Parity Code An interlaced parity codeword contains c (c = n - k) check bits. The data bits are divided into c parity groups, each covered by one of the c parity check bits. Every bit in a string of c adjacent bits belongs to a different parity group, as shown in Fig. 7. Like single parity, an interlaced parity code can also be implemented both horizontally and longitudinally.

Similar checkers used for single parity are also ap- plicable here. The number of checkers required is c times that for single parity codes but the coverage is improved. Furthermore, the size of the XOR trees in the checker for horizontal code is smaller and thus faster checking can be achieved. Interlaced parity detects all single-bit errors and all multiple-bit er- rors that involve an odd number of bits in at least one parity group. When different parity groups al- ternate even and odd parity scheme, the coverage is

further improved because the all-0s and all-1 s errors are covered. Apparently, coverage is improved at the expense of increasing check-bit redundancy and the number of checkers. In this paper we consider two values for c, namely, e = 2 and c = 3. These are called 2-group and 3-group interlaced parity code respectively.

5.2.3 Berger Code A Berger code can be implemented at two different levels. When the check bits are formed over a two- dimensional message of size (s × w), where s is the number of data packets per message and w is the number of bits per packet, it is called global Berger code (BGg). When each packet or bit stream is cov- ered by its own check bits, it is called local Berger code (BGt). Local Berger code can be implemented

longitudinally or horizontally. The Berger code is an (n, k) code in which the c check bits form the l ' s complement of the number of l 's in the k informa- tion bits. I f k=ko+kl where k0 is the number of O's and kl is the number of l 's in the data bits, then check = ~1 is the c-bit check word. In general,

c = [-log2(k+l) 7 , where [-x~ is the smallest integer greater than or equal to x. When c = logz(k+ 1), o r k = 2 c - 1,

the Berger code is called the maximal length Berger code because the c check bits are being fully occu- pied. The logic for Berger code checkers is quite simple. In the horizontal scheme, the checker con- sists of an adder (sums w bits in parallel), a comple- mentor, and a comparator . Since checking is per- formed in each switch node, large delay for error detection is prohibitive. This necessitates the use of

carry lookahead logic in the adder. Consequently, the checking delay is 8 gate delays (As) for addition and 1 gate delay for comparison. In the longitudinal scheme, a counter or a serial adder can be used to obtain the number of l ' s in each bit stream. The number of l 's counted, kl, is complemented while the check bits are being shifted into the switch. The time delay is thus only 1 gate delay for comparing the generated and received check bits. The checker for the global Berger code consists of w counters (or serial adders), one w-operand adder, a complemen- tor and a comparator . I f multi-level carry save ad- ders are used, the checker delay will be 10 gate de- lays (As) minus 1 clock cycle time of the switch (T).

Typically, T = 2 0 ns [4]. T is subtracted from the checker delay because the parallel addition can start while the check word is entering the switch. The extra time taken to transfer the check word has been counted as check-bit redundancy and should not be included here. The 10 gate delays include 8 gate delays for parallel addition, 1 gate delay for complementation, and 1 gate delay for comparison. Berger code detects all single-bit errors and unidi- rectional errors. This can be easily shown as fol- lows. Any unidirectional error causes the total number of l 's to either increase or decrease. For 0 ~ 1 errors, kl increases. This must cause check = El to decrease, requiring some bit(s) in the check bits to be changed from 1 to 0, which is impossible for unidirectional 0---, 1 errors. Similar p roof holds for

Page 8: Cost-effective error detection codes in multicomputer networks

338 M. Malek, K. Hiu Yau I Cost-Effective Error Detection Codes

1 ~ 0 errors. Berger codes do not have perfect cover- age for random multiple-bit errors. This is attribu- ted mainly to the fact that compensating erroneous bits may occur in the data part with the check bits unchanged. Additionally, if any single bit in the data part and the least significant bit in the check word are changed in the opposite direction, the er- ror is undetectable. There are some other undetect- able erros that involve bits in both data and check- bit parts. One such example is an error affecting one data bit and causing a double-bit flip in the two least significant check bits. Berger code is very effi- cient in systems where unidirectional errors pre-

dominate [9].

5.2.4 Checksum Code In single-precision checksum code, the check word is formed by adding the s w-bit packets together (modulo 2 w) and is appended to the message. Sin-

gle-precision checksum offers uneven coverage for various data lines due to the loss of some carry bits in the modulo 2 w addition. This is very undesirable for most networks and we therefore consider only the extended-precision checksum in this paper.

In the extended precision checksum code, the check word is formed by adding together the s w-bit packets without truncation. Since the maximum number representable by each packet is 2 w - 1, the upper limit for the sum is s (2 w - 1). This number re-

quires a maximum of (log2s + w + 1) bits to be repre- sented. When w > (1 + logzs), the checksum is con- tained in two packets, which are appended to the message, high order packet first. For example, if w = 8 and s ~< 128 the checksum is always totally contained in two packets. Checksum codes provide 100% coverage for all single and unidirectional er- rors. However, its random error coverage decreases as message size increases. The incomplete coverage for random errors is mainly due to the existence of compensating errors. Errors involving one bit in the data part and a corresponding vulnerable checksum bit are also undetectable. Due to time constraint, carry lookahead logic needs to be incorporated in the adder in the checker design, rendering it the most complex checker among all the EDC checkers presented in this paper. The checker consists of an s-operand (logzs + w + 1)-bit adder and a compara- tor. The delay time for addition is 4, 9, and 12 gate

delays (As) for s = 4 , 8, and 16 respectively. Since addition can start before the two checksum packets enter the switch, and 1 gate delay is needed for com- parison, the checker delay is ( 5 A - 2 T ) , ( 9A-2T) , and ( 1 3 A - 2 T ) for s = 4 , 8, and 16 respectively. Checksum codes have the attraction of low bit re- dundancy and high error coverage, but suffer from high checker complexity.

5.2.5 Cyclic Redundancy Code ( CRC) An (n, k) CRC code is obtained by dividing a data polynomial, D(x), of degree less than k by a genera- tor polynomial, G(x), of degree c = n - k. The check bits contain the remainder of this division and are appended to the data bits. CRC is easily imple- mented using linear feedback shift registers (LFSRS) that are composed of XOR gates and memory ele- ments. The LFSR serves as the check-bit generator and checker. In this paper, the generator polyno- mial chosen for discussion is G(x) = x3+x2+x+ 1 = ( x + 1)(x2+ 1). This choice is made

because we want to limit the length of the LFSR to 3, and the factor ( x + l) ensures detection of all odd bit errors. In any particular network, the generator polynomial may be modified to suit the environ- ment. The LFSR for G(x) =x3+x2+x+ 1 is shown

in Fig. 8. This LFSR is preloaded with O's. As data enter the switch, they are simultaneously shifted into the message buffer and the LFSR. As soon as the last data bit enters the buffer, the LFSR contains the check bits of the codeword. When the received check bits have passed through, the LFSR content should be 0 for error free transmission. Therefore, the CRC checker introduces no extra delay besides that used to transmit the check bits. As mentioned earlier, this delay has been counted as check-bit redundancy for longitudinal EDCS. The (n, k) CRC code detects all single-bit errors and b-adjacent errors of c (c = n - k) bits or less. The coverage for b-

Buffer

Fig. 8. LFSR checker for CRC wi th G (x) = x 3 + x 2 + x + 1.

Page 9: Cost-effective error detection codes in multicomputer networks

M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes 339

adjacent errors is 1-2 -C for b > c + l and is 1 21-c for b=c+ 1. When G(x) is divisible by ( x + 1), as in the case for G (x) + x 3 + x 2 + x + 1, all errors involv- ing an odd number of bits are also detectable. The coverage for random errors is calculated by count- ing the detectable errors and dividing it by the total number of errors. I f an error pattern is such that the error polynomial, E(x), is divisible by G(x), the er- ror is undetectable. All other errors are detected. Detailed discussion on ¢R¢ coverage for various er- ror classes can be found in [10]. CRC code is primari- ly used in the longitudinal scheme.

5.3 General Guidelines for Code Selection

The detailed comparison among the code candi- dates with respect to check-bit redundancy, checker

complexity, delays, and coverage is summarized in Tables 2, 3, 4, 5, and 6. Table 7 shows a comparison

between Berger code and interlaced parity code when the check-bit redundancy is fixed,

As can be observed from the tables furnished, code selection for a multicomputer network invol- ves a multi-dimensional tradeoff. A careful study of the particular network environment may be re- quired to cut down the number of unknown factors and reduce the tradeoff to fewer parameters. Appa- rently, the more details we know about a network, the better chance we have to choose the most cost- effective EDC. However, this introduces a higher level of tradeoff, which is between the effort spent on en- vironment study and the achievable EDC efficiency. In reality, the most cost-effective choice is very hard to make because the network behavior is usually

Table 2 Coverage and bit-redundancy for horizontal codes in 8-bi t wide packets

Code Check bits Bit-redundancy ratio Coverage (%) (%) AU 1 -bit 2-bit 3-bit

Single parity 1 12.5 50.11 100 0 1 O0

Interlaced (c = 2) 2 25 79.27 100 55.56 100 parity (c = 3) 3 37.5 93.5 100 72.73 100

Local Berger 4 50 100 100 72.73 95.91

AU: Adjacent-Bit Unidirectional error 2-bi t and 3-bi t errors are random

Table 3 Coverage and bit-redundancy for longitudinal codes in 4-packet messages (each packet is 8-bi t wide)

Code Check bits Bit-redundancy ratio Coverage (%) (%) AU 1 -bit 2-bit 3-bit

Single parity 1 25 51.85 100 0 1 O0

Interlaced (c = 2) 2 50 80.85 100 60 100 parity (c = 3) 3 75 97.01 100 76.19 1 O0

Local Berger 3 75 100 100 76.19 92.86

Global Berger 1 25 100 100 66.15 98.66

Checksum 2 50 100 100 66.7 1 O0

3-Bit CRC 3 75 93.75 100 85.71 1 O0

Page 10: Cost-effective error detection codes in multicomputer networks

340 M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes

Table 4 Coverage and bit-redundancy for longitudinal codes in 8-packet messages

Code Check bits Bit-redundancy ratio Coverage (%) (%) AU 1 -bit 2-bit 3-bit

Single parity 1 12.5 50.11 100 0 100

Interlaced (c = 2) 2 25 79.27 100 55.56 1 O0 parity (c = 3) 3 37.5 93.5 100 72.73 100

Local Berger 4 50 100 1 O0 72.73 95.91

Global Berger 1 12.5 100 100 59.31 99.13

Checksum 2 25 1 O0 100 60 100

3-Bit CRC 3 37.5 91.67 1 00 81.82 1 00

not fully understood. Another factor that makes the optimal EDC selection difficult is the rapidly chang- ing technology. Technological advances can easily alter the significance of each factor involved in the tradeoff, thus shifting the weights carried by differ- ent aspects of the network and affecting the choice. In spite of all the difficulties, we proceed to describe a general guideline that can be tailored to make a good EDC choice for a particular multicomputer network.

The factors considered in the code selection pro- cess include the following: error coverage, the pre- valent failure modes or error classes, the time con- straint, the space constraint (hardware overhead),

and the relative importance of the network perfor- mance against its cost. The space and time con- straints determine the choice between horizontal and longitudinal codes. A network with very heavy communication load may not be able to afford the transmission delay incurred by redundant check-bit packets, leaving a horizontal EOC the only feasible choice. On the other hand, a network in which extra pins and data links are either too costly or simply impossible because of packaging constraints may have to accept certain delay caused by a longitudi- nal ~DC. The relative weights carried by the perfor- mance and cost depend on the system design philos- ophy, and are usually not flexible. Given certain

Table 5 Coverage and bit-redundancy for longitudinal codes in 16-packet messages

Code Check bits Bit-redundancy ratio Coverage (%) (%) AU 1 -bit 2-bit 3-bit

Single parity 1 6.25 50 100 0 100

Interlaced (c = 2) 2 1 2.5 68.75 100 52.94 100 parity (c = 3) 3 18.75 91.45 1 O0 70.18 1 O0

Local Berger 5 31.25 100 100 67.62 97.44

Global Berger 1 6.25 100 1 O0 55.03 99.5

Checksum 2 12.5 100 100 55.56, 100

3-Bit CRC 3 1 8.75 90 1 00 78.95 1 00

Page 11: Cost-effective error detection codes in multicomputer networks

M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes

Table 6 Checkers and delays for longitudinal & horizontal EDCS

341

Longitudinal scheme Horizontal scheme

Code Checker Delay Checker Delay

Single parity 8 serial parity checkers 2A Interlaced (c = 2) 16 serial parity checkers, 1 FF 2A parity (c = 3) 24 serial parity checkers, 3 FFs 2A

Local Berger 8 x-bit counters, complementor & compa- /k rator

Global Berger 8 x-bi t counters, 8-operand y-bit adder, 10A- T comparator

Checksum s-operand z-bit adder, comparator

3-Bit CRC 83-bit LFSRS 0

1 9-bit XOR tree 4A 2 5-bit XOR tree 3A 3 4-bit XOR tree 2A

Tree of FA/HA adders adding 9A 8 bits in parallel complementor & comparator

5&-2 T( (s = 4) 9& -2T (s=8 ) 13A-2T (s= 16)

x = ~ log2 (s + 1 )7 , where s is the number of words per packet y = log2 (p + 1 ), wherep is the number of bits per packet

z = Iog2s+9 Typically, T = 20 ns and A = 5 ns, but actual delays depend on specific design and implementation technology.

cost-performance constraint, the choice usually de- pends on affordability. In other words, the most ef- fective code affordable in the EDC spectrum is se- lected. Among all the factors mentioned, the predominant error classes are most difficult to de- termine. This adversely influences the code selection because the EDC chosen is required to cover all error classes if the network failure modes are unknown. Since some errors are usually more likely to occur than others, and certain error pattern may never oc- cur at all in a particular system, a price is paid for

something that seldom or never exists. This reduces the efficiency of the ED¢. As mentioned previously in the code description, Berger code is especially ef- fective for unidirectional errors, while e r e is very

efficient for adjacent-bit errors. I f the failure modes of a network are better understood, code selection should be more cost-effective.

In short, the general procedure for selecting an EDC for a given multicomputer network is to first gather all the available information about the net- work and identify the factors that have significant

Table 7 Comparison of coverages for Berger code and interlaced parity with the same bit-redundancy ratio.

Code Coverage (%) s = 3 s = 7 s-15 AU 2-bit 3-bit AU 2-bit 3-bit AU 2-bit 3-bit

Local Berger 1 O0 70 85 1 O0 68.89 94,17 1 O0 64.91 96.9

Interlaced parity 86.96 60 1 O0 90.77 73.33 1 O0 94.79 78.95 1 O0 c= log2 ( s + l )

Observation: Berger code is better for unidirectional errors but Interlced parity is better for random errors.

Page 12: Cost-effective error detection codes in multicomputer networks

342 M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes

effect on code selection. A preliminary choice is then made to narrow the scope of the selection pro- cess. After some assumptions are made about the unknown factors, a specific EDC can be selected.

Top priority should be given to the EDC that is indi- cated by the known factors to be more preferable. For example, if unidirectional errors are definitely prevalent in the network, the choice should be Berger code. I f cost is a major limiting factor, single parity should be selected.

5.4 The Cost-Effective Choice

To demonstrate the tradeoffs involved in making a specific code choice, we introduce the environment of a typical ESB network. The network is designed with switch level error detection and retry, and computer module level reconfiguration. It is com- posed of 4 × 4 switching elements, and is under the pin limitation constraint. The preferred scheme is therefore longitudinal. Transient faults are expected to dominate in this VLSI network, and they are very likely to affect consecutive bits of the data streams and cause unidirectional errors. Therefore, the pre- valent error class does not include random errors. A careful examination of Tables 3, 4, 5, and 6 shows that both Berger and checksum codes offer 100% coverage for all unidirectional errors. Thus they seem to tie in terms of performance. This calls for a comparison in cost. As mentioned earlier, Berger code can be implemented at global or local level. Local Berger code incurs higher bit redundancy than checksum, but has simpler checker logic. How- ever, the transmission delay caused by the larger number of redundant bits is compensated by the smaller checker delay for local Berger code relative to checksum. Consequently, time redundancy or delay is also not a determining factor in the EDC se- lection process. It is the smaller hardware cost for local Berger code checker that gives it the winning edge over checksum, making local Berger code the most cost-effective EDC for the ESB network.

6. Conclusion

Several EDCS that are suitable for real-time error de- tection in multicomputer networks are discussed

and compared with respect to coverage, bit redun- dancy, checker complexity, and delay. The cost-per- formance and space-time tradeoffs are demonstra- ted and a general guideline for making the optimal choice is offered. The specific network example con- sidered in this paper is an extra-stage banyan net- work implemented with 4× 4 crossbar switches. This network uses retry to combat transient or in- termittent faults and reconfiguration to bypass per- manent faults. Error detection and retry are imple- mented at the switch level to ensure prompt recovery and prevent error propagation. I f this effi- cient recovery scheme is accompanied by a high coverage EDC, an obvious advantage over ECCS is observed because similar recovery speed is achieved at a much lower cost, which is attributed to both smaller bit redundancy and cheaper overall logic cost.

References

[1 ] L.R. Goke and G.J. Lipovski: "Banyan Networks for Par- titioning Multiprocessor System," Proc. 1st Annu. Symp. Comput. Architecture, Dec. 1973, pp. 21-28.

[2] C.- L. Wu and T.Y. Feng: "On a Class of Multistage Inter- connection Networks," IEEE Trans. Comput., Vol. C-29, pp. 694-702, Aug. 1980.

[3] G.B. Adams, III and H.J. Siegel: "The Extra Stage Cube:

a Fault-Tolerant Interconnection Network for Supersys- tems," IEEE Trans. on Computers, Vol. C-31, No. 5, May 1982, pp. 443~,54.

[4] A. Hung and M. Malek: "'A 4 x 4 Modular Crossbar De- sign for the Multistage Interconnection Network," Proc. 1981 Real- Time Systems Symp., Dec. 1981, pp. 3-12.

F5] S. Cheemalavagu and M. Malek: "Analysis and Simula- tion of Banyan Interconnection Networks with 2 x 2, 4 x 4 and 8 x 8 Switching Elements," Proc. 1982 Real- Time Systems Symp., Dec. 1982, pp. 8349 .

[6] V. Gaitanis: "Totally Self-Checking Checkers for Low- Cost Arithmetic Codes," IEEE Trans. on Computers, Vol. C-34, No. 7, July 1985, pp. 596401.

[7] J. Wakerly: Error-Detecting Codes, Self-Checking Cir- cuits andApplications, North Holland, New York, 1978.

[8] M.A. Marouf and A.D. Friedman: "Design of Self- Checking Checkers for Berger Codes," 8th Annu. Int. Symp. on Fault-tolerant Computing, 1978, pp. 179- 184.

[9] W.K. Fuchs, J.A. Abraham, and K. Huang: "Concurrent Error Detection in VLSI Interconnecting Networks," Proc. lOth Annu. Int. Symp. on Comp. Arch., 1983, pp. 309-315.

[10] W.W. Peterson and D.T. Brown: "Cyclic Codes for Error Detection," Proceedings of the IRE, 1961, pp. 228-235.

Page 13: Cost-effective error detection codes in multicomputer networks

M. Malek, K. Hiu Yau / Cost-Effective Error Detection Codes 343

Miroslaw Malek is an Associate Professor of Electrical En- gineering at the University of Texas at Austin, where he has served on the faculty since September 1977. In 1977 he was a visiting scholar at the Department of Systems Design at the University of Waterloo in Canada. His research interests in- clude parallel architectures, interconnection networks and fault-tolerant computing, on which he has published over 50

papers. Malek received the MSc degree in Electrical Engineering in

1970 and his PhD in Computer Science in 1975, both from the Technical University of Wroclaw in Poland. He is a member of the IEEE and ACM. He served as General Chair- man of the 1984 Real-Time Systems Symposium. He is a con-

sultant to industry in the areas of network performance, testa- bility, and fault- tolerant computing.

K i t t y Yau received her BS and MS degrees in electrical and computer engineering from the University of Texas at Austin, and is currently pursuing a PhD degree in the same field. Her research interests include fault-tolerant computing, multi- computer networks, and parallel processing. She worked for AT&T Bell Laboratories in the summer of 1984 on software development, and in the summer of 1985, she worked at the IBM T.J. Watson Research Center on multicomputer network fault tolerance. She is a member of the IEEE, Eta Kappa Nu and Tau Beta Pi.