optimal fault-tolerant design approach for vlsi array processors

7
Optimal fault-tolerant design approach for VLSl array processors C.N.Zhang T.M. Bachtiar W.K. Chou Indexing terms: Fault tolerance, VLSI, Systolic array, Spaceltime redunduncy, Space-time mapping, Concurrent error detection Abstract: A systematic approach for designing a fault-tolerant systolic array using space andlor time redundancy is proposed. The approach is based on a fault-tolerant mapping theory which relates space-time mapping and concurrent error detection techniques. By this design approach, the resulting systolic array is fault tolerant and achieves the optimal space-time product. In addition, it has the capability to compute more problem instances simultaneously without extra cost. 1 Introduction Fault tolerance has become an essential design requirement for VLSIIWSI array processors. Various fault-tolerant techniques for systolic arrays have been discussed and implemented [l-41. In general, fault tolerance can be achieved through some form of redundancy, i.e. either information (algorithm-based fault tolerance), space or time, and by reconfiguration. Some combinations of redundancy and reconfiguration techniques are also possible to achieve fault tolerance. This paper focuses on the use of space and time redundancy to add run-time fault tolerance to array processors. Efforts have been attempted to design fault-tolerant systolic arrays, but little has been done to develop a systematic approach for designing fault-tolerant systo- lic arrays. This paper presents a systematic approach for designing fault-tolerant systolic arrays with concur- rent error detection and correction capability using spaceltime redundancy. The design approach is based on a fault-tolerant mapping theory which is developed from the space-time mapping technique [5] and the the- ory on concurrent error detection using redundancy [6, 71. The resulting systolic array has the flexibility either to compute one problem instance with fault tolerance, or to compute simultaneously the maximum number of problem instances that can be handled by the systolic array to achieve higher throughput. 0 IEE, 1997 IEE Proceedings online no. 19970960 Paper first received 29th January 1996 and in revised form 22nd August 1996 C.N. Zhang and T.M. Bdchtair arc with the Department of Computer Science, University of Regina, Regina, Saskatchewan,Canada S4S 0A2 W.K. Chou is with the Department of Information Science, Providence University, Taichung, Taiwan, Republic of China IEE Proc.-Comput. Digit. Tech., Vol. 144, No 1, January 1997 2 linear transformation The algorithms of interest in this paper are nested loops with regular data dependence structures. Algorithms with constant bounded index sets [7] can be described by the following form: for i, := v1 to l1 for i2 := v2 to l2 for i , := vp to Ip Mapping algorithms onto systolic arrays by begin Statementl; Statement2; Statement,; end; where vr and lr, 1 5 y I p, are integers. Without loss of generality, we assume that vy = 1 and Iy 2 1 for 1 I y I p. The statements in the body of the loops can be assignments or others, as long as their data dependen- cies can be represented by an integer matrix D = (dl, d2, ..., dm), where d, is a p x 1 integer vector corre- sponding to one of the variables in the body of the For the purpose of mapping such an algorithm onto a systolic array architecture, the above algorithm can be characterised by a pair (D, C,), where CD = {(il, i2, ..., i,)}, 1 5 iy I ly, 1 I y I p, represents the index space in which the variables are used or computed. A systolic array implementation may be obtained by a p x p lin- ear transformation matrix loops. T= [ : I where II is a 1 x p vector determining the time schedul- ing and S is a (p - 1) x p submatrix mapping CD onto a (p - 1)-dimensional space denoted by C,. T is a valid transformation- ofA an algorithm QD, C,), denoted by T(D, CD) = (D, CD), if and only if the following two conditions hold: 1. causal time is preserved, i.e. nd, > 0, 1 5 i s m 2. mapping is conflict-free, i.e. det(T) # 0 where det(T) is the determinant of matrix T, D = (4, d2, ..., a), 4 = Td,, 1 s i s wz, C, = {(t, XI, x2, ..., xP-i>>, (t, XI, ..., xp-Jfr = T(il, i2, ..., iJtr, (il, i2, ..., ip) E CD and (il, i2, ..., iJ is the transpose vector of (il, i2, ..., $1. 15

Upload: wk

Post on 20-Sep-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Optimal fault-tolerant design approach for VLSl array processors

C.N.Zhang T.M. Bachtiar W.K. Chou

Indexing terms: Fault tolerance, VLSI, Systolic array, Spaceltime redunduncy, Space-time mapping, Concurrent error detection

Abstract: A systematic approach for designing a fault-tolerant systolic array using space andlor time redundancy is proposed. The approach is based on a fault-tolerant mapping theory which relates space-time mapping and concurrent error detection techniques. By this design approach, the resulting systolic array is fault tolerant and achieves the optimal space-time product. In addition, it has the capability to compute more problem instances simultaneously without extra cost.

1 Introduction

Fault tolerance has become an essential design requirement for VLSIIWSI array processors. Various fault-tolerant techniques for systolic arrays have been discussed and implemented [l-41. In general, fault tolerance can be achieved through some form of redundancy, i.e. either information (algorithm-based fault tolerance), space or time, and by reconfiguration. Some combinations of redundancy and reconfiguration techniques are also possible to achieve fault tolerance. This paper focuses on the use of space and time redundancy to add run-time fault tolerance to array processors.

Efforts have been attempted to design fault-tolerant systolic arrays, but little has been done to develop a systematic approach for designing fault-tolerant systo- lic arrays. This paper presents a systematic approach for designing fault-tolerant systolic arrays with concur- rent error detection and correction capability using spaceltime redundancy. The design approach is based on a fault-tolerant mapping theory which is developed from the space-time mapping technique [5] and the the- ory on concurrent error detection using redundancy [6, 71. The resulting systolic array has the flexibility either to compute one problem instance with fault tolerance, or to compute simultaneously the maximum number of problem instances that can be handled by the systolic array to achieve higher throughput.

0 IEE, 1997 IEE Proceedings online no. 19970960 Paper first received 29th January 1996 and in revised form 22nd August 1996 C.N. Zhang and T.M. Bdchtair arc with the Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 W.K. Chou is with the Department of Information Science, Providence University, Taichung, Taiwan, Republic of China

IEE Proc.-Comput. Digit. Tech., Vol. 144, No 1, January 1997

2 linear transformation

The algorithms of interest in this paper are nested loops with regular data dependence structures. Algorithms with constant bounded index sets [7] can be described by the following form: for i, := v1 to l1 for i2 := v2 to l2

for i, := vp to Ip

Mapping algorithms onto systolic arrays by

begin Statementl; Statement2;

Statement,; end;

where vr and lr, 1 5 y I p , are integers. Without loss of generality, we assume that vy = 1 and Iy 2 1 for 1 I y I p . The statements in the body of the loops can be assignments or others, as long as their data dependen- cies can be represented by an integer matrix D = (dl, d2, ..., dm), where d, is a p x 1 integer vector corre- sponding to one of the variables in the body of the

For the purpose of mapping such an algorithm onto a systolic array architecture, the above algorithm can be characterised by a pair (D, C,), where CD = {(il, i2, ..., i,)}, 1 5 iy I ly, 1 I y I p , represents the index space in which the variables are used or computed. A systolic array implementation may be obtained by a p x p lin- ear transformation matrix

loops.

T = [:I where II is a 1 x p vector determining the time schedul- ing and S is a (p - 1) x p submatrix mapping CD onto a (p - 1)-dimensional space denoted by C,. T is a valid transformation- ofA an algorithm QD, C,), denoted by T(D, CD) = (D, CD), if and only if the following two conditions hold: 1. causal time is preserved, i.e. n d , > 0, 1 5 i s m 2. mapping is conflict-free, i.e. det(T) # 0 where det(T) is the determinant of matrix T, D = (4, d2, ..., a), 4 = Td,, 1 s i s wz, C, = { ( t , XI, x2, ..., xP-i>>, ( t , X I , ..., xp-Jfr = T( i l , i2, ..., iJtr, (il, i2, ..., ip) E CD and ( i l , i2, ..., iJ is the transpose vector of (il, i2, ..., $1.

15

In [8, 91 an extended mapping approach has been dis- cussed.

can be a 4 x p matrix, (1 5 4 5 p ) , which can be decom- posed into any submatrices IT and S. However, the nec- essary and sufficient conditions for T being a valid transformation will be much more complicated and dif- ficult to check in practice.

For the sake of simplicity, throughout the paper we consider the case where p = 3. In fact, theoretically, our results can be used for any size of p and the extended mapping mentioned above. Suppose the mapping algorithm A is a triple nested loop and the target architecture is a 2D systolic array. Let A = (D, C,), C D = {( i , j2 k ) ) , 1 5 i 2 11, 1 5 j 5 12, 1 5 k 5 l3 and T(D, C,) = (D, G), where CD = { ( t , x, y ) } , ( t , x, y)tY = T(i, j , k)@; t represents the time when a computation is performed, and (x, y ) denotes the co-ordinate of the processing element (PE) where the computation is performed. Both refer to the computation at index ( i , j , k ) in the original algorithm A.

Let G = {(x, Y ) > ,

[til ti% t i 3 1 T = t21 t 2 2 t 2 3 I = (til, t12, t13)

S = [ t21 t 2 2 t 2 q t31 t32 t33

Then we have SCD = Cp The number of elements in G indicates the total number of processing elements (PES) required.

Theoretically, for a given algorithm A, there may exist an infinite number of valid transformation matri- ces. In practice, some constraints such as the maximum number of PES allowed and the maximum delay in PES must be enforced to limit the search space. Several design approaches to find a set of all possible valid transformation matrices for a given algorithm under certain constraints have been reported and imple- mented [5, 10-121. An important task is how to select an optimal valid transformation matrix. In most cases, the total number of PES, np (refers to the cost) and the total time required for a problem instance, t , (refers to performance), are the two major considerations. For this reason, np x t, is used as the indicator of optimal- ity. The optimal transformation should have the mini- mum value of np x t,.

In [13, 141 we have studied the most important design objective functions including np and t,. Our results show that all of these design objective functions depend only on the transformation matrix T and the length of the loop and can be computed easily. Some of the design objective functions (without proofs) are described below and will be used in this paper.

Computation time, t,: The computation time, t,, is the number of clock cycles required to compute one prob- lem instance in a systolic array:

t , = maxA { t } - minA { t } + I ( t >z ,Y) E C D ( t , x ,Y) E C D

which can be calculated by

t~ = (11 - 1)lhil + (12 - 1 ) l t ~ + (13 - l ) / t i 3 1 + 1 Note that it does not take into account the time to load the initial values into the PES and to remove the final

results from the PES if one or more variables stays in the PES during computation. In general, those over- heads are quite large and are proportional to the size of the array. For this reason we only consider valid trans- formations by which no variables stay in the PES.

Number of PES n : The number of PES required, np, is the number of diiferent elements in the set Cs = {(x, y ) } , which can be computed by

if U; > I ; for some 1 5 i 5 3

x( l2 - u2)(/3 - u3) otherwise

I TI I where

Tli is the (1,i)-cofactor of matrix T, 1 5 i 5 3 Pipelining time period tp: The processor pipelining

time period, tp, is another important design objective function. It indicates the pipeline rate of the data paths and the throughput of the array. The processor pipelining time period can be determined by the following formula:

I I

tp = 1 indicates the full computation rate. If tp > 1, then every PE in the array is active at one of tp clock cycles.

Space utilisation, U , and uy: The definition of space utilisation is similar to that of the processor pipelining time period. The space utilisation of a systolic array with respect to x is U, if and only if there are U , - 1 idle PES between any two nearest active PES in the x direc- tion for every clock time. The space utilisation with respect to y , uY, is defined similarly. Both of them can be computed by:

I I

Length of array 1, and ly: Parallel to the definition of computation time, we can define the length of the array in the x and y directions, denoted by 1, and ly, respec- tively, as follows:

I , = maxA {x} - minA {z} + 1 (. ,Y) €Cs (. ,Y 1 €Cs

- - (I1 - 1)It~lI + (12 - l)lt22l + ( I 3 - l)lt23l + 1

1, = may { y } - min- { y } + 1 (.,,)€Cs (. , Y 1 € cs

- - (11 - l)/t311 + (12 - 1)lt321 + (I3 - l)lt331 + 1

Example (matrix multiplication): Algorithm 1: (Mutrix multiplication: C = A x B):

f o r i = l , n f o r j = 1, n f o r k = l , n

begin a(i, j , k ) = a(i, j - 1, k); b(i, j , k ) = b(i - 1, j , k); c(i, j , k ) = c(i , j , k - 1) + a(i, j , k) x b( i , j , k);

end

16 IEE Pro,.-Comput. Dlgit. Tech., Vol. 144, No. I , January 1997

a(i, 0 , k ) = aik, b(0, j , k ) = by, c(i, j , 0 ) = 0, and c(i, j , k ) = cy, V( i , j , k) .

Note that each variable in this algorithm has all the index offsets which are necessary for obtaining an inte- ger data dependence vector.

We have

and lI = l2 = l3 = n. It is easy to check that

1 0 1 1

are two valid transformation matrices. Figs. 1 and 2 show the systolic array implementations obtained by TI and T2, respectively, for n = 3. Table 1 summarises the values of np, t,, tp, ux, uy, and ly of these two mappings.

t,= (n-1)+ (n-1) + ( n -1 )+1=3n-2

Ix = ( n - l ) + ( n - l ) + l = Z n - l l y = ( n - l ) + l = n

c33

0 c21 0 c12 0 0 0 c11 0 0

'32 c23 '31 '22 '13

fig. 1 Systolic array obtained by T,

a33 0 a23

o o a13

a22 0 'a17 O o .-

c22 0 O 0 0 c13 0 c23 '33

Fig.2 Systolic array obtained by T,

Table 1: Values of systolic array design objective functions

Systolic array t, nP tp U, uy 1, / y

Fig. 1 3n-2 2n2-n 2 2 2 2n-1 n

Fin. 2 3n-2 3 n 2 - 3 n + l 3 3 3 2n-1 2n-1

For example, if T = T1 we have det(T) = -2, T11 = -1, Ti2 = 1, Ti3 = 0, T21 = -1, T,, = -1, T23 = 0, T31 = -1, T32 = 1, T33 = 2, al = 1, a2 = 1 and a3 = 0.

np = n3 - (n - I)(. - 1)n = 2n - n Thus,

2

t , = (n - 1) + (n - 1) + (n - 1) + 1 = 3 n - 2

I, = (n - 1) + 1 = n 1, = (n - 1) + (n - 1) + 1 = 2n - 1

3 Fault-tolerant and redundant mapping theory

3. I Brief review on fault-tolerant techniques We assume that both temporary (transient) and perma- nent faults may occur in PES during the computation and at most one PE is faulty at any given time. The ALU of PE may be faulty, but it is independent of its inputs.

Various fault-tolerant techniques on systolic arrays have been proposed [I-4, 6, 7, 151, including: 1. Algorithm based fault-tolerant approach: The advantage of this approach is that it occupies less space and consumes less time overhead. Major drawbacks to this approach are the algorithm-specific design and arithmetic errors, including truncation and overflow. 2. Reconfiguration approach: Fault tolerance is achieved by reconfiguring the array such that the faulty PES can be isolated and replaced by the working ones. However, due to more complicated switches and data passing mechanisms, it seems more difficult to be implemented by current VLSI technology. 3 . Time and/or space redundancy approach: By repeating the same computation steps or lei ting duplicate arrays compute the same problem instance, a faulty PE can be detected and the correct result can be obtained by the attached majority voting circuit. Note that only transient faults can be corrected if computation is performed by the same PE. To handle permanent faults, extra hardware to do encoding and decoding, e.g. shifting out the inputs and shifting back in the results [15], is required. Most of the fault-tolerant designs by this approach are specific to a particular systolic architecture and inay not be applied to other systolic arrays with different topologies and data flows.

In this Section, we first show the relationship between fault-tolerant design by time and/or space redundancy and space-time linear transformation, and present sufficient conditions under which an algorithm is able to be mapped onto a fault-tolerant systolic array capable of detecting and correcting a single error

17 IEE Proc-Comput. Digit. Tech., Vol. 144, No. I , January 199

caused by a single failed PE. Based on this, a system- atic fault-tolerant design algorithm is presented. The result of the algorithm is an optimal fault-tolerant systolic array for a given algorithm with respect to n; x t l , where np* is the total number of PES that includes the extra PES for fault tolerance and t,* is the total computation time that includes the time overhead for fault tolerance.

3.2 Fault-tolerant mapping theory Consider a nested loop algorithm A = (D, C,), where CD = {(i, j , k ) ) Z3 a,nd ,Z is the set of integers. Let T(D, C,) = (D, G) = A. A represents a systolic array implementation specified by Dand C,, where CDA= { ( t , x , y ) } E Z3. Now construct ,two systolic array: A, an$ 4, based on systoiic array A a s follows. Let AI = (D, CA) and k2 = (D, Ci), where Cd = { ( t + Atl , x + Axl , y

Ax,, Ay,, At,, Ax2 and Ay2 are small integers independ- ent of 11, 1, and 13.

Definition 1: T is a fault-tolerant transformation matrix if there exist integers At,, Ax,, Ayl, At2, Ax,, Ay2 such that the following conditions are satisfied: 1. At least one of Ai,, Aj l and Ak, is not integer. 2. At least one of Ai,, Aj2 and Ak2 is not integer. 3. At least one of Ai, - Ais, Aj, - Aj,, Ak2 - Akl is not integer. where (Ai, Aj, Ak#’ = T 1 ( A t l Ax, Ay#‘, (Ai, Aj, AkJtr = T-’(At, Ax,

In fact, this definition is an extension of the defini- tion given in [6] and is explained below. Suppose that the transformation T meets the conditions 1-3. Let us construct two algorithms A, and A2 as follows. A, and A, are identical to algorithm A except that the index spaces of A, and A, are obtained from the index space of A by shifting a small step in each direction i, j and k. Let Al = (D, Cd) and A, = (D, C z ) , where Cd = { ( i + Ai,, j + Aj,, k + Ak,)}, CD2 = { ( i + Ai,, j + Aj,, k + Ak,)} and C, = {(i, j , k )} . It is clear that if algorithms A, and A, use the same inputs as algorithm A, they will produce the same outputs as that Of algorithm A. Note that systolic arrays (D, C,’) and (D, e,’) can be viewed as the results of mapping algorithms AI and A2 by transformation T. The conditions 1-3 indicate that C, n Cd = 0, CD n C; = 0 and C,’ R C,’ = 0, respec- tively. In fact, CD n Cd = 0 and CD n C:= 0 because Cd @ Z3 and C,’ Z3, but CD E Z3. Cd n C:# 0 implies that there are some integers ci, i = 1, 2, ..., 6, such that Ail + c, = Ai2 + e,, Aj, + c3 = Aj2 + c4 and Akl + c j = Ak, + cg, or Ai, - Ail = cs - c2, Aj2 - Aj, = c3 - c4 and Ak, - Ak, = cj - cg. Thus, CA n C,’= 0 if and only if the condition 3 holds.

Since T is a lipear mapping, we have CD n C,’ = 0, nhGf= 0 and C . CzA= 0. By merging systolic arrays (D, C,), (D, CA)*and ID, CD2) together,Awe haye a new systolic array (D, C, ), where CD* = C, U Cd U CO2, called the fault-tolerant systolic array of algorithm A by mapping T.

+ AVl)}, c,” = { ( t + At,, x + Ax,, Y + AY2)), and At,,

and T’ is the inverse matrix of T.

Consider algorithm 1 again. Let

T = T a = [-: : ;] 0 -1 1

By choosing Atl = 0, Axl = 1, Ay1 = 0, At, = 0, Ax2 = 2 and Ay2 = 2, we have Ai, = -213, Aj, = 113, Akl = 113, Ai2 = -413, Aj, = 213 and Ak, = 213. Thus, T is a fault-

18

tolerant transformation matrix of algorithm 1, A = (D, C,). Figs. 3 4 show the systolic array implementations of algorithm A, = @, Cd) andlalgprithm AzA= JD, C,’), respectively denoted by ID, C,’) and (9 C,”). Systolic arrays (9 Cd] and (D, G2) share the same PES as systolic array (D, q) does (see Fig. z). The ad,diLional PES required by systolic arrays (D, Cd) and (9 G2) depend on the values of Axl , Ax,, Ayl and Ay,. The values of Axi and Ayi ( i = 1, 2) indicate the number of additional columns and rows of PES that are required. According to the definitions of 1, and ly, the additional PES can be determined by 1, * max{Ayl, Ay,} + Zy * max{Axl, Ax,}.

a33 0 a23

0 0 a13 a22 0 0

a32 0

0 a12 0

a2

0

0

a31

0

0 0 c12 0 0 c22 0 0 C32 0 0 c13

0 c23 c33

Fig.3 Systolic array obtained by T2 with At = 0, Ax = 1 and Ay = 0

O32

a2 O O

O31 0

0

~ 3 1 o O ~ 1 2 0 0 c22 0 0 C32 0 0 c13

0 c23

Systolic array obtained by T2 with At = 0, Ax = 2 and Ay = 0 c33

Fig.4

Fig. 5 shows the systolic array by mergingAthqse three systolic arrays ID,, q), (D, c) and (D, G2) together, denoted by (D, %*). It can be seen that three products of the matrix multiplications will be generated by this array. If all three computations use the same input data, then three identical results should come out.

IEE Proc.-Comput. Digil, Tech., Vol. 144, No, 1, January 1997

Fig. 5 Optimal fault-tolerant systolic array

Thus, a computation of index point ( i , j , k ) in algorithm A will be performed three times at time t , t + At , , and t + At2 by PES located at ( x , y ) , ( x + A x , , y + A y l ) and ( x + Ax2, y + Ay2), respectively. Therefore, any single error can be corrected by a majority voting circuit.

Fault-tolerant systolic array (fi Ci) differs from systolic array (D, G) in the following two aspects: 1. Time: Let ;: *be the total computation time of systolic array (9 G). We have t,* = t , + max(Atl, At2). 2. Space: bet np* be the total number of PES in systolic array ( 4 G3 and 1, and Zy be the lengths of the systolic array (D, G) in the x and y directions defined in Section 2. We have np* = np + ly x max{(Axl, A X ~ ) } + I ,

In the case of At , # 0 andlor At2 z 0, some time delay units may be required to synchronise the comparison in the majority voting circuit. In general, small values (including zero) of At , , At,, Ax , , Ax2, Ay l and Ay2 ensure small time and space overheads. It is obvious that, for a given algorithm A, a valid transformation matrix T may or may not be able to map A onto a fault-tolerant systolic array. Furthermore, if T is able to map A onto a fault-tolerant systolic array, how do you choose At , , At,, A x l , Ax2, A y , and Ay2 such that the result is guaranteed to be a fault-tolerant systolic array with minimum value of the product of np* and t,*. First, we give a sufficient condition under which a valid transformation can be a fault-tolerant transformation.

Theorem 1: Suppose T(D, CD) = (a e). If max{u,, U , t p } B 3, then T is a fault-tolerant transformation of t i e algorithm (D, C,). Proof Suppose U , = max{u,, uy, t p } = 3 (proofs for

the rest of the cases aje similqr). Construct two systoli? arrays (9 CA) and (D, C,"> based on systolic array (D,

1, Ayl = 0) and C i = { ( t , x + 2, y ) } , (At2 = 0, Ax2 = 2,

x max{(b,, 4 2 ) ) .

C D ) as follow^. L?t Ci = { ( t , x + 1, y ) } , (A t , = 0, Ax1

IEE Proc -Comput Digit Tech, Vol 144, No I , January 199

Thus, Ail = -T2,Idet(r), Aj, = T2,/det(7J, Akl = -T23/ det(7) and Ai2 = -2T2,1det(r), Aj2 = 2T221det(7J, Ak2 = -2 T2,/det( r).

1. at least one of Ai,, Aj, and Ak, is not an integer. In fact, suppose all of Ai l , Aj, and Akl are integers. According to Idet(T)I = 3 gcd(T2,, T22, T24, we have

T&j, and IT231 = 3 gcd(T21, T22, T23)Aki. Thus, gcd(T21, T22, T24 2 3 gcd(T2,, T22, T2& which is a con- tradiction 2. at least one of Ai2, Aj2 and Ak2 is not an integer. The proof is the same as the above case 3 . at least one of Ai2 - Ail , Aj2 - Aj, and Ak2 - Akl is not an integer. Since Ai2 - Ail = -T21/det(7J, Aj2 - Aj, = T2,/det(T) and Ak2 - Akl = -T23/det(r), according to the proof in 1, we have that at least one of Ai2 - Ai,, Aj2 - Aj, and Ak2 - A k , is not an integer.

The following theorem gives the possible values of Ati, Axz, Ayi, i = 1, 2 in order to construct a fault-toler- ant systolic array.

Theorem 2: Suppose T(D, C,) = (6G). Possible val- ues of A t l , At2, A x , , Ax2, A y , and Ay2 which may be used to construct a fault-tolerant systolic array are 0 5 At, 5 tp - 1, 0 5 Ax, < U, - 1 and 0 5 Ayz 5 uy - 1, for i = 1, 2. Proof Suppose U, = max{u,, uy, t p } = 1 B 3. Let At =

0, Ax = q and Ay = 0. From

We can prove:

lT211 = gCd(T21, T227 T23)Ait , IT221 = gCd(T212 T22,

T [ A i ] A j = [::I Ak AY

19

we have Ai = -T21q/det(T), Aj = T,,q/det(T) and Ak = -T,,q/det(T). That is, IAiI = ~-T2,q/lgcd(T2,,T22,T23)~, IAjl = and IAkl = I-T23q/

There are three cases. First, if 1 5 q 5 1 - 1, then at least one of Ai, Aj and Ak is not an integer. According to conditions 1 and 2 of definition 1, these At2, Ax, and Ay2 for i = 1, 2 can be used to construct a fault-tolerant systolic array. Secondly, if q = 1, then all of Ai, Aj and Ak are integers. Therefore, they cannot be used to construct a fault-tolerant systolic array. Thirdly, if q = 1 + hm, where 1 5 h 5 1 - 1, m 2 1, then it is equivalent to the first case mentioned above. In fact, let At, = 0, Axl = h, Ay, = 0 and At2 = 0, Ax2 = 1 + hm, Ay2 = 0. It is easy to check that all of Ai2 - Ai,, Aj2 - Aj, and AkZ - Akl are integers. According to condition 3 of definition 1 , only one of (At,, Ax,, Ay,) and (At2, Ax2, Ay2) should be considered. Since more additional time/space overheads are required by the latter, therefore, we should only consider 0 5 Ax, 5 1 - 1. Similarly, we can prove that, if uy = I , then all values of Ay,, i = 1, 2 should be 0 5 AyL 5 1 - 1 , i = 1, 2, and if tu = 1, then all values of At,, i = 1, 2, should be 0 5 At, 5 1 - 1, i = 1, 2.

4 tolerant systolic array

Based on the results described in Section 3, we are able to develop a procedure to find a valid transformation for a given algorithm such that the resulting fault- tolerant systolic array has the minimum value of t,* x n;.

Procedure 1. (Finding an optimal fault-tolerant mapping) : input: A = (D, C D )

output: T*, At,, At2, Ax,, Axz, Ayl , Ay2 and cost. Step 1: Let cost = ( t l x nu*) = and T* is initialised to invalid. Step 2: Apply a systematic design procedure (e.g. the approach described in [5] ) to find a set of all possible valid transformation matrices, {T}, denoted by TR = {T}, which satisfy the condition max{u,, uy, tu> = 3. Step 3: If TR = 0, then terminate. Otherwise, pick a T from TR and do T, = T, - { T} . Step 3.1: Try all possible combinations of (At,, Ax,, Ay,) and (At?, Ax2, Ay2) according to the result of theo- rem 2. Find the one which satisfies conditions 1-3 listed in definition 1 and has the minimum value of nu* x ti, denoted by newcost. Step 3.2: If the value of newcost obtained in Step 3.1 is less than the previous value stored in cost, replace the cost by the newcost, T” by the matrix chosen in Step 3.1, and ((At,, Ax,, Ay,), (At2, Ax2, Ay2)) by the new values found in Step 3.1.

If the algorithm results in an invalid T, then there are no T able to map A onto a fault-tolerant systolic array. Otherwise, find a T and its corresponding fault-tolerant systolic array with the minimum value of the product of total computation time and number of PES.

Note that under the constraint described in [5] , the number of elements of TR is limited. According to the- orem 2, the total time required in step 3.1 and step 3.2 should be quite small.

Example: Consider algorithm 1 once more under the constraints of /t,l 5 3, XJK, 5 nd , , and no data should

lgcd(T21, T22, T23)I.

Mapping algorithm onto optimal fault-

20

stay in PES. The optimal fault-tolerant systolic array obtained by procedure 1 is shown in Fig. 5. The valid transformation matrix T chosen by procedure 1 is the same as the T2 in Section 2. The fault-tolerant systolic array is constructed by selecting At, = 0, Axl = 1, Ay, = 0 and At2 = 0, Ax, = 2, Ay2 = 0. Comparing the fault- tolerant systolic array in Fig. 5 with the original systo- lic array in Fig. 2, we can find that it requires addi- tional 2 x ly PES, where ly = 5 for the case n = 3. In general, the value of the product of tcT and n i is (3n - 2) x (3n2 + n - 1) = (9n3 - 3n2 - 5n +2), which is the minimum value amongst the products of all possible fault-tolerant transformations.

5 Conclusion

Based on our previous results, a new approach for designing optimal fault-tolerant systolic arrays is proposed. The approach is based on a space-time mapping technique under a linear transformation method. Fault tolerance is achieve through repeated computation of the same problem instance. The repeated computation is performed by employing space and/or time redundancy. Therefore, any single error at any given time can be detected and corrected.

A mapping algorithm to find optimal fault-tolerant systolic arrays is developed. An example of mapping a matrix multiplication algorithm onto a two dimen- sional fault-tolerant systolic array is illustrated. The result is optimal in the sense that the implementation achieves the minimum value of the product of compu- tation time and number of processing elements required.

Another interesting result of this new mapping approach is the capability to increase the throughput of the systolic array. Replacing the repeated problem instance with different problem instances, this approach can be used to compute multiple problem instances simultaneously. Therefore, the performance increases by multiplying the throughput. As an exten- sion to this approach, a combination of fault tolerance and higher throughput capability can be provided by varying the computation latency.

6

1

2

3

4

5

6

7

8

9

References

GULATI, R.K., and REDDY, S.M.: ‘Concurrent error detection in VLSI array structures’. Proceedings of IEEE international conference on Computer design, 1986, pp. 488491 HUANG, K.H., and ABRAHAM, J.A.: ‘Algorithm-based fault tolerance for matrix operations’, IEEE Trans. Comput., 1984, 33,

JACOB, A., BANERJEE, P., CHEN, C.-Y., FUCHS, W., KUO, S.-Y., and REDDY, A.: ‘Fault tolerance techniques for systolic arrays’, Computer, July 1987, pp. 65-74 KUHN, R.H.: ‘Yield enhancement by fault-tolerant systolic arrays’ in KUNG, S.Y., WHITEHOUSE, H.J., and KAILATH, T. (Eds.): ‘VLSI and modern signal processing’ (Prentice-Hall, 1985, pp. 178-184 MOLDOVAN, D.I.: ‘On the design of algorithms for VLSI systolic arrays’, Proc. IEEE, 1983, 71, (l), pp. 113-120 ZHANG, C.N., LI, H.F., and JAYAKUMAR, R.: ‘A systematic approach for designing concurrent error-detecting systolic arrays using redundancy’, Parallel Comput., 1993, 19, pp. 745-764 LI, H.F., ZHANG, C.N., and JAYAKUMAR, R.: ‘Latency of data-flow and concurrent error detection in systolic arrays’. CCVLSI-89, 1989, pp. 251-258 SHANG, W., and FORTES, J.A.B.: ‘On time mapping of uniform dependence algorithms into lower dimensional processor arrays’, IEEE Trans. Parallel Distrib. Syst., 1992, 3, (2), pp. 350- 363 LEE, P.Z., and KEDEM, Z.M.: ‘Mapping nested loop algorithms into multidimensional systolic arrays’, IEEE Trans. Parallel Distrib. Syst., 1990, 1, (l), pp. 64-16

(6), pp. 218-255

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 1, January 1997

10 CHAN, S.W., and WEY, C.L.: ‘The design of concurrent error 13 ZHANG, C.N., WESTON, J.H., and YAN, Y.F.: ‘Determining diagnosable systolic arrays for band matrix multiplications’, ZEEE objective functions in systolic array designs’, ZEEE Trans. VLSZ Trans. CAD Zntegr. Circuits Syst., 1988, I , (l), pp. 21-37 Syst., 1994, 2, (3), pp. 357-360

11 SHEU, J.P., and CHANG, C.Y.: ‘Synthesizing nested loop l4 ZHANG, C.N., BACHTIAR, T.M., and CHoU, W.K.: ‘An optimal fault-tolerant design approach for array processors’. International conference on Parallel and distributed systems, algorithms under nonlinear transformation method’, ZEEE Trans. Taiwan, December 1994, pp. 348-353 Parallel Distrib. Syst., 1991, 2, (3), pp. 304-317

12 ESoNU, M.o., A.J., HARIR1, S.$ and 15 PATEL, J.H., and FUNG, L.Y.: ‘Concurrent error detection in KHALILI, D.: ‘Systolic arrays: How to choose them’, IEE Proc. ALU’s by recomputing with shifted operands’, IEEE Trans. E, 1992, 139, (3), pp. 179-188 Comput., 1982, 31, pp. 589-595

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 1, January I99 21