parallel processing on vlsi arrays: a special issue of journal of vlsi signal processing

135
PARALLEL PROCESSING ON VLSI ARRAYS edited by Josef A. Nossek Technical University of Munich A Special Issue of JOURNAL OF VLSI SIGNAL PROCESSING Reprinted from JOURNAL OF VLSI SIGNAL PROCESSING VoI. 3, Nos. 1-2 (1991) SPRINGER SCIENCE+BUSINESS MEDIA. LLC

Upload: others

Post on 11-Sep-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

PARALLEL PROCESSING ON VLSI ARRAYS

edited by

Josef A. Nossek Technical University of Munich

A Special Issue of JOURNAL OF VLSI SIGNAL PROCESSING

Reprinted from JOURNAL OF VLSI SIGNAL PROCESSING VoI. 3, Nos. 1-2 (1991)

SPRINGER SCIENCE+BUSINESS MEDIA. LLC

Page 2: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Contents

Special Issue: Parallel Processing on VLSI Arrays Guest Editor: losef A. Nossek

Introduction ............................................................ Josef A. Nossek 5

Numerical Integration of Partial Differential Equations Using Principles of Multidimensional Wave Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alfred Fettweis and Gunnar Nitsche 7

Signal Processing Using Cellular Neural Networks ........... L.o. Chua, L. Yang and K.R. Krieg 25

Nonlinear Analog Networks for Image Smoothing and Segmentation .............. A. Lumsdaine, · ........................................................ J. L. llYatt, Jr. and l. M. Elfadel 53

A Systolic Array for Nonlinear Adaptive Filtering and Pattern Recognition ........ J.G. McWhirter, · ..................................................... D.S. Broomhead and T.J. Shepherd 69

Control Generation in the Design of Processor Arrays ............ Jurgen Teich and Lothar Thiele 77

A Sorter-Based Architecture for a Parallel Implementation of Communication Intensive Algorithms · ................................................................... Josef G. Krammer 93

Feedforward Architectures for Parallel Viterbi Decoding ...... Gerhard Fettweis and Heinrich Meyr 105

Carry-Save Architectures for High-Speed Digital Signal Processing ............... Tobias G. Noll 121

Page 3: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Library of Congress Catalogiog-in-Publicatioo Data

Parallel processing on VLSI arrays I edited by losef A. Nossek. p. cm.

"A Special issue of lournal ofVLSI signal processing." "Reprinted from lournal of VLSI signal processing, voI. 3, nos.

1-2 (1991)." Based on papers presented at the International Symposium on

Circuits and Systems held in New Orleans in May 1990. ISBN 978-1-4613-6805-2 ISBN 978-1-4615-4036-6 (eBook) DOI 10.1007978-1-4615-4036-6 1. Parallel processing (Electronic computers) 2. Integrated

circuits--Very large scale integration. I. Nossek, losef A. II. International Symposium on Circuits and Systems (1990: New Orleans, La.) QA76.58.P378 1991 004'.35--dc20 91-16484

Copyright 1991 by Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 1991 Softcover reprint ofthe hardcover Ist edition 1991

CIP

AII rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted inany formor by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Springer Scien­ce+Business Media, LLC.

Printed on acid-free paper.

Page 4: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Introduction

Guest Editor:

JOSEF A. NOSSEK

This is a special issue of the Journal of VLSI Signal Processing comprising eight contributions invited for publica­tion on the basis of novel work presented in a special session on "Parallel Processing on VLSI Arrays" at the International Symposium on Circuits and Systems (ISCAS) held in New Orleans in May 1990. Massive parallelism to cope with high-speed requirements stemming from real-time applications and the restrictions in architectural and circuit design, such as regularity and local connectedness, brought about by the VLSI technology are the key questions addressed in these eight papers. They can be grouped into three subsections elaborating on:

• Simulation of continuous physical systems, i.e., numerically solving partial differential equations. • Neural architectures for image processing and pattern recognition. • Systolic architectures for implementing regular and irregular algorithms in VLSI technology.

The paper by A. Fettweis and O. Nitsche advocates a signal processing approach for the numerical integration of partial differential equations (PD Es). It is based on the principles of multidimensional wave digital filters (MDWDFs) thereby preserving the passivity of energy dissipating physical systems. It is particularly suited for systems ofPDEs involving time and finite propagation speed. The basic ideas are explained using Maxwell's equa­tions as a vehicle for the derivation of a multidimensional equivalent circuit representing the spatially infinitely extended arrangement with only very few circuit elements. This is then transformed into an algorithm along the principles available for the design of MDWDFs. The attractiveness of the approach is in offering massive parallelism, requiring only local interconnections and inheriting all the robustness properties of WDFs including finite word­length effects.

The next three papers are concerned with neural architectures. The first two of them are relying on nonlinear analog circuits as basic functional units for collective analog computing.

The paper by L.0. Chua, L. Yang and K.R. Krieg describes Cellular Neural Networks (CNNs) which combine some features of fully interconnected analog neural networks with the nearest neighbor interactions found in cellular automata. It is shown how CNNs are, on the one hand, well suited for VLSI-implementation because of their local interconnections only, and, on the other, can perform global image processing tasks because of their dynamics. This is a very interesting and promising branch of neural networks and a lot of research has already been initiated by the very first publication.

In the paper by A. Lumsdaine, J.L. Wyatt Jr. and I.M. Elfadel, a series of nonlinear analog networks based on resistive fuses is developed for a very difficult early vision task, i.e., image smoothing and segmentation. These circuits, which are well suited for VLSI implementation, can automatically solve smoothing and segmentation problems because their solutions are characterized by an extremum principle.

The last neural network paper by J.O. McWhirter, D.S. Broomhead and TJ. Shepherd describes a systolic array for nonlinear adaptive filtering and pattern recognition. It consists of a RBF (radial basis function) preprocessor and a least squares processor behaving, in many respects, like a neural network of the feed-forward multilayer perceptron (MLP) type.

The highly parallel and pipelined architecture offers the potential for extremely fast computation and is much more suitable for VLSI -design than the MLP-architecture.

This leads us to the remaining group of contributions focusing on systolic architectures to support piecewise regular algorithms (PRA), irregular algorithms and regular ones as well as describing algorithmic transformations and proper number representations to allow efficient VLSI implementations.

The paper by J. Teich and L. Thiele deals with the control generation for the mapping of PRAs onto regular processor arrays. A systematic procedure is proposed for an efficient design of and control generation for configur­able processor arrays coping with time- and space-dependent processing functions and interconnection structure. A set of rules is provided to minimize the control overhead.

Page 5: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

6

The paper by IG. Krammer proposes an architecture based on a sorting memory to cope with the communication problem encountered while executing irregular algorithms on a regularly and locally connected processor array. The sorter-based architecture is a very flexible and efficient one to perform global data transfer operations such as needed, e.g., in sparse matrix computations. If the interconnection structure is restricted, although being global, the efficiency can be increased by exploiting these restrictions. The described architecture requires only .IN proc­essors for a problem of size N. Therefore, it offers a very interesting and attractive solution for communication intensive algorithms.

The paper by G. Fettweis and H. Meyr elaborates on the Viterbi algorithm (VA), which poses a very hard problem for VLSI realization because of the nonlinear recursion involved. This nonlinear recursion is algebraically transformed into an M-step recursion. By choosing M large enough, a pure feed-forward signal flow is possible leading to a regular parallel architecture consisting of identical cascaded modules. This is of special interest for high speed communication systems design.

The paper by T.G. Noll exploits the potentials of the redundant carry-save number representation for various high speed parallel signal processing tasks. Many problems such as: overflow effects, testability, and optimized pipelining schemes are addressed. It gives an overview of the work carried out by the author over several years in the industry. It will be of special interest to anyone involved in the actual design of high speed signal processing VLSI-circuits, both on the architectural and circuit module level.

I would like to thank all the authors for submitting their excellent work to this special issue, which reveals interesting interrelations between various branches of parallel processing, spanning from the numerical simulation of physical systems, analog and digital neural networks, to systolic architectures for high-speed digital signal proc­essing including the mapping of piecewise regular, irregular and nonlinear recursive algorithms. I would also like to thank all the reviewers for their cooperation and the Editor-in-Chief, Dr. Earl Swartzlander, for his support.

Page 6: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations Using Principles of Multidimensional Wave Digital Filters

ALFRED FETTWEIS AND GUNNAR NITSCHE Ruhr-Universitaet Bochum, Lehrstuhl fiter Nachrichtentechnik, Postfach 10 21 48, D-4630, Bochum 1, Germany

Received June 15, 1990; Revised December 18, 1990.

Abstract. Physical systems described by partial differential equations (PDEs) are usually passive (due to conser­vation of energy) and furthermore massively parallel and only locally interconnected (due to the principle of action at proximity, as opposed to action at a distance). An approach is developed for numerically integrating such PDEs by means of algorithms that offer massive parallelism and require only local interconnections. These algorithms are based on the principles of multidimensional wave digital filtering and amount to directly simulating the actual physical system by means of a discrete passive dynamical system. They inherit all the good properties known to hold for wave digital filters, in particular the full range of robustness properties typical for these filters.

In this paper, only the linear case is considered, with particular emphasis on systems of PDEs of hyperbolic type. The main features are explained by means of an example.

1. Introduction

Partial differential equations play an important role in many scientific disciplines. Since analytic solutions can be obtained only in a few particularly simple situations, numerical methods of integration have received a very large amount of attention ever since computers for ad­dressing such a task have become available. Already the number of books on this topic is so large that it does not seem appropriate to attempt listing them in this paper.

Obviously, numerical integration always implies some form of discretization of the original continuous­domain problem, and it is thus understandable that ap­proaches based on multidimensional (MD) digital signal processing are potential candidates [1], [2]. Among such approaches, it appears to be particularly attractive to investigate the possibility of directly simulating the original continuous-domain physical system by means of a discrete passive dynamical system. This way, one may expect to be able to preserve e.g., the natural passivity that physical systems have due to conserva­tion of energy, and thus to carry out the simulation by means of passive MD wave digital filters (WDFs) [3], [4], [5], especially as corresponding approaches in the one-dimensional case have already proved to be very

successful [6], [7]. Such a possibility implies in par­ticular that all stability problems that may originate from the unavoidable numerical inaccuracies can be fully solved. More generally, since stability is only one aspect of the more general problem of robustness [4], [8], simulations by means of MD WDFs may be ex­pected to behave particularly well with respect to this general criterion. Robustness in the sense used here is defined to designate the property of guaranteeing that the strongly nonlinear phenomena induced by the unavoidable operations of rounding/truncation and overflow correction cause only a particularly small change in behavior compared to that one would obtain if all computations were carried out with infinite preci­sion. For testing robustness, a number of individual criteria have been proposed, and these can all be satisfied by means of passive MD WDFs [4], [8].

In addition to robustness, a number of other prop­erties can a priori be expected to be achievable by the approach to be outlined in this paper. These concern aspects such as parallelism, nonconstant parameters, boundary conditions, types of equations, specialized computers [9] etc. This will be discussed in some more detail in Section 2.

In the subsequent sections, a direct approach for achieving our goal will be presented. Another approach

Page 7: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

8 Fettweis and Nitsche

is based on types of sampling obtained by appropriate rotations [4], [5], [10]; it will be described in more detail in a forthcoming further paper. Applications to linear problems such as those arising in the analysis of electrical or acoustical phenomena will be offered. Extensions to nonlinear problems in these fields ap­pear not to be problematic since the same has been demonstrated to hold in the one-dimensional case [7]. It is to be expected that the basic ideas described in this paper can also be applied in some other and more difficult fields in which partial differential equations are of vital interest.

The method is particularly suitable for solving wave propagation problems (hyperbolic problems), but can also be a;>plied to other situations. In the case of Max­well's equations in 2 spatial dimensions, it leads to algorithms similar to those obtainable by the method based on the use of the transmission line (unit element) concept [11]-[14], but the formulation of the algorithm can be derived in a much more straightforward way, and this is especially true for 3 spatial dimensions. Con­trary to known finite-difference methods, e.g., the Yee algorithm [15], [16], no stability problems arise in the case of nonconstant parameters, and reflection-free boundary conditions can be fulfilled in a very simple manner. A quite different point of view characterizes the methods that work entirely in the frequency domain, e.g., the Finite Integration Technique (FIT) [17].

Clearly, the topic to which this paper is devoted is rather vast, and the present text can only constitute a first introduction.

2. General Principles

As mentioned in the Introduction, the basic principle is to obtain a discrete simulation of the actual physical system described by a system of partial differential equations (POEs). This amounts to replacing the original system of POEs by an appropriate system of difference equations in the same independent physical variables (e.g., spatial variables and time) as those oc­curring in the original POEs, or in independent variables obtained from the former by simple transfor­mations. Specific aspects that are of relevance in do­ing so are listed hereafter:

(i) Physical systems are usually passive (contractive) due to conservation of energy. The simulation should preserve this natural passivity since this opens the possibility of solving all stability-

related problems. Indeed, passIvIty and in­cremental passivity [3], [18], [19] are the most powerful means available for finding satisfactory solutions to such problems.

(ii) Passive simulation is greatly facilitated by start­ing from the original system of PDEs. In partic­ular, elimination of dependent variables should be avoided, and this excludes the widely used principle of first deriving a global POE by eliminating all dependent variables except one. Such a global POE cannot characterize the passivity of a system, as is already the case for the global ordinary differential equations en­countered in one-dimensional (1-0) problems. Such a global ordinary differential equation relates indeed directly the output variable to the input variable and corresponds in a simple way to the transfer function to be considered, and it is well-known that any transfer function of a passive system can also be synthesized by an ac­tive system.

(iii) Physical systems (i.e., at least all those that are of engineering relevance) are by nature massively parallel and only locally interconnected. This is a way of expressing that all these physical systems are subject to the principle of action at proxim­ity rather than that of action at a distance. Thus, the behavior at any point in space is directly in­fluenced only by the points in its immediate neigh­borhood, and, since propagation speed is finite, any change originating at time to at any specific point in space can cause changes at any other point only at time > to. This inlIerent massive parallelism and exclusively local interconnectiv­ity represents an extremely desirable feature and should be preserved in the simulation.

(iv) The simulation should preferably be done by means of the best approximation achievable in the MO frequency domain (say in the spatio­temporal frequency domain, wave numbers being called, in the terminology adopted here, frequen­cies, or spatial frequencies), assuming the equa­tions to be linear and to involve only constant parameters. This ensures, under appropriate con­ditions, a particularly good approximation in time and space and amounts to adopting the so­called trapezoidal rule of integration. The latter aspect remains valid also in the nonconstant and even the nonlinear case.

(v) Instead of using original quantities such as voltage, current, electromagnetic field quantities,

Page 8: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

(vi)

Numerical Integration of Partial Differential Equations 9

pressure, velocity, displacement, etc., one should adopt corresponding so-called wave quantities (also frequently simply called waves, especially in the context of circuit theory [20]). In the case of an electric port characterized by a voltage, u, and a current, i, this amounts to assigning to the port a suitably chosen, but otherwise arbitrary, port resistance, R, and to define a forward wave, a, and a backward wave, b, e.g., either as voltage waves by means of

a = u + Ri, b = u - Ri (1)

or as power waves by means of

a = (u + Ri)/(2YR), b = (u - Ri)/(2YR).(2)

Closely related to the use of waves is the descrip­tion of physical systems by means of scattering matrices.

Such an approach corresponds to adopting a basic principle of wave digital filtering, as used also in the MD case. Just as for WDFs, this prin­ciple is quite essential for obtaining a directly recursible, thus explicitly computable, passive simulation. For understanding this, recall that waves and scattering matrices are concepts of fundamental, universal importance for describ­ing physical systems. Using these concepts amounts, e.g., to distinguishing clearly between input quantities (incident waves) and resulting reflected and transmitted output quantities, thus to distinguishing explicitly between cause and effect and hence to making explicit use of the causality principle. It is precisely this principle which is essentially behind the principle of com­putability, i.e., behind the fact that in order to be able to carry out computations in a sequen­tial machine it must be possible to give a con­secutive ordering in which the required opera­tions must be performed. In many cases there is no need to carry out ex­plicitly the change from the original variables to the wave quantities. All what is needed is to obtain, by means of a suitable (and usually very elementary) analogy, an MD passive electric cir­cuit [3] and to apply to this circuit known prin­ciples for deriving corresponding MD WDFs. The choice between voltage waves and power waves is relatively irrelevant and should be made according to suitability. In many cases, voltage waves (or corresponding waves in the case of

other physical variables) will lead to fewer multiplications and easier ways of guaranteeing passivity; they will therefore be preferred wherever possible [3].

(vii) If a simulation by a passive MD-WDF circuit is obtained, numerical instabilities that otherwise could occur due to linear discretization, i.e., to discretization in space and time, are fully ex­cluded. In the case of nonlinear PDEs this may hold only if power waves are adopted.

(viii) If an MD WDF is passive under ideal condi­tions, i.e., under the assumption that all com­putations are carried out with infinite precision, it can be built in such a way that passivity and usually also incremental passivity remain guaranteed even if the strongly nonlinear effects are taken into account that are due to the unavoidable operations implied by the need for rounding/truncation and overflow correction. This way, complete robustness of the algorithm carrying out the numerical integration can be en­sured, i.e., it can be ensured that the behavior of this algorithm under finite-arithmetic condi­tions (including the particularly annoying overflow aspects, which could otherwise e.g., even lead to chaotic behavior) differs as little as possible from the one that would be obtained in the case of exact computations [8].

Note that the term passivity must be inter­preted in a somewhat wider sense than what has conventionally been done [3]. Thus, it is suffi­cient that the MD WDF circuit can be embed­ded in a suitable way in a passive circuit. Or else, consider a I-D algorithm onto which the MD-WDF algorithm can be mapped in view of its recursibility. It is sufficient that this 1-D algorithm corresponds, in whatever way, to a passive I-D WDF [21].

(ix) The preservation of properties such as massive parallelism and exclusively local nature of the interconnections, which are inherent to the original physical problem, is of interest not only for physically implementing the algorithm (in particular for gaining speed by increasing hard­ware and for enabling the use of systolic-array­type arrangements), but is quite essential from a more basic viewpoint. It makes it possible, in­deed, to allow very easily for arbitrary varia­tions (e.g., in space) of the characteristic parameters of the physical system and also for arbitrary boundary conditions.

Page 9: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

10 Fettweis and Nitsche

(x) Since the approach simulates directly the behavior of the actual physical system, assumed of course to obey causality, the approach is par­ticularly suitable for time-dependent problems implying propagation over finite distances in nonzero time, thus for problems of hyperbolic type.

(xi) Problems of elliptic type as they occur for deter­mining equilibrium states can be dealt with in different ways. One possibility is to adopt suitable starting values and then to solve the dynamic problem until equilibrium is reached. In order to ensure convergence, one should of course introduce suitable losses that have no ef­fect once the equilibrium is reached. As an ex­ample, if the equilibrium state of a set of elec­trically charged conductors is to be computed, these conductors may be made strongly lossy since this causes dissipation only as long as cur­rents are still flowing, i.e., as long as the equilibrium is not yet reached. In other prob­lems, such losses may even be of a type that does not occur in the actual physical system.

(xii) Problems of parabolic type imply infinite prop­agation speed and thus normally imply idealiza­tions of what is physically obtainable. Hence, they can suitably be modified in such a way that any propagation will occur at finite speed. This modified problem is then again amenable to our approach.

(xiii) Multigrid methods are known to be attractive for numerical integration of PDEs [22]. Alternative multigrid methods can also be used in relation with the present approach. For determining an equilibrium as explained under item (xi), one may start out with a very coarse grid. The equi­librium computed this way may be used as initial value for a computation with a denser grid, etc.

(xiv) The multirate principle of digital signal process­ing [23], which is known to involve operations such as interpolation (e.g., by first applying zero stuffing) and decimation (by dropping of samp­ling points), should be applicable to the present approach, in particular in order to allow for grid densities that are nonuniform in space. No details for making use of this possibility have, however, been worked out so far.

(xv) The approach can be modified in order to deter­mine steady-state solutions in an alternative way. In particular, it is not required to compute the complete time behavior in order to determine the

steady-state in the case of a sinusoidal or complex-exponential excitation.

(xvi) Usual digital filters are linear, and application of the present approach is thus easiest in the case of linear problems. However, extension to nonlinear problems is possible, in particular along lines similar to those already used suc­cessfully for ordinary differential equations [7].

(xvii) The approach is particularly suitable as basis for developing specialized computers [9] that involve massively parallel processing and that are con­ceived for numerically solving specific classes of PDEs. Such computers would consist of large numbers of similar (or even identical) and similarly programmable (or even identically pro­grammable) individual processors. These proc­essors could be interconnected in form of systolic-array-type arrangements and have essen­tially to carry out only additions/subtractions and multiplications. Thus, the individual processors may simply be digital signal processors, possibly even of simplified type. Furthermore, due to the inherent advantageous properties ofWDFs, these digital signal processors may be built with shorter word-lengths for coefficients and signal parameters (data).

3. Direct Approach by Frequency-Domain Analysis

3.1. Representation by Means of an Equivalent Circuit

We will explain the basic ideas by means of a concrete example. For this, we first choose the equation of a set of conducting plates (possibly lossy) separated by a dielectric (possibly also lossy); in this case the variables involved are indeed electric, which is easiest to under­stand if one wants to establish the analogy with the elec­tric basis ofWDFs. The equations to be considered are

I ail + ril + au = fl(t) (3a) at3 atl

ai2 . au 1- + Tl2 + - = fz(t) (3b)

at3 at2

ail ai2 au - + - + c - + gu = Nt) (3c) atl at2 at3

where t3 corresponds to time, tl and t2 are the two spatial coordinates, i I and i2 the current densities in the direction of tl and t2 , respectively, u the voltage between

Page 10: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 11

the two plates, and ret) a given excitation (forcing func­tion), with

(4)

The problem thus described is three-dimensional (3-D), at least in the terminology of digital signal processing, since it comprises 2 spatial dimensions, fl and f2, and time, f3 • For a more conventional notation, ft. f2 , and f3 should be replaced by Xt. X2, and f, respectively. For the parameters l, c, r, and g, we may assume

l > 0, c > 0, r 2!: 0, g 2!: O. (5a,b,c,d)

These parameters may be constants or functions of tl and t2 ; they may also be functions of f3' but such a dependence is of more limited practical importance, and independence of f3 somewhat simplifies certain later discussions.

In this section we assume that l, c, r, and g are con­stants. Let us solve (3) for a 3-D steady-state behavior of the form

where It. I" 13, Et. E2 and E3 are complex constants (complex amplitudes) while r3 is a positive, but other­wise arbitrary, constant. Furthermore,

p = (Pt. P2, P3)T

where PI, P2, and P3 are arbitrary complex constants. We may interpret 11,12,13 and Et. E2, and E3 as com­plex amplitudes, while Pt. P2, and P3 are complex fre­quencies (more specifically: complex wave numbers in the case of PI and P2)' Substituting (6) and (7) in (3), we obtain the set of algebraic equations

(P3l + r)II + Plr3I3 = Et. (P3l + r)I2 + P2r3I3 = E2

(8a) (8b)

Plr3I1 + P2r3I2 + (P3C + g)r;I3 = E3 (9)

where the last equation has actually been multiplied by r3' The equations (8) and (9) can be interpreted in form of the circuit of figure la, which in turn is equivalent to that of figure lb, 01 and O2 being arbitrary positive constants.

Consider next the well-known transformation of figure 2, where figure 2a represents a symmetric two­port in T-configuration, figure 2b its equivalent lattice representation comprising the canonic impedances Z' and Z" given by

P351- PI'3 P351-PI'3

® E3 PI '3

P3 iC'i- 61-52)+g,} PJ~-P!3

13 Pz~

Fig. 1. a) Circuit representing (8) and (9). b) Circuit equivalent to that of (a).

Fig. 2. a) A symmetric two-port in T-configuration.

P3il-~)+'

b) Its equivalent lattice representation comprising the canonic impedances Z I and Z II. c) A simplified representation of b). d) A so-called Jaumann equivalent of b).

Z' = Za' Z" = Za + 2Zb , (10)

and figure 2c a simplified representation of the circuit of figure 2b. Applying this equivalence to figure lb, we obtain the circuit of figure 3 where Z;, Z;" Z~, and Z; are given by

Z; = P301 - Plr3, Z;' = P301 + Plr3, (11)

Z~ = P302 - PZr3, Z~' = P302 + P2r3; (12)

Page 11: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

12 Fettweis and Nitsche

z, /' I

Z" / I 1 / I // I

I, +

E, r-+--+----- - -"o---f----'

I I L _______ ~

+

'--------t---4'- -- - - -.+----t------'

I I L _______ ..J

Fig. 3. Circuit obtained from figure Ib by applying the equivalence of figure 2, the impedances Z;, Z;', Z~ and Z~'being given by (11) and (12).

they are the canonic impedances of the symmetric two­ports NI and N2 , respectively. The other impedances in figure 3 are given by

ZI = P3(l - 01) + r, Z2 = P3(l - O2) + r, (13)

(14)

and we assume Ob 02, and r3 to be chosen in such a way that

(15a,b,c)

The simplest choice, which we will henceforth adopt in this subsection (unless specified otherwise), is of course

(16)

in which case

ZI = Z2 = r, Z3 = 2gl/c. (17)

3.2. Derivation of WDF Arrangements

In order to arrive at a structure that can be implemented by means of difference equations, we apply the approx­imate substitutions

K = 1, 2, (18a,b)

where l/;: and l/;:' are defined by

l/;: = tanh ~ (P3 T3 - p.r,),

.1. /I = tanh 1 (p T T) 'I' , 2 3 3 + p, , ,

(19a)

(19b)

T, being also an arbitrary positive constant and :r; and T,. being given by

(20)

thus, using (16), by

TI = T2 = Y2 T,I.fTC. (21)

The transformations defined by (18) to (19) give a good approximation provided the quantities P3T3 ± P.r. are small; they correspond to using a generalized trapezoidal rule for carrying out numerical integration (cf. Subsection 4).

If we apply (18) to figure 3, the structure remains formally unchanged, with ZI, Z2 and Z3 still given by (17) but with (11) and (12) replaced by

Z; = rol/;;, Z;' = rol/; ;', K = 1, 2, (22)

where

(23)

From this newly interpreted structure of figure 3 the corresponding WDF structure can immediately be ob­tained using known principles [3]; the result is shown in figure 4, all signals being indicated by t-dependent quantities rather than by steady-state quantities.

In figure 4 we see explicitly appearing 3 sources with source intensities

(24)

3 sinks with received signals bb b2 , and b3 , 8 adders, 4 multipliers (with coefficients simply equal to -1/2), 4 shift operators with vectorial shifts equal to

(25)

a 3-port series adaptor, and 2 two-port series adaptors. The port resistances Rl to R7 are also shown; they are all positive and are given by

R = R2 = r R3 = 2g1 1 , c

(26)

Page 12: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 13

01 =Q1 Q1

t R1 = r

b1

Q3 03= Q3

R3=2~ t b3

Q2

R2= r

b2

Fig. 4. Wave digital circuit corresponding to figure 3.

Since R4 = R5, the known equations describing the 3-port series adaptor [3] reduce to

h5 - (a 3 + a4) + r ~ao, h4 = h5 + a4 - a5,

h3 (27a,b,c)

ao = a3 + a4 + a5, r ~ = R3/(R3 + 2ro). (28a,b)

They can thus be implemented using a single multiplier of coefficient r ~. The two-port series adaptors are described by

h, = a, - r(a, + a6)' (29a) h6 = - h, - (a, + a6), (29b)

h2 = a2 - r(a2 + a7), (30a) h7 = - h2 - (a2 + a7), (30b)

r = 2r/(ro + r). (31)

Note that we have used two-port series rather than parallel adaptors because this gives us directly the proper orientations, thus also the proper signs for the output signals, as is explained by the diagram of figure 5. Replacement of the two-port series adaptors by cor­responding parallel adaptors would amount to simply

changing the signs of a" h" a2 , and h2 •

The voltages u, to U7 and the currents i, to i7 can be determined by means of the known relationships

u, = (a, + h,)I2, i, = (a, - h,)I2R" A = I to 7. (32)

This enables us in particular to determine i, to i3 dir­ectly, or indirectly by means of the obvious relationships

These latter expressions have the advantage that accor­ding to (5), (23), and (26), R4 to R7 are positive (> 0) while R, to R3 may be zero (;::, 0).

It is important to observe that every directed loop in figure 4 (including those closed via the adaptors) comprises a positive shift T, in the 3rd component t3 ,

i.e., a positive shift in time. This implies that realizabil­ity is ensured in such a way that full parallelism is available for carrying out the computations for all those points in the complete data array that correspond to the same location in time. Furthermore, since all port resis­tances are positive, stability is ensured under linear con­ditions, i.e., if all computations are carried out exactly

Page 13: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

14 Fettweis and Nitsche

i3 r--' r-----, r-l I, 0 • I 0 I '" --",- I 0 I I .. 0

I I " /' I I I ,/>', I tU4 U6t ~ u, I 14= 13 16 I I ----'-. I 0 .. 0

tU3

R4 L ____ ..J R6 L_J R,

r-----, r--' i2

I --- ;-a I a

I I .. 0

tU5 I

, ,/

U7t I X I I I I ,/ ......... !U2 17 I 15= 13

0 I .-'- -- ~ I a I I 0 0 .. • R3 L_..J R5 L ____ -1 R7 L_-1 R2

Fig. 5. Diagram illustrating the orientation of the voltages Ul to U7 and the currents it to i 7 •

(thus with infinite precision), and, most importantly, full robustness can be achieved under all stability­related aspects that have ever been considered in rela­tion with the errors introduced by the unavoidable need for applying rounding/truncation and overflow corrections.

All these advantages remain valid as long as we obey (IS), but not necessarily (16). Indeed, although the resulting WD F structure is then more complicated than that of figure 4, the passivity requirements that have to be observed remain satisfied since the port resistances of all adaptors involved are still positive. In order to be as general as possible, one then has to replace (22) by

z; = rKI/;;, Z;' = rKI/;;', K = 1,2, (34)

and one finds, carrying out the approximation implied by (18) and (19),

(35)

which thus generalizes (20) and (23). Defining To and Vo by

Y2 To = .JTi + P" Vo = 1I.JlC, (36a,b)

one obtains from the requirement (IS) ,

To ~ (oi + o~)(o, + O2) - = r3 ~

T, 0,02Y2 2coio~

i.e.,

TofT3 ~ Y2 VO, T,IT3 = T21T3 ~ Y2 vo, (37a,b)

(37b) being the expression to which (37a) reduces for 7; = 1;.

The bounds thus obtained for T, are tight since they are reached for the solution defined by (16) (it cor­responds to the so-called Courant-Friedrichs-Levy (CFL) stability condition [24], which is known to be of fundamental relevance). Clearly, the solution (16) is thus the best possible one also from the point of view that it imposes the least restriction on the density of the sampling in time for a given density of the samp­ling in space. In fact, Vo is the velocity of a wave travel­ling freely according to (3) in the lossless case (r =

g = 0). Hence, (37) expresses that the distance covered by such a wave during time T, may not exceed To/Y2.

4. Direct Approach by Space-TIme-Domain Analysis

Instead of introducing the basic approximation in the frequency domain (cf. (18» we could have done it just as well in the space-time domain. Such an alternative approach is mandatory if we are dealing with noncons­tant parameters since for the steady-state solution ac­cording to (6) and (7) the quantities I" 12 , and 13 are no longer constant, and this even if E" E2 , and E3 are constant. We therefore should start directly from (3), in which I, r, c, and g may now be functions of t, i.e.,

I = l(t) , r = r(t) , c = c(t) , g = g(t). (38)

Page 14: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 15

However, although we will maintain the general way of writing (38), we must keep in mind that equations such as (3) are derived under the explicit assumption that I and c are not dependent on time, thus not on t3' Hence, we may make in (3) the substitutions

I ai, = a(li,), I ai2 = a(li2) , at3 at3 at3 at3

au a(cu) C-=--.

at3 at3 (39)

As in Section 3, we also introduce an auxiliary con­stant (thus independent of t, to t3)' r3 > 0, and define i3 by (cf. (7a»

(40)

Using (24), (39) and (40), the PDEs (3) can thus be written in the form

D 3(li,) + ri, + D,(r3i3) = e,(t),

D,(li2) + ri2 + D2(r3i 3) = e,(t),

(41a)

(41b)

D,(r3i,) + D,(r3i2) + D3(~ci3) + ';gi 3 = e3(t), (41c)

where D K is the differential operator

a DK = -, K = 1 to k,

atK

with k = 3 in the present case.

(42)

In order to arrive at a proper circuit interpretation, we observe first that an inductance, L = L(t), and a capacitance, C = C(t), in dimension K are defined by equations of the type

u = DiLi), i = D.(Cu), K = 1 to k, (43a,b)

(where u is not to be confounded with the quantity of same notation occuring in previous equations). Applica­tion of the trapezoidal rule amounts to replacing (43a) and (43b) by

u(t) + u(t - T K) = R(t)i(t) - R(t - TK)i(t - T K), (44)

i(t) + i(t - T K) = G(t)u(t) - G(t - TK)u(t - T K), (45)

respectively, where

R(t) = 2L(t)ITK, G(t) = 2C(t)ITK, (46a,b)

(the T., K = 1 to k, being positive constants); where, for k = 3 (and similarly for any k '" 2),

T, = (T" 0, O)T, T2 = (0, T2 , O)T, T3 = (0, 0, T3)T; (47a,b,c)

and where t takes only discrete values. We may repre­sent (44) and (45) more compactly by

u = .i(T.)(Ri), i = .i(TK)(GU), K = 1 to k, (48a,b)

the operator .i(T.) being defined precisely as required by (44) and (45). Using the known definitions for the incident and reflected (voltage) waves, a and b,

aCt) = u(t) + R(t)i(t), bet) = u(t) - R(t)i(t), (49)

(44) and (45), thus (48), are transformed into

bet) = - aCt - T.), b(t)G(t) = aCt - T.)G(t - T.), (50)

respectively. (Note that in our present context only in­ductances are needed; capacitances have been included in the discussion only for the sake of completeness, with R = lIG in (49).)

Using the principles just explained and applying steps similar to those that have previously led to figures 1 and 3, we can first represent the system ofPDEs (41) as shown in figure 6 where 0, and 02 are constants such that 0, > 0, 02 > O. In this figure, a notation such as D 3«(l - 0,) • ) etc., indicates that the voltage across the corresponding inductance is equal to D3«(l - o,)i), where i is the current through that inductance etc. However, the results, in principle known, that we have explained in the previous paragraph are not yet suffi­cient to arrive at the desired WDP arrangement. We have therefore to generalize the trapezoidal rule in an appropriate fashion. We do this for the specific situa­tion encountered, but the generality of the approach should be quite obvious.

Thus, consider voltage-current relations such as those appearing in the upper lattice branches in figure 6:

(51)

where OK and '3 are known to be positive constants. We approximate them by applying what we may call a generalized trapezoidal rule, i.e., by replacing them by the difference relations

u(t) + u(t - T:) = '.[i(t) - i(t - T :)], K = 1, 2, (52)

where

T; = (- T" 0, T3)T, T ~ = (0, - T2, T3? (53a,b)

and where T" T2, T3 and '. satisfy (35).

Page 15: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

16 Fettweis and Nitsche

+

r--_~---------'_------------.J

+

L-_____ '--------+-------------l

Fig. 6: Circuit representation of the system of PDEs (41).

For the other lattice branches in figure 6, we have

u = (OKD3 + '3DK)i, K = 1, 2, (54)

and, using again a generalized trapezoidal rule, we ap­proximate this by

u(t) + u(t - Tn = 'K[i(t) - i(t - T :')], K = 1,2, (55)

where

T;' = (T" 0, T3)T, T; = (0, T2, T3)T, (56a,b)

with (35) holding as before. The rules (52) and (55) amount to applying the trapezoidal rule in directions fixed by T: and T:: respectively, with K = 1, 2.

Using definitions corresponding to (49), i.e.,

a = u + 'Ki, b = u - 'Ki, (57)

(52) and (55) yield the desired result

b(t) = - a(t - T:) and b(t) = - a(t - T :'), (58)

respectively. Furthermore, it is easily verified that (52) and (55) lead to the same frequency-domain approx­imation as that given by (18) and (19), which also justifies the designation as generalized trapezoidal rule. More compactly, (52) and (55) may be written in the form, respectively,

u = 'K~(T :)i and u = 'K~(T:')i. (59a,b)

In figure 6, passivity of the inductances involving D3 implies that (15) still holds, i.e., since 0], O2 , and '3 are constants, that

(60)

where lmin and Cmin are the minimum values admitted by I and C throughout the coordinate range of interest. An analysis similar to the one that had led to (37) shows that (37) remains valid if we replace (36b) by

Vo = 1I"/lmincmin (61)

The bounds given by (37) are again tight since they are reached for the choice

0] = O2 = lmin, '3 = "/2Iminlcmin, (62)

which should be compared with (16). We may now transform the circuit of figure 6 by ap­

plying rules such as those expressed by (46a), (48a), (35), and (59). If we use a notation similar to that ex­plained in relation with figure 6, we obtain the circuit shown in figure 7 where

'K = 20JT3 = 2'31TK' ': = 2(1 - 0K)IT3, K = 1, 2, (63)

,; = 2(cr; - 0] - 02)IT3. (64)

In this new circuit, inductances thus have to be inter­preted as being described by the indicated difference relations rather than by differential relations. Kirch­hoffs rules as well as the description of sources and of elements such as resistances is, of course, as usual. Clearly, the circuit necessarily contains more elements than the one of figure 3 (as described by (17) and (22», but this is unavoidable because the choice (16) is now excluded.

Page 16: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 17

+ +

,----...... -------"1----------____ --' gr;

,/

/' /'

/'

'--____ ---"' _______ ...",. ___________ ..J

Fig. 7. Circuit derived from that of figure 6 by applying the (generalized) trapezoidal rule.

Q2= 122 CZ2

R2

t b2

R9

Fig. 8. WDF arrangement derived from the circuit of figure 7.

From the reference circuit of figure 7, the WDF ar­rangement of figure 8 is immediately obtained by ap­plying standard procedures [3]. We may assume 0, =

O2 and thus, from (63), r, = r2 , r: = r~, and 7; = T2 • We use 00, ro, and r ~ to designate

00 = 0, = 02, ro = r, = r2, r ~ = r: = r ~ (65)

and thus obtain for the port resistances

R, = R2 = r, R3 = gri, R4 = Rs = R6 = R7 = ro, (66a,b,c)

(67)

From these, the multiplier coefficients in the adaptors can be determined by known procedures [3]. In par­ticular, equations such as ('Il) to (33) remain valid as before, but some or all of the coefficients may now be

Page 17: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

18 Fettweis and Nitsche

functions of t. The determination of i , to i3 by means of (33) rather than directly by (32), with r.. = 1 to 3, has now the added advantage that R. to R7 are positive constants while R, to R3 although being <! 0, may be nonconstant.

A word of explanation is needed for justifying the claim that full robustness is achieved because all port resistances are nonnegative. Indeed, if voltage waves are used, positivity of the port resistances guarantees full robustness only if for a signal traveling through a shift element the port resistance involved is the same when the signal enters as when it leaves this element. This requirement is indeed satisfied for the structure of figure 8 since R. to R7 are constants while Rs, R9 ,

and RIO do not depend on t3'

5. Complementary Remarks

1. In the constant case (figure 4), all vectorial shifts involve T, only in combination with either ±7; or ±1;. In other words, in order to compute a point (t" t2, t3), we only need to know the state at the four points (t l

± 7;, t2, t3 - T,) and (t" t2 ± 1;, t3 - T,), not however e.g., at (t" t2, t3 - T,). This allows us to divide the grid points into two mutually exclusive sets such that the state at the points in one set never has any influence upon the points in the other set. It is then even more advantageous to simply drop one of these sets, which leads us to a form of offset sampling involving a kind of checkerboard sampling pattern in space such that, for consecutive values of t3, we alternate between black and white dots in this pattern.

In this case, the grid density (number of sampling points per unit 3-D volume) is reduced by half. On the other hand, the distance, Do between points that exert direct influence upon one another is unchanged. Assuming (16) and (20), thus also (21), to hold and replacing the time distance T, by a corresponding length vJ, (cf. (36b», we have

Do = ..In + v~T; = T, ..J3i2.

We can restore the original grid density by reducing the shift size by a factor Vi in all three directions, in which case Do re~uces to a new minimum d1stance, D, given by D = Do/V2 "" 0.794 Do. The value V2 "" 1.26 is thus a measure for the gain in accuracy obtainable by the type of offset sampling discussed here. Another advantage of this offset sampling is that it makes possi­ble simplifications in arranging the computations.

In the nonconstant case the offset sampling just described can still be used if we double the step size for vectorial shifts that operate in the t3 direction only. This amounts to approximating an expression such as (43a) , for K = 3, not by (44) and (46a), but by

u(t) + u(t - 2T 3) = R(t)i(t) - R(t - 2T 3)i(t - 2T 3), (68)

R(t) = L(t)IT3, (69)

T3 being still given by (47c). This leads to replacing T3 in figure 7 by 2T3 and to changing figure 8 into figure 9 where (65) to (67) still hold.

2. The WDF structures discussed in this section are not the only ones that can be obtained by essentially the same approach as that explained. In particular, one may apply any of the WDF transformations available for deriving new structures from a first given one [3]. A simple example for such a transformation is the replacement of the series adaptors by corresponding parallel adaptors [3], [25]. In this case, all the minus signs in the multipliers of figures 4 and 8 are cancelled (although minus signs then appear in cascade, e.g., with b" b2 , and b3). The same solution would have been obtained if, for proceeding from (3) or (41) to cor­responding circuit descriptions, we had kept the voltage u, replaced i, and i2 by voltages, and used an interpreta­tion of the forcing functions in terms of current sources.

3. In figures 4, 8, and 9, the output waves b" b2 , and b3 are not required for determining iI, i2, and i3 (cf. (33». This implies possibilities for corresponding simplifications in the three adaptors involved. Further simplifications are possible if any of the source voltages e l , e2 , and e3 is zero.

Care must be exercised, however, if power expres­sions are to be determined in figures 4, 8, and 9. Ac­cording to definitions adopted for WDFs, the power entering the rest of the circuit from source r.. is given by

(a~ - b~)/4Rl\' r.. = 1 to 3. (70)

Since the resistance Rl\ is implicitly included as inter­nal resistance of source r.., (70) is not the power delivered via the terminals of the voltage source el\ itself, but via the terminals of the series connection of el\ and Rl\, with r.. = 1 to 3. For any specific value of this r.., say r..., rest of the circuit thus implies the com­plete circuit except el\o and Rl\o"

Some difficulties may arise because in figures 4, 8 and 9 the port resistances Rio R2 , and R3 are given by the first three equations (26) and (66). Indeed, since

Page 18: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 19

01 = e1 e1

R1

e3 03= e3 b1

R3 Ra

b3 2

02= e2 e2

R2

b2

RIO -2-

Rs

blO T

Fig. 9. WDF arrangement similar to that of figure 8, but suitable for using offset sampling.

we may have r = 0 and/or g = 0, any of these port resistances may be zero, and the corresponding expres­sion (70) is then not applicable. There are several ways of circumventing this dilemma: One of these is to note that, using (32), (70) is also given by u.,./I\; further­more, if R).. = 0 for any A E {I, 2, 3}, we have u).. = e).., while i).. can be determined by means of (33). Another one is to treat e).. and R).. as separate com­ponents (thus to replace the corresponding adaptor by an adaptor with one further port) and to assign to e)..

a new port resistance R ~ that is equal to the sum of the remaining port resistances referring to the same adaptor; this way, the port to which the source is at­tached becomes reflection-free [3].

Similar difficulties may arise in figures 8 and 9 if R).. = 0 for one of the A = 8 to 10, assuming that we have adopted the equality signs in (60) and that [reaches [min and/or c reaches Cmin' The problem can again be circumvented e.g., by observing that

Us = -U1 - U6, U9 = -U2 - u7 ,

UIO = -U3 - U4 - U5·

4. The equations (3) can also be written

I ai . - + 11 + grad u = f'(t) (7la) atk

d.. au IV I + C - + gu = jit) (71b)

atk

where

i = (il> ... , ik- 1)\ t = (t l , ... , tk)T,

f'(t) = (fl(t) , ... , .fk_l(t))T,

(72a,b)

(73)

with k = 3 in the present case. However, the equations (71) to (73) remain valid also for k > 3, with tk being still the time variable. It is true that an immediate elec­trical interpretation cannot be given for k > 3, but for k = 4 the equations (71) are of same type as those for (linear) sound propagation phenomena. For these, the results in the example presented are thus immediately applicable.

The same can easily be shown to be true for Max­well's equations in the case that the fields are constant along one ofthe coordinates. The general case of Max­well's equations is somewhat more complicated, but can nevertheless be dealt with by essentially the same approach.

Page 19: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

20 Fettweis and Nitsche

5. One verifies easily that the equations (41) can also be written in the form

1 + 2 (O[D3 + r3D d(i[ + i3) = e[(t), (74a)

D3((1 - o2)i2) + ri2 + ~ (02D3 - r3D2)(i2 - i3)

DJC(cr~ - o[ - 02)i3) + r~gi3

+ ~ (O[D3 - r3D[)(i3 - it)

+ 1. (0[D3 + r3D[)(i[ + i3) 2

+ ~ (02D3 - r3D2)(i3 - i2)

+ 1. (02D3 + r3D2)(i2 + i3) = e3(t). (74c) 2

This way of writing these equations can be shown to correspond directly to the structure of figure 6.

6. Boundary Conditions

6.1. General Principles

Boundary conditions can, in principle, be taken into account rather easily. This is due to the fact that the parameters involved may vary in an arbitrary fashion from one point in space to the next and that all multiplier coefficients involved remain bounded even if port resistances of adaptors go to zero or infinity. This in turn is a consequence of the fact that scattering matrices of passive circuits are always bounded.

In order to give a more detailed explanation let us assume that a wave is propagating inside a medium M[ that is separated by a boundary surface from a medium M 2 • Let P be a point on the spatial sampling grid and let (tlo ... , tk-l) T be its coordinate vector. Restricting ourselves again, for simplicity, to k = 3, we call Pan inner point of M[ or M2 if all five points (tlo t2l, (t[

- 'It, t2), (t[ + 'It, t2), (tlo t2 - T,), and (tlo t2 + T,) are located inside of M[ or M 2 , respectively. We call it a boundary point in M[ or M2 if (tlo t2) is in M[ or M 2 , respectively, and if, of the points (t[ - 'It, t2),

(tlo + 'It, t2)' (tlo t2 - T2), and (tlo t2 + T,), at least one is in M2 or Mlo respectively. Since parameter values must be well defined, no grid point may be located on the boundary surface itself.

Assume now that we are computing the state of a point P at a certain (sampled) time instant tk' Nothing specific happens as long as P is an inner point of M[. The same is true if P is a boundary point in M[ and if the wave is for the first time reaching the boundary. We obviously assume indeed that M2 is initially fully at rest, i.e., that all registers referring to points in M2 are initially discharged. If P is a boundary point in M2 ,

however, we have to switch to the parameter values in M 2 • Specific aspects concerning this will be discussed hereafter.

6.2. Hard Boundaries

As an example for a hard boundary, consider the case of a short-circuit, which we may characterize by g = 00, in which case, according to (66), we have R3 = 00. Using the equations describing the 4-port series adaptor in figures 8 and 9, it can be shown that this yields

(75a,b)

which corresponds indeed to i3 = 0, thus to u = 0 (cf. (32), (33), and (40)). Furthermore, we may assume I = lmin = 00 and r = 0 (losslessness) as well as e[ = e2 =e3 = 0, whence from (64) to (67)

R[ = R2 = R, = R9 = O.

It can be verified that this leads to b, = a, and b9 =

a9, thus (cf. (47c) and figure 8) to

a,(t) = -a,(t - T3)' a9(t) = -a9(t - T3)' (76)

Thus, since a, and a9 were initially zero, these quan­tities remain zero ever after. The same conclusion holds if we use figure 9 instead of figure 8 since we then simply have to replace T3 by 2T3. Finally, using these results it can be shown that for a boundary point in M2 of the type considered we have not only (75) but also

(77a,b)

For the signals Clo c2 , d[, and d2 in figure 8 one finds

c[ = -b. + b6 , C2 = b. + b6 , d[ = a. - a6 ,

d2 = -a. - a6,

i.e., using (75a) and (77a)

(78)

Page 20: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 21

On the other hand, it follows form (53a), (56a), and figure 8 that

d1(t) = c1(t - T ;), d2(t) = c,(t - T ;')

and thus

C1(t) = c,(t - T ;'), c,(t) = c1(t - T ;). (79)

Hence, if t is an inner point of M2 (in which case t - T; and t - T ;' are also still in M2 so that (78) and thus (79) are applicable) we obtain from (47c), (53a), (56a) , and (79),

c 1(t - T;) = c,(t - 2T3), c2(t - T ;') = C1(t - 2T3)

and thus

Hence, since C1 and C2 were initially zero they remain permanently zero for any inner point of M 2 • The same conclusion can also be found for C3 and c. in figure 8, and corresponding conclusions also hold in the case of figure 9.

Note that the value of c (cf. (64» does not play any role in our derivation. Hence, we may always assume it to be such that c ;:", Cmin, in which case (37) remains valid for Vo given by (61).

The other type of hard boundary is r = 00, in which case according to (66) we have R1 = R2 = 00. This in turn can be shown to yield the equations

(81)

which correspond indeed to i1 = i2 = 0 (cf. (32) and (33». At the boundary points in M2 we should further­more require g = 0 (losslessness) as well as e 1 = e2

= e3 = 0, and we may put c = Cmin = 2oolr" whence from (64), (66) and (67), R3 = RIO = 0 and, as can be shown,

(82)

For obtaining these latter expressions we have made use of a lO = 0, which can indeed be justified in the same way as we had done for showing as = a9 = 0 in the case of the previously considered hard boundary.

If we assume that the parameters remain the same all through M 2 , one observes that simple relations such as (80) are now not strictly satisfied. One does find for the inner points of M 2 , however,

2b.(t) = 2bs(t - 2T3) - b.(t - T~) - b.(t - T ~') + bs(t - T ~) + bs(t - T ~'),

2bs(t) = 2b.(t - 2T3) + b.(t - T;) + b.(t - T ;') + bs(t - T;) - bs(t - T ;'),

From these expressions one can conclude by Taylor ex­pansion, as is to be expected, that for a sufficiently tight sampling, b. and bs remain close to zero, although they don't have to remain strictly zero.

One way to avoid this problem is to impose, e.g., the requirement of g = 00 for the inner points of M 2 •

In view of the earlier result, this would at least guarantee that the expressions (80) as well as the con­clusions we had drawn from these hold for all inner points of M2 that are not immediate neighbors of the boundary points in M 2 • Another possibility is to create in the second layer of the inner points of M2 a reflection­free situation of the type explained in Subsection 6.3. A further possibility will be explained in Subsection 6.4.

6.3. Reflection-Free Boundary

A further interesting choice for M2 is a medium that does not create any reflections, i.e., that is fully ab­sorbing. Such a medium can easily be created by set­ting permanently to zero the stored values in all points of M2, thus in all inner points of M2 as well as in all boundary points in M 2• In practice, no computations have therefore to be carried out for the points in M 2 .

6.4. Modified Approach

The difficulty we had encountered in Subsection 6.2 for one of the types of hard boundary can also be over­come by modifying the approach for obtaining a discrete approximation. For explaining this, let us con­sider first the frequency-domain analysis presented in Subsection 3.1 and let us replace (18) by

1, 2, (83)

with (19) to (21) as well as (23) holding as before. This amounts to replacing (22) by

z: = ~ [ I J ro if.<'+ 1/;: '

1 (84)

Z" K

These expressions show that Z: and Z:' are still reac­tance functions, although in two complex frequencies, 1/;: and I/; :', and we have, using (19),

Page 21: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

22 Fettweis and Nitsche

1/;:+1/;:'_ _ Z: + Z:'= ro -,---:-:--:-::- - ro tanh (p;r,), K - 1, 2.

1 + 1/;:1/;:'

This last expression is the key for obtaining the desired property. Observe also that the approximations implied by (83) are still of second order, just as for (19). On the other hand, the total expenditure has increased since Z: and Z:'require twice as many elements as before.

If we now apply the same modification to the situa­tion discussed in Subsection 6.2, one finds, using again known WDF principles [3], that the 4 vectorial shifts in figures 8 and 9 have to be replaced by arrangements as explained in figure 10, while (66c) has to be replaced by

R. = Rs = R6 = R7 = ro/2,

all other quantities remaining unchanged. If this is done and if one uses (81) but temporarily ignores (82), thus the influence of the 4-port series adaptor, one finds

a.(t) = -b.(t - 2T,), as(t) = -bs(t - 2T,).

Combined with (82), these relations thus yield

a.(t) = as(t - 2T,), as(t) = a.(t - 2T,).

®

-1

®

10.+,1 ro

=={> ro II ""2

-1 ro

Fig. 10. Changes to be applied to figures 8 and 9 as required by the modified approach explained in Subsection 6.4. In (a) on the one hand aod in (b) on the other, upper and lower signs correspond to one another.

Hence, a. and as now remain permanently zero if they were initially zero, and due to (82) the same holds for b. and bs. The same conclusion can then be shown to be valid also for a6, b 6, a 7 and b7 .

7. Steady-State Analysis

For an alternative approach to determining a steady­state solution one may proceed in a way we will ex­plain be means of (41). Let us thus write

eA = EAeP'k, iA = IAtf't., A = 1, 2, 3

where

EA = EA(t1> ... , tk-j), IA = IA(t), A = 1, 2, 3,

i.e., EA may depend on tl to tk- 1 while IA may also de­pend on tk. For the sake of generality we consider the case of an excitation with a general complex frequency p, this quantity being of course a constant; in the sinusoidal case we obviously have p = jw. Substituting in (41) yields, with k = 3,

D,(I'II) + (PI + r)II + DI(r,I,) = EI>

D,(l'I2) + (PI + r)I2 + D2(r,I,) = E2,

DI(r,II) + D,(r,I2) + r;(pc + g)I3 +

(85a)

(85b)

D,(r;c'I,) = E" (85c)

where we have, under the assumptions made so far, I' = I and c' = c.

These new equations are very similar to (41) except that there are now complex quantities involved. This latter aspect however does not cause any serious obstacle, and we may thus proceed just as for (41). Assuming

r + I Re p ~ 0, g + c Re p ~ 0,

with strict inequality holding for at least one of these relations, the dissipation involved will cause the IA to converge to values lAo that are independent of tk. If the losses involved are small, the convergence may obvi­ously be rather slow, and there is even no convergence at all in the lossless case (r = g = 0) if P = jw. In these cases, one may replace rand g by modified nmc­tions that are sufficiently large but such that for grow­ing values of tk they converge towards the respective original values.

Despite the added complications there may remain some computational advantage over a direct integration of (41). On the one hand, the time dependence of the h may be very much less than that of the iA, A = 1

Page 22: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Numerical Integration of Partial Differential Equations 23

to k. If we proceed as explained so far, however, this reduced dependence cannot usually be exploited for reducing the density ofthe sampling raster in the time direction since for stability reasons we are obliged to obey conditions such as those given by (37), with Vo

given by (36b). We can circumvent this restriction by observing that once the steady-state is reached, i.e., once the IA may be replaced in (85) by the correspon­ding lA" the terms involving D3 in (85) are zero. Hence, Z' and c' in (85) have no influence upon the h, and may therefore be chosen completely arbitrarily, while the condition corresponding to, say, (37a) has then to be replaced by

To/T3;:;' Y2v.;;' v"",'= 1/../T'C'.

For any given ratio To/T" this condition can always be satisfied by appropriately selecting Z' and c'. In par­ticular, Z' and c' may be chosen constant even if Z and c are not constant. Hence, in the structure realizing (85) we can always obtain further simplifications by adopting

0, = 02 = Z', r3 = .J2Z'/c'.

One may not conclude however, that the number of time steps can definitely be reduced since an increase of /' and c' not only allows us to increase T" but at the same time increases correspondingly all time con­stants. Nevertheless, some improvement might be feasi­ble in certain cases by appropriately dividing the increase between Z' and c '.

8. Conclusions

A new approach for numerically integrating partial dif­ferential equations (PDEs) has been developed. It is particularly adapted to systems of PDEs that correspond to physical problems involving time and implying finite propagation speed (thus, say, to hyperbolic problems). The algorithms obtained offer massive parallelism and imply only local interdependencies. They are thus very suited to serve as basis for the construction of special­ized computers built for solving specific classes of prob­lems and comprising a large number of individual, only locally interconnected processors.

The derivation of the algorithms proceeds by first finding a multidimensional (MD) electric circuit describing, in a suitable equivalent form, the original physical system. The actual algorithm is then obtained by applying to this circuit the same principles as those known from the theory of MD wave digital filters (WDFs). This way one can ensure that the resulting

difference equations are indeed recursible (com­putable), i.e., that they correspond indeed to a true algorithm, and that the full range of robustness prop­erties known from MD WDFs holds. These robustness properties imply not only stability after discretization in space and time, but also very good behavior with respect to discretization in value, thus with respect to the highly nonlinear operations needed for producing rounding/truncation and overflow corrections.

In fact the approach amounts to simulating the ac­tual continuous-domain dynamical system, assumed to be passive (no creation of energy!), by a discrete passive dynamical system. Such a discrete system behaves as closely as is probably possible to the actual system described by the PDEs. Thus, arbitrary variations of the characteristic parameters as well as arbitrary boun­dary conditions and arbitrary boundary shapes can also easily be accommodated. The approach has been illus­trated by means of an example.

References

1. R. Rabenstein, "A signal processing approach to the numerical solution of partial differential equations," in NTG-Fachbericht 84, Berlin: VDE-Verlag, 1983.

2. R. Rabenstein, "A signal processing approach to the digital simulation of multidimensional continuous systems, " Proc. Eur. Signal Processing Conf., Part 2, The Hague, The Netherlands, Amsterdam: North Holland 1986, pp. 665-668.

3. A. Fettweis, "Wave digital filters: Theory and practice," Proc. IEEE, vol. 74, 1986, pp. 270-327.

4. A. Fettweis, "New results in wave digital filtering," Proc. URSI Int. Symp. on Signals, Systems, and Electronics, Erlangen, W. Germany, 1989: pp. 17-23.

5. A. Fettweis and G. Nitsche, "Numerical integration of partial differential equations by means of multidimensional wave digital filters," Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, New Orleans, LA, May 1990, pp. 954-957.

6. H.D. Fischer, "Wave digital filters for numerical integration," ntz-Archiv, vol. 6, 1984, pp. 37-40.

7. K. Meerkotter and R. Scholz' 'Digital simulation of nonlinear circuits by wave digital filters," Proc. IEEE Int. Symp. Cir­cuits and Systems, vol. I, Portland, OR, 1989, pp. 720-723.

8. A. Fettweis, "On assessing robustness of recursive digital filters," European Transactions on Telecommunications, vol. I, 1990, pp. 103-109.

9. BJ. Alder, "Special Purpose Computers," San Diego: Academic Press, 1988.

10. Xiaojian Liu and Alfred Fettweis, "Multidimensional digital filtering by using parallel algorithms based on diagonal process­ing," Multidimensional Systems and Signal Processing, vol. 1,

1990, pp. 51-56. 11. P.B. Johns and R.L. Beurle, "Numerical solution of2-dimen­

sional scattering problems using a transmission-line matrix," Proc. lEE, vol. 118, No.9, 1971, pp. 1203-1208.

Page 23: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

24 Fettweis and Nitsche

12. P.B. Johns. ''A Symmetrical Condensed Node for the TLM Method," IEEE Trans. Microwave Theory Tech., vol. MTT-33, 1985, pp. 882-893.

13. Tatsuo Itoh, Numerical Techniques for Microwave and Millimeter­II!!ve Passive Structures, New York: 1. Wiley, 1989.

14. \\blfgang Hoefer, "The transmission line matrix (TLM) method," in Numerical Techniques for Microwave and Millimeter-Wave Passive Structures (T. Itoh, ed.), 1989, pp. 496-591.

15. K.S. Yee, "Numerical solution of initial bondary value problems involving Maxwell's equations in isotropic media," IEEE Trans. Antennas Propagat., vol. AP-14, 1966, pp. 302-307.

16. A. Taflove and M.E. Brodwin, "Numerical Solution of Steady­State Electromagnetic Scattering Problems Using the Time­Dependent Maxwell's Equations," IEEE Trans. Microwave Theory Tech., vol. MTT-23, 1975, pp. 623-630.

17. T. Weiland, "On the unique numerical solution of Maxwellian eigenvalue problems in three dimensions," Panicle Accelerators, vol. 17, 1985, pp. 227-242.

18. K. Meerkotter, "Incremental passivity of' wave digital filters," Proc. Eur. SignnJ Processing Conference, Lausanne, Switzerland, Amsterdam: North Holland, 1980, pp. 27-31.

19. A. Fettweis, "Passivity and losslessness in digital filtering," Arch. Elektron. Uhertr., vol. 42, 1988, pp. 1-8.

20. V. Belevitch, Classical Network Theory, San Francisco: Holden­Day, 1967.

21. A. Kummert and M. Piitzold, private communication, 1989. 22. W. Hackbusch, Multi-grid Methods and Applications, Berlin:

Springer-Verlag, 1985. 23. R.E. Crochiere and L.R. Rabiner, Multirate Digital Signal Proc­

essing, Englewood Cliffs, NJ: Prentice Hall, 1983. 24. A.A. Samarskij, Theorie der Differenzenvetjahren, Leipzig:

Akademische Verlagsgesellschaft, 1984. 25. A. Fettweis and K. Meerkotter, "On adaptors for wave digital

filters," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-23, 1975, pp. 516-525.

Alfred Fettweis received the degrees of Ingenieur Civil Electricien and Docteur en Sciences Appliquees from the Catholic University of Louvain, Belgium, in 1951 and 1963, respectively.

From 1951 to 1963, he was employed by the ITT Corporation in Antwerp, Belgium (1951-54 and 1956-63), in Nutley, NJ, USA (1954-56), and in Chicago, IL, USA (1956). From 1963 to 1967, he was a Professor of Theoretical Electricity at Eindhoven University of Technology, The Netherlands. Since 1967, he has been a Professor of Communications Engineering at Ruhr-Universitat Bochum, Ger­many. He has published papers in various areas of circuits and systems, communications, and digital signal processing as well as on some general science-related subjects. He holds some 30 patents (owned either by ITT or Siemens).

A. Fettweis is a recipient of Prix ''Acta Technica Belgica" 1962-63, Darlington Prize Paper Award (1980), Prix George Montefiore (1980), IEEE Centennial Medal (1984), VDE Ehrenring of the Verband Deutscher Elektrotechniker (1984), Karl-Kiipfmiiller-Preis of the In­formationstechnische Gesellschaft (IIG) im VDE (1988), and Technical Achievement Award of the IEEE Circuits and Systems Society (1988). He received honorary doctorates from the Univer­sity of Linkoping, Sweden (1986), the Faculte Poly technique de Mons, Belgium (1988), and the Katholieke Universiteit Leuven, Belgium (1988). He is Fellow of the Institute of Electrical and Electronics Engineering (IEEE) and a member of the Rheinisch-Westfiilische Akademie der Wissenschaften (Germany), EURASIP (European Association for Signal Processing), IIG (Informationstechnische GeselJschaft, Germany), SITEL (Belgian Telecommunication Engineers' Society), Sigma Xi, and Eta Kappa Nu.

Gunnar Nitsche received the Diplom-Ingenieur degree in electrical engineering from Ruhr-Universitai Bochum, Bochum, Germany in 1986. Since 1987, he has been with the "Lehrstuhl fiir Nachrichtentechnik;' Department of Electrical Engineering, Ruhr­Universitat Bochum, where he is now working towards a doctoral degree. His current research interests include multidimensional digital signal processing and numerical integration of partial differential equa­tions. He is a member of the Institute of Electrical and Electronics Engineers (IEEE).

Page 24: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Signal Processing Using Cellular Neural Networkst

L.O. CHUA, L. YANG, K.R. KRIEG Dept. of EECS, University of California. Berkeley, CA 94720

Received June 25, 1990, Revised January 7, 1991.

Abstract. The cellular neural network (CNN) architecture combines the best features from traditional fully-connected analog neural networks and digital cellular automata. The network can rapidly process continuous-valued (gray­scale) input signals (such as images) and perform many computation functions which traditionally were implemented in digital form. Here, we briefly introduce the the theory of CNN circuits, provide some examples of CNN applica­tions to image processing, and discuss work toward a CNN implementation in custom CMOS VLSI. The role of analog computer-aided design (CAD) will be briefly presented as it relates to analog neural network implementation.

Introduction.

One of the unifying concepts of the field of research known as neural networks is that important computa­tions can be realized in the collective behavior of large numbers of simple, interconnected processing elements. The cellular neural network (CNN) presented here is an example of what we will call very large scale analog processing or collective analog computation. Unlike the analog computing of the 1950's, we are not inter­ested in constructing complicated op-amp circuits on a macro scale. The term collective computation as we will use it here referes to the aggregate behavior of a large number of simple processing elements. Though there are many ways of approaching the topic of col­lective computation (physicists may examine long range ordering and spin glass models, computer scientists may study cellular automata, and biologists may research nervous system organization), our approach is that of nonlinear circuit theory. We consider a large array of simple analog computational elements to be a large­scale nonlinear analog circuit. As engineers, we are not only concerned with the elucidation of the behavior of collective analog computation, but also with implemen­tation of our models. For this reason, we study the cellular neural network architecture, since it is locally connected, has a simple nonlinear circuit at each node, and can be implemented using current VLSI fabrica­tion technology.

tThis work is supported in part by the Office of Naval Research under Contract NOO014-89-J1402, and the National Science Foundation under grant M1P-89l2639.

This paper presents results and a brief theoretical background for the CNN processing model. The CNN architecture combines some features of fully connected analog neural networks [1]-[3] with the nearest neighbor interactions found in cellular automata [4]­[8]. We will show that these networks have numerous advantages both for simulation and for VLSI implemen­tation, and can perform (though are not limited to) several important image processing functions. The re­cent implementation of one of these processing func­tions in a custom CMOS VLSI integrated circuit will also be discussed. First, in Section I, we will briefly discuss the theory of CNN circuits, including the issue of stability for a large class of CNN circuits having sym­metric coupling at each node, and some recently pro­ven stability conditions for asymmetric coupling. Sec­tion 2 will present simulation results which show the wide variety of processing which is possible using CNN arrays. Finally, in Section 3, we will discuss the im­plemenation of a noise removal CNN algorithm in a custom CMOS VLSI integrated circuit.

1. Theory of Cellular Neural Network Circuits

The cellular neural network architecture which we pre­sent here has several important features which enable it to perform numerous processing functions and also be implemented using standard VLSI technology, these include: local connectivity, first-order nonlinear continuous-time continuous-variable dynamics, and positive feedback within a node which ensures only two

Page 25: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

26 Chua, Yang, Krieg

~ t-: ~ t-: :" "- to :1 ~ .-:f "

< ... < ... .<>

t-:. -:f

" j,: '" < ... .... 0> < ... .. .. 0>

t-:. .-:f

'. j,:'

'" < ... < ... .. .. 0>

" ,-..:, '. ~ ~ '"

~.

'" '"

a b Fig. 1. Examples of locally connected cellular neural networks having different coupling patterns. The dotted lines indicate that this pattern is repeated throughout the array.

stable equilibrium points. In this section, we show how the circuit of a cell (also referred to as a neuron or node) is designed to exhibit these features and why they are important, both theoretically and practically. A much more detailed account of the theory can be found in [9] and [10]. We also briefly discuss methods of ob­taining multi-valued outputs from the CNN neurons, a desirable feature for image processing of gray-scale images.

Like a cellular automata machine, the CNN is local­ly connected. This means that if we arrange each cell on a grid and draw the coupling between the cells, the connections extend only a short distance from anyone cell to a set of cells which are closest. Figure 1 illus­trates two possible connectivity patterns in a network whose cells are arranged on a square lattice and which are coupled to their immediate neighbors. In this dia­gram, each set of arrowed lines indicats bidirectional coupling. It is obvious that many different size neigh­borhoods and connectivity patterns are possible, so we begin with a quantitative definition of the neighborhood of a cell, which we will call an r-neighborhood. Since the CNN arrays we discuss in this paper have cells which are located on a rectangular grid of size M by N columns, a particular cell is referred to by its coodinates, the cell in the ith row and jth column is denoted C(i, j). Figure 2 is an example of this label­ing in a 4x4 array. N/U, j) is the r-neighborhood of

Fig. 2. A simple 4x4 two-dimensional cellular neural network. The squares are the circuit units called cells. The links between the cells indicate the coupling between the cells.

cell C(i, j) and it contains all cells nearest to CU, j) and which are at most r cells away. Specifically,

Nr{i, j) = {C(k, i) I max {Ik - ii, Ii - jl} :5 r},

1 :5 k :5 M; 1 :5 k :5 N}, r '" 0

Page 26: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

00000 0000000

0 0 0 00000 0000000 0000000

0 I!IIII 0 DDI!!IDD 000l1li000

00000 0000000

0 0 0 0000000 00000 0000000

r=1 r=2 r=3

Fig. 3. The geometry of a neighborhood, lV" of cell C Ci, j) for r ~ 1, r ~ 2 and r ~ 3 respectively.

We often refer to neighborhoods with r = 1, r = 2, and r = 3 as 3x3, 5x5, and 7x7 neighborhoods, respectively. Examples of these neighborhoods is shown in figure 3. We note that the neighborhood system described above has a symmetry property in the sense that if C(i,}) E Nr(k, I) then C(k, I) E Nr(i,}) for all cells in the network.

Since these cell are the processors in our CNN archi­tecture, obviously we need a way of coupling them to an input signal. In our theoretical description of the neural network there are as many input lines to the CNN as there are cells in the network (though in prac­tice we may limit this number). However, each cell can receive input not only from its own input line, which we will call its control input, but also from its neighbors control input. Therefore, each cell sees both its input and its neighbors input. The control input coupling to each cell is specified by B(i,}; k, I), which we will refer to as the control operator for the network. B (i, ); k, I) specifies how the input, Ujj, from the outside world to C(i, j) is coupled both to C(i, }) and its neighbors C(k, I) E Nr(i, j). Likewise, to specify the coupling of the C(i,}) output, Yij' to neighbors C(k, I) we will use a feedback operator, A(i,}; k, I). We refer to A(i,}; k, I) as a feedback operator because Yij is coupled to C(k, /) E Nr(i,}) and in return Ykl is coupled to C(i,}). Therefore, the output of C(i, j) is fed back through the coupling to its neighbors. In A (i, ); k, I) there is also an explicit positive feedback term, that is A(i,}; i,}) > O. As we shall see, this positive feedback is essential to ensure that the output settles to a well defined value, even when using non-ideal cir­cuit elements (this is important since we are interested in constructing the circuit).

As we stated in the introduction, each cell of our network has simple nonlinear dynamics. These first­order dynamics not only enable the network to have a richer set of behaviors than, for instance, a synchron­ous cellular automata machine, but also allow the dynamics of the network to be modeled accurately in

Signal Processing Using Cellular Neural Networks 27

the presence of circuit parasitics. Since our CNN net­work is asynchronous, the first-order dynamics also provide us with a simple method of specifying a proc­essing order in architectures composed of several layers of CNN networks. In these cases, we may design some processing layers to settle (reach equilibrium) first, so that their output is stable during the settling time of the other layers.

In order to simplify both the analysis, design, and implementation of CNN circuits, we have chosen to specify that the final state of each cell (after transients have decayed) should have only two possible values. As we shall see, this seemingly severe restriction still allows for many interesting and important processing functions. We should point out that even though each node eventually must settle to either ±V" the equilib­rium state voltage, the state dynamics of each node are continuous (the state of a node is not binary valued). Section 2 includes image processing examples where the combination of continous state variables and sim­ple dynamics enables the CNN to extract different features of an image at various times during the net­work transient. Later in this section, we will discuss modifications to the nodal circuit which allow multiple­valued output functions for such purposes as gray-level processing or digital output.

In keeping with the goals of simple circuit analysis and design, the inputs uij from the outside world and neighbor outputs Yij, affect the state linearly. That is, the state voltage of each cell is driven by a linear com­bination of the outputs of its neighbors and the input to the cell and its neighbor cells. A piecewise-linear sigmoid (see figure 5) converts xij' the state of cell C(i, j), into an output Yij, which is coupled to the neighbor cells and back to the input.

Figure 4 shows a circuit diagram of a single cell, C(i,}), in terms of ideal circuit elements. In this circuit, the input is a voltage, vuij, shown as an independent voltage source. We shall refer to this controlling input voltage simply as the control input to the cell. The state of the cell is vxij, and the output is vyij' An independent current source, i, serves to bias the state voltage. We will see in Section 2, that due to the nonlinearity in the output equation, the processing function of the CNN network is not only dependent onB(i,}; k, I) andA(i, }; k, I), but also on i. The voltage inputs, Vuij, are coupled to the state through the diamond-shaped volt­age-controlled current sources (VCCS) [11] Ixu(i,}; k, I). Likewise, the cell outputs vyij, influence the state through the voltage-controlled current sources Ixy(i,}; k, I). For a CNN whose neighborhoods contain m cells

Page 27: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

28 Chua, Yang, Krieg

....-------...... ----<-------------_.

I c Ixu (i.j jk.l)

f··· t Ixy (i,j jk,l) Ry

Ixu (i,jjk,l) = B(i,j;k,l) v ukl j Ixy (i,jjk,1) = A(i,j;k,l) v ykl

1

2Ry (Iv .. +11 - Iv

XIJ .. - 11 )

XIJ

Fig. 4. An example of a cell circuit. C is a linear capacitor; Rx and Ryare linear resistors; I is an independent voltage source; Ixu(i, j; k,

I) and IxyCi. j; k. I) are linear voltage controlled current sources with the characteristics IxyCi. j; k. I) = ACi. j; k. l)v,kI and Ix" (i. j; k. /) = Be;, j; k, I)v"k for all C(i, j) E NCi, j); Ixy = IIRyfCvx') is a piecewise-linear voltage controlled current source with its characteristic f(o) as shown in figure 5; Elj is an independent voltage source.

each, a cell can have m inputs (the control input to itself and its neighbors) and m outputs, so the maximum number of VCCS elements in each cell circuit is 2m. The equations for these controlled current sources are:

and

Ixu(i, j; k, /) = B(i, j; k, /)Vukl

Ixy(i, j; k, I) = A(i, j; k, I)Vykl

Iyx = _1_ (Ivxi) + II - IXxij - II) 2Ry

The only nonlinear element in each cell is a piecewise-linear VCCS, iyx = (lIRy) f(Vxi}l. The func­tionf(·) has the characteristic shown in figure 5. This form of the output equation is similar to the sigmoid nonlinearities which are commonly used as output func­tions in artificial neural networks. However, in most neural networks this output function has infinite slope (or very high incremental gain) in the transition region to guarantee that the output settles at the endpoints of the curve. Since we are interested in constructing CNN devices using analog VLSI, where this very high incre­mental gain is difficult or impossible to realize, we use

fey)

-1

v

-1

f(v)=(lv+ll-lv-lll

Fig. 5. The characteristic of the nonlinear controlled source fC·).

a slope of 1 in the transition region and use positive feedback A (i, j; k, /) > 0, to ensure that the output settles at one of the endpoints. Thus, the output function

Page 28: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

of each cell has linear region where vyij = Vxij with two saturation regions on each side. Employing a piecewise­linear element in the idealized cell circuit also simplifies the analysis of cell state dynamics, since we can easily find the equilibrium points of Vxij [12)-[14).

If we apply KVL and KCL to the circuit in figure 4, the following equations are easily derived:

State Equation:

+ A(i, j; k, I)Vykl(t) C(k.l),N,(iJ)

+ BU, j; k, l)vukl(t) + /, C(k.l),N,(iJ)

:5 :5 M, I :5 j < N

Output Equation:

vyij(t) = .! (Ivxij(t) + II - IVxij(t) - 11), 2

Constraint Equations:

IXxi/O) I :5 I,

I vuij(t) I :5 I,

Parameter Assumptions:

I :5 i :5 M, I :5 j :5 N

:5 i :5 M, 1 :5 j :5 N

:5 i :5 M, 1 :5 j :5 N

A(i, j; k, I) = A(k, I; i, j),

1 :5 {i, k} :5 M, 1 :5 {j, I} :5 N

and

C > 0, Rx > 0, Ry > 0

1.1. Circuit Characteristics

There are several features of the CNN cell circuit described above which deserve further comments. These concern computer simulation of a CNN com­posed of such cells, the scaling properties of the cir­cuit components, bounds on the cell state voltage (dynamic range for Vxij), establishing the equilibrium points of the cell, and stability of the entire CNN array. We will briefly discuss them in turn. The reader who wishes an in-depth discussion of these topics should refer to [9).

Signal Processing Using Cellular Neural Networks 29

1.1.1. Simulation Advantages. The ideal model (figure 4) of each cell contains at most three circuit nodes (we can choose to start the simulation with the state capacitor C charged to some initial value and allow B(i, j; k, I) = 0 for all the cells, this makes the differential equations above autonomous). Since all the cells in a CNN have the same datum node (all are referenced to the same ground), and since all dependent sources are volt­age controlled, CNN arrays are ideal fur nodal analysis. When simulating large CNN Systems, the locality of connections makes the associated matrix node equa­tion extremely sparse and well suited for parallel simulation on a variety of computer architectures.

1.1.2. Scaling of Circuit Components. When design­ing a VLSI implementation of any circuit, it is helpful to be able to scale the component values to accommo­date the fabrication technology, power dissipation, and parasitics. Though typically the scaling properties of nonlinear circuits are very poor, for CNN's scaling the component values in the cell circuit is quite easy. Within fabrication technology limits, currents and resistances can be scaled over very wide ranges. Since the time constant of the cell is dominated by RxC, it is impor­tant to be able to adjust this so that it is slow compared to parasitic time constants (for instance, to make C significantly larger than parasitic capacitances). This ensures that the fabricated CNN circuit will behave more like its device-level simulation (a simulation which takes into account the fabrication technology and device physics characteristics), since unmodeled parasitics are unlikely to significantly affect the dynamics of the cell. The additional benefit of scaling the circuit elements is that in multi-layer CNN architec­tures (where the processing is performed using several coupled 2D arrays) some layers can have faster dynamics than others. For instance, this would ensure that the output from one layer remained constant (or approximately constant) while another layer operated on that output. The designer of a VLSI multi-layer CNN architecture would then have the additional freedom to design processing functions prossessing multiple time scales (within constraints imposed by the fabrication technology).

1.1.3. Bounding the State lVltage. Finding a bound on Vx , the maximum state voltage of the cell, is very important because the transient of an array of cells can be quite different from that of a single cell. When designing a CNN device, we need to be sure that Vxij

is within the limits of proper circuit operation. For ex­ample, when designing a transistor circuit we expect

Page 29: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

30 Chua, Yang, Krieg

to bias it within some operating range. Keeping within this range means that the device behavior will conform closely with the simulations. Operation outside this range may cause the device to be highly nonlinear, creating unmodeled distortions or (in extreme cases) forming undesired equilibrium states (such as latchup in CMOS circuits). Bounding the transients of the ar­ray is especially important if our processing depends on some properties of the transient, as we will see in the corner extraction CNN example in Section 2. In [9] it is shown that for any CNN array the states are bounded by:

max lsisM l:::;;jsN

[ ~ (IA(i, j; k, 1)1 + IB(i, j; k, 1)1) ] C(k,I),N,(iJ)

As an example, suppose we wish to scale our circuit values so that RxlII '" 1, RxIA(i, j; k, 1)1 "'1, and RxIB(i,j; k, I) 1 '" 1, for all i,j, k, I. In practice this might be done during initial circuit design, where com­ponents are normalized to convenient values . In this case Vx = 20 Volts for a CNN array with 3x3 neighb;;t;oods. This value is quite reasonable, since many analog ICs operate with supplies between 5V and 3OY.

1.1.4. Equilibrium Points of the Cell Circuits. In [9] it was shown that under fairly mild constraints the cell circuit shown in figure 4 has at most 2 stable equilib­rium points. These are points where the circuit state will settle after the transient has decayed. This will be true when A (i, j, i, j) > IIRx , that is when the feed­back of the cell's own output to its own input is greater than 1/ Rx. The bias current I determines the break­point at which the weighted sum of the cell inputs changes the derivative ofthe state voltage. This can be seen from the state equation of the cell since v xii is the sum of the weighted inputs and the bias current I. We will see in Section 2 that the bias current plays a crucial role in determining the functioning of the CNN array.

1.1.5. Stability of the Arrays of Cells. Though we have shown that the transient of an array of CNN cells has reasonable bounds, another property of the array which must be examined is stability. Though we can show that an isolated cell is stable (converges to a well defined state voltage) we must prove that the entire array of cells is also stable. With so many signal paths and each cell having first-order dynamics, it is not unreasonable to

expect that signal paths may exist which constitute an unstable loop. For instance, we might expect that sus­tained oscillations would be possible when processing signals such as gray-level images, where cell states may not start at ±l. In [9] stability was proven by construc­ting a Lyapunov function and using Lyapunov's method for CNN arrays whose cells had symmetric coupling. By symmetric coupling we refer to the previously stated Parameter Assumption: A(i, j; k, I) = A(k, I; i, j), 1 ~ i ~ M, 1 ~ j ~ N. This symmetric coupling assumption appears frequently in literature on fully­connected neural networks, often in reference to Hop­field neural networks [1], [2], [15]-[17]. Though it seems that symmetric coupling is a severe restriction, many important processing functions can be imple­mented within this framework. However, seeking to overcome the symmetric coupling assumption, we recently proved the stability of some classes of CNN arrays having asymmetric coupling [18]. Examples of processing using asymmetric coupling are presented in Section 2. Each of these examples is taken from the class of asymmetrically coupled CNN arrays for which stability has been proven.

2. Examples of Cellular Neural Network Signal Processing

Here, we discuss some previously discovered cellular neural network processing functions and some recently developed functions along with some insights on the dynamics of CNN processing. Many of the examples we present here are drawn from image processing­noise removal, corner extraction, edge extraction, con­nectivity analysis, the Radon transform, and thinning. Image processing is a natural application of these net­works since the 2-dimensional structure and nearest neighbor interactions of the CNN architecture are similar to the localized interactions which typify many image processing algorithms. Other examples are drawn from the simulation of physical systems, since there is an obvious analogy between the uniform arrays of CNN cells with simple local dynamics and physical systems which may be modeled by interactions of small regions possessing simple dynamics. Examples in this category are the diffusion equation (heat equation) and wave equation.

The pictures presented in this section are the results of computer simulation of CNN arrays. The simula­tions were run using PWLSPICE, a version of the cir­cuit simulator SPICE which was modified to efficiently

Page 30: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

handle piecewise-linear circuit components. The input to PWLSPICE is a file describing all the components and connections in the CNN array. The output is a file containing the cell output voltages as a function of time. This file is then converted into a picture of the output voltages at each time point. Each pixel in this image represents the output voltage of the cell at that loca­tion in the array. Black and white pixels correspond to the maximum and minimum voltages (1.2V and -1. 2V), respectively. Voltages between these extremes are represented by gray-scale pixels (half-tone squares). Most of these images were generated using CNN cell circuits as described in Section 1, so regardless of the initial values of the pixels, the final output image will be black and white. In these pictures, intermediate gray shades which are not in the original image illustrate the performance of the circuit during transients.

Along with each picture we present a set of templates, A (k, I) and B(k, I), which visually indicate the terms of the opeators A(i, j; k, I) and B(i, j; k, I). We refer to A(k, I) and B(k, I) (or simply A and B) as the cloning templates because the coupling they represent is repeated at each cell in the array. The indices i, j in A(i, j; k, I) and B(i, j; k, I) have been dropped to emphasize the space invariance of the tem­plate, i.e., AU, j; k, I) = A(i-k, j-l) for all i, j, k, I. Therefore, for a single layer (2-dimensional) CNN array, the A and B templates are all that is required to describe the connectivity and input coupling. This con­cept can be extended to multi-layer CNN architectures, as discussed in [9]. A numerical example of an A template is shown below with the corresponding opera­tor A(i, j):

[0.0 1.0 0.0

0.0 0.0 1 _ 2.0 1.0 -0.0 0.0

[ A(-I, -I) A(O, -1) A(-I, I) 1 A(-I,O) A(O,O) A(I,O) A( -I, I) A(O, I) A(l, 1)

We can describe the action of a template by introduc­ing the idea of a convolution operator from image proc­essing. This operator, *, specifies the action of the template on the associated circuit variables. In the case of the feedback operator A (k, I), we can define its ac­tion on vyij, the outputs of the cell and its neighbors, by the following equation:

A * Vyij = A(k-i, l-j)Vykl C(k,l),N,(iJ)

Signal Processing Using Cellular Neural Networks 31

where, A(m, n) denotes the entry in the mth row and nth column of the template. The center element, A (0, 0), is defined as the origin of the template.

Note that some of the examples presented here do not have any B template. In these examples the array begins with the state capacitor already charged to the initial state. Although in practice this capacitor would be charged to the initial state by an input signal, in the simulations we are only concerned with the dynamics of the array after t = 0, the time after the initial condi­tions are set at each cell. The reason for not showing the setup of initial conditions at each cell is that in prac­tice the state capacitor is disconnected from the rest of the cell circuit during this time (see Section 3 for details). So the dynamics of state capacitor charging depends on the external charging circuit. For an ex­ternal circuit with low output impedance, the circuit charges with a rapid RC time constant, usually on the order of 500 n sec. Whereas the array settling rimes (after t = 0 are typically about 10 I-' sec and are the determining factor in specifying the speed of CNN devices.

Also note that all examples presented here have an A template, even those which are not coupled to their neighbors' outputs (the edge and corner extraction ex­amples). This is because the direct feedback of each cell's output to its input is specified by the center ele­ment of the A template.

In Section 1 we mentioned that the bias current, I, was also important in determining the functioning of the array. This bias current determines the threshold that the weighted input sum must cross to change the derivative of the state voltage. The influence of I is best noted in the edge extraction and corner extraction ex­amples, where the templates are identical but the bias current is changed so that only corner pixels remains.

We begin our "applications gallery" with examples from image processing.

2.1. Noise Removal

A common operation in image processing is the removal of pixel noise which is uniformly distributed across the image and whose amplitude distribution is Gaussian. Though there are many very sophisticated algorithms for noise removal, the simplest one is simply to per­form spatial low-pass filtering, or weighted averaging, of pixels in each neighborhood. In this case pixel (i, j) takes on a value which is the weighted average of its original intensity and its neighbors. The space-

Page 31: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

32 Chua, Yang, Krieg

(a) (b)

(c) (d)

0.0 1.0 0.0

1.0 2.0 1.0

0.0 1.0 0.0

A

(e)

Fig. Ii (a) the input image with noise; (b) the output of CNN after a short transient; (c) the output of CNN after a longer transient; (d) the output of CNN at the steady state; (e) the template of the feedback operator A. The controlling operator B is zero. The initial state voltage of each cell of the first layer is the input image pixel value and the initial state voltage of the second layer is zero.

constant of this filtering is determined by the size of the neighborhood and by the weighting values used to average neighboring pixels.

Figure 6 shows the effects of a nois e removal CNN

circuit on a dark square corrupted by Gaussian noise. Figure 6a is the initial image, Figures 6b and 6c are examples of snapshots of the network as it settles to the equilibrium state in figure 6d. The template describing

Page 32: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

the local averaging is shown in figure 6e. From the template we can see that pixels surrounding the center in each neighborhood are weighted with value 1.0 relative to the center pixel, which as weight 2.0. If we examine the circuit of figure 4 we can see that this averaging is performed by the current summation at the state node (VXij)'

The CNN array in this example has been designed and recently fabricated as a CMOS VLSI IC. Details of the implementation can be found in Section 3.

(a)

(c)

0.0 0.0 0.0

0.0 1.0 0.0

0.0 0.0 0.0

A

Signal Processing Using Cellular Neural Networks 33

(f)

2.2. Edge Extraction

The edge extraction CNN array is an example of proc­essing which is achieved using an A template with A (k, /) = 0 for k c;C /, that is, no coupling from the neighbors' outputs to the cell. The effects of the local dynamics can be seen in the slow fading of the center region of the square. The discussion in the next exam­ple provides a more complete explanation of the dynamics of settling.

(b)

D (d)

-0.25 -0.25 -0.25

-0.25 2.0 -0.25

-0.25 -0.25 -0.25

B

Fig. 7. (a) the input image; (b) the output of CNN after a short transient; (c) the output of CNN after a longer transient; (d) the output of CNN at the steady state; (e) the template of feedback operator A and the template of control operator B. For corner extraction, the biasing current I ~ ~O.25 p.A.

Page 33: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

34 Chua, fung, Krieg

The functioning of the edge extraction circuit is quite simple if we consider the effect of the template B and the bias current I. We can see from the template that the inputs to each cell in a neighborhood are summed with equal negative weights for pixels which are off­center. For example, in the center region of the exam­ple the sum would be 0.0, but for an edge (defined as less than S pixels on in our B&W images) the sum is greater than 0.0. We have set the bias current to -0.25 fJ-A, which results in the decay of the center pixel to -1.2V (white) for any neighborhood with a pixel sum less than 0.25Y. But neighborhoods where the sum is 0.25 or greater will cause the center cell to increase in value to + 1.2V (the pixel black value).

2.3. Corner Extraction

The CNN array of figure S extracts the comer pixels from the input image. This is a very common process­ing function for cellular automata machines (CAM), where the procesor at each pixel sums the number of neighbors which are on or off and from this number, decides whether to remain on (it is a comer) or off (it is an edge or interior pixel). For the CAM this is a sim­ple one step logical process of counting pixels. As we shall see below, the CNN circuit by virtue of its local dynamics, offers some interesting advantages compared to this one-step process. As in the edge extraction ex­ample above, this CNN array has only coupling be­tween the inputs and each cell in a neighborhood and not dynamical coupling between the cells. In fact, the templates for the edge extraction and comer extraction circuits are the same, the only difference is the bias current l. In this case the I = -1.75 fJ- A, which means that more neighboring pixels must be off(or close to off) for the center pixel to remain.

One may compare this CNN with no dynamical coupling between the cells with that of perceptrons [19]-[22], which have similar feed-forward connec­tions, but are often not localized and usually have no dynamics associated with the nodes. CNN local dynam­ics provides the advantage of enabling the array to per­form processing while the circuit is settling, not only when the final state is reached.

The comer detection CNN circuit is an excellent ex­ample of the effects of cell dYfUlmics on the aggregate behavior of the network. It is interesting to note that edges of the input image are first extracted during the transient of the corner extraction network (figures Sb and Sc). This phenomenon can be easily understood if we examine the dynamics induced by the comer detec-

tion template. In the template defining A(k, I) and B(k, I), figure Se, we have normalized the coefficients so that the product of the cell capacitance and resistance is unity (C Rx = 1). The binary (±l) image to be proc­essed, Uij, appears at t = 0 as at the initial condition on the state capacitor and as the voltage input to the cell. That is, the state capacitor is assumed to be charged to the input voltage at t = O. The normalized dynamics of the state voltage, Vxij are given by

Vxij = !(Vxij) + B * u + I

where vxij(O) = uij and!(vxij) = - vxij + 0.51 Vxij + 11 -0.51 vxij - 11. Kernel convolution is denoted by *.

This equation is simply a first-order nonlinear ordin­ary differential equation in one variable. If we plot the vxij against VXij' we can easily determine the equilib­rium points of the cell dynamics and the dynamic routes to these equilibrium points, as discussed in Section 1 and in [9]. The addition of the terms B * u + I simply biases the f function (see Section 1). Note that if B * u + I ?! 0, the cell has a unique equilibrium point which is globally asymptotically stable. The state of the cell will eventually settle to the point [11]. That is,

B * u + I > 0 --> lim vxij(t) t~oo

B*u+I<O-->limvxij(t) -1 t~oo

Although the circuit has multiple equilibrium points when B * u + I = 0, this problem never arises in prac­tice. Since the input to each cell is (±1), we can see that

B * u + I = 2uij + 0.25 - 0.5n ?! 0, 0 :$ n :$ S

where n is the number of neighbor inputs equal to + 1. First, assume that uij = -1. Then B * u + 1 :$

-1.75 for 0 :$ n :$ S. Since the initial state of the cell is -1, the output will be -1 for t ~ O. Now assume that uij = +1. The equations above imply that the output will remain at +1 for n :$ 4. Typically, n = 5 for an edge pixel and n = S for an interior pixel. The state of the cells associated with both of these types of pixels will eventually decay to -1, leaving only comer pixels. However, since the derivative of the state of the interior pixels is more negative than the derivative of the state of the edges, the interior pixels decay fuster than the edge pixels, resulting in the edges being extracted during the transient. Note that by simply increasing I to -0.25, we increase the threshold value of n at which a pixel will decay. In this case, the edge pixels will not decay and the network functions is that of an edge extractor.

Page 34: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Signal Processing Using Cellular Neural Networks 35

(a) (b)

• •

• • (c) (d)

0.0 0.0 0.0 -0.25 -0.25 -0.25

0.0 1.0 0.0 -0.25 2.0 -0.25

0.0 0.0 0.0 -0.25 -0.25 -0.25

A B

( e )

Fig. 8. (a) the input image; (b) the output of CNN after a short transient; (c) the output of CNN after a longer transient; (d) th output of CNN at the steady state; (e) the template of feedback operator A and the template of control operator B. For corner extraction, the biasing

current I = -1.75.

Page 35: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

36 Chua, lling, Krieg

A ~.'f\,

;1~

!~:

J

(a) (b) (c)

II 0.0 0.0 0.0

1.0 2.0 ·1.0

0.0 0.0 0.0

A

(d) (e) (f)

Fig. 9. (a) the input image; (b, c. d) the output of CNN during the transient; (e) the output of CNN in its steady state; (f) the template of feedback operator A.

A (a)

(d)

A (b)

• -( e )

0.0

0.0

0.0

( c )

1.0 0.0

2.0 0.0

-1.0 0.0

A

(f)

Fig. 10. (a) the input image. (b. c, d) the output of CNN during the transient; (e) the output of CNN in its steady state; (f) the template of feedback operator A.

Page 36: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

2.4. Connected Segment Extraction

These figures show processing with a CNN array which maps positive transitions along horizontal or vertical lines into pixels at the far right hand side of the image [23]. Fiugre 9a illustrates this mapping using a letter ''N.' projected horizontally. In traversing the image along a horizontal line which intersects the " - " part of the letter, there is only one positive transition (from white to black pixels). This is mapped into a single pixel at the far right. Likewise, the single positive transition at the top of the ''N.' is mapped to a single pixel at the far right. The portions of the ''N.' which have white­space between them are mapped to two vertical lines in the output image. Similarly, the output image in figure 10 can be considered as a map (downward) of positive transitions in the vertical direction. Though only horizontal and vertical projections are shown here, the CNN can perform projections along any line. This can be accomplished by changing the size and coup­ling coefficients of the neighborhood.

It is important to remember that the CNN array is not a clocked synchronous machine, but an array of analog processors. The I/O mapping shown in the figures occurs asynchronously, in parallel, across the entire network. Though the final state is a binary im­age, the input can have gray-scale values, which would interact as continuous voltages until the final state is achieved. This final state is simply a minimum of an energy function of the network (though it may not be a global minimum) [16], [24]. This method of process­ing provides an interesting contrast to typical highly­parallel image computation architectures such as cellular automata machines [4], [5], [25] where proc­essing is performed using clocked finite state devices.

Though it may be difficult to construct a CNN array which could be programmed to perform the connected segment analysis in arbitrary directions, groups of CNN ICs may be used to decompose an image simply by using various orientations of the ICs relative to the input im­age. A circuit board composed of several CNN devices could be used to rapidly determine the number of con­nected line segments in an image. Such a processor board would have many applications to problems in high-speed character recognition and parts inspections.

2.5. Radon Transform

The Radon transform is the basis of many image proc­essing algorithms, perhaps the most important of which is Computerized Tomographic scanning (often called

Signal Processing Using Cellular Neural Networks 37

CT or CAT scanning) in medicine. The Radon trans­form of a 2-dimensional functionf(x, y) is defined as the integral along a line inclined at an angle () from the y-axis and at a distance s from the origin. See [26] for a complete discussion. The transform can be written as

g(s, () = J:ooJ:/(X, y)o(x cos () + y sin () - s)dxdy

-00 < s < 00,0 ,;:; () < 7r

The Radon transform operator, R, is also referred to as the projection operator and the quantity g(s, () is called the ray-sum, because the integral equation above represents the summation off(x, y) along a ray at an angle () and distance s from the origin.

Our Radon transform CNN array operates in a similar fashion [27]. However, since our output image is binary valued, the result of our transform is not a set of single pixels of varying intensities along a line perpendicular to (). Rather it is a histogram, where the height at each point represents the number of pixels encountered along the ray at distance s and angle (). Essentially, the mass of the object is shifted along this ray to an endpoint. The result is a profile of the thickness of the object in the projection direction (). U s­ing the letter ''N.' as an example, figures 11 and 12 show projections along the horizontal and vertical axes, re­spectively. Again note that though the image begins and ends as black and white pixels, the continuous voltage dynamics are evident during the transient where gray pixels emerge. Also, though it would seem that the tran­sient is the result of clocking an array of digital proc­essors, the pictures are snapshots of the asynchronous continuous-time dynamics. Since all cells in the array have approximately the same time constant, although each cell in the image is evolving asynchronously the result is an image which appears to shift synchronously.

The Radon transform CNN is our first example of a multi-layer CNN. As can be seen in the figures, there are several templates necessary to describe the coupling in this network. AI2 describes the coupling from a neighborhood in layer 2 to the center cell in layer 1. We now have neighborhoods not only in two dimen­sions, but in three.

2.6. Two-Layer CNN for Noise Removal and Corner Extraction

Several times throughout this paper we have men­tioned multiple-layer (multi-layer) CNN arrays. The ex­ample presented in figure 13 is a simple example of a 2-layer CNN which combines a noise removal layer

Page 37: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

38 Chua, fung, Krieg

A A (a) (b) (c)

A', ,!:

" .• i" • !:

(d) (e) (f)

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 -1.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 1.0 0.0 1.0 -1.0 0.0 0.0 1.25 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

(g)

Fig. 11. (a) the input image; (b, c, d, e) the output of CNN during the transient; (1) the output of CNN in its steady state; (g) the templates of feedback operator A.

with a corner extraction layer. The input image is a letter "a" corrupted by uniformly distributed Gaus­sian noise. This image is the input to the noise removal layer. The output of the noise removal layer is then input to the corner extraction layer. The organization of this CNN is very similar to that of a multi-layer feed­forward perceptron [3], [28], though in this case the state in each layer has continuous-variable dynamics

rather than a bistable state. It is interesting to note that, because of the design of the templates for the layers, they can operate in parallel. This means that the corner extraction layer need not have slower dynamics than the noise removal layer. As in the corner extrac­tion example presented previously, the edges of the "a" are extracted first, and finally only the corners remain.

Page 38: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Signal Processing Using Cellular Neural Networks 39

A A A - -(a) (b) ( c)

(d) (e) (f)

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0

0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

0.0 1.0 0.0 0.0 -1.0 0.0 0.0 1.25 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Fig. 12. (a) the input image; (b, c, d, e) the output of CNN during the transient; (I) the output of CNN in its steady slate; (g) the templates of feedback operator A.

Since the first layer is performing simple local aver­aging it has one ill effect which can be seen in the output image. The serif in the lower right hand corner ofthe "a" is missing from the output. If corner extrac­tion were performed without noise removal this pixel would remain in the output image. Multi-layer CNN circuits are an example of the processing which is possible through a building block approach to

CNN processor design. When designing an IC to perform more complicated processing, the ability to combine previously designed and tested layers (similar to standard-cell VLSI layout) significantly shortens the design cycle. Libraries of such CNN processing layers can be constructed, enabling the VLSI design­er to easily experiment with various processing structures.

Page 39: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

40 Chua, Yang, Krieg

~~ '#~: • :', ~,: ::,'

''Ita-· .'''~ ." ". ,,' .

*i .'

'" :> '., '.., ;11: " ~: (a) (b)

(d)

0.0 1.0 0.0 0.0 0.0

1.0 2.0 1.0 0.0 1.0

0.0 1.0 0.0 0.0 0.0

(f)

• . .

0.0

0.0

0.0

• .1

(e)

-0.25

-0.25

-0.25

(c)

-0.25 -0.25

2.0 -0.25

-0.25 -02.5

Fig. 13. An example of two-layer CNN for corner extraction with a noise corrupted image. (a) the input image with additive Gaussian noise; (b, c, d) the output of CNN during the transient; (e) the output of CNN at the steady state; (I) the templates of the two-layer feedback operator A. The B template for both layers is zero since the image is loaded as the layer one initial state voltage; The biasing current I} for the second layer is zero, and the biasing current 12 for the second layer is -1.75 j.LA.

2.7. Multi-Level Outputs from CNN Cells

In Section 1, we briefly mentioned that CNN cells need not be limited to binary outputs. We are currently ex­ploring several methods of obtaining multiple-valued (multi-valued) outputs from each cell. This is an im­portant research topic since multi-valued outputs enable CNN cells to produce gray-scale output, Thus, CNN arrays could perform processing with gray-scale inputs (already possible with the CNN circuit presented here) resulting in gray-scale outputs, which could then be

displayed or be further processed by additional CNN arrays or other processing schemes which employ gray­scale information, It is our intention to eventually con­struct CNN circuits which will perform such gray-scale processing,

Of the several methods we are examining to achieve multi-level outputs, the simplest is to allow the piece­wise-linear curve described in figure 5 to have more than two segments of zero slope. An example of such a curve is illustrated in figure 14f, where we have added additional steps to the sigmoid, This would enable the

Page 40: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

(a)

(c)

0.0 1.0

1.0 3.0

0.0 1.0

A

( e)

0.0

1.0

0.0

Signal Processing Using Cellular Neural Networks 41

(b)

(d)

f(v)

1.0

(f)

v

Fig. 14. (a) the input image with noise; (b) the output of CNN after a short transient; (c) the output of CNN after a longer transient; (d) the output of CNN at the steady state. (e) the template of feedback operator A; (I) the nonlinear characteristic of the output function. For this multi-level noise removal CNN, the control operator B and the biasing current I are zero. The initial state voltage of each cell is the input image pixel value.

Page 41: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

42 Chua, Yang, Krieg

cell circuit to settle to multiple equilibria. Obviously, this technique cannot be used to achieve very many dif­ferent output levels. Theoretically, there is no limit to the number of segments (stable points). However, in practice, the number of equilibria produced by the steps in the nonlinearity is limited by circuit element tole­rances and noise. Therefore, this method is expected to be used to achieve only a small number of output levels, perhaps 3 or 4 bits of equivalent gray-scale values.

The example of figure 14 shows a square object, composed of three levels of intensity, corrupted by noise. The template shown in figure 14e is the A template, since (as in figure 6) the B template contains only O's (the circuit simulation was started with the im­age as an initial charge on the state capacitor). In con­trast to the other examples in this section, as the noise is removed (using local smoothing) the gray pixels of the square are not transformed into black pixels. In­stead, these pixels remain at the equilibrium points closest to their initial values. Note that due to corner effects, the initially white pixels at the corners of the image have been transformed into gray pixels. This is

(a)

(c)

because the edge conditions (the absent pixels surround­ing the corner) have been set to 0.0 voltage which is equivalent to a gray pixel value. During the local averag­ing, the white corners are pulled up by the edge condi­tion (the average of the gray pixels) enough to put them in the first second stable region in thef(') curve shown in figure 14 f.

2.S. Thinning

In many computer vision applications, objects in a scene are best characterized by structures composed of lines or arcs. In this way, non-essential features such as line thickess need not be considered in further proc­essing. Such applications are chromosome and cell structure analysis [4], [25], recognition of handwritten or printed characters and automated printed circuit board checking are examples where reduction to line segments or arcs is helpful. In these applications, thin­ning is applied as one of the transformations prior to higher level processing.

(b)

(d)

Fig. 15. An example of a multi-layer CNN which performs thinning. The size of the image is 64x64 and the numher of CNN layers is 8. (a) and (c) are the input images; (b) and (d) are the output images.

Page 42: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

In digital image processing, thinning algorithms transform an object into a set of simple binary-valued arcs which lie approximately along their medial axes. Figure 15 presents the results of an 8-layer CNN array which performs thinning [29]. Though it would seem to be a simple procedure, the thinning problem is more difficult than it appears. Two tasks must be imple­mented-peeling the outer pixels off and stopping the peeling process when the width reduces to exactly one pixel. In CNN arrays the main difficulty lies with the stopping decision, since it must be implemented automatically in analog hardware. Most of the tradi­tional algorithms for thinning are digital, synchronous, and sequential, while the method we present here is analog, asynchronous, and parallel. Figure 15 shows two simulations of CNN arrays performing thinning on handwritten characters. Due to the complexity of the various templates (there are 16 of them, and half are 5 x5 stopping templates) we will not present them here. The interested reader is urged to consult [29].

2.8.1. Simulation of Physical Systems. The locality of dynamics and nearest neighbor coupling of cellular neural networks make them ideal for real-time simula­tion of physical systems which can be described by par­tial differential equations. The phenomena can be represented by fields, where a change of the field at one point affects the value at nearby points, and so the effects propagate. This relationship is best expressed in terms of space or time derivatives [30]. It is easy to see how CNN arrays are well suited to simulate such systems. The spatial differences arise directly from analog differencing of nearest neighbors and the time evolution from the local dynamics of each cell. Here, we present two examples of physical system models, the diffusion equation and the wave equation.

2.9. Diffusion Equation [heat equation]

The diffusion equation is often referred to as the heat equation since one of the earliest applications of this equation was to describe the flow of heat from a region of higher temperature to one of lower temperature. The equation in two dimensions is

ap K2 [a2 a2 J at = c2 aJ? + al p

where K and C are the heat conductivity and heat capacity of the material, respectively. Here, p, is the heat per unit area which is proportional to the

Signal Processing Using Cellular Neural Networks 43

temperature (p = C T, where T is the temperature of the material).

The example of figure 16 shows the simulation of a CNN array which has been designed to emulate the heat equation on a square lattice, in two dimensions, with an irregular boundary condition. Physically the boundary conditions would be a hot region (constant temperature) at the top of the slab of material, a colder region ( constant temperature) at the bottom, and a short colder bar extending halfway into the slab. The se­quence of pictures from (a) to (f) shows the time course of the temperature. Figure 16g presents the A template used to generate the CNN coupling. In the simulation, the circuit starts with the initial condition on the state capacitors, so there is no B template describing the con­trol input coupling.

2.10. Wave Equation

The wave equation differs from the diffusion euqation by having a second time derivative instead of a first. In this case the partial differential equation in two dimensions is

a2 a2 1 a2E (-+-)E=--

ax2 al c2 a?

where E is the field vector and c is the speed of prop­agation in the medium. Since the equation has a sec­ond time derivative a two-layer CNN array is required (only one time derivative per layer). The results ofthe simulation ofthis array are shown in figure 17. The in­itial condition is +Vin the upper half of the image and - V in the lower half. The boundary conditions are 0 at all the edges.

This simulation is analogous to mounting an elastic sheet on a square frame and pulling the top half of the sheet up and the bottom half down and then letting go. It is also similar (though only in two dimensions) to a square box electromagnetic resonator, started with the initial condition of opposite charges in the top and bottom halves of the box. Since the simulation is non­dissipative (corresponding to no loss in the elastic membrane or infinite conductivity in the walls of our resonant box) the oscillations shown in figure 17 would continue without end. So we show only a few snap­shots of the oscillating field. One can confirm the oscillatory behavior by noticing the alternation of dark and light regions from the top to bottom of the box.

This section has presented numerous applications of cellular neural networks and shown the rich behavior

Page 43: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

44 Chua, Yang, Krieg

( a ) (b) ( c)

(d) (e) (f)

0.0 0.25 0.0

0.25 0.0 0.25

0.0 0.25 0.0

( g)

Fig. 16 Simulation of diffusion equation (heat equation) using a CNN. <aJ the input image (b, c, d, e, f) the output ofCNN during the transient; (g) the template of feedback operator A.

possible with the combination of locally coupled cells possessing simple dynamics. These are just a few ex­amples of the processing functions and physical system simulations with the CNN cell and array architecture. Currently, applications to image segmentation, half-tone transformation, character recognition (both English and Chinese characters), and simulation of long-range ordering in physical systems are being examined.

3. VLSI Implementation of Cellular Neural Network Circuits

In the previous section we have seen numerous ex­amples of important signal processing functions which

can be realized using CNN circuits. However, one of the most appealing aspects of these circuits is that they can be more easily implemented as VLSI devices than networks which are fully connected or whose coupling coefficients must be changed during a learning process. The combination of local connectivity, fixed templates (coupling coefficients), and positive feedback with sim­ple, well defined dynamics at each node enables the circuits to be fabricated using standard microfabrica­tion technology. In this section we discusss the 2 J.I CMOS VLSI implementaion of the noise removal algo­rithm discussed in Section 3. Most of the features of this VLSI circuit are shared by VLSI implementations of other processing functions.

Page 44: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Signal Processing Using Cellular Neural Networks 45

(a) ( b) (e) ( d )

(el (f) (g) (h)

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.25 0.0 0.0 0.0 0.0

0.0 1.0 0.0 0.0 1.0 0.0 0.25 -1.0 0.25 0.0 1.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.25 0.0 0.0 0.0 0.0

(i)

Fig. 17. Simulation of the two-dimensional wave equation using a CNN. (a) the input image (b, c, d, e, f, g, h) the output of CNN during the transient. (i) the templates of feedback operator A.

3.1. Advantages of CNN Circuits for VLSI Implementation

3.1.1. Local Connectivity. The CNN arrays we have discussed here are have cells which are locally con­nected and are very efficient for VLSI layout. [Those who have read ahead may ask why we used double­poly double-metal fabrication, it is not because of the connectivity, but the size of our state capacitors which are implemented in the poly layers.] The regular array of cells also makes the routing and layout problem much easier than with traditional analog circuits. In fuct, from examining the photomicrograph of the fubricated noise­removal circuit in figure 23 one might be led to believe it was a VLSI digital memory. In fuct, many of the same routing and layout tools used for digital circuit design were used in the design of the noise removal CNN chip. Additionally, the fact that the same circuit is duplicated

throughout the chip enables us to employ the same stan­dard-cell design techniques which have allowed the de­sign of digital VLSI circuits to be completed so rapidly.

This local connectivity should be contrasted with the connectivity required for most neural networks discussed in the literature. These fully connected net­works have proven very difficult to fabricate because of the number and distances the connections must prop­agate. The examples in Section 2 show that locally con­nected architectures can perform important functions and they are easy to fabricate.

3.1.2. Lack of Learning Ability. When most people hear the phrase neural networks they most often think of systems that learn or adapt to input conditions. In fact, in many definitions of neural networks the ability to learn or adapt is cited as a requirement. However, this requirement of adaptability is difficult to achieve when

Page 45: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

46 Chua, Yang, Krieg

implementation calls for many cells in a VLSI system. In order to enable our circuit designs to be readily im­plementable, we have chosen to make the CNN arrays fixed-fimction. That is, the array is designed to perform one or a related set of processing functions using fixed coefficients. More complicated processing can be per­formed by cascading or paralleling multiple CNN VLSI devices. By avoiding the use of programmable weights or variable number of connections, the CNN circuits can be fabricated at a higher density and without the problems associated with the convergence of learning algorithms. This approach is similar to that taken by Mead [31] in the design of chips that emulate human sensory systems (retina, cochlea, etc.). Indeed, there is much interesting processing which can be performed with fixed component circuits, especially when nonlinearities are employed as in CNN circuits.

3.1. 3. Tolerance to Fabrication lilriances. Two attributes make the CNN cell circuit tolerant of fabrication varia­tions and unmodeled parasitics, these are positive feed­back at each cell and well-defined local dynamics. In employing positive feedback, we ensure that the final state of each cell is well-defined equilibrium points. In the examples here, we have chosen the output nonline­arity so that the cell state has only two equilibrium points. However, by using different nonlinearities (and possibly circuit topologies) more than two equilibrium points are possible. Using positive feedback to force the circuit to an equilibrium point is important because uncertainties in component values and parasitics may cause the circuit to exhibit oscillations or instabilities which the positive feedback tends to mitigate by forc­ing the circuit to a stable point. Using positive feed­back to force the circuit to an equilibrium point has obvious advantages over circuits design to have high gain to force the output to a particular point, as in some implementations of the neural network proposed by Hopfield [1].

By choosing to make the dynamics of each node dominated by the state capacitor and resistor we may slow down the circuit somewhat (though most of the device-level simulations using realistic assumptions have a settling time under 20 I' sec), it also makes the circuit less sensitive to parastic capacitance and resistances that occur in any fabricated circuit. In fact, in our noise-removal VLSI circuit, we purposely chose a large state capacitor value of 7.8 pF so that it would dominate the dynamics of the cell. In this way we ex­pect the number of correctly functioning CNN chips (the chip yield) will be higher than less conservative

designs where the simulated cell dynamics is close to that which may be induced by the parasitics. Since we only recently received our 20x20 CNN chip from MOSIS fabrication, we will report on the success of these assumptions in another paper.

3.2. The VLSI Noise-Removal CNN Circuit

In this section we will discuss the design of the noise­removal chip which appears in figure 23. We will begin by examining figure 18, a redrawing of figure 5. This diagram more clearly indicates the input currents from neighbors which are summed and the current sources which couple this cell with its neighbors. The currents from neighbors are shown at the top of the diagram and current going to neighbors are indicated by the Ixu(k, I; i, j).

As one might imagine there is a considerable amount of circuitry hidden by the diamond-shaped controlled current sources. These current sources form the heart of the CNN design and will be discussed next. Figure 19 shows the grouping of these controlled current sources into those controlled by the state voltage vxij

and those controlled by the input voltage Vuij. We refer to these current sources as Multiple-Output Voltage­Controlled Current Sources (MOVCCS). There is a MOVCCS associated with the feedback operator A and one with the controlling operator B. These are indicated in figure 19 as (A) and (B). At the top of each box representing the MOVCCS are the voltage inputs and the bottom of each box is attached to ground.

If we examine the inside of these MOVCCS boxes we find a core controlling circuit which performs the nonlinear output transformation (the piecewise-linear sigmoid from figure 5) and drives individual current sources whose outputs go to the neighbors. This core circuit is drawn in figure 20a. In the diagram the input and ground connections from our MOVCCS boxes (in the previous figure) appear as gate inputs on M3 and M4. The Vbl and Vb2 are global biasing voltages which are common to all cells in the network. These voltages can be adjusted from outside the chip to in­crease or decrease all coupling coefficients in the ar­ray. The core circuit produces four outputs Vp+, Vn +, Vp-, and Vn- which are used to drive the current sources. An example of how these voltages are used is illustrated in figure 20b. Here, the core circuit drives only two current sources which have opposite values. In fact, circuits to the left of the core circuit, those which use Vp- and Vn- produce the negative coefficients

Page 46: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Signal Processing Using Cellular Neural Networks 47

c • •• t I xy(k,l;iJ)

Ixu (iJ;k,l) = B(iJ;k,l) v ukl; Ixy (iJ;k,l) = A(iJ;k,l) v ykl

(Iv xiJ·+11· Iv xiJ··11) 2Ry

Ry

Fig. 18. The cell circuit of figure 4 redrawn to dearly show the outputs to neighbor cells and the inputs from neighbor cells.

• • •

I xy(iJ ;k,l) I

v .. XIJ

c

Ixu (iJ;k,l) = B(iJ;k,l) v ukl; Ixy (iJ;k,l) = A(iJ;k,l) v ykl

1

2Ry ( I v xij + 1 I • I v xij' 1 I )

Ry

Fig. 19. The current outputs to other cells illustrated as Multiple-Output Voltage-Controlled Current Sources (MOVCCS) blocks. These con­trolled current sources are the building blocks of the coupling terms in the VLSI implementation.

Page 47: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

48 Chua, lling, Krieg

VDD

vp+ Vp·

Vn+ Vn-

Vss

(a)

VDD

Vp+ vp-

CORE p--+ Ypb Vpb

i+ OF

f-----4- Ynb Vnb MOVCCS

Vn+ Yn-

Vss

(b)

Fig. 20. Transistor schematic of the MOVCCS. <a) the core circuit ofMOVCCS; (b) the positive and negative currents generated by the core MOVCCS circuit.

used in the templates. Positive coefficients are produced by current source circuits controlled by Vp+ and Vn -. Figure 21 shows how the core circuit from figure 20a is used in the noise removal cell. We saw in the noise removal template that the cell simply averaged the local pixels using appropriate positive weights. We can see this in figure 21, because the outputs to neighbors are all from the positive controlled source to the left of the core circuit. The different coefficients are achieved by varying the channel width to length ratio (W/L) of the MOS transistors. The larger the W/L ratio the greater the coupling coefficient to the neighbor cells.

The reader may have noted that in many of the ex­aples in Section 2 (especially the noise removal circuit) the simulations started with the state capacitor charged to an initial value and the circuit settled from this con­dition. In the actual VLSI circuit we also wish to load the state capacitor, but we cannot load all cells simul­taneously (check figure 23, there are not 400 inputs!). Therefore we must load a row at a time. In order to maintain the state capacitor voltage during loading, we must disconnect it from the remainder of the cell cir­cuit. Otherwise, by the time the last row has loaded the stage voltage would drop far enough to affect the processing. The circuit of figure 22 is therefore used both to isolate the capacitor from the remainder of the cell circuit and to provide a start signal, so that all cells may start simultaneously. This is accomplished using two simple MOS pass transistors.

In the design of the CNN VLSI array, several other factors were considered and designed into the chip.

VDD

VSS

Fig. 21. An example of a MOVCCS used in an actual CNN cell. This circuit is the MOVCCS which supplies the coefficients for a noise removal CNN cell. Notice that all the coeficients are from the positive current control side of the core circuit which yields the positive coefficients found in the A template.

Page 48: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Start Select

Input

------1 ~

+ CELL

Fig. 22. A block diagram showing the state capacitor loading switch­es and array slnrt switch. Note that the block containing the remainder of the CNN circuit has a low input impedance, which necessitates disconnecting the state capacitor during charging.

These were multiplexing the inputs and outputs to the chip (optical inputs are planned but not in this version) and providing test points into the circuit so that cell internal voltages and currents may be monitored for a few cells (we cannot monitor the entire array in this way, only a few cells closer to the edge of the array). Test points were also provided so that the resistances and capacitances at various points could be evaluated (the state capacitance and resistance and the output resistance) .

The result is the chip pictured in figure 23. It is a 2p. double-poly double metal CMOS IC which is packaged in a 64 pin ceramic DIP. We received the chip from MOSIS fabrication as of the writing of this paper and have only begun static tests on the device. These tests thus far indicate that the chip has the proper bias­ing voltages and current. Dynamic testing will begin shortly and will involve providing inputs to the entire array and sampling outputs at each node to characterize the dynamics of the array. It is expected that this VLSI CNN implementation will be the first in a collection of CNN chips which will contain greater numbers of nodes and more sophisticated processing.

Conclusion

We have presented a brief theory of CNN processing, demonstrated the wide range of functions which are possible using CNN arrays, and discussed the imple­mentation of a CNN circuit for noise removal. We hope the reader may realize the importance of locally

Signal Processing Using Cellular Neural Networks 49

connected analog architectures such as the CNN. It is expected that they will play more roles in signal proc­essing and computation in the near future as we gain more experience with their theoretical description and VLSI design and implementation [32]-[33]. Just as this paper is going into press, we have successfully tested a simpler 6 x 6 CNN chip from MOSIS and found it to function perfectly [34].

Finally, we emphasize that our current CNN per­tains to the circuit in figure 4 only for simplicity. We have since generalized this circuit in several ways. In the future, the terminology CNN will include a much broader multi-layer analog array having local interconnections.

References

1. J.J. Hopfield, "Neural networks and physical systems with emergent computational abilities, " Proc. Natl. Acad. Sci. USA, vol. 79, 1982, pp. 2554-2558.

2. LJ. Hopfield and D.W. Tank, "Computing with neural circuits: a model," Science (USA), vol. 233, no. 4764, 1986, pp. 625-633.

3. D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume i: Foundations, Cambridge, MA: M.LT Press, 1986.

4. K. Preston, Jr., and M. 1. B. Duff, Modem Cellular Automata:

Theory and Applications, New York: Plenum Press, 1984. 5. T. Toffoli and N. Margolus. Cellular Automata Machines-a

newenvironmentformodeling, Cambridge, MA: M.LT. Press, 1986.

6. J. von Neumann, The Computer and the Brain, New Haven: Yale University Press, 1958.

7. J. von Neumann, "The general logical theory of automata," Cerebral Mechanisms in Behavior-The Hixon Symposium, New York: Wiley, 1951.

8. S. Wolfram, Theory and applications of cellular automata, World Scientific Publishing Co., 1986.

9. L. O. Chua and L. Yang, "Cellular Neural Networks: Theory," iEEE Trans. Circuits and Systems, vol. 35, no. 10, 1988, pp. 1257-1272.

10. L. O. Chua and L. Yang, "Cellular Neural Networks: Applica­tions," iEEE Trans. Circuits and Systems, vol. 35, no. 10, 1988, pp. 1273-1290.

11. L. 0. Chua, C. A. Desoer, and E.S. Kuh, Linear and Nonlinear Circuits, New York: McGraw-Hill, 1987.

12. L. O. Chua, introduction to Nonlinear Network Theory, New York: McGraw-Hill, 1970. (987 pages)

13. L. 0. Chua and P.M. Lin, Computer-aided Analysis of Electronic Circuits: Algorithms and Computational Technqiues, Englewood Cliffs, NJ: Prentice Hall, 1975, p. 72.

14. L. 0. Chua and R. Ying, "Finding All Solutions of Piecewise­Linear Circuits," International Journal of Circuit Theory and Applications, vol. 10, 1982, pp. 201-229.

15. 1. Hopfield and D. Tank, "Neural' Computation of Decisions in Optimization Networks," Biological Cybernetics, vol. 52, 1985. pp. 141-152.

Page 49: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

50 Chua, lling, Krieg

Fig. 23. Photomicrograph of a recently fabricated 20 x20 CNN VLSI array which implements the noise removal algorithm discussed in Section 2. The chip was fabricated by the MOSIS project using 2 p. double-metal double-poly technology and is currently being tested and evaluated by the authors.

16. J. Hopfield and D. Tank, "Collective Computation with Con­tinous Variables," in Disordered Systems and Biological Organi­wtion, (ed. E. Bienenstock, F. Fogelman & G. Weisbuch), Berlin: Springer-Verlag, 1985.

17. R.L. McEleice, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh, "The Capacity of the Hopfield Associative Memory," IEEE Trans. on Information Theory, vol. IT-33, no. 4, 1987, pp. 461-482.

18. L.O. Chua and T. Roska, "Stability of a Class of Nonreciprocal Cellular Neural Networks," IEEE Trans. on Circuits and Systems, vol. 37, 1990, pp. 1520-1527.

19. WS. McCulloch and W Pitts, "A logical calculus of the ideas immanent in nervous activity," Bulletin of Mathematical Bioophysics, vol. 5, 1943, pp. 115-133.

20. W Pitts and W. McCulloch, "How we know universals: the perception of auditory and visual fonns," Bulletin of Mathemotical Biophysics, vol. 9, 1947, pp. 127-147.

21. H.D. Block, "The Perceptron: a model for brain functioning. I," Reviews of Modern Physics, vol. 34, 1962, pp. 123-135.

22. M. Minsky and S. Papert, Perceptrons: An Introduction ofCom­

putational Geometry, Cambridge, MA: M.I.T. Press, 1969. 23. T. Matsumoto, L.O. Chua, and H. Suzuki, "CNN Cloning

Template: Connected Component Detector, IEEE Trans. on Cir­cuit and Systems, vol. 37, 1990, pp. 633-635.

24. M.P. Kennedy and L.O. Chua, "Circuit Theoretic Solutions for Neural Networks-an old approach to a new problem," Proc. First ICNN, vol. 2, San Diego, CA: IEEE, 1987, pp. 169-176.

Page 50: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

25. MJ.B. Duff and TJ. Fountain, Cellular Logic Image Process­ing, London: Academic Press, 1986.

26. Ani! K. Jain, Fundnmentals of Digital Image Processing,

Englewood Cliffs, NJ: Prentice-Hall, 1989. 27. L.Q. Chua and B. Shi, Exploiting Cellular Automata in the Design

of Cellular Neural Networks for Binary Image Processing, UCBIERB M89/130, November 15, 1989.

28. T. Kohonen, Self-Organization and Associative Memory, New York: Springer-Verlga, 1984.

29. T. Matsumoto, L.Q. Chua, and T. Yokohama, "Image Thinning With A Cellular Neural Network," IEEE Trans. on Circuits and Systems, vol. 37, 1990, pp. 638-640.

30. P.M. Morse and H. Feshback, Methods of Theoretical Physics, Part I, New York: McGraw-Hill, 1953.

31. M.A. Sivilotti, M.A. Mabowald, and C.A. Mead, "Real-time visual computations using analog CMOS processing arrays;' Ad­vanced Research in VLSI: Proceedings of the 1987 Stanford Con­ference, Cambridge, MA: M.I.T. Press, 1987, p. 295.

32. T. Matsumoto, L.O. Chua, and Furukawa, "CNN Cloning Template: Hole Filler," IEEE Trans. on Circuits and Systems, vol. 37, pp. 635-638.

33. Proceedings of the 1990 IEEE International Workshop on Cellular Neural Networks and Their Applications, December 16-19, 1990, Budapest, Hungary.

34. 1M. Cruz and L.O. Chua, "A CNN chip for connected compo­nent detection," IEEE Trans. on Circuits and Systems. vol. 38, 1991.

Leon 0, Chua received the S.M. degree from the Massachusetts Institute of Technology in 1961 and the Ph.D. degree from the Univer­sity of Illinois, Urbana, in 1964. He was also awarded a Doctor Honoris Causa from the Ecole Poly technique Federale-Lausanne, Switzerland, in 1983 and an Honorary Doctorate from the Univer­sity of Tokushima, Japan, in 1984. He is presently a professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley.

Signal Processing Using Cellular Neural Networks 51

Professor Chua's research interests are in the areas of general nonlinear network and system theory, neural networks, and nonlinear dynamics. He has been a consultant to various electronic industries in the areas of nonlinear network analysis, modeling. and computer­aided design. He is the author of Introduction to Nonlinear Network Theory (New York: McGraw-Hill, 1969), a coauthor of the books Computer-Aided Analysis of Electronic Circuits: Algorithms and Com­putational Techniques (Englewood Cliffs, NJ: Prentice-Hall, 1975), Linearand Nonlinear Circuits (New York: McGraw-Hill, 1987), and Practical Numerical Algorithms for Chaotic Systems (New York, NY: Springer-Verlag, 1989), He has published many research papers in the area of nonlinear networks and systems.

Professor Chua was awarded a Fellow of the IEEE in 1974. He served as Editor of the IEEE Transactions on Circuits and Systems from 1973 to 1975 and as the President of the IEEE Society on Cir­cuits and Systems in 1976. He is presently the editor of the Interna­tional Journal of Bifurcation and Chaos, and a deputy editor of the International Journal of Circuit Theory and Applications.

Professor Chua was awarded 5 U.S. Patents and he is also a recip­ient of several awards and prizes including the 1967 IEEE Browder 1 Thompson Memorial Prize Award, the 1973 IEEE WR.G. Baker Prize Award, the 1974 Frederick Emmons Terman Award, the 1976 Miller Research Professorship from the Miller Institute, the 1982 Senior Visiting Fellowship at Cambridge University, England, the 1982/83 Alexander von Humboldt Senior U.S. Scientist Award at the Technical University of Munich, W. Germany, the 1983/84 Visiting U.S. Scientist Award at Waseda University, Tokyo, from the Japan Society for Promotion of Science, the IEEE Centennial Medal in 1985, the 1985 Myril B. Reed Best Paper Prize, and the 1985 and 1988 IEEE Guillemin-Cauer Award.

In the fall of 1986, Professor Caua was awarded a Professor Invite International Award at the University of Paris-Sud from the French Ministry of Education.

Lin Yang (S'87) was born in Shenyang, China, on April 15, 1953. He received the B.Sc, degree in electronic engineering from Fudan University, Shanghai, China, in 1982, and the M.S. degree in infor­mation engineering from Tsinghua University, Beijing, China, in 1985.

Currently, he is with the Department of Electrical Engineering and Computer Sciences of the University of California, Berkeley, where is he working towards the Ph.D. degree.

He research interests include nonlinear dynamical systems, image processing, neural networks, and computer-aided design of VLSI circuits.

Ken K, Krieg received the B.S.E.E. degree from Massachusetts Insti­tute of Technology. Currently, he is with the Department of Elec­trical Engineering and Computer Science of the University of Califor­nia, Berkeley, working towards a Ph.D. degree.

Page 51: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation

A. LUMSDAINE, lL. WYATT, JR., I.M. ELFADEL Research Laboratory of Electronics, Department of Electrical Engineering and Computer Science. Massachusetts Institute of Technology,

Cambridge, MA 02139

Received June 25, 1990, Revised October 29, 1990.

Abstract. Image smoothing and segmentation algorithms are frequently formulated as optimization problems. Linear and nonlinear (reciprocal) resistive networks have solutions characterized by an extremum principle. Thus, appro­priately designed networks can automatically solve certain smoothing and segmentation problems in robot vision. This paper considers switched linear resistive networks and nonlinear resistive networks for such tasks. Following [I] the latter network type is derived from the former via an intermediate stochastic formulation, and a new result relating the solution sets of the two is given for the "zero temperature" limit. We then present simulation studies of several continuation methods that can be gracefully implemented in analog VLSI and that seem to give "good" results for these nonconvex optimization problems.

1. Introduction

One of the most important, yet most difficult, early vision tasks is that of image smoothing and segmenta­tion. Smoothing is necessary to remove noise from an input image so that reliable processing in subsequent stages is fucilitated. However, indiscriminate smoothing will blur the entire image, including edges (e.g., corre­sponding to object boundaries) which are necessary for later stages of processing. Many researchers are cur­rently seeking to develop algorithms that smooth in a piecewise manner, respecting edges. There are two main approaches taken-stochastic, [1]-[6], and deter­ministic [7]-[9]. The former relies on such methods as simulated annealing to accomplish the minimization. The deterministic approach, on the other hand, often relies on the application of continuation methods [2], [10] to certain nonlinear systems, or in the case of [11], on using a neural network similar to that of Tank and Hopfield [12].

Although efficient computation techniques exist for numerically computing the solutions to vision problems [13], even the fastest algorithms running on a parallel supercomputer (such as the Connection Machine® sys­tem' [14]) do not approach real-time performance. The motivation of this work is to prodnce solutions to the smoothing and segmentation problem that are amenable to analog VLSI network implementation, an area that has been explored in [15]-[18]. See also [11], [19], [20].

Section 2 presents the smoothing and segmentation task as a minimization problem. Section 3 presents methods for solving the minimization problem and dis­cusses network implementations of these methods. Sim­ulation results are provided in Section 4. Finally, con­clusions and suggestions for further research are given in Section 5.

2. Image Restoration as a Minimization Problem

The difficulty with using a linear network for image smoothing is that noise and signal are equally smoothed so that edges become blurred. We therefore seek a method for segmenting the signal into regions which can be smoothed separately. One technique for doing this is to introduce a line process (i.e., a set of binary variables) which selectively breaks the smoothness con­straint at given locations. This method appears widely in the literature, e.g., [1]-[4], [6], [11].

For simplicity of notation, all equations in this paper are formulated for the one-dimensional case. The results generalize trivially to two dimensions, and the simulation results are for the two-dimensional case.

The smoothing and segmentation problem with the line process can be treated as a minimization problem. Let u E <RN be the input image, y E <RN be the output image, and IE <RN - 1 be the line process, where the bin­ary line process variable Ii assumes the values {O, I}

Page 52: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

54 Lumsdaine, UYatt and Elfadel

depending on whether the smoothness penalty between nodes i and i + 1 is enforced or not. Consider the fol­lowing cost function:

(\, 1 Ju(y, I) = 2 [Fu(y) + S(y, I) + H(l)] (1)

where F, S, and H are the "fidelity," "smoothness," and "line" penalty terms, respectively, i.e.,

N

Fu(Y) (\, Af ~ (Yi - UJ2 i=1

N-I

(2)

S(y, I) ~ As ~ (Yi - Yi+ 1)2(1 - I;) (3) i=1

N-I

H(I) ~ Ah ~ Ii' (4) i=1

This formulation assumes that the optimal reconstructed image and edges (Yap" lop,) satisfies as well as possible the generally conflicting requirements of agreeing with the data u, being smooth between edges, and containing as few edges as possible. The parameters Af, A" and Ah determine the weights given to each of these criteria.

The expression (1) can be minimized with respect to y for fixed I by differentiating with respect to each Yi and setting the derivatives to zero. This produces the following system of equations:

Af(Yi - Ui) + As(Yi - Yi-I)(l - Ii-I)

+ As(Yi - Yi+I)(l - I;) = 0, (5)

with appropriate modifications at the boundaries i = 1 and i = N.

The new results in Section 2.1 all follow from the observation that (5), as well as the related equation (9) below, describe the solution to certain resistive electri­cal networks. Notice that (5) can be viewed as the Kirchoffs current law (KCL) relation at every node of a resistive ladder in which the horizontal resistive ele­ments have switches (corresponding to a line process element) associated with them. A network for comput­ing y given I is shown in figure 1. Note that (F + S) is the electrical power dissipated in the resistors. Similar networks have appeared in [6] and [11]. This type of network will be referred to as a resistor-with-switch (or RWS) network. For any setting of the switches, the net­work solution automatically minimizes the cost function with respect to y. The difficulty is in minimizing with respect to l.

y(i) y(i+1)

Fig. 1. A simple smoothing network with switches. The vertical and horizontal resistors have conductances Af and AS' respectively.

Much has been said in the literature in regard to find­ing a global minimum to (1) by stochastic and determin­istic methods. These techniques are necessary to find the minimizing I-minimizing with respect to y given I only requires the solution of a linear system. The deterministic approaches rely on the fact that the mini­mization problem can be recast into one in which the line process variables have been eliminated. The latter will be studied here since they appear to lead to prac­tical VLSI implementations.

2.1. Discontinuous Resistive Fuse Elements

The line process variables can be removed from (1) by straightforward algebraic manipulations. In fact, Blake and Zisserman [7] demonstrated that the original cost function J u(y, I) containing real and boolean variables is intimately related to the following cost function con­taining only real variables:

[ N N-I 1

Ku(Y) ~ ~ Af ~ (Yi - Ui)2 + ~ G(Yi - Yi+l) ,

(6)

where

= {Asv2, Ivl < $.. (7)

Ah, otherwise

The line process is found a posteriori according to:

Ii = {O, IYi - Yi+11 < $. . (8)

1, otherwise

Note that Ku is a nonconvex cost function with respect to y.

Page 53: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 55

Apart from instances in which solutions occur at points where G is not differentiable, the minimum of Ku is to be found among those points where VKu(Y)

= 0, i.e.,

Af(Yi - Ui) + g(Yi - Yi-I) + g(Yi - Yi+ I) = 0, (9)

where

1 d g(v) = :2 dv G(v).

Equation (9) can also be viewed as the KCL relation at each node of a nonlinear resistive network with the topology illustrated in figure 2. The nonlinear resistor characteristic, g(v), is that of a linear resistor that re­versibly becomes an open circuit when the voltage across it exceeds a certain threshold, as shown in figure 3. Then, in electrical terms, G is twice the co-content function for this nonlinear resistor [16], [21], i.e., G(v) = 2 f; g(u)du. We refer to an element of this type as a discontinuous resistive fuse and to a network incor­porating resistive fuses as an RWF network, i.e., a resistor-with-fuse network.

9 9

+ u(i-1) + uti)

Fig. 2. Resistor-With-Fuse (RWF) network topology.

! J 81 I

-~ ~--~---_--~~~--~o-----~~~--~----~

V T. voltage V T.

Fig. 3. Characteristic of the discontinuous nonlinear resistor known as a "resistive fuse." The dotted vertical lines are not part of the constitutive relation.

For a given cost function, one can construct corre­sponding RWS and RWF networks. For every solution of an RWF network, there exists a similar solution to the corresponding RWS network, but there are switch configurations of an RWS network for which there is no corresponding solution in a corresponding RWF net­work. The question then arises whether restricting at­tention to the RWF network might cause one to over­look a solution to the RWS network that is in fact a local minimum and therefore of potential use in an opti­mization procedure. The answer is no, by the following proposition:

PROPOSITION 1. Consider the cost function I u(y, I) as specified in (1) for a one- or two-dimensional network and the corresponding RWS and RWF networks speci­fied by (5) and (9). Ifthe switches are set so that the solution y* for the RWS network is not also a solution to the RWF network, then y * is not a local minimum of lu, meaning that changing the setting of a single (appropriately selected) switch in the RWS network will produce a new solution (with a new value of y) for which the value of lu is strictly lower.

Remark. Proposition 1 differs from the result in [7] in that it concerns local minima and applies after the net­work has settled to a new y following the closing of the switch.

In order to complete the proof, we need the follow­ing lemma:

LEMMA 1. (Local Measurement Principle) Consider a one-dimensional or two-dimensional network of the type shown in figure 1, in which an arbitrary number of switches (~1) are open. The power dissipated in the network is given by

N N-I

P = Af ~ (Yi - U;)2 + As ~ (Yi - Yi+I)2(l - Ii) i=l i=l

= (F + S). (10)

For a given switch, let P - be the value of P with the switch open, p+ be the value of P with the switch closed, and define the change in the dissipated power as t!.P = P + - P -. Let vae be the voltage across the switch when it is open and let isc be the current through the switch when it is closed (after the network has set­tled). Then the increase in dissipation which results from closing the switch is

(11)

Page 54: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

56 Lumsdaine, U'yatt and Elfadel

Remark. Lemma 1 is a startling result. The local meas­urement principle states that one can measure the global change in the network cost function due to a switch change (after the network settles to a new solution) merely by taking two measurements at the switch. Both proofs below use circuit theory techniques, but can also be carried out, albeit laboriously, by mathematical argu­ments divorced from a network realization, e.g., the proof of Lemma I via a rank one perturbation method in [22]. Related work appears in [23] and [24].

Proof of Lemma l. Defme vi: and ik- to be the network branch voltages and currents when the switch is open. Define Vk and ik to be the network branch voltages and branch currents when the switch is closed. Define dVk = vk - vi: and dik = ik - ik-. By Tellegen's theorem [25], [26],

(12) all

branches

Group the terms in (12) according to branch element, and note that

~ [Vkdik voltage sources

ikdVk] + ~ [Vkdik - ikdVd resistors

+ ~ [Vkdik - ikdVk] + vswdisw - iswdVsw = 0,

s!~~~es (13)

where the subscript "sw" refers to the switch that is being closed and "fixed switches" to all others. To simplify (13), note that d Vk = ° for the voltage sources, Vk and dVk vanish for closed switches, ik and dik vanish for open switches, and for the resistors:

Vkdik - ikdVk = Rkikdik - ikRkdik = 0. (14)

Equation (13) then becomes

° = ~ [Vkdik] + vswdisw - iswdVsw voltage sources

= ~ [Vkdid + vsw(isw - is-:V) - isw(vsw - V~) voltage sources

= ~ [Vkdik] + v~isw' voltage sources

(15)

The summation term in (15) is just the change in power delivered to the network, i.e., -M, and v~isw = vocisc' Therefore,

• (16)

I ~-,-~---~------------. "

',)3

0,

o voltage

Fig 4- Load-line diagram for RWS network and for RWS network with resistive fuse substituted for one resistor-switch composite ele­ment, The dashed lines A, B, and C are possible load-lines represent­ing the behavior of an RWS network as seen by one resistor-switch paiL The six marked points indicate possible solutions, depending on switch position. If a resistive fuse element (with characteristic i = g(v), shown with a solid line) is substituted for the resistor-switch pair, the four circled solutions remain, while the solutions marked X and W disappeaL

Proof of Proposition 1. Consider any RWS network with any input u, switch configuration I, and corresponding network solution y., such that y * is not a solution of the corresponding RWF network. Then there must exist some resistor -switch composite element (element q, say), such that y. is no longer a network solution if a resistive fuse is substituted in its place. Make such a substitution and then consider the load-line describing the remainder of the linear RWS network as seen from this location. The two possible cases marked in figure 4 are Case X, in which switch q was open in the original RWS network, and Case W, in which switch q was closed. Note that the area in the first quadrant under the triangle is 1/2 Ah. In Case X, closing the switch in the original RWS network would have caused the solution to move to the circled point on line A. By Lemma 1 the change would be

iu,closed - ill.open = ~ [vocisc - Ah] < 0, (17)

where the inequality follows from the fact that 1/2 vocisc (the area under the line connecting the origin to (voc' isc)) is less than 1/2 Ah (the area under the triangle). For Case W, similar reasoning shows iu,open - iu.closed < ° if opening the switch causes the network solution to move from point W to the circled point on line C. Thus points X and W in figure 4 are not local minima

~~. .

Page 55: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 57

Remark. The converse of the proposition is not true. If the network solution lies on load-line B, one intersec­tion point or the other will generally have lower cost for the RWS network, yet both are valid solutions to the RWF network.

2.2. Continuous Resistive Fuse Elements

We will show that for the purposes of numerical opti­mization and physical (V LSI) implementation it is ad­vantageous to replace the discontinuous resistive fuse element in figure 3 by a controllable element with a single-parameter family of i-v curves, such as those drawn in dotted lines in figure 7. Elements of this gen­eral type have also been used in analog VLSI circuits for image enhancement. To the best of our knowledge, the first related circuit appeared in [15] and had a mon­otone, saturating characteristic of the general form i = a tanh(bv). John Harris at Caltech has invented the first nonmonotone circuit element of this type, named it a resistive fuse, and built image processing networks using it in analog VLSI [16]-[18]. More recently, Steve Decker, Hae-Seung Lee, and John Wyatt at MIT have developed more compact nonmonotone continuous resistive fuse circuits using fewer transistors.

The behavior of the nonmonotone fuses in the net­work in figure 2 is intuitively easy to understand. In a smooth region of the image where the input u is nearly constant, only the linear portion of the fuse curve near the origin is excited, i.e., the fuse acts essentially as a linear smoothing element. But at any point where a dis­continuity in the input occurs, i.e., where IUj - uj-ll is sufficiently large, the fuse current becomes quite small and little smoothing results. An extremum formu­lation of this behavior is given in [16].

A fundamental question is whether any rigorous relationship can be found that connects the continuous nonmonotone fuse curves in figure 7 with the discon­tinuous fuses in figure 3 or the switches in figure 2. The surprising answer is yes, due to a remarkable result of Geiger and Girosi, based on a stochastic formulation of the problem [1]. Section 2.3 gives the necessary background and Section 2.4 gives a variant on their approach, based on a formulation in terms of a marginal probability distribution function.

2.3. Stochastic Formulation of the Image Smoothing and Segmentation Problem

This section shows that the deterministic minimization problem in (1)-(4) is in fact the optimum image recon-

noisevectorN

estimate of original image

and edge discontinuities

(!l,D)~f(U)

Fig. 5. The original image one wishes to reconstruct is modeled as a random brightness vector B and a random binary edge discontinuity vector D. where D j = I if there is an edge between pixels i and (i + I). The observation U is a version of B degraded by observa­tion noise N. and the reconstruction algorithm produces an estimate (B. D) of (B. D). In our particular example. the function f is the optimization procedure specified in (23)-(25).

struction procedure in a particular probabilistic formu­lation of the problem (see figure 5). In this formulation, the original image brightness and discontinuities are modeled as a pair of random vectors (B, D). We cannot directly measure (B, D) but have to work instead with a noisy observation U of the brightness values alone.

Remark. The notation in this section follows the one classically used in probability theory, where the random variables or vectors are denoted by uppercase letters (B, D, U) while the values taken by them are denoted by lowercase letters (b, d, u). In (18), p(X = x) is denoted by p(x) and p(X = xlz = z) is denoted by p(x Iz) to simplify notation.

For this analysis, it is assumed that the probability distribution for (B, D) is given by

pCb, d) = p(bld) p(d)

= {Cs(f3As)e-~S(b.d)} {Ch({3Ah)e-~H(d)}, (18)

where the functions Sand H are given in (3) and (4), respectively, and where Cs and ch insure that pCb, d) and p( d) have unit area. The vector D represents a finite Bernoulli sequence. More specifically, the components of the binary random vector D are independent, iden­tically distributed binary random variables with a prob­ability mass function

(19)

so that the joint probability mass function of d is given by

i=l

e(-~).,hE~~-11di)

= (1 + e ~).,h)N I

Page 56: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

58 Lumsdaine, U}>att and Elfadel

which decreases with the total number of discontinuities in the scene. The brightness vector B is a Gaussian ran­dom vector dependent on D: specifically, the elements of the vector taken as a sequence represent, in the one­dimensional case, discrete Brownian motion with a uni­formly distributed initial value at i = 1 and at the right of each discontinuity location. In the absence of a discontinuity between pixels i and (i + 1), the standard deviation of Bi+! - B; is (J = 1I--./2(3As '

We also assume the observations U are distributed as

(21)

where the function Fu is given in (2) and where cjnor­malizes p(u Ib) to unit area. Equation (21) is equivalent to assuming that the observations are corrupted by addi­tive, independent Gaussian noise, i.e., U; = B; + N i ,

where N; is a Gaussian random variable with zero mean and standard deviation (J = 1I--'/2(3Aj, and N is independent of (B, D). The variable (3 is actually redun­dant because scaling (3 is equivalent to scaling Aj, As, and Ah' Since increasing (3 will reduce all the vari­ances, 11(3 is analogous to temperature in statistical mechanics.

Remark. More precisely, B is Brownian motion and N is Gaussian if the allowed values for each B· and each N; are continuous and unconstrained. But i~ that case the description ofB is flawed because a uniformly dis­tributed initial value over the whole real line is not a normalizable probability distribution. This problem vanishes if the allowed brightness values are discrete and finite in number or continuous and bounded, e.g., o :5 B; :5 Bmax.

How should we best attempt to reconstruct (B, D) given U? Given both the a priori (i.e., prior to a par­ticular observation) probability distribution pcb, d) of the image with discontinuities, and the noise-produced conditional distribution p(u Ib, d) of the observation, Bayes rule [27] gives the a posteriori distribution (i.e., the conditional distribution after noisy observation) of the image and discontinuities by the formula

pCb, diu) = p(ulb, d)p(b, d) p(u)

(22)

Maximum a posteriori estimation is a reconstruction technique that chooses (6, (I) as the value of (b, d) that maximizes pCb, diu), i.e.,

(6, (I) = j(u) = arg max pCb, diu) b,d

= arg max [P(Ulb, d)p(b, d) ] b,d p(u)

= arg max [p(ulb)p(b, d)], (23) b,d

where the last equality holds because the denominator is independent of (b, d) and the observation u depends directly on the brightness levels alone-d is not meas­ured but only statistically inferred. Using (18) and (21) in the last line of (23), we have

(6, (I) = arg max {CjCSChe-P[Fu(b)+S(b,d)+H(d)l}. b,d (24)

Remark. There is a useful distinction between the (y, I) notation and the (b, d) notation, In the deter­ministic picture, (y, I) represents algorithm or circuit variables over which we are attempting to optimize. A particular network solution (y, I) does describe the cir­cuit behavior but mayor may not bear any simple rela­tion to (b, d). The stochastic picture adds a new quan­tity not present in the earlier deterministic story: the original uncorrupted random image-discontinuity pair (B, D). The variables (b, d) always refer to possible values of (B, D) and mayor may not relate directly to circuit behavior. Without this dual notation the variables would misleadingly be used to describe both original image-discontinuity pairs and also node voltages and switch positions inside an electrical network.

Substituting our previous deterministic notation (y, I)

for (b, d), we recover

(6, (I) = (Yap" lop,)

= arg min {Fu(Y) + S(y, I) + H(I)} (25) y,l

as in (1)-(4). In conclusion, the optimization problem in (1)-(4)

yields as its solution the maximum a posteriori esti­mator of the original image brightness and discontinuity vectors (B, D), assuming the a priori distribution (18) and the observation noise model (21).

2.4. Derivation ojthe Continuous Fuse from the Resistor­with-Switch Network in a Probabilistic Formulation

In the Bayesian formulation, one first calculated the a posteriori distribution

pCb, diu) = CjCsChe-PlFu(b)+S(b,d)+H(dll, (26)

and then attempted to maximize it over (b, d), If one wishes to reconstruct only the intensities but not the

Page 57: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 59

discontinuity locations, it is appropriate to maximize the simpler marginal a posteriori distribution pCb ju) over b, where

pCb ju) ~ ~ pCb, d ju), (27) dEe

and e is the hypercube of all 2N- 1 possible binary d­vectors. The form of this density will specify the non­linear continuous fuse characteristics. The sum above was first calculated by Geiger and Girosi in [1], and a somewhat more detailed derivation is given below.

LEMMA 2. The marginal a posteriori distribution pCb ju) over b is given by:

pCb ju) = c1e-Wu(b)+J,(b») (28)

where C1 is a normalizing constant, Fu(b) is given in (2), and

J2(b) = ~ ~ In [1 + e~[~-~~~:-bi+l)' J. (29)

The proof of the lemma requires the following fact, which can easily be verified.

Fact. Let a = (al> ... , an) be a vector of n binary variables, ai E {O, I}, and let a be the set of all such vectors. Then for any r E (R n,

n

~ ea'r = II (1 + eri ), (30) aE{l i=l

where a . r is the standard inner product.

Proof of Lemma 2. Substituting (26) into ('2]) yields

p(bju) = ~ pCb, dju) dEe

= CjCsCh ~ e-mFu(b)+S(b.d)+H(d»).

dEe

(31)

The terms being summed in (31) can be decomposed as follows:

Using (30), the second exponential term in (32) sums overdEeto

exp [-13 [ - ~ ~ In(l + e-W'h-'S(bi-bi+ I)')] J (33)

and further algebraic manipulation shows that

p(bju) = CjCsChe-~[Fu(b)+],(b»), (34)

where

(35)

Absorbing an additive term into the normalizing con­stant CjCsCh, the lemma was stated in (29) in terms of

which is constructed so that J 2(0) = O. This is a neces­sary step if we are to later interpret J2(b) as the co­content function of a set of nonlinear resistors. •

The marginal distribution in (28) suggests a new cost function

(37)

in which the line process variables have been elimi­nated, The local minima of K~(y) are obtained from the set of points satisfying

V~(y) = O. (38)

Thking the i-th component of (38) gives:

Aj(Yi - u) + g~(Yi - Yi-l) + g~(Yi - Yi+l) = 0, (39)

where

(40)

Equation (39) can be considered the KCL relation at each node of a nonlinear network having vertical linear resistive elements with conductance Aj and horizontal nonlinear elements with constitutive relation i = g~(v).

In this case, K~(y) is the total co-content of the net­work, Notice that as 13 --> 00, we recover the RWF net­work, i.e., K~(y) --> Ku(Y)' Moreover, we have defined a family of t3-dependent resistive elements, illustrated in figure 7, that can be used in continuation methods.

Page 58: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

60 Lumsdaine, uyatt and Elfadel

3. Solution Methods

The resistive fuse and marginal distribution approaches produced switch-free nonlinear networks with identical topologies (see figure 2) but with different constitutive relations for the nonlinear elements. For either network, multiple solutions generally exist. On the theoretical side this is a difficulty because we are trying to find the global minimum of a specific cost function. On the practical side this is a difficulty because the solution that is obtained by a physical network realization will depend strongly on such things as parasitic capacitances and other characteristics of the network over which we have little control. We therefore seek some modification of the network that will allow us to exercise some con­trol over the solution it finds. In this section, we apply continuation methods to the nonlinear smoothing and segmentation networks.

3.1. Example-A Special Case

The simplest special case that nonetheless provides in­sight into the phenomenon of multiple solutions is the response of a one-dimensional network to a step edge input, i.e.,

{Uhi ' i. :s k

Uj =

Ufo, I > k

(41)

for some k < N. This corresponds to a step of Uhi

Ufo > 0 between nodes k and k + 1 and serves as a model for the simplest two-dimensional edge, i.e., a step that extends across the entire network and is paral­lel to one of the network "axes."

For the step input described above, the one-dimen­sional network has a simple circuit equivalent, shown in figure 6. The simplification proceeds as follows. First, we assume that the signal is "well-smoothed" on either side of the step so that each nonlinear element can be replaced by an equivalent linear resistance whose value is the incremental resistance of the nonlinear ele­ment about zero volts. The network elements on either side of the step are then replaced by their Thevenin equivalents, which are combined into a single linear element and voltage source. The simplified network will be referred to as the zero-dimensional case. Analysis of the behavior of the network to a step input is reduced to solving the KCL equation at one node: some insight into the circuit behavior can be gained by using load­line techniques (see figures 6-8).

y

u

Fig. 6. Thevenin equivalent circuit for one-dimensional nonlinear smoothing and segmentation network with step input.

This "linear load-line assumption" holds exactly only for the RWS network with fixed switch positions and for the marginal distribution network with {3 = O. For the RWF network and for the marginal distribu­tion network with {3 ..... 00, it is exact over the limited voltage range in which no new discontinuities are in­troduced into y. Otherwise, it is only an approxima­tion and its applicability to other cases of interest must be individually determined.

3.2. Continuation Methods

We seek a modification to the networks so that the solu­tion will be repeatable and also be visually and quanti­tatively "good." One technique that works well within the context of smoothing and segmentation is to apply a continuation method to the network [2], [10].

A continuation (sometimes called "deterministic an­nealing") can be realized in network form by the simul­taneous application of a given homotopy (continuous deformation) to some or all of the circuit elements. Two types of continuations are particularly appropriate for our class of nonlinear networks. Assume we have a net­work with horizontal nonlinear resistors whose consti­tutive relation is described by i = g( v), and vertical linear resistors with conductance 'Af . Consider the fol­lowing two homotopies for the horizontal and vertical elements, respectively:

CH: Replace g with g(p), p E [a, b], such that g(a)

constrains the network to have a unique solution and that g(b) = g;

CV: Replace 'Af with 'A1), p E [a, b], such that 'Aj") constrains the network to have a unique solution and that 'AJ'l = 'Af ·

Note that CH and CV define where the homotopies are applied in the network to produce a continuation; we are still free to decide the specific form of the homotopy.

Page 59: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 61

3.3. (3-Continuation

Blake and Zisserman suggest a CH continuation method-the so-called "graduated nonconvexity" algo­rithm, or ONC [7]. There are some apparent weak­nesses to using the ONC algorithm in network form, however. First, there is no reason to expect that, for an arbitrary image, the specific continuation used by ONC will produce the global minimum or that it will even produce a "good" minimum. Second, the nonlinear resistive element in a network realization of ONC will have a discontinuous first order derivative which can cause convergence difficulties in numerical simulation.

On the other hand, the marginal distribution deriva­tion of our nonlinear network provides a natural homo­topy for realization of the CH continuation. For (3 = 0, the network with elements described by (40) is linear, whereas for (3 .... 00, the elements become identical to those in figure 3 and will (locally) solve our minimiza­tion problem. This suggests using (3 directly as the con­tinuation parameter for a CH continuation for solving (39) and hence (9). Furthermore, because of the way this continuation was derived, one might expect that it would do a good job of seeking the global cost minimum.

Some insight into the behavior of this type of net­work can be gained by examining the zero-dimensional case. Figure 7a shows the marginal distribution non­linear resistor characteristic for various values of (3, along with two load-lines representing two different values of the input. As (3 is taken from 0 to 00, the solu­tion will follow the continuous path represented by the intersection of the resistor curve and the load-line. In this example, the smaller step will be smoothed, and the larger step will be segmented.

Interestingly, discontinuous behavior can occur with this type of continuation, as is shown in figure 7b. In this example, the initial solution point will be the inter­section of the load-line and the marginal distribution resistor characteristic for (3 = O. As (3 is increased, the "hump" of the nonlinear resistor curve will at one point pass completely beneath the load-line, at which point the solution will jump from being a smoothing solution to being a segmenting solution.

3.4. '~'J-Continuation

The CV continuation can be realized in a straightfor­ward manner by varying the vertical resistors in the net­work. In particular, we begin with the resistors having infinite (or sufficiently large) conductance so that the

12 xlO-4

10

1.5 voltage

(al

2 2.5

xlO-4 IO,-----__ ----__ ------__ -----r------.

6

4

0.5 1.5 voltage

(bl

2.5

Fig. 7. Approximate load-line plot for marginal distribution network with iJ-continuation. In part (aJ, the solid lines with negative slope represent the load-lines fur two different input values (0.5V and 2.5V). The nonlinear resistor is shown for various values of {3. For {3 = 0, the nonlinear resistor acts as a linear resistor. As (3 --+ 00, the nonlinear resistor characteristic becomes that of the discontinuous resistive fuse. In part (b J, the solution exhibits a discontinuous jump, from a smooth­ing solution to a segmenting solution.

network has only one solution, namely y = u (or, for large conductance, y "" u). Then, we continuously decrease the value of the conductance to At.

Examination of the zero-dimensional case provides some insight into the behavior of this type of network. Figure Sa shows the marginal distribution nonlinear resistor characteristic for large (3, along with two se­quences ofload-lines representing two different values of the input. In this example, the solution for the larger input will remain at the initial intersection point of the load-line and the resistor curve as AjPl is taken from Aj"l = Ao to AJ'l = At. On the other hand, the solution for the smaller input will follow the continuous path represented by the intersection of the resistor curve and the load-line. Hence, the larger step will be segmented and the smaller step will be smoothed.

Page 60: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

62 Lumsdaine, U'yatt and Elfadel

voltage (a)

6 x1Q4

5

4

El

~ 2

0 0 0.5 1.5

voltage (b)

Fig. 8. Approximate load-line plot for Arcontinuation. In part (a), the nonlinear resistor characteristic g{j(v) is shown for large {3 along with two sets of load-lines, each set for a different value of the input (the load-lines intersect the g~(v) = 0 line at the value of the input voltage: 0.5V and 2.5V). As AI is decreased, the load-lines rotate counter-clockwise. In part (b), the nonlinear resistor characteristic is shown for finite /3. In this case, the solution exhibits a discontinuous jump, from a segmenting solution to a smoothing solution, as 'y is decreased.

Discontinuous behavior can also occur with this type of continuation, when the continuation is used with non­linear resistors of finite {3, as is shown in figure 8b. In this example, the initial solution point will be a segment­ing solution in the lower right-hand corner of the figure. As Af is decreased, the load-line will at some point pass completely beneath the nonlinear resistor charac­teristic, at which point the solution will jump from being a segmenting solution to being a smoothing solution.

4. Numerical Experiments

In order to quantitatively and qualitatively demonstrate the behavior of the {3-continuation and Ajcontinuation networks, the results of several numerical experiments

..

Fig. 9. Two-dimensional network topology.

are presented. The experiments were all conducted with two-dimensional networks, the topology of which is shown in figure 9. The experiments were conducted using serial and parallel versions of a special purpose circuit simulator developed specifically for vision cir­cuits [28]-[30].

The continuations were simulated by performing dynamic simulations of the networks. In order to add dynamics to the networks, a small parasitic capacitance to ground was added at each node such that the time constants of the network were much faster than the rate at which the circuit elements were varied to perform the continuation. Dynamic simulation of the networks in this way has several advantages. First, the presence of parasitic capacitances is somewhat more physical and will allow the system to perform a gradient descent which will thereby guarantee that the network does not settle on a solution which statically satisfies KCL but is actually a local maximum of the network cost func­tion [16]. Second, the dynamics will insure that the net­work behavior is well-defined at points where solutions in the static case would disappear, as in figures 7b and 8b. (Our experience has been that discontinuous circuit behavior is much more common in the Ajcontinuation network than in the (3-continuation network, which causes simulation of a Ajcontinuation network to take much more time.)

4.i. Experiments with a Synthetic image

A series of seven experiments conducted on a 16 x16 circuit grid with a synthetic input image was conducted. Figure lOa shows the l6x16 synthetic image used for the experiments. The small step is IV in height and the

Page 61: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 63

(a) (b)

Fig. 10. (a) Original image. (b) Original image corrupted with noise. The noisy image was used as input for all experiments in Section 4.1.

large step is 3V. The original image was then corrupted by the addition of spatially uncorrelated noise uniformly distributed between OV and 0.5V and is shown in figure lOb. The noisy signal was used as input for this series of experiments.

For each experiment, a cost function was determined and the corresponding {3-continuation and Ar continuation networks constructed. Then, the networks were each simulated using the input image shown in figure lOb. For each experiment, the value of As was fixed at 1.0 x 10 - 3 and the value of Ah was changed. For the {3-continuation, the value of Afwas fixed at 1xlO-4

and the value of {3 was increased from 0 to 20/Ah' For the Arcontinuation, the value of {3 was set to 20/Ah and the value of Af was varied from 1 to 1xlO-4 . Thus, for each experiment, the final cost function parameter values of the {3-continuation and Arcontinuation net­works were the same.

Solutions obtained by the two nonlinear networks were compared as follows:

1. Given a cost function, construct the corresponding nonlinear networks, and in addition, construct a cor­responding RWS network;

2. Provide each network with the same input and allow each network to attain its solution;

3. For each nonlinear network, transfer the line process solution obtained to the RWS network by setting the switches according to equation (8);

4. Allow the RWS to attain its voltage solution and compute the resulting cost-it is this cost that is used for comparison.

The results of the seven experiments are shown in Table 1. For the particular values of parameters used, each network computed a lower cost in roughly half the experiments. This set of experiments was actually taken from a larger set of 49. Of those, the {3-continuation found the lower cost 35 times, the Arcontinuation found the lower cost eight times, and there were six ties. Thus, in these experiments, the {3-continuation performs its task of minimizing the cost function (1) extremely well.

Table 1. Experimental results showing the values of the cost function of the solutions produced by the {3- and "rcontinuation networks for different values of "h'

Cost

Expt "h (1-continuation "'rcontinuation

l.Ox 10-3 1.775 X 10-2 l.770x 10-2

2 5.0XIO-4 9.699 X 10-3 l.254 X 10-2

3 l.Ox 10-4 3.299 X 10-3 2.940 X 10-3

4 5.0x 10-5 l.740xlO-3 l.740XIO- 3

5 l.Ox 10-5 7.800x 10-4 2.641 x 10-3

6 5.0xlO-6 6.600x 10-4 l.650XIO-3

7 l.Ox 10-6 5.518xlO-4 4.246 X 10-4

(a) (b)

Fig. 11. (a) Network solution produced by "rcontinuation in experi­ments 2 and 3. (b) Network solution produced by {3-continuation in experiments 2 and 3.

If the cost function were the last word on image smoothing and segmentation, we could immediately recommend the {3-continuation. However, remember that the ultimate goal for a smoothing and segmentation network is esentially to recover an original image minus any noise, and the cost function was introduced to give us a quantitative means for doing this. Now consider figure 11, which corresponds qualitatively to the solu­tions produced by the two nonlinear networks in experi­ments 2 and 3. Note that whereas lla is the qualitatively correct solution, it corresponds to the higher cost in experiment two, but corresponds to the lower cost in experiment three.

Naturally, this calls into question the entire cost function methodology used for smoothing and segmen­tation. The difficulty arises because our efforts were concentrated only on finding an optimal solution rather than the larger issue of determining the best cost func­tion and parameter values. See however [31].

4.2. Experiments with a Real Image

The networks were then tested with a real image. Figure 12 shows the 256x256 input image-a portion of the

Page 62: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

64 Lumsdaine, JJYatt and Elfadel

Fig. 12. 256 x256 image of the San Francisco skyline.

San Francisco skyline. The output images shown in fig­ures 13-16 were produced using a recently developed circuit simulation program on the Connection Machine.

Figure 13 shows the output produced by the (3-continuation with fixed parameter values As = lXlO-3,

Ah = 2xlO-5 , and Aj = lxlO-4. Figure 13a shows the output of the network at the beginning of the continua­tion when (3 = 0; figure 13b shows the output of the network at an intermediate point of the continuation when (3 = 5 XI03; figure 13c shows the output of the network at the end of the continution when (3 = 5 X 105 .

Figure 14 shows the output produced by the (3-continu­ation with fixed parameter values As = lxlO-3, Ah = lxlO-5, and Aj = 3XI0-5 . Figure 14a shows the out­put of the network at the beginning of the continuation when (3 = 0; figure 14b shows the output of the network at an intermediate point of the continuation when (3 = 2 x104; figure 14c shows the output of the network at the end of the continuation when (3 = lxl06. Figure

15 shows the output produced by the Arcontinuation with fixed parameter values As = lXlO-3, Ah = lXlO-5,

and (3 = 1 X 106 • Figure 15a shows the output of the network at the beginning of the continuation when Aj = 1; figure 15b shows the output of the network at an intermediate point of the continuation when Aj =

lXlO-3; figure 15c shows the output of the network at the end of the continuation when Aj = 3xlO-5. Note that the final parameter values of this network are iden­tical to those for the network of figure 14. Figure 16 shows the output produced by the Arcontinuation with parameter values As = lXlO-3, Ah = 2XlO-5 , (3 = 5 X 104 . Figure 16a shows the output of the network at the beginning of the continuation when Aj = 1; figure 16b shows the output of the network at an intermediate point of the continuation when Aj = 5XlO-4; figure 16c shows the output of the network at the end of the continuation when Aj = lxlO-6•

4.2. Discussion

As can be seen from the experiments with the real im­age, not only does the selection of parameter values affect the behavior of the networks, but the continuation used also has a profound effect on the network behavior. The differences between cost functions for a particular continuation can be seen by comparing figures 13c and 14c, and by comparing figures 15c and 16c. The differ­ences between continuation methods for a given cost function can be seen by comparing figures 14c and 15c.

One can understand the differences in the continua­tion methods quite readily. At the beginning of the (3-continuation, the output of the (3-continuation network is rather smooth, since initially the network is equiva­lent to a linear resistive network (see figures 13a and 14a). The edges are then added during the course of the continuation (see figures 13b and 14b). This is a difficulty because without any initial edge information, some of the edges might be misplaced or even completely

Fig. 13. Output produced by (3-continuation network. Here, the parameter values are A, ~ IXlO-3, Ah ~ 2XlO-5, At ~ IXlO-4, and (3 ~ 0, 5 XI03, and 5 XlO5 for figures 13a, \3b, and \3c, respectively.

Fig. 14. Output produced by (3-continuation network with smaller fidelity and line penalty weights and larger final (3 value than for the network in figure 13. Here, the parameter values are A, ~ IXlO-3, Ah ~ IXIO-5, At ~ 3XIO-5, and (3 ~ 0, 2X104, and IXlO6 fur figures 14a, 14b, and 14c, respectively.

Fig. 15. Output produced by Arcontinuation network. Here, the parameter values are A, ~ IXIO-3, Ah ~ IXlO-5, (3 ~ IXlO6, and At ~ I, At ~ IXIO-3, and At ~ 3XlO-S for figures 15a, 15b, and 15c, respectively. Note that the final parameter values of this network are identical to those for the network of figure 14, but that the output image is much closer to the input image shown in figure 12.

Fig. 16. Output produced by Arcontinuation network. Here, the parameter values are A, ~ IxlO-3, Ah ~ 2xlO-5, (3 ~ 5x104, and At ~ I, At ~ 5XIO-4 , and At ~ IXlO-6 for figures 16a, 16b, and 16c, respectively.

Page 63: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 65

Fig. 13a. Fig. 13b. Fig. 13e.

Fig. 14a. Fig. 14b. Fig. 14e.

Fig. 15a. Fig. 15b. Fig. 15c.

Fig. 100. Fig. 16b. Fig. 16c.

Page 64: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

66 Lumsdaine, Wyatt and Elfadel

lost. Notice that in figures 13c and 14c, edges tend to line up along the network axes.

On the other hand, the initial output image of the Arcontinuation network is very close (or identical) to the input one (see figures I5a and I6a). All the edges are initially present and only the spurious edges are smoothed during the course of the continuation. Since all the edge information is present at the start of the continuation, one would expect that the Arcontinuation would more properly locate and preserve edges. This observation is borne out in figures 15c and I6c; the edges are much more well preserved than those in fig­ures 13c and I4c.

Finally, since the Arcontinuation only requires that a linear resistance be varied, its VLSI implementation should be much more compact than that of the (3-continuation, which would require varying the charac­teristics of a nonlinear resistor.

4.3. Behavior of the ArContinuation

There were some interesting properties exhibited by networks constructed with the elements described in (40) having a fixed (3 < 00. For such a network, it can be shown that there exist a Amin > 0 and a Amax < 00

such that for Af > Amm; and for Af < Amin, the network has a unique solution. In fact, for Af > Amax, the out­put will essentially match the input (i.e., y "" u), whereas for Af < Amin, the output will contain no edges. Consider the network behavior as a function of Af as Af is varied continuously from Amux to Amin. The initial solution of the network will closely match the input. Then, as Af is decreased, edges will begin to disappear, first the smaller, then the larger, until all the edges are gone. In other words, Af acts as a scale­space parameter. This has important practical applica­tions. The dynamic network of Perona and Malik [9] has the property that time acts as a scale-space param­eter. In contrast, we can exercise direct control over the scale-space parameter in the Arcontinuation network. See also [2]. Moreover, this behavior is somewhat remi­niscent of the more successful methods used for hierar­chical multi scale image representation [32].

Under some mild assumptions, it can be shown that the particular solution path with the endpoints described above is continuous, connected, and can be numerically traced out in eRN+! (Af X eRN space) using an arc­length continuation [33]. In such a case, any particular value of Af would correspond to an N-dimensional hyperplane parallel to eRN given by XN+I = Afand net­work solutions for that Af would be intersections of the

solution path and that hyperplane. An interesting ques­tion now arises: why can't we just trace out the path in eRN+!, determine the solutions, and sort them by cost to find the global minimum? The answer, unfortu­nately, is that there can exist solution loops that are dis­joint from the main solution path, meaning that the path traced out from the starting point of large Af will not necessarily contain all the solutions at any given value of Af. The solution loops can occur in as small an ex­ample as a three pixel circuit (see figure 17) and we offer as proof the experimental evidence in figure 18. To produce the paths, Af was parameterized as a func­tion of parameter t according to Af = (1 - t)Alarge + Asmalb and the plots were made in t X eR2 space. Notice that, as predicted, there is one solution path with endpoints {t = 0, V; = V in = 2.5, V, = O} and {t = 1, V; = 0, v, = O}, corresponding to an edge across the first nonlinear resistor. In addition, there is a closed

+ V -1

9

+ V -2

9

Fig. lZ Three pixel example circuit. For the result shown in figure 18, Vin = 2.5V, f.., = 0.19, f..h = I.OXIO-4 , {3 = 40, f..f= (I - t)f..j"g,

+ Asman, Alarge = 0.1, Asmall = 1.0xlO-12 , and t was varied from o to 1.

Fig. 18 Solution path in t x ill' space for the three pixel example circuit demonstrating the existence of a disconnected solution loop.

Page 65: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Nonlinear Analog Networks for Image Smoothing and Segmentation 67

solution loop centered (approximately) at {t = 0.9. V; = 0.2, V, = 1.2}, corresponding to a "misplaced" edge, i.e., an edge across the second nonlinear resistor. In general, it can be shown that for a one-dimensional RWS or RWF network with a single step input, solu­tions corresponding to misplaced edges always have higher cost than solutions corresponding to correctly placed edges.

5. Conclusion

In this paper, we developed and compared a series of nonlinear networks for image smoothing and segmenta­tion. The results of several experiments indicate that the typical cost (or "energy") function minimization formulation of the smoothing and segmentation prob­lem does not necessarily capture the essence of the task. For the specific parameter values we used, the Ar continuation network performed extremely well even though it did not always find the solution with minimum cost. The Arcontinuation network has several imple­mentation advantages over the f3-continuation network. First, in certain cases, it seems to perform the smooth­ing and segmentation task in a more visually correct fashion. Second, Af can be used as a scale-space pa­rameter. Finally, since the Arcontinuation only requires that a linear resistance be varied, its VLSI implementation should be much more compact than that of the f3-continuation.

Several open questions remain. Primary among these is the need for a comprehensive characterization of the natural behavior of these networks. By "natural behavior" we mean a set of quantitative empirical state­ments that relate the behavior of the network, given cer­tain canonical edge configurations, to the cost function parameters. Furthermore, it is important to know how the networks behave in the presence of varying amounts and types of noise. Finally, Tom Richardson has devel­oped an alternate formulation of the smoothing and seg­mentation problem based on a rigorous analysis of the continuous case [31]. This leads to a more complex cir­cuit interpretation that might offer better performance than the methods investigated here. Since efficient simu­lation tools on the Connection Machine are now avail­able, it is hoped that some of these questions can be addressed in the near future.

Acknowledgments

This work was supported by the National Science Foun­dation and the Defense Advanced Research Projects

Agency under Contract No. MIP-88-14612. The first author was also supported by an AEAlDynatech faculty development fellowship. The authors are grateful to Thinking Machines Corporation, especially Rolf Fiebrich, for providing hardware and software support for the development of the simulator used to produce the experimental results in Section 4. The authors would like to acknowledge helpful discussions with Professor Alan Yuille of Harvard University, Dr. Davi Geiger of Siemens, and Professor Jacob White of MIT.

Note

1. Connection Machine is a registered trademark of Thinking Machines Corporation.

References

1. D. Geiger and F. Girosi. "Parallel and Deterministic Algorithms from MRF's: Surface Reconstruction and Integration." IEEE Trans. Pattern Analysis and Mach. Intell., forthcoming.

2. D. Geiger and A. Yuille, '\\ Common Framework for Image Seg­mentation," in Int. Jour. Compo Vision, forthcoming.

3. S. Geman and D. Geman. "Stochastic Relaxation, Gibbs Distribu­tions, and the Bayesian Restoration of Images," IEEE Trans. PAMI-6 (6), 1984, pp. 721-741.

4. 1.L. Marroquin. "Optimal Bayesian Estimators for Image Seg­mentation and Surface Reconstruction," MIT AI Laboratory Memo 839, April 1985.

5. F.S. Cohen and D.B. Cooper, "Simple Parallel Hierarchical and Relaxation Algorithms for Segmenting Noncausal Markovian Random Fields," IEEE Trans. PAMI-9 (2), 1987, pp. 195-219.

6. 1. Marroquin, S. Mitter. and T. Poggio, "Probabilistic Solution of ill-Posed Problems in Computational Vision;' Jour. Amer. Stat. Assoc. (Theory and Methods), vol. 82, no. 31J7, 1987, pp. 76-89.

7. A. Blake and A. Zisserman, Visual Reconstruction, Cambridge, MA: MIT Press, 1987.

8. A. Blake, "Comparison of the Efficiency of Deterministic and Stochastic Algorithms for Visual Reconstruction," IEEE Trans. PAMI-ll (I). 1989, pp. 2-12.

9. P. Perona and 1. Malik, "Scale Space and Edge Detection Using Anisotropic Diffusion," IEEE Trans. PAMI-I2 (7), 1990, pp. 629-639.

10. 1.M. Ortega and WC. Rheinboldr, Iterative Solution o/Nonlinear Equations in Severall1>riables, New York: Academic Press. 1m.

II. C. Koch, 1. Marroquin, and A. Yuille, '\\nalog 'Neuronal' Net­works in Early Vision." Proc. Natl. Acad. Sci. USA, vol. 83, 1986, pp. 4263-4267.

12. D.W Tank and J.J. Hopfield, "Simple 'Neural' Optimization Net­works: An AID Converter, Signal Decision Circuit, and a Linear Programming Circuit," IEEE Trans. CAS-33 (5), 1986.

13. D. Terzopoulos, "Multigrid Relaxation Methods and the Analysis of Lightness, Shading, and Flow." MIT AI Laboratory Memo 803, October 1984.

14. WD. Hillis, The Connection Machine, Cambridge, MA: MIT Press, 1985.

15. CA. Mead, Analog VLSI and Neural Systems, Reading, MA: Addison-Wesley, 1988.

Page 66: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

68 Lumsdaine. Wyatt and Elfadel

16. J. Harris. C. Koch, J. Luo, and J. Wyatt, "Resistive Fuses: Analog Hardware for Detecting Discontinuities in Early Vision," in Analog VLSI Implementation of Neural Systems, CA. Mead and M. Ismail, eds., Boston: Kluwer, 1989.

17. J. Harris, C. Koch, and J. Luo, "A Two-Dimensional Analog VLSI Circuit for Detecting Discontinuities in Early Vision," Science, vol. 248, 1990, pp. 1209-1211.

18. J. Harris, C. Koch, E. Staats, and 1. Luo, "Analog Hardware for Detecting Discontinuities in Early Vision," Int. 1. Compo Vision, vol. 4, 1990, pp. 211-223.

19. T. Poggio and C. Koch, "Ill-Posed Problems in Early Vision: From Computational Theory to Analogue Networks," Proc. Roy. Soc. Lond. B 226, 1985, pp. 303-323.

20. B.K.P. Horn, "Parallel Networks for Machine Vision," MIT AI Laboratory Memo 1(J71, August 1988.

21. W. Millar, "Some General Theorems for Non-Linear Systems Possessing Resistance," Phil. Mag., 1951, pp. 1150-1160.

22. I. Elfadel, "Note on a Switching Network for Image Segmenta­tion," unpublished manuscript, October 1988.

23. P.M. Hart, "Reciprocity, Power Dissipation, and the Thevenin Circuit," IEEE Trans. CAS-33 (7), 1986, pp. 716-718.

24. P. Cristea, F. Spinei, and R. Tuduce, "Comments on 'Reciprocity, Power Dissipation, and the Thevenin Circuit,' " IEEE Trans. CAS-34 (10), 1987, pp. 1255-1257.

25. P. Penfield, Jr., R. Spence, and S. Duinker, Tellegen's Theorem and Electrical Networks, Cambridge, MA: MIT Press, 1970.

26. B.D.H. Tellegen, "A General Network Theorem with Applica­tions," Philips Res. Rept. 7, 1952, pp. 259-269.

27. A. Papoulis, Probability, Rnndom Jbriables, and Stochastic Proc­esses, 2nd Edition, New York: McGraw-Hill, 1984.

28. A. Lumsdaine, M. Silveira, and 1. White, "Simlab User's Guide," to be published as an MIT memo.

29. A. Lumsdaine, M. Silveira, and 1. White, "CMVSIM User's Guide," to be published as an MIT memo.

30. L.M. Silveira, A. Lumsdaine, and 1.K. White, "Parallel Simulation Algorithms fur Grid-Based Analog Signal Processors," Proceedings of the International Conference on Computer Aid­ed Design, pp. 442-445, Santa Clara, CA, 1990.

31. T. Richardson, "Scale Independent Piecewise Smooth Segmen­tation ofimages Via Variational Methods," MIT Laboratory for Information and Decision Systems Technical Report LIDS­TH-1940, February 1990.

32. A. Witkin, "Scale-Space Filtering," Intemotional foint Confer­ence on Artificial Intelligence, pp. 1019-1021, Karlsruhe, 1983.

33. H. Keller, "Numerical Solution of Bifurcation and Nonlinear Eigenvalue Problems," in Applications of Bifurcation Theory, P. Rabinowitz, ed., New York: Academic Press, 1977.

Andrew Lumsdaine received the SBEE, SMEE, and EE degrees from MIT in 1984, 1986, and 1988, respectively. During 1986, he worked as an engineer in the manufacturing development group at the Packard Electric Division of General Motors. Presently, he is again at MIT, completing the Ph.D. His current research interests include circuit simulation, parallel numerical algorithms, and robot vision. Mr. Lumsdaine is a member of IEEE and SIAM.

John L. Wyatt, Jr, received the S.B. degree from the Massachusetts Institute of Technology, Cambridge, in 1968, the M.S. degree from Princeton University, NJ, in 1970, and the Ph.D. degree from the Uni­versity of California at Berkeley in 1978, all in electrical engineering. After a Post-Doctoral year in the Department of Physiology at the Medical College of Virginia, he joined the faculty of the Electrical Engineering and Computer Science Department at MIT, where he is currently a Professor. His research interests include nonlinear cir­cuits and systems and analog VLSI for robot vision.

Ibraltim M, Elfadel graduated from the University of Paris (Maitrise de Mathematiques) and from the Ecole Centrale de Paris (Diplome d'Ingenieur) in 1983. He moved to the US in 1984. He obtained an MS degree in Mechanical Engineering from MIT in 1986. He is cur­rently completing his Ph.D. dissertation at the Research Laboratory of Electronics, Department of EECS, MIT. His current research inter­ests include computer vision and image processing.

Page 67: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

A Systolic Array for Nonlinear Adaptive Filtering and Pattern Recognition

J.O. Me WHIRTER, D.S. BROOMHEAD, AND T.J. SHEPHERD Royal Signals and Radar Establishment, St. Andrews Road. Malvern, Wares. WRJ4 3PS, England

Received July 28, 1990.

Abstract. A systolic array for multi-dimensional fitting and interpolation using (nonlinear) radial basis functions is proposed. The fit may be constrained very simply to ensure that the resulting surface takes a pre-determined value at one or more specific points. The processor, which constitutes a form of nonlinear adaptive filter, behaves like a neural network based on the multi-layer, feed-forward perceptron model. One obvious application of such a network is as a pattern classifier, the constraints being used to ensure the correct classification of selected patterns.

1. Introduction

In essence, a neural network model, as applied, for ex­ample, to pattern recognition or classification, con­stitutes a form of nonlinear adaptive filter. In the learn­ing phase, this form of network absorbs representative input and output training data from a given system whose response function is not known and must be suit­ably modelled. In the subsequent generalization phase the network generates plausible outputs upon the in­put of further test data. One form of neural network, the multi-layer perceptron (MLP) [1], employs layers of simple nonlinear processors, the processing elements being highly interconnected between adjacent layers as illustrated in figure 1. The strength or weight of each connection is varied during the learning phase to minimize the difference between outputs of the network and those of the system being modelled. Unfortunate­ly, since the output of the MLP depends nonlinearly on some of the weights, the associated learning algo­rithm tends to converge very slowly and may arrive at an unsatisfactory local minimum in the error surface. Furthermore, the high degree of nonlocal connectivity between processing cells of the MLP renders it much less suitable for VLSI than a regular, mesh-connected processor.

Realizing that the minimization process associated with an MLP amounts to a form of multi-dimensional curve fitting and interpolation, Broomhead and Lowe [2] have proposed an alternative technique for nonlinear adaptive filtering. In their algorithm, the unknown re­sponse function is modelled by a limited set of (nonlin­ear) radial basis functions [3] and a linear least squares fit is performed. In a subsequent paper, Broomhead

>

e j

Fig. 1. Typical multi-layer perceptron. Each cell sums its weighted inputs and applies a sigmoidal nonlinearity to the result. The circuit above the dotted line could be regarded as a nonlinear pre-processor.

et al. [4] described a fully pipelined, mesh-connected network which combines the nonlinear radial basis function (RBF) processor with a well known systolic array for linear least squares estimation. In this paper, we show how the RBF processor may be used in con­junction with a more general systolic array designed for linearly constrained least squares optimization. The resulting network may be used, for example, as a highly efficient pattern recognition system, the constraints ,be­ing used to modify the training process such that the correct classification of certain patterns is guaranteed.

Page 68: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

70 McWhirter, Broomhead, Shepherd

2. Radial Basis Function Model

Consider a system which takes as its input a p­dimensional vector x and produces the corresponding scalar output f(x) where f is the (unknown) nonlinear response function. Broomhead and Lowe [2] suggested that the system could be modelled by a linear expan­sion of the form

N,

j(x) L; wi q, (II x - xi II) (1) i=1

where {xi I i = 1, 2, ... , Nc } is a given set of center vectors in the data space and {Wi I i = 1, 2, ... , Nc }

is a corresponding set of real coefficients or weights to be determined. q,(r) is a given nonlinear scalar func­tion whose argument is the p-dimensional Euclidean distance between the data vector x and one of the center vectors xi. It is therefore referred to as a radial basis function. The Gaussian function

(2)

has been found to give very good results over a wide range of practical pattern recognition problems but many other functions such as the simple cubic or ex­ponential could be used. The use of radial basis func­tions in this context may be thought of as a generaliza­tion to higher dimensional spaces of the procedure for curve fitting in one dimension using cubic or quadratic b-splines. For further information about the RBF method the reader is referred to Powell's excellent review [3].

3. Least Squares Formulation

It is assumed that the response of the nonlinear system is known for a discrete set of input vectors {xn I n = 1, 2, ... , N} where N > Nc . The known output values

(3)

and corresponding input vectors are referred to as the training data which is used to determine the optimum weights Wi in the RBF expansion. The optimum weight vector

(4)

is chosen in accordance with the least squares criterion to minimize the residual metric

(5)

where ej is the difference between the known output Yj and the modelled response of the system to the in­put vector Xj' i.e.

Nc

ej = Yj - j(Xj) = Yj + L; WI q, (lIxj - xi II) (6) i=1

Defining the residual vector

eT = [e], e2, ... , eN] (7)

it can be seen that W is simply the vector which minimizes

lie II 1I<I>w + yll (8)

where

(9)

and <I> is the N X Nc matrix whose (j, i) element is defined by

<l>j,i = <I> (lIxj - xf II ) (10)

The least squares weight vector is given by the well­known Gauss normal equations which may be expressed in the form

(11)

and have a unique solution provided that the N x N T c c matrix <I> <I> is of full rank.

4. Constraining the Fit

Suppose now that the RBF model is required to output precisely a given value d in response to the input vec­tor z. Accordingly, the weight vector must be chosen such that

N,

d + L; wiq,(lIz - xfll) = 0 (12) i=1

i.e., the least squares fit must be carried out subject to a simple linear constraint of the form

(13)

where c is the Nc-element constraint vector whose ith element is

Ci = q,(lIz - xfll) (14)

Page 69: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

A Systolic Array for Nonlinear Adaptive Filtering and Pattern Recognition 71

It is well known that the optimum weight vector for this type of constrained least squares problem is given by the equation

<l>T<I>w = AC (15)

where

(16)

However, in order to preserve numerical accuracy on a finite wordlength processor, it is better to avoid squaring the information matrix <I> as in equation (15) and use instead an orthogonal algorithm such as the method of QR decomposition by square-root-free Givens rotations [5]. McWhirter and Shepherd [6] have shown how this particularly robust algorithm may be implemented using a triangular systolic array which is capable of achieving very high throughput rates and is eminently suitable for VLSI circuit design. This array requires as its inputs, the matrix <1>, data vector y and constraint vector c.

5. RBF Preprocessor

Figure 2a depicts a systolic array designed to compute the matrix elements CPj,i as defined in equation (10). It comprises a p X Nc rectangular array of diamond shaped cells and a row of Nc lozenge shaped cells where we have, for ease of illustration, assumed that p = 5 and Nc = 4. The function of the cells is specified in figure 2b. Before processing commences, the Nc center vectors xi are loaded into the rectangular array of diamond shaped cells where they are stored such that the center vector xi resides in the first col­umn, the vector Xz in the second column and so on. The sequence of input training vectors XI, Xz, ... XN

is then input from the left in the appropriate time­staggered manner as indicated in figure 2a. As the jth input vector moves from left to right across the ith col­umn, the parameter s which is !,assed down from cell to cell serves to accumulate the value

p

s = L: (Xk,j - 4j = lI(xj - xDllz (17) k~1

where Xk,j and xL denote the kth element in the vec­tors Xj and xi respectively, When the sum s emerges from the bottom cell in the ith column of diamond shaped cells it is fed into the i th lozenge shaped cell which generates the matrix element CPj,i as defined in equation (10) and outputs it from below. It can be seen

0 0 0 0

x X 12 11

X X 22 21

X X 32 31

X X '2 41

X x51 52

,023 )32'

,021 ,022 0'14

°13

,0;1 ~2

Fig. 2a. RFB pre-processor array.

5

9 5

s' = s + (X - ;<,,)2

Fig. 2b. Cells required for RBF pre-processor array.

therefore, that the sequence of vectors CPr. cpi, ... , cP~ (where cpT denotes the ith row of the matrix <1»

emerges from the bottom of the combined pre-processor array in the same time-staggered manner as the sequence of input vectors x" Xz, ... , XN but with a latency of p + 1 clock cycles.

Note that if the vector z is input to the pre-processor array in the same time-staggered manner, the corre­sponding output (delayed by p + 1 clock cycles) will be the time-staggered linear constraint vector cT whose elements are defined in equation (14). The pre-processor array is capable therefore of generated the matrix cP and the constraint vector cT which are required as inputs to the linear least squares processor.

6. Least Square Processor

A systolic array for performing the linear least squares optimization is illustrated in figure 3a. Since the

Page 70: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

72 McWhirter, Broomhead, Shepherd

0 24

Y2

0 23 Y

P'21

¢22

,013

~4

~l ~2

Fig. 3a. Systolic least squares processor array.

x

(5,Z) ~(S,Z) 'I' x'

Mode J (adaptive)

[fz=Ooro=O then

( s = 0; 0' = 0 )

else (d'= d +oz';

S = oz/d'; 0' = od/d' d = d'

x' = x - fz r = r + !:ix'

1

D

Mode 2 (fTozen)

s = 0 ; d' = 6

x' = x - i'z

Fig. 3b. Cells required for least squares processor array.

operation of this array is well documented in the literature [7, S], it will only be described very briefly here. It takes the form of an Nc X Nc triangular array labelled ABC and an extra column of cells DE where we have, as before, chosen to depict the case Nc = 4. The entire array is constructed using two types of proc­essing cells each with two modes of operation as de­tailed in figure 3b. We note here, that the frozen mode of operation may be generated very simply from the

adaptive mode by setting the parameter b = 0 in each of the boundary cells and hence, an extra instruction program is not required, In its adaptive mode, the main triangular array is designed to perform a QR decom­position of the incoming data matrix <I> as defined in equation (10), Le" it performs a transformation of the form

[ D 1!2R ] Qrf> = 0 (IS)

where Q denotes an orthogonal matrix, R represents an Nc x Nc unit upper triangular matrix and D is an Nc x Nc diagonal matrix. The QR decomposition is implemented recursively using a sequence of square­root-free Givens rotations [5]. The boundary cells (represented by circles) serve to compute the rotation parameters which are then passed horizontally to the internal cells (represented by squares) for application to other data in the same row. Off-diagonal elements from the evolving triangular matrix R are stored within the internal cells while elements of the diagonal matrix D are stored within the boundary cells. As each new row of the data matrix <I> passes down through the main triangular array, the stored matrices D and R are up­dated to take account of all data up to and including that row. When the last row has been processed, the array stores the fully transformed matrix pair [D, R] as defined in equation (IS).

The function of the right hand row DE is to apply the same sequence of Givens rotations to the incoming data vector y as defined in equation (9), Le., it per­forms the transformation

[ D 1!2u ] Qy = v (19)

the vector u being stored within the right hand column of cells where it evolves recursively in a similar man­ner to the matrices D and R. When the QR decomposi­tion process has been completed, the least squares weight vector may be obtained by solving the triangular system of equations

Rw+u=O (20)

However, this step proves to be unnecesary in practice. If the array storing D, Rand u is switched to the frozen mode, then for any test vector x, the value of the cor­responding modelj(x) may be obtained directly by in­putting to the top of the array the vector [rf>T, 0] where

rf>i = rf>(lIx - xiII). (21)

Page 71: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

A Systolic Array for Nonlinear Adaptive Filtering and Pattern Recognition 73

The corresponding output, which emerges from the bot­tom cell in the right hand column after 2Nc clock cycles, takes the required valuej(x). The required in­put vector cpT may be generated very simply by input­ting the vector x to the same pre-processor array as that defined in figure 2 for generating the data matrix cf> from the training data vectors, Xj, xz, ... , XN'

It remains to point out how the constraint in equa­tion (13) may be incorporated into the least squares process. It has already been indicated that the vector cT may be generated by passing the vector z through the RBF pre-processor. Furthermore, McWhirter and Shepherd [6] have shown how the least squares opti­mization may be constrained as required, by storing the vector [cT, d] in the top row of the least squares array before processing any data and ensuring that the top row operates in the frozen mode throughout the subsequent training and test phases. We note as before, that the freezing may be accomplished very simply by setting the parameter {j = 0 in the top boundary cell. The combined RBF least squares processor array is illustrated schematically in figure 4 which shows the overall flow and sequence of data required for both the learning and test phases.

Test Training do.to data.

.---, .---, / I I / /,---------~

X d

;; / / / /

/ =3. " ;~ /~~,j . ~~1/~~

/ / / /

lineo.r constro.,nt Pr~rocessor

Leost squores processor

Fig. 4. Schematic of combined RBF least squares processor array indicating the input data sequence. d denotes a delay of NO' + P + 1 clock cycles. The linear constraint pre~processor is switched to mode 2 as soon as the constraint vector has been captured. The least squares array is switched to mode 2 once the training data has been processed.

7. Comments and Conclusions

In this paper we have described a novel systolic array for rapid and efficient fitting and interpolation using

radial basis functions. It may be viewed as a form of nonlinear adaptive filter which is capable of learning from a set of training data vectors in its adaptive mode and subsequently applying that knowledge to a set of test data vectors in its frozen mode. The mode of opera­tion may be selected very simply by setting the input parameter {j to one or zero respectively at the top boun­dary cell.

The RBF processor is capable of performing a wide range of complex pattern recognition tasks and behaves, in many respects, like a neural network based on the feed-forward MLP model. It comprises a fixed non­linear pre-processor which passes data on to a linear least squares processing network. As indicated in figure 1, the MLP may be separated into similar components assuming, as is often the case, that there is no output nonlinearity associated with the final perceptron and that the RBF processor is not constrained. The main difference is that the MLP's nonlinear pre-processor also depends on a set of weights Wi which are varied during the learning phase. Since the least squares residual metric is often quite insensitive to these nonlinear weights, the associated learning procedure tends to converge very slowly and the global minimum is rarely attained. It is worth noting that varying only the final layer of weights in such an MLP would reduce the learning process to a simple linear least squares op­tirnzation in which case the back propagation algorithm proposed by Rumelhart et al. [I] reduces to the well known LMS adaptive filtering algorithm [8]. Alter­natively, as in the RBF processor, the least squares com­putation could be performed directly using the type of triangular systolic array depicted in figure 3.

Fixing the nonlinear weights in an MLP is clearly analogous to choosing a fixed set of center vectors in the RBF expansion. In principle, the position of these center vectors could also be varied with view to reduc­ing further the least squares residual metric. However Lowe [9], in applying the RBF technique to the prob­lem of vowel classification in speech recognition, has found the corresponding increase in computation to be of little benefit except when the number of center vec­tors must be reduced as far as possible in order to pro­duce a circuit of minimal complexity.

Several authors have now reported on comparisons between the (unconstrained) RBF fitting technique and the MLP or other approaches to pattern recognition. For example, Renals et al. [10] have applied both methods to the speech processing problem of phoneme recognition and found the overall performance of both techniques to be very similar and to compare favorably

Page 72: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

74 McWhirter, Broomhead, Shepherd

with the use of a vector quantized hidden Markov model. However, the training of an RBF processor could be carried out two orders of magnitude filster than that of an MLP by back propagation using the same serial computer.

Bounds et al. [11] have found that the RBF tech­nique, when applied to the diagnosis of low back pain, gave better results than three groups of doctors and was comparable in performance to the statistical methods of K-Nearest Neighbor and Closest Class Mean classification. They varied the number of centers in the RBF expansion choosing these as a random subset of the training patterns and fuund that the results were not sensitive, in general, to the number or choice of centers.

Vrckovnik et al. [12] have used the RBF method to classify impulse radar waveforms from ashphalt bridge decks. They found that it could be trained faster and gave better overall performance compared to an MLP. They also varied the number of centers and found that the performance saturates quite rapidly as the number is increased from a small initial value.

As yet, the technique described in this paper of con­straining the RBF fit has not been applied to any prac­tical pattern recognition problem. However, it is en­visaged that the use of one or more constraints will help to resolve patterns from closely related classes by en­suring that key members of such classes are correctly identified by the RBF processing network. In this paper, we have only discussed the introduction of a single linear constraint but, as described by McWhirter and Shepherd [6], the method may readily be extended to cope with several constraints assuming, of course, that the number of constraints is less than the number of centers Nco

Throughout this paper, it has been assumed that the output of the RBF network is a simple scalar j which is compared to known scalar outputs Yj in the training data set. However, the method may be generalized very simply to the case of a network which produces an m­dimensional vector output j for comparison with m­dimensional vector values Yj in the training set. As in­dicated in figure 4, the systolic array is simply extend­ed to include m columns of cells on the right hand side, each identical to the right hand column DE in figure 3a.

In summary then, we have described a novel systolic array for nonlinear fitting and interpolation using radial basis functions. It is capable of perfurming a wide vari­ety of complex pattern recognition tasks and can be

compared in many respects to an artificial neural net­work based on the feed-forward MLP model. The RBF fitting technique compares very favorably in terms of recognition performance but, even on a sequential com­puter, the underlying algorithm converges orders of magnitude filster. The highly parallel and pipelined ar­chitecture proposed in this paper offers the potential for extremely filst computation and furthermore, since it takes the form of a regular mesh-connected array, the RBF processor is much more suitable for design and fabrication than the MLP.

References

1. D.E. Rumelhart, G.E. Hinton and R.L Williams, "Learning in­ternal representations by error propagation," in Parallel

Distributed Processing, vol. 1, (eds D.E. Rumelhart and LL. McClelland), Cambridge, MA: MIT Press, 1987, pp. 318-362.

2. D.S. Broombead and D. Lowe, "Multi-variable function inter­polation and adaptive networks;' Complex Systems, vol. 2, 1988,

pp. 321-355. 3. M.LD. Powell, "Radial basis functions for multivariable inter­

polation: a review," Proc. IMA Conference on Algorithms for the Approximation of Functions and Data, RMCS Shrivenham, 1985.

4. D.S. Broombead, R. Jones, J.G. McWhirter and TJ. Shepherd,

"A systolic array for nonlinear adaptive filtering and pattern recognition;' Proc. IEEE Int. Symposium on Circuits and Systems, New Orleans, 1990.

5. W.M. Gentleman, "Least squares computations by Givens rota­tions without square roots," 1. Inst Moths. Applics., vol. 12, 1973, pp. 329-336.

6. J.G. McWhirter and TJ. Shepherd, ':A pipelined array for linearly constrained least squares optimization," in Mathematics in Signal Processing, (ed T.S. Durrani et aI., Oxford: Clarendon Press, 1987, pp. 457-483.

7. W.M. Gentleman and H.T. Kung, "Matrix triangulation by systolic arrays, Proc. SPIE, vol. 298, "Real Time Signal Proc­

essing IV," 1981, pp. 298-303. 8. S. Haykin, Adaptive Filter Theory, Englewood Cliffs, NJ: Pren­

tice Hall, 1986. 9. D. Lowe, ''Adaptive radial basis function non-linearities and the

problem of generalization," Proc. lEE Int. Conf on Artificial Neural Networks, London, Oct. 1989.

10. S. Renals and R. Rohwer, "Learning phoneme recognition us­ing neural networks;' Proc. IEEE Int. Conf on Acoustics, Speech and Signal Processing, Glasgow, 1989.

11. D. G. Bounds, P. Lloyd and B.G. Matthew, ':A comparison of neural networks and other pattern recognition approaches to the diagnosis of low back disorders," furthcoming.

12. G. Vrckovnik, C.R. Carter and S. Haykin, "Radial basis func­tion classification of impulse radar waveforms," private communication.

Page 73: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

A Systolic Array for Nonlinear Adaptive Filtering and Pattern Recognition 75

John G. McWhirter is a senior principal scientific officer at the Royal Signals and Radar Establishment in Malver, England. He joined the establishment in 1973 and is currently carrying out research on advanced algorithms and architectures for digital signal processing with particular emphasis on systolic and wavefront arrays. He is a fellow of the Institute of Mathematics and its Applications and a visiting professor at The Queen's University of Belfast. He graduated with first class honors in applied mathematics at Queen's Universi­ty, Belfast in 1970 and received the PhD from the same university in 1973 for theoretical research on atomic and molecular physics.

David Broomhead was educated at Merton College, Oxford and. in 1974, obtained a B.A. in Natural Sciences (Chemistry). He received

his post-graduate training in the Department of Physical Chemistry at Oxford, and in 1976 obtained a D.Phil. having done research into the theory of relaxation of nonlinear intramolecular modes. Since then, working in Harwell, Kyoto, Warwick and, since 1983, at the Royal Signals and Radar Establishment, Malvern, he has become

increasingly interested in the theory on nonlinear dynamical systems and its application to the processing and analysis of time series data.

T,J. Shepherd graduated in Physics from the Imperial College of Science and Technology, London, U. K., in 1974. He went on to gain

a PhD and Diploma of Imperial College in the subject of Theoretical Elementary Particle Physics. He joined the Royal Signals and Radar Establishment in 1978, where he is now a Principal Scientific Of­ficer. His research interests include Adaptive Systems in Digital Signal Processing, and Quantum and Laser Physics. Dr. Shepherd is a member of the S.P.I.E. and of the Institute of Physics.

Page 74: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays

rURGEN TEICH AND LOTHAR THIELE Institute of Microelectronics, University of Saarland, D-6600 Saarbriicken, ll!est Germany

Received July 23, 1990, Revised November 29, 1990.

Abstract. The problem of mapping algorithms onto regular arrays has received great attention in the past. Results are available on the mapping of regular algorithms onto systolic or wavefront arrays. On the other hand, many algorithms that can be implemented on parallel architectures are not completely regular but are composed of a set of regular subalgorithms. Recently, a class of configurable processor arrays has been proposed that allows the efficient implementation of piecewise regular algorithms, In contrary to pure systolic or wavefront arrays they are distinguished by a dynamic configuration structure. The known trajectories, however, cannot be applied to the design of configurable processor arrays because the functions of the processing elements and the interconnection structure are time- and space-dependent. In this paper, a systematic procedure is introduced that allows the effi­cient design of configurable processor arrays including the specification of the processing elements and the generation of control signals. Control signals are propagated through the processor array. The proposed design trajectory can be used for the design of regular arrays or configurable arrays.

1. Introduction

The advances in VLSI technology lead to an increas­ing interest in algorithmically specialized circuits. Especially in the area of information processing systems with real time or interactive time constraints there is a strong demand for the development and application of new architectural and circuit concepts to handle the extreme computational and I/O request of the algo­rithms needed. Consequently, there is a definite interest in parallel hardware architectures [1], [2]. Some of the most interesting applications can be found in areas of real-time signal processing and image processing.

In order to match the need of high throughput imple­mentations of computational intensive algorithms a special class of realizations suited for VLSI has been proposed. This class known as systolic or wavefront ar­rays is distinguished by massive parallelism and pipe­lining, a completely regular interconnection structure and a local communication scheme [3], [1]. A large number of such arrays has been proposed for solving problems in signal processing and image processing, see [2], numerical linear algebra like [2], [4], [5], [6], [7], combinatorial problems like [8], [9], [10] and for solving uniform recurrent equations like [11], [12].

In order to guarantee a correct design in a short design time, the use of computer aided design tools to relate an algorithmic specification and the actual imple-

mentation is indispensable, Consequently, there is large interest in systematic design methods for processor arrays.

It is well known that there are systematic methods to map certain classes of algorithms on regular proc­essor arrays [13]. The class of algorithms related to regular processor arrays is called regular iterative algorithms (RIAs). These algorithms are distinguished by completely regular dependence graphs, Systematic procedures to design regular or wavefront arrays are based on affine index transformations. By now, many results on the relation between algorithms and systolic architectures are available [11], [12l, [13], [14], [15]. Moreover, some of these methods have been imple­mented in software [16], [12].

Despite of these results the concrete design of proc­essor arrays very often leads to problems whose solu­tion necessitates more general concepts than those of a regular dependence graph, a regular algorithm, and a regular processor array [17]. The design trajec­tory we are going to follow is based on a successive refinement of program specifications. To the given program which specifies the algorithm to be executed a sequence of program transformations is applied until an efficient program for the given target architecture is obtained. Consequently, the exploitation of new architectural features is equivalent to a refinement of the program model. The definition of a design

Page 75: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

78 Teich and Thiele

methodology for processor arrays can be partitioned into two basic tasks:

1. Definition of a basic set of program transfonnations. Each of these transformations acts like a filter: An in­put program is tranformed to a provable correct pro­gram. Despite of the degree of freedom inherent in these transformations, the correctness of the syntax and semantics of the resulting program is guaranteed.

2. Optimization. The sequence of transformations and the corresponding parameters must be chosen such that certain criteria are satisfied.

In order to point out the importance of the control generation within the whole design trajectory let us describe some basic steps:

• Very often, the algorithm to be realized is given in form of an hierarchical program specification. Ex­amples are fast realizations for Gaussian elimination [18], implementations of neural networks [19] and parallel solution of Kalman filter equations [20]. In these cases, the dependency structures of the algo­rithms are not regular but can be partitioned into regular subsets.

• In order to satisfy resource constraints imposed by the target architecture (limited number of process­ing elements, given dimension of the interconnection network) the techniques of partitioning [21], [22] or multiprojection [2], [4], [6] must be applied.

• The data dependencies of the given program must be localized if the target architecture has a local inter­connection network, see [23], [24]. The same prob­lem occurs if the input and/or output data of the algorithm are required to be available at the borders of the processor array.

• The mapping of nested loop programs on processor arrays necessitates their transformation into single assignment forms in order to extract inherent paral­lelism, see e.g. [25].

In can be shown that the results of any of these pro­gram transformations can be described in form of a piecewise regular algorithm, see [26], [17]. On the other hand it has been mentioned already that the class of algorithms directly related to regular processor arrays consists of regular iterative algorithms. The control generation as described in this paper fills this gap in the design trajectory as any piecewise regular algorithm and can be transformed into a syntactically and seman­tically correct regular iterative algorithm. The intro­duced control mechanism covers liD-control as well as functional control of the processing elements.

In order to guarantee efficient VLSI implementa­tions for a large class of algorithms the following con­traints on the concept of control must be considered:

• Simplicity. For exploitation of massive parallelism a simple control mechanism should be imposed.

• Independence on the size of the problem. The proc­essing elements should be independent on the size of the problem to be processed.

• Local control. In order to preserve the local com­munication scheme of processor arrays, the control mechanism must avoid a global access of process­ing elements.

• Flexibility. It should be possible to implement the specified control program either in software on a general purpose parallel computer or to implement it directly in form of a dedicated processor array.

The problem of configurability in networks of mesh connected arrays has been addressed by many authors like [27] and [28]. Intelligent operating systems have been proposed in [29] to control network configura­tions for solving problems in image processing. More­over, the design of fault-tolerant processor arrays [30] is based on a dynamically reconfigurable interconnec­tion structure. The following results concern the con­trol of processor functions. For some special types of processing elements (e.g., [31], [32]) the problem of functional control is treated but there, no systematic concept is given that relates a certain class of algorithms and a specification of the required control. In [33], the Configurable, Highly Parallel Computer is introduced. Switches that contain a local program control data path selections. Switches have to be loaded by a separate interconnection skeleton and during program execution a global controller broadcasts a command to all switch­es. In the case of the Instruction Systolic Array [34], control signals are locally propagated. Given a mesh­connected processor array, vertical control signals con­trol the processor functions while horizontal control signals control the data path selections. Unfortunately, no class of algorithms can be related to this architec­ture. In [13], canonical propagation vectors are in­troduced that are added to a RIA at each iteration point. Corresponding variables contain the index coordinates of each iteration. Each processing element of the resulting array selects the proper functions and data paths by checking which equations are defined at the present iteration point. The processor array, however, is problem-size dependent because the memory require­ment of each processing element depends on the size of the problem. In [35], [23] the need of control has

Page 76: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 79

been discovered while working on localization methods that convert algorithms containing global data dependencies into local algorithms. In [35] the class of CUREs (Conditional Uniform Recurrence Equa­tions) is considered which is distinguished by iteration dependent conditionals described by sets of linear ine­qualities. In the case of equality constraints, Boolean control signals are propagated in hyperplanes of the dependence graph.

In the following we will present a control method that takes the above criteria into account and helps to overcome some of the problems of existing approaches. The result of such a procedure is a control algorithm that specifies the propagation of additional control variables and that specifies the processing elements in­cluding the control unit (CU). First we introduce the class of piecewise regular algorithms. The capabilities of configurable processor arrays will be outlined in Sec­tion 3. We introduce the architecutral concept of the control flow processor, a model of a processing ele­ment of a configurable processor array. In Section 4, a systematic procedure is given that relates a piecewise regular algorithm and a configurable processor array. The proposed method is suitable for the hierarchical mapping of algorithms, see [17] (Section 5).

2. Definitions and Notation

2.1. Notation of Programs

In the following we will use the notation introduced by Chandy and Misra [36] to specify the programs and architectures we are dealing with. We restrict ourselves to a small subset of the structures as defined in UNITY.

The basic program structure we are going to use con­sists of an initially-section and an always-section. The initially-section is used to define all inputs to the pro­gram, i.e., initial values of the variables. The purpose of the always-section is to define relations between the variables of the program using a set of equations. The syntax is identical to that of the initially-section. Despite the fact that an initially-section is not necessary for our purposes, programs containing their inputs explicitly are often easier to understand.

The always-section and initially-sections are com­posed of a set of equations. The sign II is used in order to separate the equations. The notation of a quantifica­tion as introduced in [36] can be used to describe sets of equations easily. The quantification (Ill: I E I :: S[l)) where I is an iteration space, I is an iteration vector

and each S[l] is an equation, denotes an enumeration of equations. For example, (IIi, j : i = 1 " 1 :5 j :5

2 :: xli, j) = x [0, j)) is equivalent to x[l, 1] = x[O, 1] II x[l, 2] = x[O, 2]. The notation for conditionals can be seen using the following equation for assigning to x the absolute value of y: x = y if y ;;:: 0 - -y if y < O. The cases are separated by the symbol -. If we have an iteration dependent conditional of the form SrI] if I E I we call the term I the condition space of the indexed equation S[l].

As both the always-section and the initially-section consist of a set of equations some restrictions are necessary to avoid circular definitions:

1. A variable appears at most once on the left hand side of an equation (single-assignment property).

2. There exists an ordering of the equations after quan­tifications have been expanded such that any variable appearing on the right hand side of an equation ap­pears on the left hand side earlier in the ordering (computability).

The notation introduced can be used to easily repre­sent a subset of regular iterative algorithms which is directly related to regular processor arrays, see [13]. In this case, the following restrictions on the program schema are imposed:

1. All indexing functions are of the form I - d where I E ZS denotes the iteration vector and d E ZS is in­dependent of I. For example, xli, j - 3, k + 4] is a valid indexed variable whereas xli, i - j) is not.

2. The program consists of one quantification only. Consequently, to all indexed equations there is associated a common iteration space. The common iteration space I contains all integer points in a con­vex region of the s -dimensional Euclidean space.

3. There are no iteration dependent conditionals.

As already mentioned in the Introduction, the properties 2 and 3 are to restrictive in order to represent programs which result from operations like localization, parti­tioning, mUltiprojection and nested loop conversion.

2.2. Piecewise Regular Algorithms

It is important to have a consistent trajectory from a certain class of programs to the final hardare implemen­tation. It has been shown that the class of piecewise regular algorithms as defined in [26], [17] leads to a consistent model for the program transformations men­tioned in the Introduction. Before defining this class of programs some definitions are given:

Page 77: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

80 Teich and Thiele

DEFINITION 2.1. An (integer) lattice in Z' is the set of all integer linear combinations of a set of m linearly independent vectors ai E Z' (m ,.; s). The lattice generated by a), a2, ... , am denoted by L(al, a2, ... , am) is the set

{ t Kia; : Ki E z} . 1=1

If we write a), a2, ... , am as the columns of an in­teger matrix A E z,xm, then L(a), a2, ... , am) = L(A) = {I E Z' : I = AK; K E zm}. In the following we add an additional integer offset vector b E ZS, and defineL(A, b) = {IE Zs: I = AK + b; K E zm}.

According to the following definition, a system of linear inequalities represents a polyhedron or a polytope.

DEFINITION 2.2. A polyhedron PER' is the set of vectors that satisfy a finite number of linear inequalities withP = {IE R S : AI,.; b} where A E Rnx" bERn. A bounded polyhedron is called a polytope.

Using the above definitions, a linearly bounded lat­tice can be defined as follows:

DEFINITION 2.3. The cutset of an integer lattice and a polyhedron is denoted as a linearly bounded latttice.

In particular, a linearly bounded lattice I can be de­scribed as

I = {I E Z': I = AK + b; K E zm A AI;:: b} (1)

where A E z,xm, b E Z', A E Rnx" bERn. Now, the class of piecewise regular algorithms can

be defined as an enumeration of quantifications with some special properties:

1. All indexing functions are of the form I - d where I E Z' denotes the iteration vector and d E Z' is in­dependent on I.

2. The iteration spaces of the quantifications are linear­ly bounded lattices.

3. The condition spaces of iteration dependent condi­tionals are linearly bounded lattices.

Before we are going to give some remarks on the above characterization we introduce the main example of this paper. The example has been chosen to be as simple

as possible but exemplifying the basic properties of the proposed procedures.

Example 2.1. The equation

N-I

Y; = ~ APi-j j~O

is the difference equation of a finite impulse response (FIR)-filter. The filter relates the output Y; and the filter input Ui by evaluating the weighted sum of the current and the last N - 1 input values. The weights Aj are the coefficients of the filter. A piecewise regular algorithm for FIR-filtering can now be given as follows:

initially (II i : i ;:: 0 :: y [i, - 1] = 0) II

(IIi: i ;:: -N:: uri, -1] = Ui+I)1I (2)

(IIi: -N,.; i ,.; -1 :: ali, -1] = A-H)

always (3a)

(II i, j : i ;:: 0 A 0 ,.; j < N :: (3b)

y[i, j] = y[i, j - 1] + ali, j]u[i, j])1I(3c)

(IIi, j : i > j - N A 0 ,.; j < N :: (3d)

a[i,j] = ali - l,j - 1] ifi,.; 0 (3e)

- a [i - 1, j] if i > 0 II (3f)

uri, j] = uri - 1, j - 1]) (3g)

The result Yi satisfies

(IIi, j : i ;:: 0 A j = N - 1 :: Y; = y[i, j]) (4)

This program has been generated starting from

N-I

Y; = ~ APi-j-j~O

The localization of the summation operator leads to (3b), the localization of Aj and Ui- j leads to (3f) and (3g), respectively. We require all inputs and outputs of the algorithm to be at the borders of the processor ar­ray. The variables are initialized at specified boundary hyperplanes and then propagated to their destinations by means of localization. This procedure leads to (3e).

Obviously, there are many equivalent representations of a piecewise regular algorithm. For example, the lattice

Page 78: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 81

I = {i,} : i = 3Kl + K2 I\} = 2K2 + I; KJ, KZ E Z} can equivalently be represented as I = {i,} : 2i - } + I == O(mod 6) I\} - I == O(mod 2)}. More impor­tant, iteration dependent conditionals can be trans­formed into a set of quantifications without iteration dependent conditionals as the cut-set of linearly bound­ed lattices is again a linearly bounded lattice. For ex­ample, the set of equations defined by (3e, f) can be written as

(IIi,) : 0 ~ i > } - N 1\ 0 ~} ::

a[i,j] = ali - I,} - 1])11

(IIi,): i > 01\0 ~ j < N:: a[i,j] = ali - I,}])

It will be seen, that the control overhead is greatly reduced if any variable occurs at most once on the left hand side of an indexed equation. In this case, a quanti­fication of a piecewise regular program defining the indexed variable Xb I ~ k ~ V has the form

(III: IE Ik :: Xk[I] = FU···} if I E 11

Fi { ... } if I E Ii (6)

Ffk{ ... } if I E Irk)

where Fi { ... } denotes an arbitrary functional relation.

2.3. Dependence Graph (DG)

The dependency structure of a piecewise regular algorithm can be represented by a dependence graph. There is a node for any indexed variable. In particular, for a variable xdL] there is a node I at index point L. There is an edge between two nodes of a dependence graph, iff the corresponding indexed variables directly depend on each other. For example if a variable Xl [L] is evaluated using the variable xk[K] then there is an edge from vertex k at index point K to vertex I at index point L. The definition of the dependence graph of a PRA is given below:

DEFINITION 2.4. The dependence graph of a piecewise regular algorithm has vertices corresponding to the variable named Xl at all index points I where xdi] is defined. There is an edge between vertex k and index point K and vertex I at index point L iff there is an equa­tion of the form

with K = L - dkl in the completely expanded pro­gram. F w { ... } denotes an arbitrary many-to-one function not dependent on I. If the variable xdK] is not defined in the always-section, then it is an input to the algorithm and is defined in the initially-section.

Example 2.2. Figure I represents the dependence graph of the PRA of the FIR-filter in Example 2.1 for N = 3.

Variables at} = -I are external inputs to the algorithm.

j=-l i=-3

j=O j=l j=2

Fig. 1. The dependence graph of the FIR-filter algorithm (Example 2.1) for N = 3.

2.4. Algorithm Transformations and Projection

The design trajectory called stepwise refinement as described in the Introduction necessitates the applica­tion of basic program transformations in order to ob­tain a program schema which is matched to the required target architecture. In this paper, we only make use of the piecewise regular transformation as defined in [26], [17]. The iteration vectors of the V variables defined by a piecewise regular algorithm are transformed inde­pendently using V affine mappings. In contrary to a global transformation of the iteration space, e.g., [12], [14], [15], [37] feasible schedules for more general classes of algorithms can be determined. In addition, the mapping must be restricted in the way that all transformed dependence vectors have a strictly positive time component and that dependencies between vari­ables remain iteration-independent [26].

Page 79: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

82 Teich and Thiele

In order to map a piecewise regular algorithm onto a processor array, the geometric representation of its dependence graph is changed via piecewise affine transformations of the iteration spaces. Finally, the transformed dependence graph is projected onto a pro­cessor array. This is accomplished by assigning one coordinate of the iteration spaces to the sequencing of operations. Other coordinate directions represent the spatial dimensions of the processor array.

The following example describes the application of a simple affine transformation applied to all iteration spaces of the piecewise regular algorithm of the FIR­filter. As a result, a new algorithm is obtained which is related directly to a realization in form of a processor array.

Example 2.3. We are going to transform the PRA of the FIR-filter given in (2, 3,4). Direction i is associated to the coordinate of time. The affine transformation of the dependence graph shown in figure 1 with

I'=TI+f

and

yields the dependence graph shown in figure 2. Some of the details of the given dependence graph (multiple edges, input and output nodes) are not shown in order to simplify the representation. The transformed algo­rithm can be written as follows:

initially (lit: t ~ -1 :: y[t, -1) = 0)11

(lit: t ~ -N - 1 :: u[t, -1) = Ut+2)11 (8)

(lit :: -2 ~ t ~ -N -1 :: art, -1) = A_t - 2)

always

(lit, i' : t ~ i' " 0 ::;; i' <N ::

y[t, i') = y[t -1, i' - 1) + art, i')u[t, i'))11

(lit, i' : t > 2i' - N" 0 ::;; i' < N :: (9)

art, i') = art - 2, i' - 1) ift::;; i'

- a [t - 1, i ') if t > i' II

u[t, i') = u[t - 2, i' - 1])

The outputs are obtained as follows:

(lit, i': t ~ i'" i' = N - 1 :: Y,-i = y[t, i')) (10)

A processor array that realizes this algorithm can be obtained by projecting the dependence graph of figure 2 in the direction of the time coordinate. The resulting linear array shown in figure 3 contains N identical proc­essing elements. The processing elements are connected by register elements. Due to the inhomogeneous nature of the dependence graph in figure 2, however, an addi­tional variable c to control the different processor func­tions and the different interconnection structures is add­ed. A systematic control procedure to derive this ar­chitecture will be introduced in Section 4.

Fig. 2. Transformed dependence graph of the FIR-filter algorithm for N ~ 3.

N-l

Fig. 3. Procesor array that implements the FIR-filter algorithm of Example 2.1.

3. Architectural Concepts of Configurable Processor Arrays

There are many different possibilities in mapping a given piecewise regular algorithm onto a processor ar­ray. In addition to the inherent degree of freedom in the well known program transformations, i.e., affine

Page 80: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 83

index transformations, we are going to consider the implementation of iteration dependent conditionals. Obviously, there are many possibilities to add to the execution units of the processing cells appropriate con­trol units to take these conditionals into account. In the Introduction we summarized the requirements of a con­trol mechanism that enables the efficient mapping of PRAs onto configurable processor arrays. In this sec­tion, principally different possibilities to control the functions of the processing elements and to change the configuration dynamically are compared.

3.1. Control models

In [38] different control models for programmable VLSI processor arrays are introduced. They can be classified into a global model and local models.

• Global model. To each processing element in the ar­ray identical (a typical example is the SIMD (single instruction multiple data) machine, see e.g., [2]) or individual control data are broadcasted from a cen­tral control unit (e.g., [31]). If identical data are sent, all processing elements execute the same function. On the other hand, if individual control data are used the size and 110 rate of the central unit depends on the size of the processor array and thus is not suited for massive parallelism. The architectural concept of the global control model is shown in figure 4.

Fig. 4. Global control model.

• Local models. This class can further be subdivided into a control flow model and a model with prestored local control.

a. Prestored local control. In case of prestored local control, each processing element has its own con­trol mechanism that locally controls the processor functions (see figure 5). In some cases, the method of adding to each processing element a context­sensitive finite state mechanism makes the design

" I;!;"'II

•• .•. i.i .•.•. W •• H .•. ' .•. ,. • •. , •. ,.Jl.,.' ....•.••.•••.••.•.•. c.,.! .... >:_c·o •• ,"

c' = c

'modify & propagate' 'propagate only'

control tlow models pre stored local control

Fig. 5. Local control models.

of VLSI processor arrays more complicated, and also makes the processing elements be problem­size dependent. In the case of the implementation of (6), processing elements must be capable of deciding whether an iteration point is part of a linearly bounded lattice or not. Consequently, run­time control is performed and the size of the dependence graph determines the specification of the processing elements. On the other hand, if the structure of the condition space is reasonably sim­ple, the prestored local control may be preferable to the concept of control flow.

b. Control flow model. The control data are pipelined through the processing elements of the processor array just like operation data (see also figure 5). The iteration dependent conditionals as shown in (6) are converted into data dependent conditionals. The control units can be considered to be purely combinatorial at this level of hierarchy. We can distinguish between two different classes of con­trol units:

• propagate only: Control data are decoded in the control unit of a processing element selecting the actual processor function. These control data are propagated unmodified to the neighbors of the processing element (e.g., [35], [38]).

• modifiy & propagate: The only difference to the propagate only method is that in this case, con­trol data may be modified in the control unit of a processing element before being sent to neigh­bor processing elements .

3.2. The Control Flow Processor

The fact that not only 110 switching must be controlled but also iteration-dependent functions must be handled leads to the definition of a new kind of processing ele­ments, the control flow processor that has the follow­ing capabilities:

Page 81: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

84 Teich and Thiele

• Dynamic selection of data paths. The active input and output ports can be selected dynamically during program execution.

• Ability of performing time variable processor func­tions. The function of a procesing element can be selected dynamically.

• Stateless control flow mechanism. Control data are treated equal to operation data and are propagated through the processor array. In order to guarantee an efficient implementation in hardware, the control flow processor contains (on this level of hierarchy) a purely combinatorial control unit (CU) and execu­tion unit (EU). The function of the control unit is fixed for a certain class of algorithms to be imple­mented. The architecture of a configurable processor array consisting of control flow processing elements is shown in figure 6.

int'?"cc'fl!llecti"!l network with register chains

Processor Array Control Flow Processing Element

Fig. 6. Architecture of a configurable processor array consisting of control flow processing elements.

On the other hand, the flexibility of the control flow processing element can easily be increased if the func­tion of the control unit can be programmed. In this case, operation phases of the configurable processor array can be interleaved with programming phases. More­over, one operation of the EU may be refined to a se­quence of microoperations which leads to a finite state model of a control unit.

4. Processor Specification

In the previous chapter, the capabilities of the control flow processor have been defined. Now it remains to develop a control method which enables the mapping of a given piecewise regular algorithm (PRA) onto a processor array and at the same time specifies the

control data, the control units (CU) and the execution units (EU) of the control flow processing elements.

According to the stepwise refinement, additional con­structs must be added to the given algorithm in order to specify the necessary control. To this end, iteration dependent conditionals are replaced by control data dependent conditionals preserving the semantics of the program. If this process is applied to all k = 1, ... , V quantifications and to all I = 1, ... , Wk iteration dependent conditionals of the form

(III: IE Ik :: xk[Ij = Fi{. .. } if IE Ik

Fi {. .. } if I E Ii (11)

a given piecewise regular algorithm is transformed in­to a regular iterative algorithm which can be directly related to a regular processor array. At first we describe the design of completely homogeneous, regular proc­essor arrays. Later, we generalize this approach to the design of more general architectures (Section 4.5).

The proposed procedure consists of the generation of the control space leo the homogenization of the indi­vidual iteration spaces Ik and the control generation that specifies the control unit of the control flow proc­essing elements.

4.2. Control Space

Obviously, any execution unit needs additional control inputs which specify the actual function at a given itera­tion point. The corresponding control data can simply be determined using the definitions of the iteration spaces and condition spaces of the different equations in (11). Consequently, we can define a control machine that at each time step tells the EU which processor func­tion to perform next. This machine can be installed locally inside a PE to implement the local control model with prestored local control. On the other hand, the control signals for all processing elements can be generated in a central unit which leads to the global control model (e.g., SIMD model).

In case of control flow, however, that is required for the implementation of a PRA on a regular array or a configurable processor array, the generation of control is more difficult as all control data must be propagated to the processing elements from the borders of the

Page 82: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 85

processor array. We are going to add a local control graph to the homogenized dependence graph of a PRA by introducing additional control variables. These variables are initialized at the border of the processor array. These two steps of control graph generation and control data assignment specify the control units (CU) of the control flow processing elements.

The control generation specifies a set of control variables that provide each iteration point with the necessary information to select the operation related to this iteration point. Consequently, we are replacing iteration dependent conditionals by conditionals depen­dent on the control variables. After the projection of this dependence graph, the actual processor functions are determined by the local control information. A pro­cedure for control generation has to specify the number of control variables, the corresponding data dependen­cies (directions of control propagation) and the initial­ization of the control data at the borders of the proc­essor array.

In order to specify the iteration points that must be accessed by control signals, we define the control space Ie:

DEFINITION 4.1. Given a PRA consisting of V quan­tifications of the form of (11) with the iteration vector I = (t: ib iz, ... , is-I>' = (t: Ip)', where t denotes the coordinate of time (sequence of operations). The control space Ie is a right prism that is defined as follows: The polyhedra of all V iteration spaces are pro­jected on the subspace defined by the variables i], iz, ... , is-I' The bottom surface Ip of the control space Ie is the integer convex hull of these proj ected spaces and can be defined by

Ip = conv {u pro}ls_1 (Ik)} = k~l

{Ip E zs-I : p. Ip ~ p}.

where conv(S) denotes the convex hull of a set Son the subspace defined by il. .s-I (S) denotes the or­thogonal projection of a polyhedron S on the subspace defined by i l •... , is-l- The neight of the prism can be determined by projecting the interation spaces on the subspace defined by the variable t:

It = conv {U prO},(Ik )} = k~1

Therefore, the right prism Ie is given by

The control space has the form of a right prism because in case of a synchronous design, any processing ele­ment is active in a time interval of the form tmin :5 t

:5 tmax' The polyhedron Ip describes the processor space. It identifies all locations of processing elements in the corresponding processor array. All iteration points in Ie must be accessed by control signals. The orthogonal projection of polyhedra can be computed by the Fourier-Motzkin-Elimination (see [39]).

eset plane

Ie

Fig. 7. Dependence graph and control space I, of the PRA in figure

2.

Example 4.1. Given the dependence graph of the PRA in figure 2, figure 7 shows the corresponding control space that is obtained as follows: From Definition 4.1 we obtain Ip by projecting the iteration spaces of the equations in (9) on the subspace defined by variable i' : Ip = conv{i' E Z : 0 :5 i' < N} = {i' E Z : 0 :5 i' < N}. Similarly, It = {t E Z: -N >t} can be derived from Definition 4.1. Note that in this case, the equations in the initially-section are not considered for the computation of the control space. This is because all data assigned to variables in the initially-section are not projected onto locations of variables in the always­section but will be projected onto the border of the

Page 83: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

86 Teich and Thiele

array (host). We consider the host(s) processor(s) at borders to be capable of treating inhomogeneous data and operations.

Thus, Ie is given by

The boundary hyperplane {I E ZS : t = tmin } ofIe is called the reset plane of Ie because we assume that the computation is initiated at time step t = tmin- All values of variables with iteration points satisfying t < tmin can be related to the initialization of synchronous registers. In case all registers can't be reset by a com­mon arbitrary value, the control space may have to be extended towards smaller time steps. This problem has already been faced in the initial algorithm formulation in (2): By assuming that registers inside the processor array cannot be initialized with the filter coefficients, data has to be initialized at the borders and then has to be propagated to the proper cells. Therefore, the ini­tialization of data (initially-section) is an important part of an algorithm formulation, depending on the target architecture.

4.3. Homogenization of Iteration Spaces

In a regular array, all processing elements are identical. This can be derived from the fact, that to all indexed equations of a regular iterative algorithm only one com­mon iteration space is associated.

PROPOSITION 4.1. The iteration spaces Ik of a given PRA of the form (11), consisting of an always-section only, can be equivalently replaced by the control space Ie defined in Definition 4.1.

Proof The relations Ik ~ Ie hold for all 1 :5 k :5 V. Therefore, we can equivalently replace the resulting program by a concatenation of two programs: The first one is identical to the given PRA and the second one contains the equations which are in the resulting PRA

but not in the given PRA. As the first program does not use variables defined by the second one, the func­tionality of a PRA is unaffected by the homogeniza­tion of iteration spaces.

In the following we are going to deal with PRAs of the form

(13)

F{k { ... } if I E I{k II ... )

From the above description it is obvious that the initial representation of the PRA greatly influences the resulting algorithm. We are going to deal with these issues in the Conclusion.

4.4. Control Generation for Regular Processor Arrays

Now, we are going to describe the conversion of the iteration dependent conditionals given in (13) to con­ditionals depending on additional control variables. If we access all iteration points in the control space Ie with control data that uniquely determine whether an iteration point I satisfies I E It for all k = 1, ... , V and I = 1, ... , Wk then the processor function cor­responding to any iteration point is uniquely identified. Obviously, it is possible to deal with each conditional separately. On the other hand, the control overhead can be greatly reduced if common control variables for the conversion of several conditionals are used. In order to simplify the presentation, we describe the procedure for each conditional separately. The general case is discussed in the Conclusion.

In contrary to the approach taken in [13], [35], dif­ferent integral lattices of the iteration spaces are taken into account in the choice of feasible control propaga­tion directions. The procedure of control variable generation and control data assignment using this method is summarized in the following proposition:

PROPOSITION 4.2. A PRA of the form in (13) with associated time coordinate t is completely controlled if control data are propagated to all iteration points I E Ie in planes parallel to all boundary hyperplanes {I E zs : a~(i)1 = H(i)}, k = 1 ... V, I = 1 ... Wk ,

i = 1 ... n L of the condition spaces I ~ related to variable Xk and condition I. a L (i) denotes the i th row vector and bi(i) denotes the ith element of the matrix

Page 84: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 'Ij7

and vector Ai, bi, respectively, where the condition spaces are given in the form Ii = {I E ZS : I = Ak K + bL; K E zm: f\ AU ~ bD.

Sketch of Proof A constructive proof can be given by introducing a simple control method. To this end, we are going to consider the conversion of one conditional only. The general case follows from repeating this schema for all

conditionals. The condition space is given as I = {I E ZS : I = AK + b; K E zm f\ Al ~ b}. To any of the n boundary hyperplanes {I E ZS : a(i) I = bUll, i = 1, ... , n of the condition space there is associated a control variable Ci which is propagated to all iteration points I E Ie in planes parallel to the hyperplane. Then to all iteration points I E Ie that satisfy a (i)I ~ b (i), there is assigned a control signal with the value '1' iff IE L(A, b) and a signal '0' otherwise. Therefore, if an iteration point I E Ie is assigned to signal '0', the equation is not defined at that point. An equation is defined at an iteration point I, iff to the vertex I there is assigned n times the control signal '1'.

The following examples and propositions serve to clarify the procedure in the above proof. At first, we demonstrate the determination of control propagation vectors and the assignment of control data. The follow­ing proposition gives a necessary condition for a con­trol propagation vector to be feasible.

PROPOSITION 4.3. Given a PRA with the iteration vec­tor I = (t: i 1: ... is-I)'. A control propagation vector de E ZS is feasible to control the boundary hyperplane {I E ZS : a(i) I = bUll of the condition space description Ii = {I E ZS : I = AkK + bi: K E zmL f\ AU ~ bi}, if the following conditions are satisfied:

a. ak(i)dc = O. b. The first element of the vector de (time component)

satisifes de (1) ~ o. c. de E L(Ak).

Proof It will be shown that a vector de that satisfies the above conditions enables the identification of all iteration points in Ie that satisfy aici)I > bi(i), ak(i)I = bi(i) or aiU)/ < bk(i).

a. ai(i) is a normal vector to the hyperplane defined by aiU)I = bk(i). In order to identify all points in that plane, de must be perpendicular to ak(i).

b. This condition guarantees the causality of the final realization.

c. Given an iteration point lIon the hyperplane with ai(i)I1 = bk(i). As 12 = h + de must also be in L(AL b~), we can conclude:

II = A kKl + bi f\ 12 = A iK2 + bi =>

12 - II = de = A i(K2 - Kl) =>

de = A iK => de E L(A i)

Example 4.2. After the homogenization of the iteration spaces, the always section of the FIR-filter algorithm in Example 2.3 has the form

always

(lit, i' : t ~ -N + 1 f\ 0 :5 i' < N ::

y[t, i'] = y[t -1, i' - 1] + art, i']u[t, i'])11

art, i'] = art - 2, i' - 1] if t :5 i' (15)

- art - 1, i'] if t > i' II

u[t, i'] = u[t - 2, i' - 1])

Fig. 8. Dependence graph of the homogenized FIR-filter algorithm in Example 4.2.

Figure 8 shows the dependence graph of the homogen­ized algorithm. Obviously, only for the set of equations that compute variable a in (15) control has to be generated. We only have to identify one control prop­agation vector to control

Page 85: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

88 Teich and Thiele

(-1 1) [:,] ~o. The vector de = (1 I)' is feasible.

After a set of feasible control propagation vectors is chosen we want to determine the iteration space, where a control variable c associated to a propagation vector de must be initialized.

DEFINITION 4.2. A hyperplane IB = {I E ZS : aI =

b} is called initializing hyperplane for a given control variable c, if the following conditions are satisified:

a. aI ~ b is an inequality of (12).

b. ade > 0, where de is the control propagation vec­tor associated to c.

As control signals are propagated in the control space In control propagation vectors de enter Ie through the plane IB = {I E ZS : aI = b} if ade > O.

Example 4.3. Given the description of the control space Ie of the PRA of the FIR-filter in Example 4.1, the hyperplanes

are initializing hyperplanes for the control variable c that propagates control signals in the direction of de (see Example 4.2).

Now we are able to determine the iteration spaces where control variables are initialized:

PROPOSITION 4.4. Given an initializing hyperplane IB = {I E ZS : aI = b} for a control variable c propa­gated in direction de with ade > O. The correspond­ing initializing space, where c must be initialized is given by the polyhedron

The system A-I ~ b - contains all inequalities that describe Ie according to (12) but inequality aI ~ b.

The next proposition determines the additional equa­tions in the initially- and always-section and the modification of the iteration dependent conditionals in the always-section to completely control the PRA:

PROPOSITION 4.5. The control algorithm that controls the boundary hyperplane i of the condition space Ii with Zii(i) I ~ iii(i) is given as follows: The corre­sponding control variable c is initialized using quantifi­cations of the form

(III: IE Iini,(Id, de) :: c[I] = 1 if Zii(i)I ~ bi(i) /\ I E L(A L hi)

- 0 if otherwise)

for any of its initializing spaces Iinit(Id, de). In In C

is propagated in direction of the propagation vector de using a quantification of the form

(III : I E Ie :: c[l] = c[I - del)

If we denote the control variables corresponding to the hyperplane i = 1, ... , ni as Ci' the iteration depen­dent conditional "xk[I] = FU ... } if I E Ii" can be replaced equivalently by

xdI] = FU· .. } if cdI] = /\ ... /\ cn~[I] = 1

The following example serves to clarify the above procedure.

Example 4.4. We develop the control algorithm for variable c of Example 4.3: The initializing hyperplanes are

{[ ;, J E Z2 : i' = -1/\ t ~ -N}

{[;, J E Z2 : -1 ~ i' ~ N - 1/\ t = -N }

Therefore we have to add to the initially-section the quantifications

lift>i'

o if t ~ i'

(III: IE Iinit(IB2, dJ :: crt, i'] lift>i'

Oift~i'

Page 86: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 89

and to the always-section the quantification

(III E Ie :: e[I] = c[l - de])

where

1= [:,] .

After adding the above control algorithm to the given PRA, the final step of the control procedure consists in the replacement of iteration dependent conditionals by control data dependent conditionals:

(III: I E Ie:: art, i'] = art - 2, i' - I] ifc[t, i'] = 0

- a[t-l,i']ifc[t,i'] = 1 >.

The dependence graph of the FIR-filter algorithm with homogenized iteration spaces is shown on the left hand side of figure 9, the control algorithm is shown on the right hand side of figure 9. Also the initializing spaces and the propagation of the control variable are shown. The initialization of control data of variable c that propagates control signals along direction de is included.

o .,--~~ ""(d,...:.,.;:a-~,r-I e 0 j.,,-4 ....... -"&.....;:!il-,,---

o reset plane

Ie

Fig. 9. Completely controlled PRA for FIR-filtering.

If the condition spaces of all quanitifications are con­trolled separately, an upper bound on the number of required control variables is given by

V wk

b bni. k~l l~l

This number can be greatly reduced if control variables with the same propagation direction are combined, see Conclusion. In contrary to [l3] the control overhead is independent on the size of the problem to be imple­mented. Therefore, the condition space descriptions should have as few different inequalities as possible. As a result, prior to the application of the control pro­cedure, elimination of redundant inequalities and trans-

formations of inequalities should be performed. More­over, the form of the given PRA greatly influences the control overhead. Control overhead can be further reduced by mapping a PRA onto a more general class of processor arrays.

4.5. Design of Configurable Processor Arrays

The proposed control method can be also used to only partially convert iteration dependent conditionals into control data dependent conditionals. Especially, if the conversion is carried out in the direction of the time coordinate only, the processing elements are not iden­tical and the processor array consists of regular com­municating subarrays. As a result, only those condi­tion space boundary hyperplanes have to be controlled, that have normal vectors with a nonzero time compo­nent. Conversely, it is possible to specify the homo­geneous subarrays of the target architectures and to design a control algorithm which satisfies this require­ment. Control overhead and the complexity of the processing elements is reduced in comparison to the completely controlled regular implementation. In this case, also the number of different codes for the con­trol variables can be reduced: as there exists a set of different processing elements, an ambiguous control data assignment may be chosen, e.g., one code for dif­ferent functions in different kinds of processing elements. The corresponding optimization problem can be formulated as a graph coloring problem.

In this paper, we described a procedure for the deter­mination of control signals which are propagated unmodified through the processor array. The more complex method modify & propagate offers further opportunities. From the above considerations, the following result can be deduced:

PROPOSITION 4.6. For any PRA with associated time coordinate t, an equivalent regular iterative algorithm can be derived. In particular, any PRA can be con­trolled by means of control flow using the scheme of modify & propagate.

5. Conclusion

We have given a methodology which enables the con­trol of processing elements in a configurable processor array. The control procedure that allows the mapping of a PRA onto processor arrays involves the following steps:

Page 87: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

90 Teich and Thiele

1. Determination of the control space. 2. Homogenization of the iteration spaces. 3. Control Generation

a. Choose a minimal number of control propagation vectors that enable the unique identification of all iteration spaces.

b. Generate a local control algorithm.

4. Substitution of iteration dependent conditionals in the homogenized algorithm by control data depen­dent conditionals.

The presented control flow method makes use of the propagation of control signals and avoids global access of the processing elements. There are different exten­sions to the derived procedure possible:

1. The control generation offers high freedom in the choice of propagation directions and the assignment of control signals. The following rules help to minimize the number of additional control inputs of the processing elements:

a. The control propagation vectors shuld be chosen equal to dependence vectors dkl of the PRA. In this case, no new directions of data/control com­munication are introduced.

b. Control propagation vectors which are feasible to control several hyperplanes should be chosen.

c. Control propagation vectors are chosen to be co­prime in order to provide short interconnections.

d. Control variables with the same propagation vec­tors can be combined.

2. Hierarchical Aspects. Hierarchical transformations for mapping algorithms onto processor arrays of re­duced size or onto arrays of lower dimension like multiprojection [6], clustering [22], partitioning [21] or tesselation transform iteration spaces of equations into general integral lattices. As the proposed algo­rithmic model and the control procedure consider iteration spaces that are integral lattices, the metho­dology permits a consistent hierarchical design of processor arrays. Using the methodology, the hier­archical mapping of algorithms can be carried out in different ways. In case ofmultiprojection, a prob­lem with iteration space of dimension n is mapped onto an array of dimension n - k, I < k ~ n:

• Flat Mapping. The mapping of an algorithm onto an array of given dimension is carried out in one step. The homogenization and control generation is performed once for the final algorithm.

• Incremental Mapping. The dimension of the realization is decreased one by one. In this case, homogenization and control generation may also be carried out incrementally. As an example, let us suppose that the iteration space of the given PRA has dimension n, after adding the control algorithm, the initialization of the control variables is done on initialization hyperplanes of dimension n - I (see Proposition 4.4). Now it is possible to consider these initializations as separate sub­algorithms and to apply the control generation pro­cedure. As a result, the subalgorithms will have no iteration dependent conditionals and a new in­itialization space of dimension n - 2 is added. This form of incremental mapping can be con­tinued. If applied to multiprojection, it may be possible to concentrate the control generation on one host procesor only and to distribute control data. The final processor array consists of distinct subarrays of different dimensions. The correspon­ding control model was introduced as the modify & propagate schema.

Mapping a problem onto a target architecture of fixed size also involves a hierarchical control mechanism. If a problem is split into tiles of fixed size to repre­sent the size of the realization, a tile represents a subprogram that is executed sequentially (main pro­gram). As the splitting may be expressed in form of the presented class of piecewise regular algo­rithms, one can control the main program and the subprograms separately.

3. The initial representation of the PRA greatly influ­ences the control effort, the utilization of process­ing elements and the possibility to take resource constraints into account. There is a high degree of freedom in choosing the iteration spaces for each variable and the corresponding condition spaces. The two extremes can be described as follows: (a) The iteration space of a variable is the whole s­dimensional Euclidean space and the condition spaces are responsible for the selection of equations or (b) the condition spaces of a variable are defined by as few hyperplanes as possible. In case (b), the control effort will be small whereas (a) leads to a small number of additional equations imposed by the homogenization of the iteration spaces.

4. Alternatively, the propagation of control variables can be described by means of a localization pro­cedure, see [24].

Page 88: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Control Generation in the Design of Processor Arrays 91

5. If there is a reformulation such that the condition spaces are defined by lattices only (no boundary hyp­erplanes), there is a very simple solution in case of the prestored local control model. A simple counter is sufficient to generate the necessary information. The initialization of the counter may depend on the index of the processing element in the processor space. As the control procedure has been described for each condition space and each hyperplane/lattice individually, it is possible to mix the control flow and prestored models. For example, it may be effi­cient to realize iteration dependent conditionals which stem from partitioning transformations in the prestored model and those which result from local­ization or hierarchical algorithm formulation using the control flow model.

References

I. H. Kung, "Let's design algorithms for VLSI systems," in Proc. Caltech Conj. on VLSI, 1979, pp. 65-90.

2. S.Y. Kung, VLSI Processor Arrays. Englewood Cliffs, NJ: Pren­tice Hall, 1987.

3. H. Kung and C. Leiserson, "Systolic arrays for VLSI," in SIAM Sparse Matrix Proceedings, Philadelphia, 1978, pp. 245-282.

4. U. Schwiegelshohn and L. Thiele, "One- and two-dimensional arrays for least squares problems," in IEEE Con! on Acoust. Speech Signal Processing, Dallas, 1987, pp. 791-794.

5. U. Schweigelshohn and L. Thiele, ''A systolic array for cyclic­by-rows Jacobi algorithms," 1. on Parallel and Distributed Com­puting, vol. 4, 1987, pp. 334-340.

6. U. Schwiegelshohn and L. Thiele, "Linear processor arrays for matrix computations," 1. on Parallel and Distributed Computing, vol. 7, 1989, pp. 28-39.

7. L. Thiele, "Computational arrays for Jacobi algorithms," in SVD and Signal Processing, North Holland Pub., 1988, pp. 369-383.

8. L. Guibas, H. Kung, and C. Thompson, "Direct VLSI implemen­tation of combinatorial algorithms," in Proc. Con! on VLSI: Arch­itecture, Design and Fabrication, 1979, pp. 509-525.

9. M. Huber, ''A systolic processor chip dedicated to the shortest path problem," in Proceedings of COMPEURO 87, Hamburg, 1987, pp. 500-501.

10. U. Schwiegelshohn and L. Thiele, ''A systolic array for the assign­ment problem;' IEEE Trans. Computers, 1988, pp. 1422-1425.

II. S.K. Rao and T. Kailath, "Systematic design of special purpose processor arrays;' Proceedings of the IEEE, 1987.

12. P. Quinton, ':Automatic synthesis of systolic arrays from uniform recurrent equations," in The IEEEIACM lIth Annual Int 'I Symp. on Computer Architecture, Ann Arbor, MI, 1984, pp. 208-214.

13. S.K. Rao, Regular iterative algorithms and their implementations on processor arrays. PhD thesis, Stanford University, 1985.

14. W.L. Miranker and A. Winkler, "Space-time representation of computational structures," Computing, 1984, pp. 93-114.

15. D.l. Moldovan, "On the design of algorithms for VLSI systolic arrays;' Proceedings of the IEEE, 1983, pp. 113-120.

16. 1. Annevelink and P. Dewilde, "HIFI: A functional design system for VLSI processing arrays," in Proc. Int'l Con! on Systolic Ar­rays, San Diego, 1988, pp. 413-452.

17. L. Thiele, "On the hierarchical design of VLSI processor ar­rays," in IEEE Symp. on Circuits and Systems, Helsinki, 1988, pp. 2517-2520.

18. A. Benani and Y. Robert, "Spacetime-minimal systolic arrays for Gaussian elimination and the Algebraic Path Problem;' Report from the Ecole Normale Superieure de Lyon, vol. 9, 1990.

19. 1. Hwang and S. Kunk, "Parallel Algorithms/Architectures for Neural Networks," Journal of VLSI Signal Processing, vol. I, 1989, pp. 221-251.

20. M. Chen and K. Yao, "On realization and implementation of Kalman filtering systolic arrays," in Proceedings of John Hopkins IWirkshop, 1987.

21. D.l. Moldovan and R.A.B. Fortes, "Partitioning and mapping of algorithms into fixed size systolic arrays," IEEE Trans. Com­puters, Vol. C-35, 1986, pp. 1-12.

22. H. Nelis, E.F. Deprettere, and 1. Bu, ''Automatic design and par­titioning of algorithms for VLSI systolic/wavefront arrays," in Proc. SPIE Conference, San Diego, 1987.

23. YWong and 1.M. Delosme, "Broadcase removal in systolic algo­rithms," in Proc. of Int'! Con! on Systolic Arrays, San Diego, 1988, pp. 403-412.

24. V. Roychowdhury, L. Thiele, S.K. Rao, and T. Kailath, "On the localization of algorithms for VLSI processor arrays," in VLSI Signal Processing III, New York: IEEE Press, 1989, pp. 459-470.

25. 1. Bu, L. Thiele, and E. Deprettere, "Systolic array implementa­tion of nested loop programs," in Application Specific Array Proc­essors, Princeton, NJ: IEEE Computer Society Press, 1990, pp. 31-43.

26. L. Thiele, "On the design of piecewise regular processor ar­rays," in Proc. IEEE Symp. on Circuits and Systems, Portland, 1989, pp. 2239-2242.

27. D. Smitley and l. Lee, "Synthesizing Minimum Total Expansion Topologies for Reconfigurable Interconnection Networks;' Jour­nal of Parallel an Distributed Computing, vol. 7, 1989, pp. 178-199.

28. l. Schierson and S. Ilgen, ''A Reconfigurable Fully Parallel Associative Processor," Journal of Parallel and Distributed Com­puting, vol. 6, 1989, pp. 69-89.

29. C.H. Chu, ''A Model for an Intelligent Operating System for Executing Image Understanding Thsks on a Reconfigurable Arch­itecture," Journal of Parallel and Distribured Computing, vol. 6, 89, pp. 598-622.

30. M. Chean and J. Fortes, ''A Taxonomy of Reconfiguration Tech­niques for Fault-Tolerant Processor Arrays," Computer, vol. 23, 1990, pp. 55-69.

31. P. Frison, D. Lavenier, H. Leverge, and P. Quinton, "MICS­MACS: A VLSI Programmable Systolic Architecture," in Proc. Int. Can! Systolic Arrays, 1989, pp. 146-155.

32. O. Menzikioglu, H.T. Kung, and SW. Song, ''A Highly Con· figurable Architecture for Systolic Arrays of Powerful Processors," in Proc. Int. Can! Systolic Arrays, 1989, pp. 165-165.

33. L. Snyder, "Introduction to the Configurable, Highly Parallel Computer," Computer, 1982, pp. 47-56.

34. M. Kunde, HW. Lang, M. Schimmler, and H. Schroeder, "The Instruction Systolic Array and its relation to other models of Parallel Computers," in Proceedings Parallel Computing, North­Holland Amsterdam, 1985.

Page 89: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

92 Teich and Thiele

35. S. Rajopadhye and R. Fujimoto, "Systolic array synthesis by static analysis of program dependencies." in Proc. of Parallel Architec­tures and Languages Europe 1. Bakker, A. Nijman, and P. Treleaven, eds.), Springer Verlag, 1987, pp. 295-310.

36. K. Chandy and 1. Misra, Parallel Program Design, Reading, MA: Addison-Wesley, 1988.

37. S.K. Rao and T. Kailath, "Regular iterative algorithms and their implementations on processor arrays," Proceedings of the IEEE, vol. 6, 1988, pp. 259-282.

38. M. Huber, J. Teich, and L. Thiele, "Design of configurable proc­essor arrays (invited paper)," in Proc. IEEE Int. Symp. Circuits and Systems, New Orleans, 1990, pp. 970-973.

39. G. Nemhauser and L. Wolsey, Integer and combinatorial optimi­zation, New York: John Wiley, 1988.

Jorgen Thich received the Diplom-Ingenieur in electrical engineer­ing from University of Kaiserslautem, Germany, in 1989. Since 1989, he is working in the research group of Professor L. Thiele at the University of Saarland. His working interests include the theoretical aspects of VLSI design and systematic concepts for massive parallel architectures.

Llthar Thiele received the Diplom-Ingenieur and Dr,-Ing. degrees in electrical engineering from Technical University of Munich, West Germany, in 1981 and 1985, respectively. In 1986, he received the award of the Technical University for his PhD. Thesis "Circuit syn­thesis using methods of linear algebra." Since 1981, he has been a research associate with with Professor R. Saal at the Institute of Net­work Theory and Circuit Design of the Technical University Munich. After finishing his Habilitation thesis, he joined the group of Pro­fessor T. Kailath at the Information Systems Laboratory, Stanford University, in 1987. His stay was devoted to the exploitation of rela­tions between parallel algorithms and parallel architectures in signal and image processing. In 1988, he has taken up a chair of microelec­tronics at the fuculty of Engineering, University of Saarland, Saar­brucken, West Gennany. At his institute, the research activities in­

clude theoretical aspects of VLSI, systematic methods and software tools for the design of array processors and the development of parallel algorithms for signal an image processing, linear algebra and com­binatorial optimization.

Professor Thiele authored and co-authored more than forty papers. He received the "1987 Outstanding Young Author Award" of the IEEE Circuits and Systems Society. In 1988, he was the recipient of the "1988 Browder 1. Thompson Memorial Prize Award" of the IEEE.

Page 90: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

A Sorter-Based Architecture for a Parallel Implementation of Communication Intensive Algorithms

JOSEF G. KRAMMER Technical University Munich, Institute for Network Theory and Circuit Design, Arcisstr. 21, 8000 Munich 2, Germany

Received July 28, 1990, Revised December 21, 1990.

Abstract. This paper deals with the parallel execution of algorithms with global and/or irregular data dependen­cies on a regularly and locally connected processor array. The associated communication problems are solved by the use of a two-dimensional sorting algorithm. The proposed architecture, which is based on a two-dimensional sorting network, offers a high degree of flexibility and allows an efficient mapping of many irregularly structured algorithms. In this architecture a one-dimensional processor array performs all required control and arithmetic operations, whereas the sorter solves complex data transfer problems. The storage capability of the sorting net­work is also used as memory for data elements. The algorithms for sparse matrix computations, fast Fourier transfor­mation and for the convex hull problem, which are mapped onto this architecture, as well as the simulation of a shared-memory computer show that the utilization of the most complex components, the processors, is 0(1).

1. Introduction

In many areas, such as digital signal processing, matrix computation, graph theory and combinatorial optimiza­tion, there are algorithms with global and/or irregular data dependencies, which cannot be transformed into regular iterative form. Mapping these algorithms onto a processor array with a regular nearest neighbor in­terconnection topology, such as mesh connected arrays, results in extensive communication and data transfer problems. It is well known that these problems are the crucial point in the parallel execution of a large class of algorithms. Generally, these transfer problems are solved by the use of data routing algorithms, which in general require O(YN) steps for transfer operations on a YN X YN mesh. For the restricted class of bit-oriented permutations (e.g., bit-reversal, perfect shuffle, etc.), efficient routing algorithms can be found in [11], [14], [22]. For more general transfer operations, more flex­ible routing algorithms are necessary, e.g., [10]. Un­fortunately, these methods require a lot of preprocess­ing for determining the data paths of the elements. Other very flexible data routing procedures, which avoid complicated preprocessing, are sorting algorithms. Using a sorting algorithm, data elements are transferred by sorting keys attached to them [2], [5], [8], [20]. Due to recent interest among several scientists in two-dimensional sorting algorithms, a

variety of efficient and VLSI -suited sorting networks are available [\2], [16], [17], [20], [21]. These networks turn out to be very efficient in transferring data on regular arrays. Another very important fact in favor of the use of a sorting algorithm as routing procedure is that a sorter is highly suited for fault-tolerance [9]. Since the sorter interconnects the processors, it is likely that a sorter plays a key role in many fault-tolerant processor arrays.

In many applications the time spent in performing arithmetic operations is negligible compared to the time spent moving data from one processor to another, e.g., parallel execution of the FFT [14], [22], sparse matrix computation [5], [10] and graph-theoretic algorithms [1]. Therefore, an architecture with a reduced number of processors involved in arithmetic operations is pro­posed. In this architecture complicated data transfer problems are solved by the use of a two-dimensional sorting algorithm. The architecture is composed of three components: a two-dimensional sorter, a one­dimensional processor array, and local memories belonging to individual processors. Due to the data routing capability of the two-dimensional sorter, a high data transfer bandwidth is available. The one­dimensional arrangement of processors has a compu­tational capability which is matched to the data transfer bandwidth of the sorting network, but requires only relatively small amount of chip area. Data elements

Page 91: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

94 Krammer

which are accessed by one processor only are stored in the local memories.

The paper is organized as follows: In the next sec­tion is explained how the sorting algorithm is used for data routing operations. In Section 3 the sorter-based architecture is presented and the execution of elemen­tary operations is outlined. The implementation of algorithms for sparse matrix computation is shown in Section 4, for the fast Fourier transformation in Sec­tion 5 and for the convex hull in Section 6. Section 7 illustrates how the proposed architecture can be used as a shared-memory computer.

2. Sorting: A Useful Tool for Data Routing

The indexing scheme, which is a mapping of the numbers 1 to N to the memory locations in the two­dimensional array, determines the ordering of a sorted sequence. Figure l(a) shows the row-major indexing scheme on a mesh with wrap-around connections. This topology and indexing scheme is used throughout this work. In Figure 1(b) the mesh with N = -IN x -IN cells is folded around the middle axis and, thereby, the wrap­arounds are localized.

(a) (b)

Fig. 1. Indexing scheme on a mesh with wrap-arounds (a) and on a folded mesh (b).

A sorting algorithm is able to transfer all N! possi­ble initial orderings of data items into nondecreasing order [7]. Therefore, a sorting network can perform N! different permutations with the elements in the two­dimensional array. Thus, there is no restriction to a special class of permutations.

For data routing operations, keys are attached to the data items. The keys are sorted and, thereby, the items are transferred with them. Figure 2 shows an example of a transfer operation. The keys are denoted by numbers and the data values by letters.

I;+-- key L£...d---- data

(a)

27 a 16 s

28 D

5 w

14 h 10 a

24 n 7 d

21 9 I e 3 15 b i 20 23 k m 12 2 ~ v

(b)

2 3 5 7 v b w d 9 10 12 14 e a g h 15 16 20 21 i s k 1

23 24 27 28 m n a p

(c)

Fig. 2. Data transfer operation: (a) data element, (b) unsorted, and (c) sorted sequence.

In a sorted sequence, the elements elo e2, ... , eN

are arranged in an order ep" ep" ... , epN in which the keys of the data elements satisfy the inequality kp, ,;;

kp, ,;; ... ,;; kpN' Therefore, data elements with con­secutive keys lie next to each other, assuming that the indexing scheme is mapped locally. Due to this prop­erty the sorting procedure can achieve two tasks which turn out to be useful for data transfer operations.

T1: Bringing data elements into neighborhood: The elements ei and ej are brought into neighborhood, if the keys of all other elements are either greater or smaller than those of the elements ei and ej.

It is possible to bring pairs of data elements into neighborhood or groups of elements into connected regions. Note that the place to which an element is routed is undefined, because this depends also on the keys of other elements.

T2: Transferring data elements to specified places: The element ej with the key kj is routed to the memory location i, if the key is llnique and the number of data elements with keys smaller than kj is i-I and with keys greater than kj is N - i.

Example. The following sparse-matrix vector multi­plication illustrates the tasks T1 and TI. For the matrix­vector multiplication

X(k+I) = A x(k) (1)

(A is a sparse matrix with e nonzero elements) the nonzero elements of the jth column of the matrix are multiplied with the jth component of the vector X(k) ,

and the partial products of the row i are added to yield the value of the ith entry of the vector x(k+ \). Hence, a column- and row-wise access to the elements of A is necessary.

In order to reduce the storage requirements only nonzero elements are stored in the array, as shown in figure 3. In this order the elements of individual rows

Page 92: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Sorter-Based Architecture for Parallel Implementation of Communication Intensive Algorithms 95

a11 a14

a24 a25

a36 a41

a55 a6

a 21

a26 7

a44

a64

a22

a33

a 53

a66

one row of A in a connected region

Fig. 3. Compact mapping of the nonzero elements of the A.

lie in connected regions, but not the elements of the columns. Unfortunately, it is not possible to bring both the elements of the columns and the elements of the rows into connected regions at the same time. As a con­sequence, complicated data transfer problems arise. These transfer problems depend on the structure of the coefficient matrix and, therefore, may vary from prob­lem to problem.

In order to solve the communication problems with a sorting algorithm, we introduce a second matrix X, whose structure is equal to that of A. Here again, only those elements of X which correspond to nonzero elements of A are stored in the array.

A [ "" a a al4 a a

a21 a 22 a a2. a25 a 26

a a a33 a a a36

a.1 a a a44 a a a a a 53 a a 55 a a a 62 a a64 a a 66

XII a a X I4 a a X 21 X 22 a X2. X 25 X 26

a a X33 a a X36

X 41 a a X44 a a X

a a X53 a X55 a a X62 a X64 a X66

The diagonal elements of X contain the values of the vector x(k) (Xii = xik). The other elements xij' i ;z!

j, of X will be filled by broadcasting the diagonal elements over their corresponding columns as described below. The elements of X can be transferred by the sort­ing algorithm. Therefore, an aribtrary ordering of the elements xij is possible and the columns or the rows of X can be brought into connected regions at different times.

In figure 4 (a) the nonzero elements of X are sorted with keys whose most-significant bits (MSBs) repre­sent the column index. The elements of each column

lie in one connected region (Tl) because the keys of all elements of column} are greater than those of the columns 1, 2, ... ,} - 1 and smaller than those of the columns} + I,} + 2, ... Thus, the values of the diagonal elements, which are equal to the elements of the vector x(k), can be broadcast over the regions, that is, over the elements of the columns of the matrix X, . _ (k) 1.e., Xj} - x) .

one column of X ",.-,--.,......".-. in a connected Xu )121 x41 x2 region

;1;62 x33 x53 Xl

x24 x44 xM x25

xS5 x26 x36 x6tj

MSB LSB

Xj' __ key CIIIJ '-""", CI:J

sort~j ; key", U, i)

(.)

MSB LSB __ keyCIIIJ

Xij_data CI:J sortxij ; key = (i,j)

(b)

one row of X in a connected regIon

Fig. 4. Data transfers for a matrix-vector multiplication.

In figure 4(b) the nonzero elements of X are sorted with keys whose most -significant bits are equal to that of the row index and whose least-significant bits (LSBs) are equal to that of the column index. Using this order­ing it is guaranteed that the elements aij and xij are in the same processor cell, because the coefficient matrix A and the matrix X have an equal structure and they are in the same lexicographic order (T2). Hence, the multiplication, xij = aijxij' can be done independent­ly in all cells at the same time. In the same ordering the elements belonging to one row of X can be added, because they lie in one connected region (Tl).

An analysis of the algorithm in the example above shows that for broadcasting and summation O(re) time steps are required. The same time is also needed for the transfer operations within the regions. This time complexity, which is a lower bound for this type of problem on a locally connected architecture, is caused by a matrix which has a row or a column with O(e) nonzero entries. In this case the time required to broad­cast an element or to perform the summation within a region with O(e) elements is O(re). The multiplica­tion, which requires no transfer operations, can be done in one time step. The time complexity of the sorting operation depends on the sorting algorithm used. If a time optimal sorter is used, this time is also of O(re) and, therefore, of the same order as the time required for broadcasting and summation.

The choice for the application of a sorter for data routing operations is motivated by the fact that a sorter

Page 93: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

96 Krammer

provides the highest degree of flexibility and is highly suited for VLSI-implementation. It is composed of sim­ple, locally and regularly connected modules, which perform elementary compare and exchange operations. The global control requirements for the transfer opera­tions are minimized, because the transfer is controlled by the keys and the sorter itself requires only a moderate amount of control. If the algorithm proposed in [17] is used, no global control is necessary. The area re­quired for the implementation of a two-dimensional sorter is proportional to the number of bits (keys and data values) of the data elements which are stored in the network. The length of the keys is proportional to the logarithm of the number of data elements.

If a random access memory and a sequential proc­essor are used for the sparse matrix-vector multiplica­tion, pointers and addresses of D(log e) length would be required too. Hence, the area required for the sorter is within a constant factor to the area necessary for the implementation with a conventional memory.

3. A Sorter-Based Architecture

The example in the last section demonstrates that the time performance of some algorithms, which are mapped onto locally connected architectures, is deter­mined by data transfer operations. Consequently, the utilization of the processors, which must be capable of performing complex operations, such as additions and multiplications, is very poor. The only way to in­crease the efficiency of the architecture (without sacrificing the local interconnection topology) is to decrease the hardware requirements for the complex components, i.e., multipliers, adders, and control units.

3.1. Architecture

In the following an architecture with a reduced number of processors is proposed. This architecture combines a two-dimensional sorting network with a one­dimensional processor array. In this composition the data transfer bandwidth of the sorter and the computa­tional capacity of the processors allow the execution of many communication intensive algorithms (e.g., sparse matrix computations) with the minimum time complexity possible on a locally connected array. Figure 5 shows the architecture, which consists of the follow­ing three components:

1. A YN x YN two-dimensional sorting network.

sorter PE's LM's

Fig. 5. The sorter-based architecture.

2. a one-dimensional array with YN processor elements (PEs), attached to a vertical boundary of the sorter and

3. YN local memories (LMs) each having D(VN) memory locations.

In this architecture the sorter performs all complex data transfer operations. Furthermore, the storage capability of the sorter is used as memory for N data items. Therefore, the sorter can be considered as a special kind of shared-memory. All control and arithmetic operations are performed by the processors. Hence, the complex operations are concentrated on a relatively small number of processors and the hardware requirements are reduced.

Many algorithms do not require complex transfer operations of all data elements. Therefore, every processor has its own local memory, whose size is pro­portional to the number of data elements in the corre­sponding row of the sorter. For many applications it is sufficient that the processors have a sequential access to the elements stored in the memories. Therefore first­in-first-out buffers (FIFOs) with a recirculation of the data elements are sufficient. Hence, the data transfers from and to the LMs also involve only local operations.

As depicted in figure 5, the processors have only ac­cess to the right, and via the wrap-around connections to the left column of the sorter. Other columns must be transferred to the boundary before they can be ac­cessed by the processors. This kind of transfer opera­tion can be also done by sorting. Figure 6 shows how the inner columns are transferred to the boundary by local exchange operations. The elements within the white cells represent the original data, e.g., elements of a matrix. The gray cells denote a new set of data elements, which is written into the sorter by the proc­essors. It is assumed that during such a read/write­operation all exchanges are performed simultaneously throughout the columns, because otherwise the two sets

Page 94: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Sorter-Based Architecture for Parallel Implementation of Communication Intensive Algorithms gy

4 6 (al

keys: MSB MSB

R/W-pass from left LI --,11 ___ ~ LI 0::.JI ___ -"I R/W-pass from right

(b)

Fig. 6. RlW-pass from left (al and keys for RlW-passes from both sides (bl.

of data, the old and the new one, may be mixed. Such an exchange of data is also called a read/write-pass (R/W-pass).

Using a shortperiodic sorter, presented in [17], which is based on a repeated application of horizontal and ver­tical compare and exchange operations, the elements can be transferred to the processors simply by writing new data elements to a boundary column of the sorter. If the R/W-pass is performed from the left boundary, these new elements (gray columns in figure 6) must have keys, which are larger than those of the elements which have to be transferred to the left boundary. A similar procedure can be applied from the right side. In this case, the new elements must have keys which are smaller than those of the elements within the sorter. If the RlW-passes are applied alternatively from the right and left boundary, the input and output of the data can be controlled by an additional, most-significant bit of the keys, which are alternatively set to one and to zero. Hence, sorting and R/W-passes are distinguished only by operations at the boundary. The whole data routing is controlled by keys, which are produced in the processors or stored in the LMs.

Whenever a column of a sorted sequence reaches the processors, its data is processed and the result is sent back to the sorter with new keys. During a R/W­pass O(N) data elements are accessed by O( YJIl) proc­essors within O(YJIl) time steps, which is also the time complexity of a time optimal sorting procedure. As a

result, a high processor utilization of 0(1) can be achieved for many algorithms with complex data dependencies. For this reason it is presumed that the algorithm can be decomposed in routing operations and other operations in which the overwhelming majority of the data elements in the sorter are accessed. In many cases the lowest time complexity possible for an im­plementation on a locally connected array is achieved.

3.2. Elementary Operations

In the following it is shown how simple operations, such as multiplication, broadcasting, and summation, can be performed with elements in the sorter and in the local memories during R/W-passes.

(i) Simple unary and binary operations with elements in the sorter and the local memories. During a R/W-pass, the elements of the sorter and the LMs are transferred to the processors. These elements are processed there and the results are sent back either to the sorter or to the local memories. Figure 7 illustrates the calculation of the partial products for the example of the matrix-vector multiplication in the previous section. The elements aij come from LMs and the elements xi} from the sorter. The products aijxij are written back to the sorter during the same R/W-pass. These elements cor­respond to the gray elements in figure 6.

Page 95: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

98 Krammer

X22 x21 x14 Xu all al4 a21 a22_ ~ _______________ J

x33 x26 x25 x24 a24 a25 a26 a33_ ________________ J

X 53 x44 x41 x36 ~ _ ~3.§_ ~'!t _a~ _ "-5;3 ~

x66 x64 x62 x 55 a55 a62 a64 a66_ ~ _______________ J

~ ¢:= from and to the sorter from the LM's

Fig. 7. Calculating the partial products for a sparse matrix-vector multiplication.

(ii) Multiple broadcasting. This operation sends a copy of one element within a connected region to all other elements within the same region. This can be done in all regions in parallel. For that, two R/W-passes, one from the right and one from the left boundary are necessary. Two registers in each processor are also required. During the first RlW­pass (from the right), the elements are copied to the left. The first and the last element of every row is also stored in a register of a processor at the boundary, see figure 8(a). These elements are re­quired for broadcasting in regions which overlap several rows. After the first R/W-pass, copies of the elements in the registers are sent to all proc­essors belonging to the same region, see figure 8(b). Afterwards, a second RlW-pass from the op­posite side can complete the broadcasting operation.

(iii) Multiple summation. This operation adds elements within connected regions. During one R/W-pass the elements of regions which lie entirely in one row of the sorter can be added completely. All results, except the first in regions which overlap a row and the last result in all rows (last column), are sent back to the sorter during the same R/W­pass. The partial sums of row-overlapping regions are now stored in the registers in the processors. Individual partial sums of the same region are added before they are sent back to the sorter with the last column. Hence, one R/W-pass is sufficient for the summation.

The time required for an elementary operation depends on the number of time steps necessary for one R/W-pass, i.e., O(YR) steps. Some additional steps are needed for broadcasting and summation in cases where the regions are extended over more than one row. The number of these steps is small (:5 YR) and, therefore, they have no influence on the time complexity. Note that

b c d b d

e f e f

broadcasting to connected regions

•s~rter ~re~sters

the left side last first

¢:== element of a row

R/W-pass from right

(al

R/W-pass from left

~

H1ffi~ ~~

,--aa-abl--

c d~ f---b

d

e

elements for elements for broadcasting broadcasting downwards upwards

(bl

d

f

f-- -d e r-f--_ f~

~-

broadcasting to the right-hand side

Fig. 8. Broadcasting the elements a-f within connected regions.

in some cases it is possible to combine successive elementary operations (e.g., multiplication and sum­mation) and perform it during the same R/W-pass.

It is also possible to modify the elementary opera­tions by exchanging the multiplication, addition, and copy function with other operations, e.g., with the max­imum or minimum operation. Hence, a variety of ele­mentary operations can be performed. For example, it is possible to perform in O(YR) time prefix or scan computations, which can be used as primitives for many algorithms, see [4].

4. Parallel Solution of Sparse Linear Systems

In many iterative schemes for the solution of a system of linear equations,

Ax=b (2)

the major computational step in each iteration is a matrix-vector multiplication [19]. In the following a system of linear equations with a sparse coefficient matrix A is solved using the Jacobi method.

Jacobi Iteration. For the Jacobi method, Equation (2) is transformed to the following iterative form:

Page 96: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Sorter-Based Architecture for Parallel Implementation of Communication Intensive Algorithms 99

x(k+ I) = G X(k) + h. (3)

If the spectral radius of G is less than one, the itera­tion vector converges to the solution x of Equation (2). The matrix G has the same zero entries as A. Addi­tionally, the diagonal elements of G are zero-valued as well. In accordance to the example in Section 2, a matrix X with a structure, which is equal to that of A, is introduced for data routing. The values of the iterative solution vector x(k) are stored in the diagonal elements

ofX. The nonzero elements of the matrix G and the vec­

tor h are stored in the local memories. Thereby the components of h are replacing the zero valued diagonal elements gu, see figure 9.

Algorithm I: Single Jacobi-Step

1. sort elements Xu : key = (j, i) 2. broadcast the value of x)) to all elements

Xu {xU = x))} 3. sort elements Xu : key = (i,]) 4. multiplication : Xu = guxu for i ;c j and Xii

5. summation = hi

: Xii = E xij )

The algorithm for a single iteration step requires two complex routing operations (sorting) to bring the elements into the necessary order. Figure 9 and 10 show the arrangement of data elements in the architecture during different steps of the algorithm.

xll x21 X22 rrr:l"I hi gl4 g21 h2 x41 =: x6 x33 x53 xI4 g24 g25 g26 h3 :: X2L x44 x64 x25 g36 g41 h4 ~53 ;; x55 x26 x36 x66 h5 g62 !!64 h6

L±J

sorter PE's local memories

Fig. 9. Ordering of data elements for broadcasting.

xll x14 x21 x22 ...JF7l hi g14 g21 h2 :; x33 g24 g25 g26 h3 X24 x25 x26

~ ix36 x41 x44 x53 g36 g41 h4 g53 ;; ix55 x62 x64 x66 h5 g62 ~4 h6 l2J

sorter PE's local memories

Fig. 10. Ordering of data elements for multiplication and summation.

Algorithm I runs on architecture with an O("re X (i) sorting network and O(re) processors, each having a local memory for storing O(re) elements.

If the multiplication and summation is performed during the same R/W-pass, Algorithm I requires 3 R/W-passes, a small number of additional steps (:5 O(re)) for summation and broadcasting in row­overlapping regions, and two sorting operations. Dur­ing the R/W-passes O( e) multiplications and additions are performed within O(re) time steps. The time re­quired for data transfers depends on the time complex­ity of the sorting algorithm used. A time optimal sorter also requires only O(re) steps. Hence, one Jacobi step has a time complexity of O(re) and the processor utilization is 0(1).

5. Fast Fourier Transform

For many algorithms in digital signal processing only a restricted class of permutations, such as bit-oriented permuations, are necessary. In these cases the data transfer can be performed either by a sorter or by a more specialized network, which reduces the area and time requirements as well as the flexibility of the ar­chitecture. Therefore, in these cases the architecture can be optimized by replacing the sorter by a network as described in [11], [14], [22].

The FFT is a famous example of an algorithm based on bit-oriented permutations. Although we cannot ex­ecute it with minimal time possible on a locally con­nected array, our solution utilizes YN complex proc­essors with 0(1). This is consistent with the results achieved by other authors using an orthogonal multiprocessor [6].

The FFT can be computed with a constant geometry version [13], [18]. In this case the same permutation (e.g., perfect shuffle or unshuffle) interconnects con­secutive stages. Figure 11 shows a decimation in fre­quency FFT algorithm.

The interconnection scheme between the stages is a perfect shuffle. This network permutes the elements in the following way:

P(i) = {2i

- 2i+I-N

if 0 :5 i :5 NI2 - 1

if NI2 :5 i :5 N - 1.

This can be expressed in the binary representation of the indices as follows (m = log2 N):

i = im _ 12m - 1 + im _ 22m - 2 + ... + i[ 2 + io

P(i)= im _ 22m - 1 + '"

Page 97: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

100

10 11 12

13 14 15

Krammer

stage 1 stage 2 stage 3

Fig. 11. A constant geometry vefliion of the FFT.

12

10 6

14

5 13

11

IS stage 4

For the implementation of the FFT the data samples are initially and after each stage mapped in row-major order onto the mesh. If the perfect shuffle permuta­tion between the stages is performed by the use of a storting algorithm, the keys of the elements must be equal to the binary representation of the indices of the permuted sequence P(O. Consequently, sorting the data samples with these keys performs a shuffle permuta­tion, as shown in figure 12.

Xo Xl X2 x3 Xo xg Xl X9

X4 X5 x6 X7 x2 XIO X3 XlI

Xg X9 XIO XlI X4 X12 X5 X13

Xl2 X13 Xl4 Xl5 X6 Xl4 X7 Xl5

input sequence shuffled sequence

Fig. 12. Perfect shuffle permutation on a mesh.

In order to reduce the number of processors for the butterfly operations the data elements are transferred to a one-dimensional array of processors at the boun­dary with a R/W-pass, as shown in Section 3. Each pro­cessor reads two consecutive data elements, performs the butterfly operation, and writes the results back to the sorter. The keys of these elements are cyclically rotated and, thus, the elements are ready for the next sorting operation.

A butterfly operation involves one complex multiplication with W as well as two complex addi­tions. In each stage every procesor performs YN/2 but­terfly operations and, therefore, requires YN/2 different coefficients per stage or YNI2 log2N coefficients for the whole transformation. Fortunately, the maximal number of YN/2 different coefficients is needed in the first stage

only and is decreased by a factor of 2 in each of the following stages. Therefore, a processor requires only O(-/N) different coefficients for a complete transforma­tion. The whole architecture is composed of a sorter or permutation network for N = YN X YN data elements and a one-dimensional array of YN processors, each having a local memory for O(-/N) coefficients, see figure 13.

sorter PE's LM's (coefficients)

XO xg XI X9 WOw" w"w" WO WO

X2 XIO X3 XII v(2 W2 w" w" WO w"

X4 Xl2 XS X13

X6 XI4 X7 XIS

stage 1 stage 2 stage 3 stage 4

\ I / redundant coefficients

Fig. 13. Mapping of the elements for the decimation in frequency FFT.

The time required for the complete transformation is O(YNlog N). After the transformation the elements are arranged in reversed binary order and can be brought back into the normal order by bit-reversing the keys and a following sorting operation.

6. Convex Hull

The convex hull problem is defined as follows: Given that there are N points in a plane, determine which of these points (extreme points) lie at the edges of the smallest convex hull that contains all points. In the following the quickhull algorithm [15] is described.

At first, two extreme points are determined. This can be done by searching the points with the largest and smallest x-coordinates. Then the problem can be split into two parts: An upper hull problem involving the points which lie above the line connecting the first two extreme points; and a lower hull problem involv­ing the points lying below this line. Then, on each side the point which has the greatest distance to the line con­necting the extreme points is determined. These are new extreme points, and the problem can be further sub­divided into smaller problems. This is continued until all points lie within the hull, see figure 14.

In the following, a short outline of the implementa­tion of the quickhull algorithm on the sorter-based ar­chitecture is given. During the first two steps the data elements (each represents one point) can be in an ar­bitrary order.

Page 98: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Sorter-Based Architecture for Parallel Implementation of Communication Intensive Algorithms 101

G

E

x

extreme points: iteration 1: {A, I) iteration 2: {A, B, I, J) iteration 3: {A, B, G, I, J, E)

Fig. ]4. Construction of a convex hull.

1. The first two extreme points are determined by selecting the points with the minimal and maximal x-coordinates during one R/W-pass.

2. In a following R/W-pass, every point is classified whether it lies above or below the line connecting the extreme points.

3. The problem is split into two problems as follows: Two additional points with the same coordinates as the two previously determined extreme points are added. Then the elements are sorted. For this sort­ing operation the MSBs of the keys of the elements in the upper half-plane, including the two extreme points, are set to one. Those of the elements in the lower half-plane, including the two added points, are set to zero. The remaining bits of the keys contain the values of the x-coordinates. As a consequence, the points for the upper and lower hull are separated into two regions by the sorting operation. Within the regions the points are ordered according to their x-coordinate.

4. Identify in every region that point which has the greatest distance to the line connecting the first and last points in that region (the distance must have a positive y-coordinate for the upper and a negative for the lower hull). This point divides the region into two smaller (connected) regions. The new extreme points are allotted to both neighboring regions. Repeat this step until no new extreme points are detected.

5. Delete the points introduced in Step 3.

The algorithm requires one sorting operation and several other operations, which can be done in all con­nected regions independently. These operations are similar to the elementary operations presented in Sec­tion 3.

Each step of the algorithm has a time complexity of O(YN). Unfortunately, Oem) steps (m: number of hull points) are required in the worst case, which might be proportional to N. Nevertheless, for well-distributed ex­treme points the number of steps is significantly less.

7. Simulation of a Shared-Memory Computer

Following the ideas of Batcher [2], we show that the sorter-based architecture can be used as a shared­memory computer. Generally, in such computing machines N operations are performed in parallel in 0(1) time. Between consecutive operations a parallel memory access is performed. This access requires at least O(log N) time, or, if Batcher's sorter is applied, O(logW) time is needed. In the method described below in each step O(N) operations are performed by O(YN) processors in 0 (V'N) time, which is also re­quired for the memory access.

The sorter-based architecture is slightly modified as shown in figure 15. The upper half of the architecture serves as shared-memory only. The processors in this part of the architecture are required for the control of the memory access (control processors). In the lower half the actual computation takes place. The processors here must be capable to perform all necessary opera­tions (general processors). In the sorter there are two different sets of data elements: NI2 normal elements (see figure 16 (a)), which serve as storage for the shared data, and N12 transfer elements (figure 16 (b)-(e)), which are necessary for the data transfers between the processors and the normal elements. Local data is stored in the LMs.

VN x VN sorter

shared-memory

N/2 normal elements control processors

N/2 transfer elements general processors

LM's

Fig. 15. Shared-memory computer.

Page 99: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

102 Krammer

(a) nonnal element

(b) transfer element (read request)

~ key ~ control bit

III element # Il( data

III element # 101 return address I

(e) transfer element (return to processor) I 0 I return address 101 data

(d) transfer element (write request) 111 element # 101 data

(e) empty transfer element 1010 ... 001010001

Fig. 16. Data elements.

Parallel memory access operations can be divided into two classes: 1) exclusive-read (ER) and exclusive­write (EW) operations and 2) concurrent-read (CR) and concurrent-write (CW) operations. In exclusive accesses each memory element is addressed by one read or write request only, whereas in concurrent accesses several read or write attempts are allowed to the same memory element.

ER access. During a RlW-pass, each processor in the lower half of the architecture sends in every step one transfer element (read request) to the sorter (all nor­mal elements are assumed to be in the upper half of the sorter). The keys of the transfer elements are equal to the keys (element II) of those normal elements which the processors want to read. Additionally, the return address is sent with the element, see figure 16(b). After this R/W-pass a sorting operation brings the request and normal elements pairwise into neighborhood. In this ordering the data of the normal elements can be copied to the transfer elements by a broadcasting opera­tion. Additionally, the return addresses are written to the keys of the transfer elements and the MSBs are cleared, see figure 16( c). A sorting operation brings the transfer elements back to the lower half of the ar­chitecture and into the correct order, so that they can be accessed by the processors in a R/W-pass.

EW access. For an exclusive-write access transfer elements (write request) are sent to the normal elements as described above. For this operation no return ad­dresses are necessary, see figure 16(d). Then the data values of the transfer elements are copied to the nor­mal elements and the transfer elements are cleared, i.e., all bits are set to zero, see figure 16( e). The empty transfer elements are brought back into the lower half of the sorter by sorting. These empty elements are re­quired there as spaces for the next memory access (in­sertion of request elements).

CR and CWaccesses. The concurrent-read access dif­fers from the exclusive read access only in the fact that the required copy function is done with groups of elements instead of with pairs. With a concurrent-write operation difficulties arise, because several processors may be attempting to write to the same element. Such write conflicts can be solved by specifying the way in which these problems should be treated. One policy is to store the sum of all elements attempting to be stored at the same element. For that, the multiple sum­mation operation, explained in Section 3, can be ap­plied. In this case the control processors must be capable of performing additions.

8. Conclusion

The principal purpose of this paper is to show that a sorting network is a very useful and flexible compo­nent for VLSI architectures. Here the sorter is used as an efficient network for data transfer operations and due to its storage capability the sorter also servers as a special kind of shared-memory, which has the capability to sort the data elements stored. The proposed architec­ture consists of a two-dimensional sorter, a one­dimensional array of processors, and local memories. This combination represents a compromise between the different complexities of data routing, arithmetic opera­tions and storage of local data. The algorithms which have been mapped onto this architecture show that this sorter-based multiprocessor is very flexible and that a large class of algorithms can be mapped on it.

The sorter provides the flexibility to perform all possible permutations and, therefore, it is not restricted to a special class of transfer operations. The control of the data transfer is done by sorting keys which are attached to the data elements. In many cases these keys are produced by simple bit-operations (e.g., an ex­change of digits) or they are preprocessed and stored in local memories. Hence, the costs for the control of the data transfer are very small and a global syn­chronization is only necessary for the processors.

The example of the FFT shows that algorithms with more restricted interconnection patterns can also be ef­ficiently mapped onto an architecture which combines a two-dimensional permutation network with a one­dimensional processor array. In general, algorithms with a global interconnection structure are slowed down by a factor of O(YN) due to the local interconnections and the limited number of processors. But, in all these cases, the utilization of the processors is very high.

Page 100: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Sorter-Based Architecture for Parallel Implementation of Communication Intensive Algorithms 103

Besides that, the clock period can be made shorter in a locally connected architecture than in one with a global topology.

The proposed architecture requires only O(YFI) com­plex processors for a problem of size N. This makes the sorter-based architecture an attractive candidate for single chip solutions for communication intensive algorithms. Estimations of the area requirements [3] suggest that a thousand-element problem instance is within the scope of current technology.

Cascading sorter chips for a multichip solution for large problems, however, involves a speed reduction caused by the limited I/O-bandwidth of individual chips. A chip containing a square section for N' cells of a time optimized two-dimensional sorting or per­mutation network would require a multiple of {N' pins operating at the same clock frequency as the local on­chip interconnections. The resulting pin-bottleneck is not surprising, because a sorter is a specialized circuit for transferring a large number of data elements to dif­ferent locations within the array, and the on-chip band­width is much higher than the bandwidth overchip boundaries.

In this context it turns out to be very important that efficient methods for including fault-tolerant features into a two-dimensional sorting network exist [9]. This is a crucial point, because with such techniques it is possible to overcome the limitations set by the max­imal chip area of current fabrication processes.

References

1. S.G. Akl, The Design and Analysis of Parallel Algorithms, Englewood Cliffs, NJ: Prentice-Hall, 1989.

2. K.E. Batcher, "Sorting Networks and Their Applications," Proc. AFIPS Spring Joint Computer Conf 32, 1968, pp. 307-314.

3. E. Bernard, "CMOS-Entwurf eines 2-dim. fehlerto1eranten Sor­tiernetzwerkes fur Datentransportaufgaben," Diplomarbeit, TU Munchen, 1990.

4. G.E. Blelloch, "Scans as Primitive Parallel Operations," IEEE Transactions on Computers, vol. 38, pp. 1526-1527, 1989.

5. 1. Gotze and U. Schwiegelshohn, "Sparse-Matrix-Vector Multiplication on a Systolic Array," Proc. of ICASSP, pp. 2061-2064, 1988.

6. K. Hwang, P.-S. Tseng, and D. Kim, "An Orthogonal Multiprocessor for Parallel Scientific Computations;' IEEE Trans­actions on Computers, vol. 38, 1989, pp. 47-61.

7. D.E. Knuth, The Art of Computer Programming, vol. 3: Sort­ing and Searching, Reading, MA: Addison Wesley, 1973.

8. 1.0. Krammer, "Parallel Processing with a Sorting Network," ISCAS, New Orleans, 1990, pp. 966-969.

9. 1.0. Krammer and H. Arif, ''A Fault-Tolerant Two-Dimensional Sorting Network," Proc. of ASAp, 1990, pp. 317-328.

10. M. Misra and V.K.P. Kumar, "Efficient VLSI Implementation of Iterative Solutions to Sparse Linear Systems," Technical Report, IRIS no. 246, University of Southern California, 1988.

11. D. Nassimi and S. Sahni, ''An Optimal Routing Algorithm for Mesh-Connected Parallel Computers;' Journal of the ACM, vol. 30, 1980, pp. 6-29.

12. D. Nassimi and S. Sahni, "Bitonic Sort on a Mesh-Connected Parallel Computer," IEEE Transactions on Computers, vol. 27, 1979, pp. 3-7.

13. M.e. Pease, ''An adaptation of the fast Fourier transform for parallel processing," Journal of the ACM, vol. 15, 1968, pp. 252-264.

14. M.e. Pease, "The Indirect Binary n-Cube Microprocessor Ar­ray," IEEE Transactions on Computers, vol. 26, 1977, pp. 458-473.

15. F.P. Preparata and M.1. Sharnos, Computational Geometry­An Introduction, New York: Springer-Verlag, 1985.

16. I.D. Scherson and S. Sen, "Parallel Sorting in Two-Dimensional VLSI Models of Computation," IEEE Transactions on Com­puters, vol. 38, 1989, pp. 238-249.

17. U. Schwiegelshohn, ''A Shortperiodic Two-Dimensional Systolic Sorting Algorithm," Int. Conf on Systolic Arrays, San Diego, Calif., 1988, pp. 257-264.

18. H.S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on Computers, vol. 20, 1971, pp. 153-161.

19. G. Strang, Linear Algebra and Its Applications, New York: Academic Press, 1980.

20. C.D. Thompson and H.T. Kung, "Sorting on a Mesh-Connected Parallel Computer;' Comm. ACM, vol. 20, 1977, pp. 263-271.

21. e.D. Thompson, "The VLSI Complexity of Sorting," IEEE Transactions on Computers, vol. 32, 1983, pp. 1171-1184.

22. KW Przytula, 1.0. Nash and S. Hansen, "Fast Fourier transfonns algorithm for two-dimensional array of processors," SPIE, vol. 826, 1987, pp. 186-198.

Josef G. Krammer received the Diplom-Ingenieur degree in elec­trical engineering from the Technical University of Munich, West Germany, in 1987. Currently he is working at the Institute for Net­work Theorie and Circuit Design towards his Dr.-Ing. degree. His interests and activities are mainly in the field of VLSI architectures, bit-level systems, fault-tolerance, and sorting.

Page 101: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding

G. FETTWEIS IBM Almaden Research Center. 650 Harry Road, San Jose. CA. 95120-6099

H.MEYR Aachen University of Technology, 5240 ERr, D-5100 Aachen, West Germany

Received August 19, 1990, Revised November 20, 1990.

Abstract. The Viterbi-Algorithm (VA) is a common application of dynamic programming. The algorithm contains a nonlinear feedback loop (ACS-feedback, ACS: add-compare-select) which is the bottleneck in high data rate implementations. In this paper we show that, asymptotically, the ACS-feedback no longer has to be processed recursively, i.e., there is no feedback. With only negligible performance loss, this fact can be exploited technically to design efficient and purely feedforward architectures for Viterbi decoding that have a modular extendable struc­ture. By designing one cascadable module, any speedup can be achieved simply by adding modules to the im­plementation. It is shown that optimization criteria, as minimum latency or maximum hardware efficiency, are met by very different architectures.

1. Introduction

Dynamic programming is a well-established approach for a large variety of problems concerning multistage decision processes [I]. One common specific applica­tion of dynamic programming is the Viterbi-Algorithm (VA) [2]. It is a dynamic programming formulation for the search of the shortest path through a finite-state discrete-time state transition diagram, called a trellis, see figure I.

In 1967, the VA was introduced as a method of decoding convolutional codes [3]. In the meantime it has become widely used for various applications in digital communications, magnetic recording, speech recognition, etc. [4]. Dynamic programming requires the execution of a nonlinear and data-dependent recur­sion, and this feedback loop is the main problem and bottleneck for a high-speed implementation ofthe VA.

Given a certain algorithm which is to be im­plemented for a high-speed application, the maximum inherent parallelism needs to be extracted. Once this is done, there exist many methods of mapping the algorithm onto (regular) architectures, as given, for in­stance, in [5]-[8]. However, as any hardware has a limited computational speed and an algorithm has limited inherent parallelism, an implementation of an algorithm for very high-speed applications can require the introduction of additional parallelism, to permit the

.. ~ <Xx:>®< ... ~~-+I __ ~I ____ ~I __ -+I __ ~I~ ____ --+t/T

k k+1

Fig. 1. Example of a trellis with N = 2 states s, and s,. algorithm to be mapped onto massively parallel and/or pipelined architectures so as to achieve an additional speedup by orders of magnitude. It is of special interest to introduce the parallelism in a way that allows the derivation of architectures that at best lead to a linear increase in hardware complexity for a linear speedup in throughput rate. This we refer to as a linear scale solution.

Three levels exist to introduce additional parallelism into an algorithm [9], namely the algorithmic level, the word level and the bit level. At the algorithmic level, application specific knowledge might allow a major algorithmic modification or the use of an alternative algorithm that has much more parallelism. The word level can allow transformations of the signal-flow diagram or algebraic transformations of the algorithm itself. Finally, at the bit level, transformations might allow the introduction of bit level pipelining such that the critical path of the implementation runs only through a few bit levels and thus is extremely short and allows a very high clock frequency.

Page 102: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

106 Fettweis and Meyr

Despite the nonlinear feedback of the VA, solutions recently were presented at the bit-level [10] and word­level [11]-[17]. Here we want to show solutions at the algorithmic level [17]-[20]. After recalling the VA and major features of the VA in Section 2, the paralleliza­tion at the algorithmic level is shown in Section 3. In Section 4, knowledge at the algorithmic as well as the word level is combined to achieve efficient new parallel Viterbi decoding methods. A comparison of the methods is given in Section 5.

2. The Viterbi Algorithm

The underlying multistage decision problem which is solved by the VA is to calculate the optimum path (shortest or longest) through a graph of weighted branches. The graph is a discrete-time state transition diagram of a finite state machine, called a trellis, see e.g., figure 1. The trellis is described by a finite number of N states Si (i E {I, ... , N}) at every time instant kT (T = 1) and by branches (transitions) for the time in­tervals (k, k + 1) that connect the states, as shown in figure 1. Thus, the trellis is a 2-dimensional graph of states and time instants. Below we refer to a specific state at a certain time instant as node Si,k' Connected with each possible transition (branch) from state Sj to Si (j ~ i) there is a measure called the transition metric Aij,k for time interval (k, k + 1). The problem is to estimate that path through the trellis that, by ac­cumulation of its transition metrics, maximizes the gain. This is achieved by recursively calculating the optimum path to each node at every time instant k.

2.1. The Algorithm

Given at time k are N paths, one optimum path to each state, see figure 2. Associated with each path (hence with each state) is a path metric 'Yi.k' The updating of the best path to each state at time k + 1 is achieved through dynamic programming by updating the path metrics 'Yi.k+1 according to the ACS-recursion

V Si: 'Yi.k+ I = maximum (Aij.k + 'Yj.k), (1) Vj-i

which for the simple example figure 1 leads to

'YI,k+1 = max (All,k + 'YI,k; A12,k + 'Y2.k) (2a)

'Y2.H I = max (A21.k + 'YI.k; A22.k + 'Y2.k) (2b)

Y2,k k time •

Fig. 2. Decoding the optimum path to node sl,H1 at time k + 1.

This ACS-recursion can easily be explained with the help of figure 2, Given that the updating is to be car­ried out for node SI,HI (eq, (2a)), then each path at time k is extended by the branches leading to S I ,k+ I.

The transition metrics of the branches are added to the path metrics, resulting in new path metrics, from which the path with maximum metric is chosen as being the optimum one for sl,H I (eq, (2a)), and all others are discarded,

The maximum selection is nonlinear and data­dependent, It seems as though all path metrics "Ii at time k have to be computed before one is able to calculate those at time k + 1, This is the feedback bot­tleneck mentioned in Section 1,

If the N paths at time k are traced back in time, it can be seen that they merge into a unique path, the op­timum one. The survivor depth D is then defined as that depth at which it is highly probable that all paths have merged if they are traced back D steps in time to k - D, In a practical implementation of the VA, called a Viterbi decoder (VD), this allows the decoded transition to be given out with latency D,

In the block diagram of a VD shown in figure 3, we find a pipeline consisting of three units, the TMU (transition metric unit) to calculate the transition metrics Aij,ko the ACSU (add-compare-select unit) to carry out the ACS-recursion, and the SMU (survivor memory unit) to compute the paths and give out the decoded transition with latency D. All units are purely feedforward, except for the ACSU because of its nonlinear ACS-recursion, Thus, a high-speed parallel implementation can easily be found for the TMU and SMU, but the ACS-recursion of the ACSU is a bot­tleneck, However, this bottleneck can be broken by a word-level algebraic transformation of the ACS­recursion [12]-[13], as will be recounted in the next section.

Fig. 3. Block diagram of the Viterbi decoder.

Page 103: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 107

2.2. Algebraic Formulation of the VA

Upon closer examination of the two operations addi­tion (add) and maximum selection (max), one sees that the distributive law holds as follows

max (a + c; b + c) = max(a; b) + c. (3)

It can be seen that algebraically add in (3) corresponds to the multiplication (8) and max corresponds to the addition ffi. Using the symbols (8) and ffi, we can rewrite (3) as

a (8) c ffi b (8)c = (a ffi b) (8) c. (4)

Based on the fact that the distributive law is met and that many sets under (8) form a semi-group and under ffi form an Abelian semi-group, the operations (8) and ffi form a semi-ring [12]-[13], [21]. Sets for which this holds are, for instance, (positive) reals and integers. We need only formally add a neutral (zero) element Q of ffi such that a ffi Q = a (Q can be considered as "- 00"). Since semi-ring algebra holds also for matrices [13], [21], we may rewrite the ACS-recursion as the following matrix-vector recursion

(5)

where r k is the vector of all N path metrics at time k, i.e.,

(6)

and where Ak is the NxN transition matrix which con­tains all transition metrics Aij,k' It should be noticed that the operations of matrix-vector multiplication are defined in analogy to the well-known definitions of linear algebra, requiring the definition of two opera­tions, namely (8) and ffi. Rewriting the ACS-recursion (2) of the trellis of figure I in the form of (5) leads to the algebraic formulation of (2)

[ 'Yl] - [All Al2] (8) ['YI] _ 'Yz k+ 1 - AZI A22 k 'Yz k-

[ All (8) 'YI ffi A12 (8) 'Yz] (7) AZI (8) 'YI ffi Azz (8) 'Yz k'

Beyond the simplified notation, the real advantage of semiring algebra is that it allows the ACS-recursion to be written as a linear recursion. This allows (5) to be transformed to an M-step ACS-recursion [12]-[13]

(8)

with the M-step transition matrix MAk defined as

(9)

Thus, the ACS-recursion is not a bottleneck since, with the help of semiring algebra, it can be transformed to an M-step ACS-recursion (8) with look-ahead computa­tion according to (9) in analogy to the results known for conventional linear systems [21]-[24]. Now the path metrics only have to be computed in time steps MT to obtain either a decrease in the clock rate of an im­plementation by a factor of l/(MT) or an increase in throughput by a factor of M.

2.3. Interpretation of the M-Step Recursion

Since (8) is of the same structure as (5), (8) again cor­responds to the decoding of a trellis, now in time steps MT. This Ms-trellis (Ms: M-step) is found by combin­ing M steps of the original trellis to one M-step, see figure 4, as is algebraically formulated in (9). The Ms­transition matrix MAk then describes the NxN op­timum transitions from all N states at time k to those at time k + M [13].

MAk

:: >«: = original (1-step) trellis

M-step trellis

_-+1_-+-1 --!-_-+-_+-I ____ , tiT k k+1 k+M

Fig. 4. Example of a trellis with N = 2 states. where each step is described by a transition matrix, and its M-step trellis (M = 4) de­scribed by the M-step transition matrices.

Since the Ms-ACS-recursion comprises operations with matrices, it is clear that efficient systolic array ar­chitectures can be derived for its implementation [12]­[13]. However, the problem with this solution to the bot­tleneck is that each matrix-matrix multiplication is of complexity O(N3). This can be reduced to O(N') for most trellises [13] but, in comparison to the original set of N ACS-recursions, an additional factor of N arises. To find less complex solutions, we will in Sec­tions 3 and 4 incorporate knowledge at the algorithmic level.

2.4. Algorithmic Properties of the VA

When a VD starts to decode in midstream of the data flow, i.e., in the middle of the trellis, a period of initial

Page 104: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

108 Fettweis and Meyr

synchronization occurs in which unreliable decisio~s are made up to the point that the VO decodes the trellIs exactly as if it had been operating since time - 00 [25]. In a practical implementation, one chooses a fixed num­ber oftime steps, referred to as the acquisition depth A, such that, by processing A steps, the probability of a decoding error due to not-accomplished-acquisition is negligible compared to the decoding error probability.l

As was mentioned above, if all N paths at time k (one leading to each state) are traced back in time, it is highly probable that they merge at time k - O. Therefore, we can summarize the algorithmic level in­formation in a scheme of decoded paths of the VA, see figure 5. When a block of E steps has been processed, the decisions of the first A steps have to be discarded as unreliable. In the last D steps, the unique path branches out to N paths from which, after processing further steps of the trellis, one of the N will be chosen as the optimum path of the time interval (E - D, E). Solely by exploiting this algorithmic knowledge, new parallel VO architectures were recently derived [17]-[20].

Iloiqllely decoded path

acq'listllo0 depth 1 N decoded path~ of the sluvjvor depth

A M=E-A-D D I 1

1 ------1-1 ----_~~ ; : tiT ~ '1'

Fig. 5. Scheme of decoded paths after processing E = A + M + D steps.

2.5. Consequences for the M-Step VA

If we look at time k + M and trace the paths decoded by the M-step VA of each node s;.k+M back by one M­step, i.e., back to time k, this must lead to exactly the same preceding nodes s}.k as if the original I-step ACS-recursion (5) had been carried out and the paths were also traced back from time k + M to k. This is due to the fact that the M-step approach is no modifica­tion of the dynamic programming algorithm itself, but only a modification of the type of execution. Hence, when tracing back from k + M to k with M ~ D, only one path exists at time k independently whether one has used the I-step (5) or M-step (8) ACS-recursion. Therefore, all states of such a decoded M-step have the same preceding state a. In other words, by calculating the M-step ACS-recursion

[MA.la J

= : <8)

MA.Na k

'Ya,b (10)

all N inner products (one for each state) decide upon the same state a. Therefore, column a of MAk is chosen as new path metric vector with arithmetic ad­ditive offset 'Ya.k> see (10). Which state is chosen as a depends on the N values of the vector rk> but, in­dependently of r b the same preceding state is deter­mined. Therefore, by modifying rk such that the value of the element 'Y~.k is increased substantially «(3 ~ a), all nodes S;,k+M decide upon S~,k as the unique preced­ing node of the M-step. Hence, by increasing 'Y~,k step by step, one arrives at a value .y ~,k for which all nodes S;,k+M simultaneously change their decision on the preceding state from sa,k to s~.b i.e., (10) holds wah

[MAla J [MAI~ J

: <8) 'Ya.k = : <8) .y~,k' MANa k MAN~ k

(11)

Since the choice of the (3-th element of r k is arbitrary, all columns (rows) of MAk must be linearly dependent in the <8), ffi semiring algebra if M ~ D. Arithmetic­ally, this means that all columns/rows of MAk differ only by an arithmetic additive offset

(12)

2.6. The Acquistion Depth

When carrying out the VA, the absolute value of the path metrics is not relevant but only their relative values. These determine the decisions of the ACS­recursion. Therefore, as is well-known, the path metrics can be normalized by subtracting a common value from each of them. Hence, no error occurs if rk+M of (10) is normalized to

(13)

Page 105: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 109

However, since all columns of MAk are linearly depen­dent, we can set f k+M to an arbitrary column J of MAk

(14)

For further decoding from time k + M on, it is of no concern which column J is chosen. As is shown in [13], the computation of columnj(MAk) corresponds to the decoding of the trellis over the finite block (k, k + M) with unique initial node Sj,k (called the rooted trellis). Therefore, (14) shows that, after processing M = D steps of the trellis, acquisition of the path metrics is completed, i.e. A = D (with high probability)2. This shows that the finite length of the acquisition depth as well as the survivor depth is based on the same fact, the linear dependency of MAk for M ~ D.

3. Parallel VDs Based on the Acquisition Property

In this section it will be shown how parallel Viterbi decoders can be derived solely by using algorithmic knowledge, i.e., by exploiting the finite length of the acquistion and survivor depth [18].

3.1. Acquisition Method I

To derive the parallel VD we first examine the schematic view of the decoded paths of a VD after E : = A + M + D steps (ACS cycles) as shown in figure 5. For the last D transitions, the N paths are seen to diverge. The first A transitions are shown as a dotted line, indicating the unreliable decisions in the acquisi­tion interval. Thus a uniquely decoded block of M tran­sitions is obtained. Assuming the VD stopped decoding after E cycles, then a second VD starting at MT could decode the subsequent block of M transitions, and so on. The resulting scheme of block-overlap decoding is shown in figure 6, which is referred to as acquisition method l. Thus a parallel number of L, VDs can be implemented to decode the constant data flow, each of them operating in time multiplex on a block of E tran­sitions. By this method, each VD in the mean decodes only every MIE-th step by processing every L,-th block of data. This yields a speedup factor S for decoding compared to one VD where S is the product of the decoding efficiency, MI E, and the gain by paralleliza­tion, L" of

M A+M+D S = L, < = > L, = S. (15)

A+M+D M

Thus, the required amount of hardware (L, VDs) depends linearly on the speedup S. Therefore, this is a linear scale solution.

................................... hlmRhl:!-·I.Y..!;t~~9..~.~.Q .. p..~.'b§J?t.lgn..QttJ .. M

VD1 : I .... I~!< .///1:"1-1< VD2 : 1 .... ;-.:::.1'= :

I~:

VD3: I .... ~I<

Fig. 6: Acquisition method 1: Scheme of block-wise decoding (overlap-abut).

3.2. Acquisition Method 1/

The solution presented above exploits the algorithmic properties of the VA, namely that a finite length of the acquisition and survivor depth leads to negligible per­formance loss (i.e., for instance, that the bit error rate increases negligibly). From the decoding of the op­timum path, there is a fundamental difference between A and D. The decisions of the first A steps cannot be given out since they are only used for the acquisition of the path metrics, whereas those of the last D steps are obtained with correct path metrics. In the interval of the last D steps, exactly N paths are decoded, one optimum path to each state. And exactly one of these N paths would be decoded as the optimum path if the subsequent steps of the trellis were decoded. Therefore, the last D steps have been decoded correctly, however, not uniquely. The only information missing is which of the N paths is the correct one, i.e., which is the cor­rect state of time instant ET? The second VD of figure 6 decodes this information. Thus, with this informa­tion fed to the first processed block, one can pick the correct path out of the N. As the first VD then decodes the larger section of M + D transitions of the first block of length A + M + D, the second processed block can be time shifted to the right and only has to start at time instant (A + M) T to decode the state of time instant ETwhile decoding the subsequent block of M + D tran­sitions. The corresponding scheme is shown in figure 7. Again L2 parallel VDs can be implemented to achieve the same decoding speedup S mentioned above, i.e.,

S= M+D L 2 <=> A+M+D

L2 = A + M + D S = > L2 < L,. (16) M+D

Page 106: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

110 Fettweis and Meyr

VD : 1

VD : 2

VD : 3

r-~I-----+I--~I----~I-------------t~ A A+M A+M+D 2A+2M

Fig. 7. Acquisition method II: Scheme of block-wise decoding with selection of the optional of N paths of the preceding block (overlap-select) .

It follows from (16) that this acquisition method II re­quires fewer VDs to be implemented (L2) compared to the first approach (acquisition method I: L 1) for the same speedup S, because the efficiency of decoding in the second approach is (M + D)/(A + M + D).

We see that the second solution (like the first solu­tion) is based on a block-wise decoding (see figure 7) but requires the implementation of an additional hier­archy of data transfer to handle the decision to be sent from block to block, i.e., from VD to Vo.

3.3. Acquisition Method III

The axioms of the semiring allow the ACS-recursion of the VA to be written as an algebraic vector-matrix recursion and to transform it to an M-step recursion. If we further exploit the fact that the scalar operation ® is commutative, we see that the well-known law of transposition of a matrix product of two matrices A and B holds, namely

(A ® B)T = BT ® AT. (17)

Since ® is associative, so also is the E-fold matrix product EAk , i.e.,

EM = (Ak+E-l ® ® Akl = (AI ® ... ® AI+E - 1) (18)

As can be seen from (8), transposing EAk gives the E­fold matrix product in time-reversed order.

Acquisition methods I and II require a block-wise decoding of overlapping blocks of the trellis. The decoding of one block of the trellis corresponds to the calculation of the E-fold vector-matrix product

(19)

where r 0 is an arbitrarily chosen initial value for the path metrics. In this case, the interval (k, k + A) of the block is used for acquisition. If the same block is

decoded in time reversed manner, which we refer to as the backward-VA to distinguish it from the conven­tional forward-l'lt, then this corresponds to the com­putation of

(20)

In this case, the interval (k + D + M, k + D + M + A) of the same block of the trellis is used for ac­quisition instead of the interval (k, k + A). The lost part of decoding due to acquisition is at the end of the block, see figure 8. Since both the forward-VA and the backward-VA process the same matrix A+M+DAb the same optimum path is decoded (see also [1] backward equation of dynamic programming).

! IOiQ! ,ely deooded path

N decoded paths of t~e survivor depth 1 acquisjtjQ!o depth

D M=E-A-D A

--~=S~~ ____________ ~ ______ I

~ireCtiOn of decoding tiT

Fig. 8. Scheme of decoded paths after processing E ~ A + M + D steps in time reversed manner by the backward-VA.

The acquisition and survivor depth of the backward­VA remain to be determined. The argument given in Section 2.5 showed that the limited survivor depth results from the fact that all columns and rows of a D­step matrix DAk are linearly dependent.

The same holds of course, for the transposed matrix DAr, and thus it is clear that the backward-VA has the same acquisition and survivor depth as the forward-VA. Therefore, the method derived in Section 3.2 (figure 7) can be transformed to the solution shown in figure 9. This decoding/decision structure has no decision feedbacks from block to block, but a feedforwarding of one decision from block to block.

When implementing the parallel VD, an additional overhead arises from (15) and (16). But since this overhead depends on the freely choosable blocklength M, this overhead can in principle be made arbitrarily small, in contrast to the methods described in a previous paper [11].

3.4. Ring-Architecture for the Acquisition Methods

To implement one of the solutions described in the previous section, a number of identical parallel VDs

Page 107: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 111

VO: 1

VO: 2

VO: 3

direction of decoding a block

III

!---±b----:D::-'~>::-M:----2-::Dc+~-::2M---:----- 1fT

Fig. 9. Acquisition method III: Scheme of block-wise decoding with selection of the optimal of N paths of the following block (overlap­select). In contrast to acquisition method II each block is decoded in time reversed direction by the backward-VA.

are required, each of them periodically processing one block of E transitions every LM or every L(M + D) steps. A modular ring-architecture will now be presented for the acquisition methods.

The complete system configuration is shown in figure 10. First, a serial-to-parallel conversion of the data is introduced, which supplies an input-bus at the same low rate as the VDs operate at. To supply each VD with its blocks of E transitions, an input buffer is inserted between each VD and the input-bus. This buf­fer, which mainly consists of one dual-port or two sim­ple RAMs, carries out the parallel-to-serial conversion of the blocks of the trellis for the VD, the required buf­fering of the blocks, and the time reversion for acquisi­tion method ill (blockwise reversed decoding). All the VDs operate in a unidirectional ring to transmit the starting state from VD to VD, see in the decoding scheme in figure 7. Because of the feedforward decoding principle (acquisition method ill) and because the data transfer on the ring is local from block to block only, it is not time critical. Therefore, an asynchronous data transfer can be implemented. Each VD processes the steps of the trellis of one block serially and feeds its results into its output buffer, which collects the de­coded data and gives it out in parallel blocks onto the output bus. The different output buffers therefore store adjacent non-overlapping blocks of decoded data, whereas the blocks of data that are stored in the input buffer overlap either by A or A + D steps.

This architecture allows a modular realization of a VD with little peripheral hardware, where each VD­module (as indicated in figure 10) comprises one VD plus buffers and ring communication ports. Note that, when a VD-module is designed for a fixed input/out­put bus-width, it is designed in principal for one speed

Fig. 10. Modular ring architecture for the acquisition methods II and III.

of decoding. However, one complete parallel VD­system (with its ring cut open at one point) can again be implemented as one VD in a ring, surrounded by an additional hierarchy of input/output buffering in analogy to the system shown in figure 10.

To show the feasibility of this modular VD imple­mentation we designed a VD-module for 120 Mbit/s decoding of 6 state 8 PSK trellis coded modulation [26] with 5 VD-modules (A = 16, K = 80). Each VD­module is a 2 /tm CMOS ASIC [27], consisting of blocks of standard cells and RAMs, see figure 11. Because of the blockwise decoding, no register ex­change SMU but rather the smaller RAM-based trace­back method was implemented [28]. The VD-module consumes 64 mm2 chip area and operates at 15 MHz. Thus the solution described in this section enables the realization of high speed parallel VDs in high-density, low-power technologies like CMOS. The resulting com­plexity is much lower than for an implementation of a single VD with high-speed technology (ECL) [30]-[32].

3.5. Systolic Architecture for the Acquisition Methods

Instead of using the modular ring-architecture pre­sented above in Section 3.4, the following systolic

Page 108: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

112 Fettweis and Meyr

Fig. 11. Layout of a VD-module for acquisition method III; a ring of 5 VD-modules achieves 120 Mbitls: 6-state TCM 8 PSK code, 2 I'CMOS, 33 K transistors, 64 mm', VENUS/SIEMENS.

architecture can be applied. The main difference be­tween the two is that the serial/parallel conversion in the latter takes place at once for the whole block of length E = A + M + D steps. Then, instead of im­plementing L individual VDs, we can use a systolic pipeline to carry out the A + M + D-fold vector-matrix product A+M+vAk ® roo To do so, the systolic ar­chitecture, shown in figure 12 for acquisition method II and for the simple example A = D = 2 and M =

0, requires E = A + M + D TMU s to calculate the transition metrics of one full block in parallel. These TMUs are followed by a pipeline of E vector-matrix multipliers. Since the E-fold multiplication has to take place serially, a skewing buffer needs to be imple­mented as shown in figure 12. For easier understand­ing, the archiecture in figure 12 is presented as if no pipelining were present between the different opera­tions (buffers are shown by fat bars). However, the figure is laid out in such a way that one block of input data is fed through the architecture from left to right as one vertical block while it is being processed, as if the horizontal axis were the time axis. This allows the determination of the necessary amount of skewing buffers as well as the latency of decoding.

After the block is processed in the systolic pipeline, the decisions of the ACS-operations are buffered and

input

Fig. 12. Systolic architecture for acquisition methods II (and III), here for the simple example D = 2 and M = O. The signal flow of arithmetic values is shown by solid lines, whereas the signal flow of decisions is drawn with dashed lines. The serial/parallel conver­sion takes place every D time instances with block length 2D which results in a block overlap of D steps as required for acquisition methods II (and III). The vector-matrix multipiers are oval and the trace-back blocks of the SMU are square shaped.

then used to decode the starting state that needs to be transferred from block to block, see figures 7 and 9. With the help of the starting state, the previous block (stored in the buffers) is trace-back decoded and given out.

One implementation feature of this systolic architec­ture is that the ACSUs operate in a systolic pipeline without feedback wiring. This allows the laying out of the ACS-processing elements efficiently according to the trellis [33].

Page 109: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 113

Given the systolic architecture, an additional increase in processing/data rate by a factor e can be achieved by two alternatives. Either the block size is enlarged, which leads to an increase in latency by e and buffer­ing by e2, or the systolic architecture is interpreted as one VD-module of the ring-architecture of figure 10. In the latter case, the VD-module can be configured as shown in figure 13, which indicates how the data­flow of the individual VD-modules can be interleaved. An increase in decoding speed by a factor of e leads neither to an increase in latency nor to larger buffers.

Fig. 13. Modular extendable systolic architecture for acquistion method II (and III), here shown for D = 2 and M = O. This architecture is obtained by adding more than one systolic architectures of figure 12 in parallel and interleaving the dataflow. Two possible chip boun­daries are indicated for a modular implementation.

4. Combining Algebraic and Algorithmic Knowledge

In this section the algorithmic knowledge (limited A and D) will be applied to the M-step VA to derive addi­tional efficient parallel VD architectures.

4.1. Elimination of the Feedback

The only feedback loop that exists in a VD is the ACS­recursion. However, with the help of (14), we now are able to rewrite the Ms-ACS-recursion as (Ms: M-step)

f k+M = MAk ® fk = MAk ® columll.J<MAk-M) (21)

Thus, for M ;l> D, the sole feedback loop can be re­placed by the expression (21) which has no feedback. As was mentioned in Section 2.6, when computing (21), the decisions of all N inner products yield the same

result, namely the unique preceding state ex of time k. Hence, to obtain the decoded state of time k it is suffi­cient to calculate only one inner product, namely

row~Ak) ® columniMAk_M) (22)

By computing this Ms-ACS-expression (22), the cor­rect state ex of time k is obtained. The degeneration of the Ms-ACS-recursion for M ;l> D yields this simple expression which is purely feedforward as shown in figure 14. This result was achieved by combining algorithmic properties of the VA with the word-level M-step transformation.

rnM

M~e ___ ~_,:,:~~. a)

:~ decision ~0--------'

inner~duct b)

Fig. 14. Block diagram a) of the Ms-ACS-recursion and b) of the Ms-ACS-expression. The vector matrix multiplier is drawn oval and the inner product multiplier diamond shaped.

As mentioned above, in a practical implementation of the VA, the survivor depth can be kept finite with negligible performance loss. For this reason, the modification presented here can be applied already for moderately large M "" D, specifically for M = D, which is relatively small.

4.2. Architectural Consequences

The key consequences of the elimination of the feed­back loop of the M-step approach can be explained with the help of figure 15. In figure l5a, a block processing implementation according to [21] is shown for M = 2. This block architecture is characterized by the im­plementation of M = 2 Ms-ACS-recursions so that the state of every time instant of one block is decoded in parallel by the M time-shifted Ms-ACS-recursions. If the block-size is increased to M = 4 (for a 2-fold speedup), this results in an architecture as shown in figure ISb. However, if D equals 2, then it is sufficient to compute 2-step transition matrices 2Ak only. As can be seen in figure ISc, this results in a large hardware saving. Now neither the wiring nor the processing hard­ware nor the buffering increases more than linearly with the block-size, i.e., this is a linear scale solution.

Furthermore, the architecture can easily be divided into identical modules implemented on one integrated circuit such that the desired throughput rate determines only the number of parallel modules, but does not

Page 110: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

114 Fettweis and Meyr

a) b)

c)

decoded 4-step

Fig. 15. Block processing architecture with tree-like look-ahead. Matrix-matrix multipliers are round, and vector-matrix multipliers are oval. a) for step size M = 2, b) for step size M = 4, and c) for block size 4 but D = 2.

require a redesign of the module itself. It is clear that the architecture shown in figure 15 is just one example out of a whole variety of different architectures which can be derived from the different solutions proposed for the implementation of M-step recursions [21]-[24]. The significant result for the VA is that it does not make sense to increase M beyond D for very high data rates, since it then is much more efficient to extend the ar­chitecture in analogy to figure 15c. Note that here we chose the very small values (M = 4 and D = 2) for easier understanding only.

The realizability of this concept can be seen by the layout of the chip in figure 16, which we have designed for a 4-state linear convolutional code (D = 16). Four chips are needed to configure a complete D-step VD that can be implemented in parallel to achieve even higher throughputs. On this chip, we did not implement the tree-like architecture for look-ahead computation (figure 15) but rather a linear pipeline as given in [12]-[13]. One chip achieves a rate of decoding of 50 Mbit/s so that a cascade of 20 chips realizes a I Gbit/s VD.

4.3. Minimized Method

To derive an extremely efficient architecture, the Ms­ACS-expression (22) must be further analyzed. As mentioned in Section 4, the computation of (22) deter-

mines the state €X of time k. To be able to distinguish between decoded states of different time instants we in­troduce the variable Zk (€X = Zk)'

For the computation of the Ms-ACS-expression, the whole M-step transition matrix need not to be calculated, but only one row and one column of it. The column can be computed according to

columnJ~Ak_M)

= Ak- 1 Q9 . .. Q9 Ak- M+1 Q9 columnAAk- M). (23)

By carrying out the M-fold multiplication of (23) from right to left, i.e., in time sequence, it can be seen that this is exactly as if the conventional VA were carried out over M - 1 steps with initial path metrics of time k - M + 1 set to rk-M+l = columniAk_M). Further­more, since columnAAk_M) is made up of all the tran­sition metrics from the branches that originate in node sJ.k-M, computing (23) is equivalent to decoding the rooted i-step trellis of state SJ (sJ-ls-trellis [13]) as shown in figure 17. In addition, to be able to compute (22), a row of an M-step transition matrix needs to be determined, which, however, is equal to a column of the transposed matrix, i.e.,

row kAk) = column,ctMAk)

= Ak Q9 . .. Q9 Ak+M-2

Q9 column,ctAk+M-I). (24)

Since (24) is equivalent to (23) except that the time se­quence of the M-fold multiplication is reversed, it can easily be seen that this corresponds to the decoding of a rooted trellis, but in reversed time direction, i.e., with fixed end node SK.k+M' Therefore, (22) can be calcuated as shown in figure 17 by decoding two rooted trellises, each M = D steps long, with the direction of decoding running towards each other. To compute the inner product of (22) the metrics are added at each state separately and the maximum of these N sums determines the correct state Zk (€X = Zk)'

4.4. Architectural Consequences

The complexity of processing a rooted trellis is at most that of decoding the original trellis over the same inter­val. Since the state of every time instant can be decoded by the Ms-ACS-expression (22), the whole trellis could, therefore, be decoded without feedback, however, with 2D times as much effort as is required with the con­ventional VA. It thus surely is more efficient to decode states in larger intervals and then separately decode the

Page 111: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 115

Fig. 16. Layout of a Ms-VD (M ~ 4) which can be configured by chaining 4 chips to aD-step VD (D ~ 16): 4-state conv. code (K ~ 3), 50 Mbitls, 2 I'CMOS, 120k transistors, 120 mm', ES2/CADENCE.

decoding cclumn J of MA k-M k·M • k

decoding row K of MA k

II! k+M I I

l a = zt.

Fig. 17. Way of calculating the Ms-ACS-expression (16) to decode the state of time K("Zk ~ ,,") by decoding two rooted trellises. (M: max).

intervals with known beginning and end nodes. An especially efficient choice of the interval (block) length is 20. As indicated in figure 9, the states are first decod­ed every 2D time steps, e.g., ex : = Zk and w : = Zk+2D'

In a second step the exact track of the path between ex and w has to be found, which is equivalent to the computation of 2DAwa,k> the (w, ex )-th element of the 2D-step transition matrix of the examined interval. Since

2DA k = DAk+ D ® oAk> (25)

the metric 2DAwa,k can be determined by

2DAwa,k = roww(oAk+D) ® column,,(vAk) (26)

in analogy to (22). Therefore, as indicated in figure 18, the exact track of the path between ex and w can be found in analogy to the determination of the states ex and w. After the processing of (26), this leads to the correct state Zk+D' In a third step, the optimum path can be traced back. As indicated in figure 18, this method can easily be realized by implementing one part (of block

Page 112: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

116 Fettweis and Meyr

Fig. 18. Scheme of the minimized method with schematic represen­tation of decoded rooted trellises. As indicated, all to be computed for an interval of length D may be integrated on one cascadable chip.

length D) on a chip, which then is implemented in parallel to obtain the desired throughput rate.

4.5. Systolic Architecture

As can be seen from figure 18, the complexity of the minimized method is at most twice that of the conven­tional VD, independently of the number of states N. However, this conclusion holds only in case the col­umn or row of MAk is computed by serially decoding the rooted trellis. If the (M - I)-fold matrix multiplica­tion of MAk is computed in a tree-like structure, then the M - 1 vector-matrix multiplications of (26) are replaced by matrix-matrix multiplications of O(N3).

A possible systolic architecture for the implemen­tation of the minimized method of figure 18 is given for the simple example D = 3 in figure 19. Not shown is a serial/parallel conversion of block length 2D and an implementation of 2D TMUs that compute the tran­sition matrices of each block in parallel. To distinguish between the time index k and the period of block fre­quency which is 2D-times longer, the variable n is used as a block index. Going from left to right in figure 19, two pipelines of vector-matrix multipliers first run towards each other to decode the rooted trellises with arbitrary starting state (root). Using the results the com­putation of the inner product (22) leads to the node of time n(2D), Zn(2D)' This node is the root of the next hierarchy of rooted trellises that have to be decoded. Now the upper pipeline computes rowZn(2D)CvAn(2D)-D)

which is then multiplied according to (22) with columnZn(2DJ_2DCnAn(2D)-2D)' Since this column was computed by the lower pipeline while processing the previous input block of time interval (n2D - 3D, n2D - D), the previous results are buffered and fed to the results of the upper pipeline to carry out the inner prod­

uct row Zn(2D) CnAn(2D) -D) ® columnzn(2D)_ 2D CnAn(2D)-2D)

to obtain Zn2D-D' Following the second hierarchy of pipelines, the decisions made in these pipelines are

Ao ... n(2D)-D

A 1+n(2D)-D

A 2+n(2D) D

Fig. 19. Systolic architecture for the minimized method which can be modularly extended (here D = 3, block length 2D). The oval and diamond shaped multipliers are explained in figure 14. The layout of the figure indicates the latency of decoding of 3D. The architec­ture can easily be divided into two (nearly) identical half's which each can be implemented on one chip_ It is clear that the architecure can be extended to process input blocks of length of any multiple of 2D.

trace-back decoded to decode the desired path. Because of the processing of finite input blocks of length 2D, the results at the upper or lower edge of the systolic architecture need to be transferred to the other edge to be used for the processing of the previous or following input block.

The architecture of figure 19 can easily be im­plemented on two (almost) ideRtical chips, one for the upper half and the other for the lower half. The two identical chips simply need a control signal to identify their position and to modify the buffering accordingly. As can easily be seen, the chip needs little extension to permit, say, 4 chips to be configured for a parallel processing of input blocks of 4D, etc. Then any multi­ple of 2D can be obtained.

To show the realizability and the efficiency of the minimized method we are currently implementing a VLSI ASIC (application-specific integrated circuit) for a constraint length K = 3 convolutional code with D = 12 [20]. The design is being carried out with 1.2/t CMOS standard cells and will use approximately 175 mrn2 of chip area. One chip comprises all units as in­dicated in figure 18 and 19 and will operate at 50 MHz. Taking this clock frequency into account results in a decoding rate per chip of 600 Mbitls, the 2-chip con­figuration of figure 19 will achieve 1.2 Gbit/s, and multiples hereof are obtained by using more chips in parallel.

5. Complexity Considerations

All acquisition methods as well as the minimized method have a latency of decoding that is linear in D.

Page 113: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 117

In case of acquisition methods II and III, the minimal latency is obtained by setting M = 0 which results in a latency of O(4D). From figure 19, we see that the latency for the minimized method is O(3D). Using the Ms-VA with tree-like computation of the MAk as given in figure 15 results in a latency O(logD) for M = D, whereas the Ms-VA with serial computations of MAk

given in [13] has a latency of O(2D). To measure the complexity of the hardware of the

different solutions, the (area) x (cycle time)-product, the AT-measure [34], needs to be computed for the different architectures. To simplify the results, we assume that the number of branches that leave/enter one node of the trellis is much smaller than N and that the systolic architectures for vector-matrix and matrix-matrix multiplication of [13] are applied so that area is dominated by computational hardware and not by wir­ing. Since all architectures are purely feedforward, there is no theoretical bound on the number of pipeline stages that can be introduced, unlike the case for conventional VDs (even if bit-level parallel processing is introduced [10]). We found out that it is possible and efficient to pipeline the ACS-hardware (in CMOS) so that the clock rate can still be doubled compared to a conventional VD with bit-level parallel processing as described in [10]. Therefore, the complexity measures given below should be reduced by about half when compared to the complexity measured by the AT-measure of a conven­tional YD.

The tree architecture of figure 15 needs to compute matrix-matrix products and thus has an AT-measure of O(N3). The acquisition methods II and III can be measured by viewing the systolic architecture of figure 12 with M = 0 for minimal latency. Since the transi­tion metrics of each transition are processed twice, this results in an AT-measure of O(2N) , which is twice the complexity of a conventional VD (of O(N». The SMU part of the acquisition methods is comparable to a con­ventional YD. The minimized method also processes the transition metrics of each step twice, which results in a complexity of O(2N). Thus it seems as if the new methods were twice as complex as the conventional VD. However, the minimized method decodes rooted trellises of finite length D, hence, much smaller word­lengths can be used in the ACS-hardware. Furthermore, as can be seen in figure 19, each decoding decision is processed only once in the trace-back decoding im­plementation. Therefore, the SMU part is much less complex than that of a conventional YD. Altogether, the complexity of the minimized method is about the same as that of conventional VDs, i.e., O(N), but the

former allows unlimited parallelism. Another method to achieve similar results is to reduce the bandwidth (throughput rate) by introducing additional known bits to have the trellis run through a predetermined known state, as known, e.g., from blocked convolutional codes. Then each block can be decoded separately in parallel (as recalled, e.g., in [15]). This, however, leads to a reduction of bandwidth if the block length is short, or to a significant amount of buffering if the block length is chosen to be long (especially in the SMU because no reduction of the complexity is achieved, in contrast to the minimized method).

The results are summarized in table 1. As can be seen, a minimal latency is achieved with the tree ar­chitecture of the Ms-VD of figure 15c, resulting in a complexity of O(N3). The acquisition methods are much less complex but have an increase in latency. By com­bining knowledge at the word-level as well as at the algorithmic level, the minimized method was derived. It has smaller latency than the acquisition methods and is less complex.

Table 1.

Parallelization Level Architecture Latency AT

bit level carry-save ACSU [10] 4D N

word level pipeline architecture [B] 2D N' tree architecture Id(D) N3

algorithmic level acquisition method 4D N

all three levels minimized method 3D N

(an architectures with trace-back SMU)

6. Conclusions

By exploiting algorithmic properties ofthe VA, namely the limited acquisition and survivor depth, pure feed­forward parallel Viterbi decoding solutions were presented. These acquisition methods allow the im­plementation of high-speed VDs as linear scale solutions.

The ACS-recursion of the VA can be transformed algebraically on the word-level to an M-step recursion. This allows the use of architectures known for linear recursions to achieve extremely high throughput rates. In addition, by exploiting algorithmic properties of the VA, this M-step ACS-recursion can be replaced for M ;::: D by a feedforward expression (with negligible

Page 114: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

118 Fettweis and Meyr

performance loss). This allows a pure feedforward im­plementation of the VA by a large variety of different architectures. These parallel VDs all are linear scale solutions with respect to processing hardware as well as to wiring and buffering. This enables the architec­tures to be divided into identical modules which can be cascaded to achieve any throughput rate desired.

The examples discussed are only to be seen as in­stances of a wide space of solutions. The block proc­essing architecture of figure 15 has an AT-complexity which is by O(JP) greater than that of a conventional VD, but it has a latency of decoding of only O(logD). The minimized method has about the same complex­ity as a conventional VD and has the same latency of OeD). In this paper only methods of exploiting algorithmic knowledge were presented that allow the derivation of parallel Viterbi decoding architectures, which allow a wide variety of further architectures to be developed.

Acknowledgment

This work was sponsored by the DFG under contract Me 65117.

Notes

1. For convolutional codes, analytic bounds for A and D exist [25]. 2. Note, if the paths do not merge in D steps, as may occur, then

this leads to a decoding error in case of carrying out the VA from time - 00 on as well as when starting in mid-stream of data.

References

I. R.E. Bellman and S.E. Dreyfus, Applied dynamic programming, Princeton, NJ: Princeton Univ., 1962.

2. A.l Omura, "On the Viterbi algorithm:' IEEE Trans. In! Theory, 1969, pp. 177-179.

3. AJ. Viterbi, "Error bounds for convolutional coding and an asymptotically optimum decoding algorithm," IEEE Trans. In! Theory, 1967, pp. 260-69.

4. G.D. Forney, "The Viterbi algorithm," Proc. IEEE, 3, 73, pp. 268-78.

5. S.Y. Kung, VLSIprocessorarrays, Englewood Cliffs, NJ: Pren­tice Hall, 1987.

6. R.M. Karp, R.E. Miller and S. Winograd, "The organization of computations for unifonn recurrence equations," J. ACM, vol.

14, 1967, pp. 563-590. 7. L. Thiele, "On the hierarchical design of VLSI processor ar­

rays," Proc. IEEE Int. Symp. Circuits and Systems (ISCAS '88), Helsinki, 1988, pp. 2517-2520.

8. S.K. Rao and T. Kailath, "Regular iterative algorithms and their implementation on processor arrays," Proceeding of the IEEE, vol. 76, 1988, pp. 259-269.

9. G. Fettweis and H. Meyr, "On the interaction between DSP­algorithm and VLSI-architecture," Int. Zurich Seminar,March 1990, pp. 219-230.

10. G. Fettweis and H. Meyr, "A 100 Mbitls Viterbi decoder chip: Novel architecture and its realization," Proc. IEEE Int. Can! Commun., 1990, ICC '90, 307.4, pp. 463-467; also in ITG­Fachbericht 110, Oct. 1989, pp. 163-168.

11. G. Fettweis and H. Meyr, "Parallel Viterbi algorithm implemen­tation: Breaking the ACS-Bottleneck," IEEE Trans. Commun, 8-89, pp. 785-90; partly in Proc. IEEE ICC '88, pp. 719-23.

12. G. Fettweis and H. Meyr, "A systolic array Viterbi processor for high data rates," Int. Conf. on Systolic Arrays, Ireland, 1989, Systolic Array Processors, Englewood Cliffs, NJ: Prentice Hall, 1989, pp. 195-204.

13. G. Fettweis and H. Meyr, "High rate Viterbi processor: a systolic array solution," IEEE J. Sel. Areas Commun., Oct. 1990, pp. 1520-1534.

14. H.K. Thapar and 1M. Cioffi, "A block processing method for designing high-speed Viterbi detectors," IEEE ICC '89, pp. 1096-1100.

15. H.-D. Lin and D. Messerschmitt, ':Algorithms and architectures for concurrent Viterbi decoding," Proc. IEEE ICC '89, pp. 836-840.

16. K.K. Parhi, "Look-ahead in dynamic programming and quan­tizer loops," IEEE ISCAS, Portland, 1989.

17. G. Fettweis, "Verfahren zur Ausfuhrung des Viterbi- Algorithms mit Hilfe parallelverarbeitender Sttukturen," German pat. pend., No. P3721884.0, July 2, 1987.

18. G. Fettweis and H. Meyr, ':A modular variable speed Viterbi decoding implementation for high data rates," North-Holland: Signal Processing IV, Proc. EUSIPCO '88, 1988, pp. 339-342.

19. G. Fettweis and H. Meyr, "Cascaded feedforward architectures for parallel Viterbi decoding:' IEEE Int. Symp. on Circuits and

Systems (ISCAS '90), 1990, pp. 1756-1759. 20. G. Fettweis, H. Dawid and H. Meyr, "Minimized method Viterbi

decoding: 600 Mbit/s per chip," IEEE GLOBECOM '90, paper 808.5.

21. G. Fettweis, L. Thiele and H. Meyr, "Algorithm transformations for unlimited parallelism," Proc. IEEE ISCAS '90, MC-19. Also L. Thiele, G. Fettweis, "-," Electronics & Commun. (AEu), vol. 44, April 1990, pp. 83-91.

22. P. Kogge and H. Stone, ':A parallel algorithm for the efficient solution of a general class of recurrence equations," IEEE Trans. Computers, vol. C-22, 1973, pp. 786-793.

23. C. Barnes and S. Shinnaka, "Block shift invariance and block implementation of discrete time filters," IEEE Trans. Circuits and Systems, vol. CAS-27, 1980, pp. 667-672.

24. K.K. Parhi and D.G. Messerschmitt, "Pipeline interleaving and

parallelism in recursive digital filters," IEEE Tr.ASSP, 7, 89, pp. 1099-117.

25. AJ. Viterbi and A.l Omura, "Principles of digital communica­tion and coding," New York: McGraw-Hill 1979, pp. 258-260.

26. M. Oerder and H. Meyr, "Rotationally Invariant Trellis Codes for MPSK Modulation," AEU, vol. 41, 1987, pp. 28-32.

27. E. Hoerbst, M. Nett and H. Schwartzel, "Design ofVLSI Cir­cuits, Based on VENUS," Springer-Verlag, 1986.

28. G.c. Clark Jr. and lB. Cain, "Error-correction for digital com­munication," New York: Plenum, 1981.

Page 115: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Feedforward Architectures for Parallel Viterbi Decoding 119

29. J. Stahl, H. Meyr and M. Oerder, "Implementation of a High Speed Viterbi Decoder," (EUSIPCO '86) Signal Processing III, LT. Young et a1. (editors), North-Holland, 1986, 02.6, pp. 1117-1120.

30. J. Snyder, "High Speed Viterbi Decoding of High Rate Codes," 7th ICDSC, 1983, Phoenix, USA, conf. rec. pp. XII16-XII23.

31. T. Fujino, ''A 120 Mbit/s 8PSK Modem With Soft-Decision Viterbi Decoder," ICDSC 1986, conf. rec., pp. 315-321.

32. RJ.F. Fang, ''A coded 8-PSK system for 140-MBit/s informa­tion rate transmission over SO-MHz nonlinear transponders," Proc. Int. Con! on Digital Satellite Commun. (ICDSC), 1986, pp. 305-313.

33. H. Burkhardt and L.c. Barbosa, "Contributions to the applica­tion of the Viterbi algorithm," IEEE Tmns. Information Theory, IT-31, 1985, pp. 626-634.

34. J.D. Ullman, "Computational aspects of VLSI," Rockville, Maryland: Computer Science, 1984.

Gerhard Fettweis received the Dip1.-lng. degree in 1986 in elec­trical engineering from the Aachen University of Technology, Ger­many, and the Dr.-Ing degree (Ph.D.) in 1990, respectively. During 1986 he was with the conununications group of the ABB corporate research in Switzerland to work on his diploma thesis. He currently is at the IBM Almaden Research Center in San Jose.

His interests are in digital communications, especially in the interactive design of algorithms and architectures for high-speed VLSI implementations. Gerhard Fettweis is serving as vice chair­man for the 1991 IEEE Int. Workshop on Microelectronics in Communications.

Heinrich Meyr (M'75-SM'83-F'86) received the Dip1.-lng. and Ph.D. degrees from the Swiss Federal Institute of Technology (ETH), Zurich, in 1967 and 1973, respectively.

From 1968 to 1970 he held research positions at Brown Boveri Corporation, Zurich, and the Swiss Federal Institute for Reactor Research. From 1970 to the summer of 1977 he was with Hasler Research Laboratory, Bern, Switzerland. His last position at Hasler was Manager of the Research Department. During 1974 he was a Visiting Assistant Professor with the Department of Electrical Engineering, University of Southern California, Los Angeles. Since the sununer of 1977 he has been Professor of Electrical Engineering at the Aachen University of Technology (RWTH), Aachen West Ger­many. His research focuses on synchronization, digital signal proc­essing, and in particular, on algorithms and architectures suitable for VLSI implementation. He has published work in various fields and journals and holds over a dozen patents.

Dr. Meyr served as a Vice Chairman for the 1978 IEEE Zurich Seminar and as an International Chairman for the 1980 National Telecommunications Conference, Houston, TX. He served as Associate Editor for the IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING from 1982 to 1985, and as Vice President for International Affuris of the IEEE Communica­tions Society.

Page 116: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing

IDBIAS G. NOLL Corporate Research and Development Siemens AG, ZFE ME MS 31, Otto-Hahn-Ring 6, D-8000 Munich 83, FRG

Received August 6, 1990, Revised January 10, 1991.

Abstract. Carry-save arithmetic, well known from multiplier architectures, can be used for the efficient CMOS implementation of a much wider variety of algorithms for high-speed digital signal processing than only multiplica­tion. Existing architectural strategies and circuit concepts for the realization of inner-product based and recursive algorithms are recalled. The two's complement overflow behavior of carry-save arithmetic is analyzed and efficient overflow correction schemes are given. Efficient approaches are presented for the carry-save implementation of a saturation control. The concepts are extended and refined for the high-throughput implementation of decision­directed algorithms such as division, modulo multiplication and CORDIC which have yet been avoided because of a lack of efficient concepts for implementation.

It is shown. that the carry-save technique can be extended to a comprehensive method to implement high-speed DSP algorithms. Successfully fabricated commercial VLSI circuits emphasize the potential of this method.

1. Introduction

Most DSP algorithms can be decomposed into a few simple basic operations from where the addition mostly appears to be the speed limiting operation due to the time critical carry propagation (CP). Many strategies exist for upgrading the speed of addition by means of parallel processing:

• Carry path acceleration improves the linear dependency of the carry ripple delay 7 from the wordlength m. i.e., 7 = O(m), practically to 7 = O(log m) for VLSI-suited tree-carry-look-ahead ad­ders or 7 = O(rm) for carry-select adders (figure 1).

• In pipelined structures additional pipelining is possi­ble along the carry path, resulting in two-dimensional pipelined structures with 7 = 0(1) but also in a con­siderable larger synchronization overhead.

• Residue number systems as well as redundant number systems in general avoid any CP by means of number representation, i.e., 7 = 0(1), but suffer from the need of complicated conversions from and to con­ventional number systems or require special adder stages much more complex than conventional adders. For example, redundant signed-digit number representations as proposed by Avizienis [1] require complicated conversions for radices r > 2 and ex­pensive two-transfer adder stages for radix r = 2.

16

BC~. l:add. 0(109 m)

12 (SeA "Ladd - O(v m)

_____________ CSA 'Odd,O(1)

16 24 32 m

Fig. 1. Addition time Tadd in gate delays versus the wordlength m for a) carry ripple adders (CRA), b) block-carry-look-ahead adders (BCLA) with block length n = 4, c) carry-select adders (CSCA) and d) carry-save adders (CSA).

• Carry-save (CS-) adders are well known from their use in efficient pipelined array multipliers and as three-two-counters in tree multipliers. The CS-prin­ciple was developed in the late 1940's and early 1950's for fast digital computers. The basic idea is to postpone the time consuming CP from a number of

Page 117: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

122 Noll

multiple additions, i.e., 7 = 0(1). From the view of number representation the saving of the carry word also results in a redundant number representation because of the multiple alternative representations of any given algebraic value (for a survey of redundant number representations refer to [2]). But for an im­plementation only simple, well optimized conven­tional full adders are required.

In Section 2 existing architectural strategies and cir­cuit concepts for the efficient high-throughput realiza­tion of inner-product based operations using CS-adders and arithmetic are recalled. Section 3 describes possibilities for the efficient implementation of CS­specific circuit components like vector-merging adders, required to perform the postponed CPo The overflow behavior of CS-arithmetic is analyzed and efficient overflow correction schemes as well as possibilities for a CS-saturation control are presented. Examples of ar­chitectures and realized circuits implementing inner product based, recursive and iterative decision directed algorithms are described in Section 4, demonstrating the potential of these strategies. Finally a summary is given in Section 5.

2. Architectural Strategies

For non-recursive algorithms based on the performing of inner products such as convolution, correlation, matrix multiplication and so on, pipelining is applied for concurrency in time and space wherever possible. Here CS-arithmetic allows so-called complete pipelin­ing at full-adder level with only a moderate amount of pipeline registers. To break up the CP-chain, the carry out of each digit (weight position) is stored (saved) in an extra register. This can be regarded as one­dimensional pipelining.

In contrast, two-dimensional pipe lining is required to break up the carry path in CP-adder structures. Basic work for the use of this approach in more complex VLSI-structures was done by McCanny, McWhirter (e.g., [3]) and Cappello, Steiglitz (e.g., [4]). Because in CMOS technology even area optimized registers occupy about one third of the area of a full adder [5], two-dimensional pipelining results in a large synchron­ization overhead oftypically 60 ... 80% [6] or more.

Unfortunately, these two types of processing are sometimes mixed up and only for certain types of ar­ray multipliers the resulting structure is the same. For clearness the difference between the two approaches

4 'I 9 t ~ ,.gister

.~

riJl~ (all IT' Fig. 2. Comparison between a) carry ripple approach with two­dimensional pipelining and b) carry-save approach with one­dimensional pipelining.

is illustrated in figure 2 for a typical adder array cas­caded by an accumulator. For the carry ripple (CR-) approach the bits of y occur exactly as in the non­pipelined structure but only skewed in time. In the CS­approach carries are passed diagonally between rows rather than allowing them to ripple along the rows. As a consequence each intermediate result y

m-l

y = _yO + L: y-i Z-i = S + c (1) i=l

(throughout the paper superscripts denote the weight of bits in an m bit two's complement number and frac­tionally data format is assumed) is decomposed into a combination oftwo vectors: the sum vector s and the carry vector c

m-2 m-l

C = -co + L: c- i 2 -i; S = -so + L: S-i 2-i . (2) i=l i=l

As will be shown below, the resulting CS­redundancy causes some difficulties due to overflow behavior and in exact level detection.

Note that the CS-accumulator and the two­dimensionally pipelined CR-accumulator have an iden­tical topology of adders and registers. Only the time skew of the input bits determines whether the ac­cumulator acts as a fully bit-level pipelined CR­accumulator or as a CS-accumulator.

The frequently referred disadvantage of using only one-dimensional pipelining, the so-called broadcast

Page 118: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 123

problem for some of the input data, is in most applica­tions only an academic one. Keeping in mind that in each of such highly pipelined structures the much more extended clock network has to be charged and dis­charged twice as frequently as the data lines, it should always be possible to solve the broadcast problem without any significant loss in throughput rate or effi­ciency. Precalculation and pipelining of partial products as described in [7], [8] is one efficient solution to this problem.

The combination of the CS-principle with bit-level pipelining offers high computation speed, limited only by the delay time of a single (three-bit) full adder. This limit can be seen as an upper throughput for a given technology. In today's CMOS technologies CS­arithmetic offers the potential of clock frequencies of up to several hundred MHz [5]. But in practice, the maximum frequency is limited by clocking and 110-bandwidth problems. So, the strategy to implement ef­ficient DSP circuits is to meet the required throughput rate using CS-arithmetic and then to reduce silicon area and power dissipation applying:

• An optimized pipeline scheme [7], [8]. Because chip area increases and minimum clock period decreases with the degree of pipelining, there must be an optimal tradeoff between the degree of intermediate latching and therefore throughput rate and area cost. Figure 3 shows the silicon area of a typical multiplier basic cell normalized to the area of a full adder and the minimum clock period nor­malized to the delay of a full adder versus s being the number of full-adder operations per period. One full adder operation per period (s = 1) is equivalent to fully or bit-level pipelining with pipeline registers after each full adder, while s = 0.5 corresponds to a pipelining of half adders (i.e., EXOR-gates). Ob­viously for s = 0.5 the silicon area and the asyn­chronous delay time of the registers becomes very significant. Therefore the efficiency 7/, defined by throughput rate divided by silicon area (reciprocal of area-period product) and also shown in figure 3 drops drastically.

An exceptional efficient CMOS realization is pos­sible for an operation rate of two full additions per period (s = 2) using dynamic registers as suggestive for high-throughput implementations. The registers can be split up into latches, each performing a delay of half a clock period, and distributed between the adder stages. The latches again can be realized using clocked transmission gates and the input capacitances

/ ~ k--/v ~

"x ~

.6

T ::c< .... - . .' --. A

4

----

o o

Fig. 3. Silicon area A, minimum period Tp and nonnalized efficiency 1] versus the degree of pipelining or operation rate s in full additions per period.

of the connected adders acting as dynamic storing elements. It can be seen from figure 3 that there is a distinct maximum of efficiency at that operation rate of two full additions per period.

Another aspect not taken into account in figure 3 is clock nonoveriapping as practically necessary. For implementations with distributed registers as described above, the clock nonoveriapping does not increase the minimum possible period as long as the nonoverlap time is smaller than the asynchronous delay between the clocked transmission gates.

This optimization of the pipeline scheme also reduces all the problems associated with large clock network capacities like driver capability and delay, clock skew and power supply noise.

• Area efficient adder cells, e.g., the inverting full ad­der shown in figure 6a (additional inverters can be implemented for sum and carry to decrease the total delay at the price oflarger silicon area). This adder has twice the delay time of a speed optimized adder as used in [9] or [10], but only a quarter of the silicon area. That means, the adder efficiency is increased by a factor of two. In cases were the throughput rate available from full adders with even minimum sized transistors still exceeds the required speed, possibili­ties for a further reduction of silicon area by means of timesharing and multiplexing have to be considered (e.g., see Section 4.1.6).

• Wherever applicable an optimal succession of basic operations in order to keep wordlengths small as well as proper recoding of fixed coefficients e.g., into a canonical signed digit (CSD) code representation.

Attractive approaches for saving silicon area by an efficient ordering of basic operations are for ex­ample the use of bit planes (Section 4.1.2), the use of Horner's scheme or to apply an MSB-first scheme (e.g., [11]). One advantage of the bit-plane approach

Page 119: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

124 Noll

is, that no additional quantization noise is introduced in higher order systems from the accumulation of truncated intermediate results (accumulation error).

3. Carry-Save Specific Circuit Components

In this section special circuit components will be described, which become necessary when using CS-arithmetic.

3.1. l-ector Merging Adders

For the final CP, efficient vector-merging-adder (VMA-) structures were developed as conversion elements to conventional two's complement representation. In prin­ciple there are two possibilities to implement a VMA:

• Sum and carry vector can be summed up in an addi­tional CS-half adder array as shown in figure 4a. Because this array is pipelined on half adder level, the possible maximum throughput rate is un­necessarily high compared to the actual CS-array. The VMAs given in figure 4b,c utilize the fact, that a speed-optimized two- (or three-) bit ripple adder has nearly the same latency time as a single full ad­der (one or two additional carry gates). Therefore the resulting throughput rate is almost equal to that of the actual CS-array but with significantly less hardware.

• The VMA can also be implemented using a speed­optimized CP-adder, e. g., from the carry-select type [12]. This is interesting in nonpipelined structures and

yO

yl

~ register

y3

y2

(a) y3 (b)

y2

where, because of moderate wordlength and through­put requirements, such adders can be used without slowing down the pipeline. The advantage of this ap­proach is, that the speed-optimized CP-adder which needs certainly a much larger amount of silicon area and power than a CS-adder, is required only once at the end of the array.

In cases where several CS-operations have to be per­formed in cascade it has to be considered whether it is efficient to implement a VMA at the end of every operation or not. Without VMA the result of a CS­operation is represented with two words and usually the next operation has to be done twice (on the sum and the carry word).

For the example of a multiplier/accumulator (cf. figure 2b) as required in [7] or [13], it is more efficient to skip the VMA between the multiplier and ac­cumulator and to use a CS-accumulator with two in­puts (figure 5). Then the resulting pipeline scheme with two additions per clock period fits well to the optimized pipelining of the actual CS-array.

Another example is the exploitation of the sym­metry/antisymmetry of the coefficients in linear phase filters. The symmetrical taps of the delay line in a direct form 1 structure can be merged pairwise into an ad­der/subtractor before the multiplication with the com­mon coefficient. Applying a CS-adder stage without VMA requires two multipliers per coefficient (for sum and carry) which brings no overall saving at all. Therefore, the adders/subtractors have to be imple­mented directly as CP-adders or VMAs. Exploiting the symmetry in a transposed direct form 1 structure results in a doubling of the number of tap adders. Layout

yl l y3 y2 Y 1 yO

(e)

Fig. 4. Block diagrams for optimized vector merging adders based on a) a half adder array and b). c) on 2-bit ripple adder arrays.

Page 120: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 125

(2 52 51 (1 sO (0

Fig. 5. Block diagram of a two-input carry-save accumulator.

studies show that, at least for high throughput rates and except for special cases (e.g., the vertical part of 2D­filters, cf. Section 4.1.4), these structures do not offer any advantages over an optimized filter structure without exploiting the symmetry.

3.2. Overflows

One particularity of CS-arithmetic is the existence of two types of overflow:

• The conventional overflow occurs if both carry and sum word have the same sign and the wordlength is too small to represent the actual magnitude of the number. This type of overflow has to be excluded by system measures e.g., by proper saturation as described below.

• In the case that even small magnitude numbers are represented by large carry and sum words of oppo­site sign, a second overflow, called carry overflow occurs. This modulo overflow, caused by the usual truncation of the carry coming out from the sign position full adder, must be corrected to avoid unnecessarily high wordlengths or errors in the suc­ceeding operations.

3.2.1. Carry Overflow Correction. A simple example will explain the effect. The addition of three numbers 0.5 + (-0.5) + 0 = 0 in a CS-stage results in

weight radix

2- 1 2-2 -2°

0 0 0

0 0 0

0 0

-2° 2;1 2-2

0 0 O.S@)

0 -O.S@)

0 0 0 O.O@)

.x 0 carry

0 0 sum

~ 0 0 -1.0@)

0 0 -1.0@)

(where the index @) denotes decimal notation). Trun­cation of the carry out from the FA in the sign posi­tion, as usual in conventional two's complement arithmetic, leads to a carry overflow, even if the given worcllength is sufficient to represent the expected result (in our example zero).

Because the intermediate (overflow) result minus two does not fit in a single two's complement number with the same wordlength, this modulo-overflow is compen­sated at least in a following VMA-operation

.00

.00

x 0 • 0 0

But in the case that the CS-addition is followed e.g., by a scaling operation and proper sign bit extension, e.g., for our example [0.5 + (-0.5) + OJ' 0.5, also the value ofthe modulo-overflow is scaled by 0.5 and it fits into the given word length:

Page 121: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

126 Noll

1. 0 1 • 1 0

)1( • 0 0

After the VMA-operation the result is minus one which is wrong. This overflow can be easily corrected if its value +21 is represented by 2° + 2° and substituted by replacing the two logical ones (-2°-2°) in the sign position of the result by zeros:

_2°.2-1 2-2

)( • 0 0 • 0 0

~ o • 0 0 o • 0 0

Positive overflows can be treated in an analogous way (as can be easily exercised with the example of adding -0.75 -0.75 + 0.75) and a common correction scheme can be formulated as follows: The carry-overflow is detected from CO ~ c l• As long as

-1-2-m+2 :;;; c + s :;;; I-Tm- I , (3)

that is, the expected result fits in a single conventional two's complement number, SO ~ cl holds and the cor­rection can be performed according to

(_21) _20 .2-1 2-2 .. . 2-(m-l) with

c' CO -I -2

• c c

SO • S

-I s-2 •• . S -(m-I)

~ CO

• c -I c-2

SO • s

-I s-2 (4)

A direct realization of (4)

SO = [SO <±> (c l ffi cO)] (5)

(with EXOR operation ffi) would cost approximately an additional delay time of a full adder. This disadvan­tage is completely avoided if a modified full-adder cell as shown in figure 6b is used in the sign position. This cell performs both, the full addition and the carry overflow correction. It neither requires more silicon area nor introduces a longer delay time than the basic full adder in figure 6a. The advantage of the described correction scheme is, that without any overhead, sum

and carry word are valid two's complement numbers. That is, for example, essential for the saturation con­trol described below. In many recursive applications (e.g., z-I /[I-0.5z-1 + .:l]; where.:l could stand for additional nonzero bits like.:l = 0.25z-1 or for a sec­ond coefficient like .:l = b • Z -2) it is impracticable to avoid the carry-overflow using a simple sign bit ex­tension for the sum word as it naturally occurs in multipliers.

For nonrecursive operations with cascaded additions it would cause a continuous growth of the wordlength with the number of operations performed independent of the magnitude of the numbers to be represented and therefore a waste of silicon area. For example, without overflow correction the bit-plane structure described in [8] would require about 80% more silicon area. Sim­ple truncation of the overflow carryon the other hand would produce invalid filter results at the scaling opera­tion between the bit-planes.

3.2.2. Wordlength Truncation Correction. When the maximum magnitude represented by a CS-number at a point in the signal flow is known, truncation of this number to a certain number of LSB's, leads to a similar effect. Also here an efficient correction scheme can be given.

Properly speaking we can generalize the carry overflow situation. For example, after a number of operations a CS-number might be represented with n bits in front of the radix, but the magnitude at that point is bound to e.g.

-1 :;;; c + S :;;; 1-2-(m-l). (6)

Truncation of the front bits as possible in conventional two's complement arithmetic wi1l1ead to overflows in CS-arithmetic. Again, a correction scheme can be for­mulated (proof in [14])

CO.c-1 c-2 .. .

io.s-1 s-2 .. .

(7)

{

CO;Sl ;c cl

CO =

cO; else

ensuring that after truncation and correction both sum and carry words are valid two's complement numbers.

(7) can be realized by

SO = [SO <±> (Sl <±> cl)]

CO = [CO ffi (Sl ffi cl)]. (8)

Page 122: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 127

("

V dd

-IJ~ ~

.:- '- rl

~ 5

!-~

'--

H ~

-

-"-s

!.. YI ( d YI -(

~ r- I-------f--

~ ~

(a) - - - (b) - - - -Fig. 6. Transistor diagrams of an area optimized inverting full adder (a) and a full adder with carry-overflow correction (b).

Typical applications are recursive or iterative structures as described below.

3.3. Saturation Control

In conventional number representation the overflow detection for a saturation control as frequently required in accumulators, recursive filters and so on can be realized by a simple EXOR-operation. A typical disad­vantage of redundant number representations is that for the required level slicing a complete CP would be necessary, which of course is undesirable at least in­side recursive CS-structures. Fortunately, for satura­tion control a level estimation can be derived from the inspection of only a few most significant digits (MSDs). This leads to a saturation characteristic with some ranges of uncertainty, but which is sufficient for many applications.

Again we start from the number representation

-2°.2-1 2-2 ••. 2-(m-l)

CO • c- I c-2 ... c-(m-I)

SO • S-I S-2 .. . s-(m-I)

(for simplicity the denotation for an overflow correc­tion is neglected). We assume that a saturation to an interval of ±0.5 has to be performed. For an inspec­tion of, for example, only p = 2 MSDs (weights -2° and 2-1) we can formulate the following saturation strategy:

• A positive saturation to c = 0.000 ... and s =

0.011 ... = 0.5-2-m+1 "" 0.5 is performed when

both, carry and sum word, are positive (C' = SO = 0) and at least one of them becomes large, indicated by one of the two bits with weight 2 -I being logical one

POVL = (CO V SO) /\ (c- I V S-I). (9)

• A negative saturation to c = 0.000... and s =

1.100 ... = -0.5 is performed when both carry and sum word are negative (C' = SO = I) and at least one of the two bits with weight 2-1 is logical zero

NOVL = (CO /\ SO) /\ (c- I /\ s-I). (10)

The resulting saturation characteristic is shown in figure 7. According to (9) the smallest positive value saturated to 0.5-2-m+l is 0.5 and the largest nonsaturated value is 1-2-m +2 . Since the m-p less significant digits are neglected, it is not certain whether a saturation for the interval 0.5-2-m+1 ~ c + s ~ 1-2-m+2 is per­formed or not. For the negative part of the characteristic the same appears for the interval -1 < c + s < -0.5-2-m+2. These ranges of unceruti'nty can be decreased by inspecting more digits for saturation con­trol but disappear only if all the weights are comprised (p = m), equivalent to a complete CPo

The direct inspection of the bits as in (9) and (10) for p = 2 digits becomes unfavorable for larger p. A more attractive approach is to first calculate an estimate y for y = c + s by performing a limited VMA-operation for the p most significant digits

p-l p-I

Y = -co + ~ c- i 2- i - SO + ~ S-i 2-i (11) i=1 i=l

and then compare y with the two detection levels y-s and y +5 according to the required saturation

Page 123: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

128 Noll

y't 1-2-m+2~1 { 0.011 ..

0.011..

uncertainty

0.000. 0.100 ... ·

0.000... --....;....----,;01' 1.100 ..... .

uncertainty

1.111... } 1.011... -0.5-2-m + 2" -0. 5

1.100 .. }-1.0 1.100 ..

0.000 .. . 0.011 .. .

y=c+s

Fig. 7. Carry-save saturation characteristic for p = 2 inspected digits.

characteristic. For small p, the slow down of the pipe­line due to the limited VMA-operation can be kept small by using fast CR-adders (e.g., p = 3, cf. Section 3.1).

But for many applications like the coefficient cor­relator in [7] or to ensure forced response stability in wave digital filters [15], [16] it suffices to inspect only the first p = 2 digits as described above. The VMA­operation outside the recursive loop can then be fol­lowed by an exact saturation in order to remove any uncertainty from the overall characteristic.

This approach preserves one of the main advantages of CS-arithmetic, the independence of throughput from the wordlength (7 = 0(1)).

4. Architecture Examples and Realizations

For a number of operations it is already shown in the literature, that CS-arithmetic combined with carefully designed pipeline registers and clocking systems offers the opportunity for highly efficient implementations. The use of CS-arithmetic allows excellent VLSI-rele­vant efficiencies and regularities. Some typical design and realization examples following that strategy will be explained in more detail now.

4.1. Inner Product Based Algorithms

4.l.J. Video Matrix with Fixed Coefficients. Manyap­plications require only fixed coefficients and do not need the implementation of real multipliers. In those cases generating the partial products for multiplication reduces to simple hardwired shift operations and most of the adder stages can be saved (two thirds on the average) by a recoding of the coefficients into a canonical signed digit form.

For example, a matrixing unit required for conver­sion of R, G, and B components of color video camera signals into lunimance and chrominance signals U and V will be presented. The color matrix equation accord­ing to CCIR recommendation 601 describing studio digital video coding is

y ] 1 [77 150 29] R ] U = - -44 -87 131 . G V 256 131 -110 -21 B

(12)

The coefficients can be stripped and recoded according to figure 8a. Figure 8b shows the four least significant bit slices of the block-diagram on full adder level for the Y-rail which has the highest count of nonzero bits (11 in comparison with 10 for U and 9 for V) in a CS

Page 124: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 129

R G B

I 00100-101 ~~ Y

I 000-10-10-1 ~~ U

(a) 0-10-1-1001 ~IIOOOOOII ~~ V

R G B

y

(b)

+ 1

Fig. 8. Realization example of a ROB-matrix for video signal conversion aJ simplified block diagram on word level b J block diagram for least significant bit-slices of the V-rail on full adder level (pipeline latches not shown).

approach. Two kinds of simple basic cells containing an inverting 24 transistor full adder and the transmis­sion gates for pipelining are required.

Depending on whether the next operation has the same or opposite sign, one or two inverters are im­plemented for the input signal x. The wiring of the in­put signals between the basic cells on layout level ac­cording to the weight of the next nonzero bit can be done easily by a simple routing program.

The complete matrix contains 35,000 transistors on about 10 mm2 silicon area (1.5 /Lm CMOS) and allows a maximum clock frequency of 40 MHz under worst case conditions (i.e., 95 MHz under typical conditions).

4.1. 2. Programmable Transversal Filter. For the realiza­tion of programmable transversal filters for high sam­ple rates one parallel hardware multiplier per coeffi­cient is required. Figure 9a shows the signal flow graph of a transversal filter with three coefficients each with a wordlength of 4 bits in bit-plane (BP) form [17]. The original tap multipliers are stripped for their partial pro­ducts cJ . 2v • x which are optimally reordered for equal weight in order to reduce the internal wordlength for realization. Each triangle stands for a simple (hard­wired shift and) logical AND-operation on the input signal x.

In the first BP the least significant partial products are summed with the proper delays between the adder stages, equivalent to a parallel-in/serial-out filter struc­ture with one-bit coefficients. After synchronizing the input signal x to the intermediate output signal, the next more significant partial products are processed in the second BP and so on, up to the most significant BP. For proper two's complement processing the most significant partial products have to be subtracted in the last BP.

Between the BPs only a hardwired shift over one weight on the input signal is necessary. Obviously at the same time one LSB of the intermediate output signal may be truncated without any accumulation error. De­pending on the required final output wordlength this allows a significant reduction of the internal wordlength in the array and therefore of silicon area. Applying CS­arithmetic results in a computation rate of one full ad­der operation per period (s = 1, see figure 3).

In order to obtain the optimal operation rate of s =

2, the partial products in the BP-structure have to be resorted in order to process two partial products per coefficient in one clock period (i .e., s = 2). Figure 9b shows the resulting modified-bit-plane (MBP) struc­ture. The registers in the MBPs are split up into latches denoted by primes and placed between the adder stages.

Page 125: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

130 Noll

(a)

(b)

Fig. 9. Signal flow graph of a 3-tap transversal filter with 4-bit coefficient wordlength. using a) bit planes and b) modified bit planes.

(b)

Fig. 10. Comparison of typical basic cell layouts for a) two­dimensional pipelined carry ripple arrays and b) modified bit plane carry-save arrays.

(a)

Each latch performs a delay of half a clock period (z-Y,) and can be efficiently realized as described in Section 2. The use of CS-arithmetic and the pairwise processing of coefficient bits results in a synchroniza­tion overhead per basic cell of only 15 %. Figure 10 shows a comparison between the layout of a typical

Page 126: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 131

Fig. 11. Signal flow graph of a decimation filter in bit-plane structure with 4 taps and 2-bit coefficient wordlength.

basic cell for an approach using two-dimensional pipelined CR-arrays and for the MBP approach using CS-arrays.

Such transversal filters were realized and described in [7]. The realization of an adaptive digital equalizer build up by using a 9-tap MBP-transversal filter was reported in [18].

4.1.3. Programmable Decimation/Interpolation Filters. Figure 11 shows the signal flow graph of a program­mable decimation filter used, for example, as a frac­tionally spaced filter for equalization. The original filter is split up into two parts according to a polyphase struc­ture. The two subfilters in bit-plane form are merged to a new structure very similar to the MBP structure in figure 9 (except for the multiplexer for decimation at the input and the additional input synchronization chain between the MBPs. Typical 1.5 JLm CMOS basic cells allow an input sample frequency for this filter of 80 MHz under worst case conditions.

In [19], [20] a programmable interpolation filter chip also based on a polyphase and MBP structure is described. Typical samples could be verified up to 220 MHz.

4.1.4. Programmable Two-Dimensional Filters. The next example is a programmable two-dimensional transversal filter used for image enhancement and reconstruction

line n-2

in digital image and video processing. Figure 12 shows the signal flow graph of a 3 X 3 transversal filter with 4-bit coefficients. The line buffer provides all the data from previous video lines required to create the actual content of the operation window for 2D-convolution. MBP-filters connected to the input and to the outputs of the line buffer perform the pixel delay and weighten­ing operation. The MBP-filter outputs are merged for the final output in an additional adder stage. The ver­tical symmetry/antisymmetry of the filter kernel as required generally for linear phase of the transfer func­tion in image processing, is exploited by merging symmetrical line buffer outputs in an adderlsubtractor to a single (folded) filter input. Hence, only (N + 1)/2 instead of None-dimensional subfilters are required. Figure 13 shows the chip photograph of a 7x 7-trans­versal filter realized using this approach [21].

4.1.5. Matrix-Matrix Multiplier. Realizing that a trans­versal filter operation is equivalent to a matrix-vector multiplication we can build up an NxN matrix-matrix multiplier by cascading of N MBP filters in parallel as shown in figure 14 [22]. The elements of the first matrix X are connected to the filter inputs. Proper connection and skewing of the element bits of the second matrix Y to the coefficient inputs leads to the wanted result matrix at the filter outputs. An 8x8 matrix-matrix

Fig. 12. Signal flow graph of a 3 x3 transversal filter using modified bit planes and 4-bit coefficient wordlength.

Page 127: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

132 Noll

Fig. 13. Chip photograph of a programmable 2D-transversal filter. The 1.5 I'm CMOS chip performs a 7>0 filtering at 40 MHz (typ.) and contains 280,000 transistors on a silicon area of 140 mm2.

z .,

-0-z·112

X12 xl1~rr--+--'-+--+--'-+--+--1 -0-

Fig. 14. Signal flow graph of a 2x2 matrix-matrix multiplier using modified bit planes and 4-bit wordlength.

Page 128: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 133

multiplier with 8-bit wordlength for a maximum clock frequency of 40 MHz takes 408,704 transistors on a silicon area of 86.8 mm2 (1.5 pm-CMOS technology).

4.1.6. Two-Dimensional Discrete Cosine Transforma­tion. Another example for the potential of these architectural strategies is an implementation of the two­dimensional discrete cosine transformation (DCT). High-speed DCT is widely required in transform im­age coding, e.g., for visual telephony but also in future HDTV systems. The direct implementation of the matrix formulated transform equation

y = CT(X C) ; Y, X, C: nXn matrices (13)

(with X, Y, and C being the input, spectral coefficients and transform coefficient matrix respectively) requires two matrix-matrix multiplications. Obviously a fully parallel implementation of matrix-matrix multipliers, as shown in figure 14, would overshoot the mark for most applications. Moreover, each matrix-matrix multiplication can be decomposed into N inner product computations performed in parallel. Each inner prod­uct computation again can be decomposed into N multiply and add operations, serially performed in a CS-processing element (PE). Figure 15 shows the simplified block diagram. In the first linear array of N PEs the input matrix X is multiplied by the transform coefficient matrix C. After a parallel/serial conversion the intermediate result X· C is transposed and again multiplied by C in the second array. The result yT = (X . C{ C is the transposed spectral coefficient matrix (13). Obviously, this circuit can perform any other separable orthogonal transformation as well, by reprogramming the coefficient ROM.

Figure 16 shows the layout photograph of a 2D-DCT­chip designed in a 1.5 pm-CMOS technology accord­ing to this CS-architecture [13]. The chip performs a high accuracy 8 x 8 block DCT or mCT (inverse DCT) at 45 MHz, i.e., nearly HDTV performance under worst case conditions. The regularity is as high as 99%, i.e., less than 3,000 of the 280k devices had to be designed individually.

Block sizes of 16x16 can be processed at 22.5 MHz with the same chip by decomposing matrices and vec­tors into 8 x 8 submatrices and subvectors of length 8 and timeshared use of the PE-arrays. This timesharing avoids an increase of silicon area by a factor of two for 16x16 block processing, but it reduces the throughput by a factor of two.

So not only block sizes and accuracies are paramet­rizable by the number and widths of the PEs but also

x YT

Fig. 15. Simplified block diagram of a circuit for two-dimensional discrete cosine transformation using carry-save arrays.

the throughput rate available from that architecture. E.g., for small band ISDN picture telephony perfor­mance only two PEs are required.

Similar to the FFT, filst DCT-algorithms were devel­oped in the past allowing a significant reduction of the number of requied multiplications by a factor of 5 to 10. For example, the well known Lee-algorithm requires only log(I:P multiplications per sample compared to 2N for the direct implementation described above.

A careful comparison taking the different accuracies n, block sizes N and technologies}. into account, shows that the design described above features at least the same efficiency" = f/[A/(n2 N }.2)] as other designs based on fast DCT-algorithms and/or distributed arithmetic implementations but at a much higher regularity and parametrizability and less design effort.

4.2. Recursive Algorithms

In recursive structures simple pipe lining is not ap­plicable. Here CS-arithmetic is a very attractive number representation in order to deal with the fixed delay of the time critical recursive loops.

The basic idea is to postpone the CP and therefore move the CP-path out of the recursive loops into parts of the structure where the timing restrictions are relaxed by decimation or where pipelining can be applied. As already shown for the CS-accumulator in figure 2b, the

Page 129: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

134 Noll

Fig. 16. Layout photograph of a 20-0CT chip. The 1.5 I'm CMOS chip performs an 8x8 OCT at 45 MHz (worst case) and contains 280,000 transistors on a silicon area of 91 mm2.

output sum and carry word are fed back as separate numbers and manipulated as two's complement numbers (cf. Section 3.2). Typical examples are given in [23] and [24]. Examples of realized recursive CS­circuits are a coefficient accumulator for 40 MHz [7] or a wave digital filter (WDF) for 35 MHz (2 I'm CMOS) [9], [25]. Because of the low coefficient ac­curacy requirements of WDFs, the number of nonzero bits can be kept small. In [9] the registers are distributed between the adder stages of the adaptor to get an op­timal operation rate of s = 2 full additions per period.

Additional delays can be introduced into the loop of recursive filters by means of look -ahead approaches and pole/zero compensation at system level [26], [27]. Distribution of the registers again allows the realiza­tion of an optimal operation rate.

Another very attractive approach here is to apply MSB-first schemes. For example, in [11] an architecture for a recursive filter in direct form is proposed, using a redundant signed-digit number representation which allows an MSD-first scheme. The typical disadvantages

of the two-transfer addition in radix-two systems [I] are avoided by again using a CS-structure for the multiplier arrays and applying signed-digit number representation only to the final adder or VMA, required for the front digits in MSB-first array multipliers [5].

In general, all these approaches fail if a nonlinear operation within the recursive loop, like level slicing for decision, calls for a CPo However, for the implemen­tation of the Viterbi-algorithm containing a nonlinear decision operation in the time critical recursive loop, a very efficient circuit concept was presented in [28]. Beside optimizations at system and architecture level, at circuit level a combination of CS-arithmetic with an MSD-first comparator structure is used, allowing very high throughput rates at a reasonable amount of hard­ware. For a decision feedback equalizer (DFE), an ex­tending of the loop latency at system level is possible by cascading a transversal filter and a DFE. Thereby, the residual echoes of higher order from the transver­sal filter are cancelled in the time domain by far-off recursive taps [29].

Page 130: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 135

Fig. 17. Chip photograph of a complex valued decision feedback equalizer using carry-save arithmetic. The 1.5 ",m CMOS chip works at 70 MHz (typ.) and contains 62,000 transistors on a silicon area of 75 rom'.

Nevertheless, in such cases CS-tree adder structures combined with CMOS-optimized final CP-adders as of the carry-select type offer high performance. In [30] the realization of such a decision feedback equalizer circuit using CS-arithmetic for sample rates of up to 70 Megasamples/s (in 1.5 J.tm CMOS) is described. Figure 17 shows a chip photograph.

4.3. Iterative Decision Directed Algorithms

As with saturation control, the same problem occurs in the form of sign detection uncertainty in decision directed algorithms such as division, modulo multipli­cation and root extraction or in CORDIC processors. Here the decision whether the next operation is an ad­dition or a subtraction, is made on the sign or amplitude of the previous intermediate result or partial remainder. Because these algorithms are iterative and decision directed but not recursive in principle, additional pipelining along the carry chain of CP-adders would be possible to reach T = 0(1). This is, however, im­practical because of the amount of necessary skewing register hardware and latency time. With slight modifications of the estimation scheme derived for

saturation control, all these algorithms can be im­plemented very efficient! y in CS-arithmetic with throughput rates independent of the word lengths (i.e., T = 0(1)).

4.3.1. Carry-Save Divider. First we consider the divi­sion b:d = q with normalized divisor d (i.e., 0.5 ,,; d < 1) according to the nonrestoring algorithm. After the initialization with

qO = 1 and r1 = b-d, (14)

the sign of the partial remainder rk at the kth iteration determines whether the quotient bit q-k is +1 or -1

q-k = sgn(rk); q-k E {-I, + I} (15)

and whether the next partial remainder is calculated by an addition or a subtraction of the divisor

(16)

The problem of performing (15) in CS-arithmetic is, that an accurate sign-determining mechanism is needed. As for the saturation control, an exact level slicing for sign detection would require a complete CPo So-called carry-save array dividers were proposed in the past using carry-look-ahead structures for a complete CP

Page 131: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

136 Noll

over the full wordlength in each iteration for sign detec­tion [31], [32].

But similar to the approach for the saturation con­trol, an estimate r for the sign of the actual partial re­mainder can be evaluated as follows:

If, e.g., the p = 3 most significant digits are in­spected, the neglection of the m-p least significant digits leads to an estimation characteristic as shown in figure 18. The diagram shows the estimated par­tial remainder r versus the exact partial remainder r and the dots denote right-hand open intervals. The maximum estimaton error of r is !:J.r = 2(2-(P-l)-2-(m-l) and for p = 3, m ~ 1 we derive !:J.r "" 2 -I = 0.5 (length of intervals in figure 18). According to its LSBs weight the quantization of r is 2-(P-l) = T2 = 0.25 (spacing of intervals in figure 21).

.. '-­,

-0.5

1.0 t r

0.5

-0.5

·1.0

e----t ..... , , -0.5} k q- =0

Fig. 18. Estimation characteristic for a carry-save divider for p = 3.

It is evident from figure 18 that only for the interval

-0.25 ~ r < 0.25 with r = -0.25 (17)

the sign of r can deviate from the sign of r. So, only in the case with r = -0.25 we do not know whether an addition or a subtraction of the divisor should be performed next to iterate the remainder to zero. Because we know that the magnitude of the partial remainder is smaller than or equal to 0.25, the best we can do is to add zero. For that case, the remainder can be com­pletely counterbalanced in the following iterations. Con­sequently, we allow zero as an additional element in the quotient digit set

{-I; rk < -0.25

q-k E {-I, 0, +1}; q-k = 0; rk = -0.25 + 1; else (18)

similar to SRT (~weeney, gobertson and Iocher)­division [31] but here for the purpose of applying CS­arithmetic instead of skipping iterations.

The largest magnitude of the partial remainder oc­curs in the nonrestoring algorithm using conventional arithmetic if in the (k - l)th iteration the remainder becomes zero and therefore, the divisor is subtracted

(19)

For the modified nonrestoring algorithm described above the same partial remainder can occur. The fact that in the case with r = -0.25 whether an addition or a subtraction is performed results in the interval (p = 3)

-0.5 ~ PK )F~ -0.25 < 0.5 (20)

which is smaller than that given by (19) for any divisor. For p = 2 we derive a largest magnitude of 1.0 be­

ing larger than that given in (19) for most divisors which proves that p = 3 is not only sufficient but also necessary.

The quotient digits generated this way will be in redundant signed digit form and can be interpreted analogous to a CS-number if we split the positive and negative digits for two different numbers [31]. E.g., the conversion to a conventional two's complement representation

¢ 01.0010 +1.-10+1-1

-(00 • 1 0 0 1)

can then be performed at the very end of the iterations in a VMA-stage as customary in CS-implementations.

It is interesting that the basic idea of this approach was sketched in general for redundant number represen­tation already in 1959 by Metze and Robertson [33] and was only reinvented here.

The check for r = -0.25 can be implemented, for example, using

(CO (j;) SO) /\ (c- 1 (j;) S-l) /\ (c- 2 (j;) s-2). (21)

But it seems to be more attractive first to calculate r by a 3-bit CR-adder instead of direct inspection of the p = 3 digits. A problem may occur then for the (half­sized) interval at the lower left in figure 18, which is coupled to the (also half-sized) interval at the upper right because of wordlength effects as marked by a dot­ted line. For example, a partial remainder of zero with c = 0.1101 ... , s = LO011 ... and a large divisor of d =

0.1111 ... result in the wrong estimate of r = 0.75. One way out is to implement a proper overflow correction

Page 132: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing \37

as described in Section 3.2 and to take the carry out from the sign position ofthe 3-bit-ripple adder into ac­count for the comparison. Another possibility is to im­plement two digits in front of the radix and to use con­sequently a 4-bit CR-adder even for p = 3.

The principle block diagram for one iteration is shown in figure 19. After the addition or subtraction of the divisor the estimate is calculated from the first digits and compared to -0.25. The result is the quo­tient digit in signed digit representation which controls the next operation. Obviously, the iterations can be per­formed serially on one single stage as shown in figure 19, or for highest throughput in parallel on L stages with pipeline registers inbetween.

q-k s1 (1 50 (0 5-1 c-1 5-2 (-1

Fig. /9. Principle block diagram of one iteration stage of a carry­save divider for p = 3 and a 4-bit-ripple adder for estimation.

4.3.2. Carry-Save Modulo Multiplier. Multiplication modulo a positive integer is an operation frequently used in encryption. E.g., in the RSA (Rivest, Shamir and ~delman)-algorithm [34] the openrtion (ct)mod n is decomposed into a series of modulo multiplications

e = (a b) mod n (22)

by polynomial evaluation. Using the law of congruence, this multiplication can be formulated as iterative par­tial product modulo additions

ek = (2 ek-l + bm-ka) mod n; k = 1,2, ... , m (23)

with the initialization eo = 0 and em = e being the result. For typical applications thousands of such modulo additions with wordlengths of up to about 1,000 bits have to be performed. The goal of applying CS­arithmetic is to avoid the CP along the full wordlength during each of these additions.

Using conventional two's complement arithmetic, (23) is implemented according to [34]:

< n < 2n

(24)

In CS-arithmetic, the exact testing of the magnitude of ek is impossible. Similar to the CS-divider approach described above, the use of an estimte e k is possible. Without going further into details, figure 20 shows the corresponding transfer characteristic ek(ek). Depending on the actual implementation, the inspection of 4 or 5 digits of ek is necessary (proof in [35]). In com­parison to the conventional arithmetic, the carry chain and therefore the time critical path is reduced from typical 1,000 bits to 4 or 5 bits independent of the ac­tual wordlength. Only at the very end of the whole modulo multiplication, a few final CPs are required after the last iteration. Because that can be done in parallel to the next operation, an enormous increase in speed is possible by using CS-arithmetic. A parametrizable modulo multiplier macro for a clock fre­quency of 30 MHz (worst case and independent of the actual wordlength) was designed in a 1.5 I'm-CMOS technology and takes a silicon area of 0.05 mm2/bit.

.t--jqk=4 -, - i }qk=2

0.5

-1.0 -0.5 0.5 qk =0 ek

-0.5 qk =-2 , , - -1.0 qk =-4

Fig. 2a Estimation characteristic for a CS-modulo multiplier.

4.3.3. Carry-Save CORDIC Processor. As a last exam­ple the outline for a CS-implementation of the CORDIC-algorithm [31], [36] in rotate mode will be given. Using the CORDIC algorithm, a vector Po = (xo, Yo) is transformed into a vector P Iv = (x Iv, y Iv) ac­cording to a rotation of the coordinate system by an angle of Zoo This transformaton is performed iteratively by N incremental rotations using a prestored angle sequence

Page 133: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

138 Noll

Yk = Yk-! + 0kXk-! 2 -k;

Zk = Zk-! - Okak

k 1,2, ... , N

(25)

and the final result is derived from an overall scaling by a constant factor of K

YN = K YN (26)

Thereby the total change of the angle is the sum of the incremental changes, which itself are the elements ak

of the angle sequence weighted by the rotation direc­tions Ok = ±1. In the rotate mode these rotation direc­tions Ok are determined in such a way that the total change in angle is approximated to Zo. Therefore, in each step Ok is selected to decrease the magnitude of the remaining angle Zb i.e.,

(27)

The problem in applying CS-arithmetic again is similar to the divider problem. But now, if the magnitude of the remaining angle is too small to allow an exact deter­mination of the sign only from the inspection of a few MSDs, it is not possible to set Ok = 0, because then the overall scaling factor K (depending not from the rotation directions but from the angle sequence) would have to be changed.

Here one solution is to replicate all the elements in the angle sequence and therefore its total length. Now each wrong rotation step can be completely counter­balanced to a net angle of zero in the next steps, which is almost equivalent to allowing a single rotation by zero. Of course, this overhead in hardware is undesirable. Fortunately, the replication of each ele­ment of the sequence is not necessary. A relation be­tween the number of inspected digits and the number of angle elements which have to be replicated can be derived. For an inspection of p = 3 or 4 MSDs (depending on the actual implementation) typically only each second sequence element has to be replicated. In [37] the realization of a CORDIC processor using CS­arithmetic for 60 MHz clock frequency was described. Figure 21 shows the chip photograph of a realized CORDIC processor in rotate mode with 18 stages (12 stages effective).

A similar scheme can be derived for CORDIC pro­cessors in vectoring mode.

5. Conclusion

It was shown that CS-arithmetic can be applied for the efficient high-speed implementation of a much wider

variety of DSP algorithms than only multiplication. Thereby, CS-arithmetic allows the use of simple op­timized conventional full adders and two's complement sign representation.

As demonstrated, it is not suggestive only to count and minimize the number of multiplications and addi­tions at algorithm level. First of all, a careful mapping of VLSI-suited algorithms and design on circuit and layout level ensures highest efficiency together with high regularity and parametrizability. The latter are ex­tremely important not only to limit the design effort but also to preserve design certainty.

Even the problems associated with the realization of large clock networks are reduced by the use of an optimized pipeline scheme as described, the increas­ing complexity and througput rate of implementations calls for an ingenious design of the clock system. Although it could not be described here in detail, well­contrived concepts are required for the clocking and synchronization as well as for the power supply net­work to obtain the full performance available from those highly pipelined circuits. Clock skew, transient current peaks and supply voltage drop are only a few catch­words here.

Applying CS-arithmetic results in new problems due to overflow effects and with basic operations like level slicing and sign detection, being not problematic in con­ventional arithmetic. The general solution of these prob­lems allows the efficient CS-implementation of yet undesirable operations like division and so on, being elementary functions, e.g., for algorithms of linear algebra as matrix decomposition and inversion.

A last aspect that can only be mentioned here is testability. It is well known, that the ease and speed with which a circuit can be tested is a very important factor in production. The use of CS-arithmetic offers advan­tages here in the sense of better and easier fault detect­ability. For example, considering pipelined adder arrays, CMOS-stuck-open fuults are much better detect­able in CS-adders than in CP-adders, showing unex­pected transitions during the ripple of the carry signal. The feature of C-testability allows the testing of large deep sequential arrays by only a few simple deter­ministic test patterns.

Acknowledgment

The author would like to thank all his colleagues and coworkers for the basic work, referred to in this paper.

Page 134: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

Carry-Save Architectures for High-Speed Digital Signal Processing 139

Fig. 21. Chip photograph of a rotate mode CORDIC processor using carry-save arithmetic with 18 stages (12 stages effective). The 1.5 I'm CMOS chip works at 30 MHz (worst case) and contains 86,500 transistors on a silicon area of 60 mrn'.

References

1. A. Avizienis, "Signed-Digit Number Representations for Fast Parallel Arithmetic," IRE Trans. on Electronic Computers, 1961, pp. 389-400.

2. B. Parhami, "Generalized Signed-Digit Number Systems: A Unifying Framework for Redundant Number Representations," IEEE Trans. on Computers, vol. 39, 1990, pp. 89-98.

3. J.V. McCanny and J.G. McWhirter, "Completely Iterative, Pipelined Multiplier Array Suitable for VLSI," lEE Proc. Pt. G, vol. 129, 1982, pp. 40-46.

4. P.R. Cappello and K. Steiglitz, "A Note on 'Free Accumula­tion' in VLSI Filter Architectures, " IEEE Trans. on Circuits and Systems, vol. CAS-32, 1985, pp. 291-296.

5. T.G. Noll et aI., "A Pipelined 330-MHz Multiplier," IEEE Journal of Solid-State Circuits, vol. SC-21, 1986, pp. 411-416.

6. M. Hatamian and G.L. Cash, "Parallel Bit-Level Pipelined VLSI Designs for High-Speed Signal Processing," Proc. of the IEEE, vol. 75, 1987, pp. 1192-1202.

7. T. G. Noll, "Architektur- und Schaltungsentwurf eines digitalen adaptiven Entzerrers fur den Digital-Richtfunk mit lokal systolischen Carry-Save-Arrays in CMOS-Technologie," Doc­toral Dissertation, Ruhr-University Bochum, 1989.

8. T.G. Noll, "Semi-Systolic Maximum Rate Transversal Filters with Programmable Coefficients," in Systolic Arrays, (W. Moore et aI., Eds.), Bristol: Adam Hilger, 1987.

9. U. Kleine and M. Bohner, "A High-Speed Wave Digital Filter Using Carry-Save Arithmetic," Proc. of ESSCIRC '87, Bad Soden 1987, pp. 43-46.

10. W. Kamp, K. Knauer, and E. Lackerschmid, "A Fast 16x 16 Bit Asynchronous CMOS Multiplier," Proc. ofESSClRC '86, Delft 1986, pp. A4.4-A4.6.

11. S.C. Knowles et aI., "Bit-Level Systolic Architectures for High Performance IIR Filtering," Journal of VLSI Signal Process­ing, vol. I, 1989, pp. 9-24.

12. W. Ulbrich, A. Reiner, and T.G. Noll, "Digitaleg Rechenwerk," European Patent No. 0 130 397/BI.

Page 135: Parallel Processing on VLSI Arrays: A Special Issue of Journal of VLSI Signal Processing

140 Noll

13. U. Totzek et aI., "CMOS VLSI Implementation of the 2D-DCT with Linear Processor Arrays," Proc. of ICASSP '90, Albuquerque April 1990, pp. 937-940.

14. TG. Noll and U. Kleine, Patent pending. 15. U. Kleine and T.G. Noll, "On the Forced Response Stability of

Wave Digital Filters Using Carry-Save Arithmetic," AEU, Bd. 41, 1987, pp. 321-324.

16. U. Kleine and T.G. Noll, "Wave Digital Filters Using Carry­Save Arithmetic," ISCAS '88, Espoo 1988, pp. 1757-1762.

17. P.B. Denyer and DJ. Myers, "Carry-Save Arrays fur VLSI Proc­essing," Proe. of First Int. Can! on VLSI, Edinburgh, 1981, pp. 151-160.

18. S.R. Meier et aI., "A 2 I'm CMOS Digital Adaptive Equalizer Chip for QAM Digital Radio Modems," IEEE Journal of Solid­State Circuits, vol. SC-23, 1988, pp. 1212-1217.

19. W. Haberecht, E. DeMan, and M. Schulz: "A Programmable 32 Tap Digital Interpolation Filter in 1.5 I'm CMOS with 80 MHz Output Data Rate," Proc. of CICC '90, Boston, 1990, pp. S13.1.

20. E. DeMan, M. Schulz, and W. Haberecht, "A Digital Interpola­tion Filter Chip with 32 Programmable Coefficients for 80 MHz Sampling Frequency," IEEE Journal of Solid-Slnte Circuits, forth­coming, 1991.

21. W. Kamp et aI., "Programmable 2D Linear Filter for Video Ap­plications," IEEE Journal of Solid-State Circuits, vol. SC-25, 1990, pp. 735-740.

22. S.R. Meier, R. Kunemund, and TG. Noll, Patent pending. 23. W. Ulbrich and T.G. Noll, "Design of Dedicated MOS Digital

Filters for High-Speed Applications," Proc. of ISCAS '85, Kyoto 1985, pp. 255-258.

24. TG. Noll and W. Ulbrich, "Senti-Systolic Arrays for High-Speed Digital Filters in VLSI-MOS," Proc. of lASTED '85, Paris 1985, pp. 160-258.

25. W. Lao and H. Samueli, "Architecture and Design of a High­

Speed CMOS 15th Order Half-Band Recursive Digital Filter;' Proc. of Midwest Symp. on Circuit and Systems, St. Louis, 1988.

26. H.H. Loomis, "The Maximum Rate Accumulator;' IEEE Trans. on Electronic Computers, vol. EC-15, 1966, pp. 628-639.

27. K.K. Parhi and D.G. Messerschmitt, "Pipeline Interleaving and Parallelism in Recursive Digital Filters-Part I," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 37, 1989, pp. 1099-1117.

28. G. Fettweis and H. Meyr, "A Systolic Array Viterbi Processor for High Data Rates," Int. Can! on Systolic Arrays, Ireland, 1989, Systolic Array Processors, Englewood Cliffs, NJ: Prentice Hall, 1989.

29. G. Sebald, B. Lank!, and lA. Nossek, "Advanced Time- and Frequency-Domain Adaptive Equalization in Multilevel QAM Digital Radio Systems," IEEE Journal on Selected Areas in Com­munications, vol. SAC-5 (3), 1987, pp. 448-456.

30. M. Schobinger, 1. Hartl, and T.G. Noll, "CMOS Digital Adap­tive Decision Feedback Equalizer Chip for Multilevel QAM Digital Radio Modems," Proc. of ISCAS '90, New Orleans,1990, pp. 574-577.

31. K. Hwang, Computer Arithmetic, New York: John Wiley, 1979. 32. HY. Lo, "An Improvement of Nonrestoring Array Divider with

Carry-Save and Carry Lookahead Techniques;' IFIP Proc. of VLSI '85, Tokyo, 1985, pp. 243-251.

33. G. Metze and lE. Robertson, "Elimination of Carry Propaga­tion in Digital Computers;' Proc. International Can! on In! Proc­essing, Paris, 1959, pp. 389-396.

34. E. Lu et al., '~Programmable VLSI Architecture for Computing Multiplication and Polynontial Evaluation Modulo a Positive In­teger;' IEEE Journ. on Solid-State Circuits, vol. 23, 1988, pp. 204-207.

35. H. Schutzeneder, "Entwurf eines Carry-Save-Modulo­Multiplizierers in CMOS-Technik," Diploma Thesis, FH Munich, 1989.

36. J.E. Voider, "The CORDIC Trigonometric Computing Tech­nique;' IRE Trans. on Electronic Computers, vol. EC-8, 1959, pp. 330-334.

37. R. Kunemund et aI., "CORDIC Processor with Carry-Save Ar­

chitecture," Proc. of ESSCIRC '90, Grenoble, 1990, pp. 193-196.

Tobias G, Noll received the Ing. (grad.) degree from the Fachhochschule Koblenz, Germany, in 1974, the Dipl.-Ing. degree in electrical engineering from the Technical University of Munich in 1982 and the Dr.-Ing. degree from the Ruhr-University of Bochum in 1989.

From 1974 to 1976, he was with the Max-Planck-Institute for Radio Astronomy, Bonn, West Germany, working on the development of microwave components. From 1976 to 1982 he was with the MOS

Integrated Circuits Department and from 1982 to 1984 he joined the MOS-Design Team trainee program of Siemens AG, Munich. Since 1984 he has been with the Corporate Research and Development Department of Siemens and since 1987 he has been head of a group of laboratories concerned with the design of algorithm specific in­tegrated CMOS-circuits for high-speed digital signal processing.

Dr. Noll is a member of the Informationstechnische Gesellschaft (ITG) im Verein Deutscher Elektrotechniker.