designing asynchronous circuits for low power: an ifir ...cs5830/handouts/00740020.pdf · ... an...

14
Designing Asynchronous Circuits for Low Power: An IFIR Filter Bank for a Digital Hearing Aid LARS S. NIELSEN AND JENS SPARSØ, MEMBER, IEEE Invited Paper This paper addresses the design of asynchronous circuits for low power through an example: a filter bank for a digital hearing aid. The asynchronous design re-implements an existing synchronous circuit which is used in a commercial product. For comparison, both designs have been fabricated in the same 0.7- m CMOS technology. When processing typical data (less than 50-dB sound pressure), the asynchronous design consumes 85 W—a fivefold reduction compared to the synchronous design. This has been achieved by the use of asynchronous control and data-path logic, an improved RAM design, and by a mechanism that adapts the number range to the actual need (exploiting the fact that typical audio signals are characterized by numerically small samples). Apart from the improved RAM design, these measures are only viable in an asynchronous design. The principles and techniques explained in this paper are of a general nature, and they apply to the design of asynchronous low- power digital signal-processing circuits in a broader perspective. In fact, this understanding is one of the contributions of the paper. Finally, the paper can be read as an example-driven introduction to asynchronous low-power design. Keywords—Acoustic signal processing, asynchronous logic cir- cuits, CMOS digital integrated circuits, design methodology, FIR digital filters, power demand. I. INTRODUCTION A. Asynchronous Circuit Design Most digital circuits designed and fabricated today are “synchronous.” In essence, they are based on two funda- mental assumptions that greatly simplify their design: 1) all signals are binary and 2) all components share a common and discrete notion of time, as defined by a clock signal distributed throughout the circuit. Manuscript received November 30, 1997; revised August 27, 1998. This work was supported by The Danish Technical Research Council. L. S. Nielsen is with Oticon, Inc., Hellerup DK-2900 Denmark (e-mail: [email protected]). J. Sparsø is with the Department of Information Technology, Technical University of Denmark, Lyngby DK-2800 Denmark (e-mail: [email protected]). Publisher Item Identifier S 0018-9219(99)00882-8. Asynchronous circuits are fundamentally different; they also assume binary signals, but there is no common or dis- crete time. Instead the circuits use handshaking among their components in order to perform the necessary synchroniza- tion, communication, and sequencing of operations. This difference gives asynchronous circuits inherent properties that can be and have been exploited to advantage in the following areas: • lower power consumption [1]–[6]; • higher operating speed [7]–[9]; • robustness toward variations in supply voltage, temper- ature, and fabrication process parameters [10]–[12]; • less emission of electromagnetic noise [1], [13]; • better composability and modularity [14]–[18]; • no clock distribution and clock skew problems. On the other hand, the asynchronous control logic that implements the handshaking normally represents an over- head in terms of silicon area, circuit speed, and power consumption. It is therefore a question whether the in- vestment pays off, i.e., whether the use of asynchronous techniques results in a substantial improvement in one or more of the above areas. Research in asynchronous design goes back to the mid- 1950’s [14], [19], but it was not until the 1990’s that projects in academia and industry demonstrated that it is possible to design asynchronous circuits which exhibit significant benefits to nontrivial real-life examples. Low power consumption seems to be one of the more promising directions, and the design reported in this paper is one of these examples. Many researchers, including the authors [20], have ex- perienced that “just going asynchronous” results in larger, slower, and more power consuming circuits. The crux is to use asynchronous techniques to exploit characteristics in the algorithm and architecture of the application in question. The work reported here represents a contribution toward a better understanding of this, as well as a contribution to the 0018–9219/99$10.00 1999 IEEE 268 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Upload: doannguyet

Post on 27-Jul-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Designing Asynchronous Circuits forLow Power: An IFIR Filter Bankfor a Digital Hearing Aid

LARS S. NIELSENAND JENS SPARSØ,MEMBER, IEEE

Invited Paper

This paper addresses the design of asynchronous circuits for lowpower through an example: a filter bank for a digital hearing aid.

The asynchronous design re-implements an existing synchronouscircuit which is used in a commercial product. For comparison,both designs have been fabricated in the same 0.7-�m CMOStechnology. When processing typical data (less than 50-dB soundpressure), the asynchronous design consumes 85�W—a fivefoldreduction compared to the synchronous design. This has beenachieved by the use of asynchronous control and data-path logic,an improved RAM design, and by a mechanism that adapts thenumber range to the actual need (exploiting the fact that typicalaudio signals are characterized by numerically small samples).Apart from the improved RAM design, these measures are onlyviable in an asynchronous design.

The principles and techniques explained in this paper are of ageneral nature, and they apply to the design of asynchronous low-power digital signal-processing circuits in a broader perspective.In fact, this understanding is one of the contributions of the paper.

Finally, the paper can be read as an example-driven introductionto asynchronous low-power design.

Keywords—Acoustic signal processing, asynchronous logic cir-cuits, CMOS digital integrated circuits, design methodology, FIRdigital filters, power demand.

I. INTRODUCTION

A. Asynchronous Circuit Design

Most digital circuits designed and fabricated today are“synchronous.” In essence, they are based on two funda-mental assumptions that greatly simplify their design: 1) allsignals are binary and 2) all components share a commonand discrete notion of time, as defined by a clock signaldistributed throughout the circuit.

Manuscript received November 30, 1997; revised August 27, 1998.This work was supported by The Danish Technical Research Council.

L. S. Nielsen is with Oticon, Inc., Hellerup DK-2900 Denmark (e-mail:[email protected]).

J. Sparsø is with the Department of Information Technology,Technical University of Denmark, Lyngby DK-2800 Denmark (e-mail:[email protected]).

Publisher Item Identifier S 0018-9219(99)00882-8.

Asynchronous circuits are fundamentally different; theyalso assume binary signals, but there is no common or dis-crete time. Instead the circuits use handshaking among theircomponents in order to perform the necessary synchroniza-tion, communication, and sequencing of operations. Thisdifference gives asynchronous circuits inherent propertiesthat can be and have been exploited to advantage in thefollowing areas:

• lower power consumption [1]–[6];• higher operating speed [7]–[9];• robustness toward variations in supply voltage, temper-

ature, and fabrication process parameters [10]–[12];• less emission of electromagnetic noise [1], [13];• better composability and modularity [14]–[18];• no clock distribution and clock skew problems.

On the other hand, the asynchronous control logic thatimplements the handshaking normally represents an over-head in terms of silicon area, circuit speed, and powerconsumption. It is therefore a question whether the in-vestment pays off, i.e., whether the use of asynchronoustechniques results in a substantial improvement in one ormore of the above areas.

Research in asynchronous design goes back to the mid-1950’s [14], [19], but it was not until the 1990’s thatprojects in academia and industry demonstrated that itis possible to design asynchronous circuits which exhibitsignificant benefits to nontrivial real-life examples. Lowpower consumption seems to be one of the more promisingdirections, and the design reported in this paper is one ofthese examples.

Many researchers, including the authors [20], have ex-perienced that “just going asynchronous” results in larger,slower, and more power consuming circuits. The crux isto use asynchronous techniques to exploit characteristics inthe algorithm and architecture of the application in question.The work reported here represents a contribution toward abetter understanding of this, as well as a contribution to the

0018–9219/99$10.00 1999 IEEE

268 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 2: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

bulk of engineering experience needed to design efficientcircuits.

B. Designing for Low Power

In CMOS circuits, the power consumption is mainlyrelated to signal transitions and stems from the charging anddischarging of the parasitic capacitances in transistors andwires and from short-circuit currents during switching [21].Minimizing power consumption is therefore a question ofavoiding unnecessary signal transitions which do not con-tribute to the computation in question [22]. In synchronousdesign this is addressed by stopping the clock signal inunused modules. This is called clock gating and is basicallyan ad hocapproach which is only manageable at a coarsegrain level.

Asynchronous circuits have the inherent property ofonly activating modules and storage elements where andwhen needed. This may be viewed as a systematic wayof introducing fine grain clock gating and variable lengthclocks. For this reason, asynchronous design holds signifi-cant promise in applications characterized by a significantdata-dependent variation in their computational complexity:loops where the number of iterations is data dependent, andarithmetic operations on predominantly small numbers area few examples of such characteristics.

C. The Interpolated Finite Impulse Response(IFIR) Filter Bank Application

So far, the work reported on low-power asynchronousdesign has focused on microprocessor design [3], [4], andon error-correcting codes [1], [2]. The latter is an area thatis characterized by algorithms where the number of stepsis data dependent and where multiple clocks are involved.

The work presented here addresses a different area:digital signal processing with a moderate sampling rate(e.g., digital audio). At first glance, this application areadoes not seem to have any of the “right” characteristics:the algorithm involves a fixed number of steps and theenvironment is synchronous (due to the fixed samplingrate). However, as demonstrated in this paper, the asyn-chronous implementation is able to exploit low-level datadependencies in ways that are not available in synchronousdesign.

A seven-band IFIR filter bank served as a vehicle forthis research. It is part of the fully digital hearing aid,DigiFocus, manufactured by Oticon, Inc., and it was chosenbecause it is a realistic industrial example where low powerconsumption is very important. The existing synchronousdesign and the asynchronous re-implementation have beenfabricated in the same CMOS technology, and a fivefoldreduction in power consumption has been measured [5], [6].

D. Contributions and Organization of the Paper

The paper makes several contributions: 1) the asyn-chronous audio filter bank chip is one of the very fewexisting asynchronous chips which exhibit significant ad-vantage in a nontrivial industrial example and 2) the designfalls within an application domain that has not previously

been addressed by the asynchronous community and itexploits characteristics of this application in novel ways.

In addition to this, the paper can be read as an example-driven introduction to asynchronous low-power design. TheIFIR filter bank is sufficiently complex to bring out some ofthe key characteristics of asynchronous design, and yet thealgorithm and the architecture in question are sufficientlysimple to be understood by a broad audience.

The paper is organized as follows: Section II describesthe IFIR filter bank algorithm and the architecture usedto implement it. Section III discusses the characteristicsof the sampled audio input data which are exploited tominimize power consumption. Section IV gives a briefintroduction to asynchronous handshaking protocols anddescribes the circuit implementation style used in the filterbank design. Section V explains the circuit implementationof key components in the filter bank design, and Section VIdescribes the physical implementation of the two chips andpresents the speed and power figures of the two designs.Finally, Section VII concludes the paper.

II. A LGORITHM AND ARCHITECTURE

This section introduces the hearing-aid filter bank algo-rithm and describes the architecture of the circuit used toimplement it.

A. Algorithm

A block diagram of the hearing aid is shown in Fig. 1(a).The filter bank splits the input signal into seven frequencybands. These are amplified individually and merged intotwo frequency bands. Finally, these two signals undergoadditional signal processing.

The filter bank constitutes about half of the signalprocessing circuitry in the hearing aid. As illustratedin Fig. 1(b) and (c), it consists of a tree structure ofcomplementary interpolated linear phase FIR filters [23].Explaining the details of the algorithm is beyond thescope of this paper and it is not necessary for thefollowing discussion. We only mention that much efforthas been devoted to reducing the number of multiplicationsin order to save power: many of the coefficients arezero and the nonzero coefficients in the individual IFIRfilters are symmetric around the midpoint. This allowsa folded implementation that effectively halves thenumber of multiplications [see Fig. 1(c)]. Furthermore, themultiplications have been simplified by approximating thefilter coefficients by numbers whose binary representationcontains at most three ones. This approximation enablesa simple implementation of the multiplier using a shiftmodule and two adders.

The following figures provide some indication of thecomplexity of the design. In some areas the figures areapproximate, as we cannot disclose exact values for thehearing aid.

• The sampling rate is approximately 20 kHz and theinput is linear up to a sound pressure level of 100 dB.

• The entire IFIR filter bank structure requires storageof several hundred data values. During the processing

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 269

Page 3: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

(a)

(b)

(c)

Fig. 1. Block diagrams of (a) the hearing aid, (b) the filter bank, and (c) a single IFIR filter withits primary and complementary output.

of one input sample, only one fourth of these areaccessed.

• The data samples, the filter coefficients, and the inter-nal buses are in the 15–25-bit range.

• The number of nonzero filter coefficients is around 30,and the values of several of these are identical. 29%of the filter coefficients are represented using threeones, and 48% of the coefficients are represented usingonly a single one (corresponding to multiplication by0.5, 0.25, etc., which can be implemented as shiftoperations).

B. Architecture

The modest speed requirement (sampling rate) allows fora highly sequential implementation. The algorithm can beserialized in several dimensions: using bit-serial arithmeticunits and/or serialization in the time domain by mappingthe arithmetic units depicted in Fig. 1(c) onto a smaller setof hardware units.

To avoid excessive power consumption due to handshak-ing overhead, bit-serial implementations should be avoided[20]. Also, structures where data are copied unchangeddown a chain of registers without being used should beavoided, as this consumes power without contributing tothe computation. This means that a straightforward data-flow implementation following the structure in Fig. 1(c)would be a poor solution from a power consumption pointof view.

These simple arguments hint that the optimal choice isa dedicated processor structure (see Fig. 2) with a singleadd–multiply–accumulate (AMA) data path, a RAM forthe data samples, a ROM for the filter coefficients, andan address sequencing and control unit. Due to the foldedstructure of the IFIR filters, it is convenient to use a dual-port RAM. Using this architecture, the processing of oneinput sample requires a sequence of approximately 30 AMAoperations, corresponding to approximately 600 000 AMAoperations per second.

270 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 4: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 2. Architecture of the IFIR filter bank processor.

Fig. 3. The address sequence for filter H7 (indicating the addresspairs for the dual-port RAM).

The main task of the address sequencing and control unitis to generate the sequence of read and write addresses forthe dual-port RAM. For each IFIR filter, a portion of thedual-port RAM is administered as a cyclic buffer: whentime progresses one step and a new data sample is input, itis stored in the location which holds the oldest data samplethat is no longer needed. The input sample stays in thislocation throughout its lifetime (in order to avoid power-consuming data shifts). For this reason, the addresses ina computation sequence for an IFIR filter must be offsetby one from one input sample to the next. In combinationwith the many coefficients being zero, this results in a veryirregular address sequence. As an example, filter H7 inFig. 1(c) has 31 delay elements, and its address sequenceis defined by the code fragment in Fig. 3.

All IFIR filters have an odd number of delay elements,and in the actual implementation the writing of a newinput sample is performed in the same step as the last read

operation related to the processing of the previous inputsample. Furthermore, in this last step the data path performsa multiply–accumulate–subtract (MAS) operation, therebyproducing the two outputs from the IFIR filter. This meansthat the data path must be able to perform both AMA andMAS operations.

III. D ATA DEPENDENCIES

This section reports on an analysis of typical real-lifeinput samples and discusses the implications it has on theimplementation of the filter bank. In essence, the analysisshows that the stream of input samples to the filter ischaracterized by a huge predominance of numerically smallvalues and a significant correlation among the data samples.

A. Switching Activity in Sampled Audio Signals

Fig. 4(a) shows the average signal transition probabilitiesin a 5-s recording of several people speaking at the sametime, using a 17.5-kHz sampling rate, 16-bit resolution, andtwo’s complement number representation. The figure showsa clear pattern which is typical in sampled real-life audioand video signals [24]. The most significant (MS) bits, 0–3,are outside the dynamic range of the signal and correspondto the sign bit and a number of sign extension bits. Thesebits change whenever the sign of the data changes. Froma computational point of view the sign extension bits carryno information. The LS bits, 8–15, exhibit a 50% switchingprobability, corresponding to random uncorrelated data. Themiddle bits, 4–7, correspond to a transition region where theswitching probability falls from the 50% level to the levelof the MS bits.

The analysis of switching activity shown in Fig. 4(a) isbased on several people speaking at the same time. A fur-ther analysis shows that during a normal conversation, thefilter is idle—processing background noise—for 20–50% ofthe time due to pauses in the conversation. Over a full daythis is even more predominant, and it is thought provokingto realize that the battery lifetime is dominated by the powerconsumed when processing background noise.

The consequence of this is that the switching activityprofile, shown in Fig. 4(a), is shifted toward the right whenreal-life audio is considered. Depending on the environ-ment, the background noise can have different activityprofiles, but a 40-dB sound pressure level is common tomost environments. Representing this requires less than halfof the bits in a 16-bit sample.

B. Switching Activity in the Data Path

The switching activity profile is entirely different insidethe filter bank circuit. The data samples are accessedout of order and the correlation is lost. Furthermore, asdemonstrated in [22], the choice of number representation(sign magnitude or two’s complement) has a significantimpact on the switching activity. Fig. 4(b) and (c) showsthe activity on the output of the dual-port memory, and onthe MS 16 bits of the multiplier output, when processing

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 271

Page 5: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

(a)

(b)

(c)

Fig. 4. (a) Average switching activity in 5 s of sampled speechusing sign magnitude and two’s complement representation and (b)and (c) switching activity at internal interfaces in the data path.

the audio sequence whose switching profile is shown inFig. 4(a).

As can be seen, the two’s complement representation hasa much higher switching activity than the sign-magnituderepresentation. The area between the two graphs representswasted power for a two’s complement representation. Atthe multiplier output this overhead is more than 100%, andconsidering the reduced switching activity in real-life audio,

Fig. 5. Slicing of the data path using tagging of the operands.

the overhead can easily exceed 200%. In large circuits withheavily loaded buses, the overhead can have a significantimpact on the power consumption of the circuit.

This is a strong argument for sign-magnitude repre-sentation, but it only considers the switching activity atthe input and output ports of the arithmetic modules.Unfortunately, sign-magnitude addition and subtraction arecomplex operations to implement. A closer look at possibleimplementations reveals an internal switching activity sim-ilar to that of the two’s complement interfaces describedabove (unless positive and negative numbers are dealtwith separately). Instead we use a different approach, asexplained below.

C. Adapting the Number Range to the Actual Need

To avoid the excess power consumption inherent in two’scomplement representation, the data path and the memoryblocks have been sliced, and the circuit automaticallyadapts the number range to the actual need by conditionalactivation of the more significant slices. The mechanism isbased on tagging and overflow detection. Fig. 5 illustratesthis for the simplest possible case with only two slices. A50-dB slicing point (0.5 100 dB) results in equally sizedslices, and as the MS slice is rarely activated, this basicallyhalves the power consumption.

The mechanism works as follows: a comparator at theinput detects when the MS slice of the input sample carriesredundant sign extension information, and a tag (“big” or“small”) is appended to the LS slice of the data sample. Aslong as the operands from the RAM’s and the intermediateresults in the data path are all tagged “small” the MS sliceis not activated. It is only activated if one or more “big”operands are present, or if an overflow in the LS sliceoccurs. Furthermore, it should be noted that the filter bankconsists of filters with a finite impulse response. Therefore,it is not necessary to have a hardware mechanism thatchecks the result and resets the tag from “big” to “small.”A sequence of “small” input samples will eventually flushall “big” operands from the circuit. If an output sample is“small,” it is necessary to sign extend it to the full wordlength before outputting it. The actual implementation ofthe filter bank, described later in the paper, uses two slicesas explained above. The same principle can be used in ascenario with more slices. As more slices call for morecontrol logic this decision involves a tradeoff.

D. Scaling Down the Supply Voltage

Finally, it should be noted that the varying word lengthresults in varying latencies of the operators in the data

272 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 6: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

(a)

(b)

Fig. 6. The three asynchronous protocols used in practical asynchronous designs: (a) the four-phaseand the two-phase bundled data protocols and (b) the four-phase dual-rail protocol.

path. Because of the fixed sampling rate, the filter bankmust be designed for the worst-case situation where thefull word length is used. It is therefore possible to exploitthe reduced latency in the typical case and obtain additionalpower savings by adaptively reducing the supply voltage. Amechanism for this is explained and analyzed in [12]. As thehearing aid operates from a very low battery voltage (closeto the threshold voltage of the transistors), the advantageis limited. But for circuits operating from “normal” 3.3 or5.0-V supplies the advantage will be significant.

IV. A SYNCHRONOUSIMPLEMENTATION STYLE

Asynchronous design is not a single well-defined method,but rather a wide spectrum of options in a number ofareas: handshake protocol, circuit implementation style, andtheoretical basis for the design of control circuits [25].In the following, the choice of handshaking protocol andcircuit implementation style is addressed in the context oflow power consumption.

A. Asynchronous Handshake Protocols

The three commonly used asynchronous handshake pro-tocols are illustrated in Fig. 6. As will be clear from thefollowing, they all have different properties which affectpower consumption.

The four-phase bundled-data and the two-phase bundled-data protocols in Fig. 6(a) are self explanatory: when dataare ready the sender issues a request, and when dataare received, the receiver issues an acknowledge. Thefour-phase protocol uses signal levels to signal requestand acknowledge, and the two-phase protocol uses signaltransitions. In both cases it is important to ensure that thetiming relation “data before request” is not violated at thereceiver’s end due to different delays in the request anddata wires. This is similar to the data setup-time and hold-time requirements in a synchronous circuit. It also leads toworst-case timing behavior, though only on a local scalewhere safety margins can be tighter.

The four-phase dual-rail protocol in Fig. 6(b) is insen-sitive to such delays. This is obtained by a combined

Table 1Simple Comparison of Asynchronous Protocols

encoding of data and request using two wires per data bit.This robustness comes at a very high price.

For the different protocols illustrated in Fig. 6, Table 1shows the number of wires and the number of signaltransitions, including the request and acknowledge signalwires, when communicating an -bit data word from onemodule to another. The number of signal transitions is ameasure of the associated energy consumption.

For the bundled-data protocols, the number of signaltransitions depends on the transition probability of theindividual data bits. The values quoted in Table 1 assume aworst-case switching probability of , correspondingto uncorrelated data. For the four-phase dual-rail protocol,

of the data wires will make an up-going transitionfollowed by a down-going transition. This is independent ofthe switching probability of the data bits. Consequently, thefour-phase dual-rail protocol does not allow the designerto exploit the reduced switching activity found in manyreal-life data, as illustrated in Section III. Taking this intoaccount, the four-phase dual-rail protocol can result in apower consumption that is an order of magnitude higherthan that of a bundled-data protocol. Although the abovearguments do not consider the switching activity inside thecommunicating circuit modules, the huge difference speaksfor itself.

The choice between the four-phase and the two-phasebundled-data protocol is also a simple one. In our experi-ence, register implementations for the two-phase bundled-data protocol are significantly larger or significantly slowerthan the traditional latches used in four-phase designs. Thesame is true for the control circuitry used to implement

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 273

Page 7: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 7. The four-phase bundled-data circuit implementation style.

conditional sequencing. The reader may find more detailsand circuit level insight about these matters in [4] and [20].Furthermore, if the decision is on precharged logic ratherthan static logic, then the four-phase protocol comes as anatural choice since the request signal can directly controlthe precharging and evaluation of the circuits.

At this point it is relevant to mention that deciding ona four-phase bundled-data protocol conforms with whatseems to be a general trend when focus is on power and area(and possibly also speed): Philips Research Laboratorieshave re-targeted their Tangram silicon compiler from four-phase dual-rail to four-phase bundled-data circuitry [2],[26], and the Amulet Group at Manchester University usesfour-phase bundled-data circuitry in the second version oftheir asynchronous ARM microprocessor, where the firstversion used two-phase bundled-data circuitry.

For the sake of completeness, it should be noted thatprotocols other than the three described here do exist[27]–[29], but in general they are impractical.

B. Circuit Implementation Style

The asynchronous IFIR filter bank uses the four-phasebundled-data protocol, and Fig. 7 gives a flavor of theassociated circuit implementation style used in the design.Clock signals for latches and registers are derived locallyfrom the handshake signals by small speed-independentasynchronous sequential control circuits (CTL) [30]–[32].The request signals either undergo matched delays or theyare derived by detecting completion of the correspondingoperations.

Completion detection is possible by local use of thefour-phase dual-rail protocol. Section V provides someexamples of this mixing of protocols. A reader familiar withCMOS transistor-level design will notice that the dual-railprotocol is basically what is found in differential prechargedstructures such as RAM’s, ROM’s, and DCVSL gateswhich are used in synchronous CMOS circuits [21], [33].Furthermore, it should be noted that the dual-rail encodingshown in Fig. 6(a) is a one-hot encoding. It is oftenconvenient to use four-phase-rail one-hot generalizationsof this protocol. Again, Section V provides some examplesof this.

Using the design style described above, the difference be-tween synchronous and asynchronous circuits has more orless diminished. The low-level component implementations

are the same; the difference is that where a synchronousdesign uses clock distribution, clock buffering, and clockgating, the asynchronous circuit uses distributed asyn-chronous control to derive local clock signals for theindividual latches and registers. This style of design givesasynchronous circuits inherent properties that resembleclock gating and variable-length clocks taken to the ex-treme: registers, latches, and combinational circuits are onlyactivated where and when they are needed. However, thereis one important difference: asynchronous design offers asystematic approach to achieve this.

For this reason, asynchronous techniques are advanta-geous for the implementation of algorithms which exhibitirregular and/or low-level data dependencies, i.e., in sit-uations where a synchronous implementation using clockgating is not viable.

V. CIRCUIT IMPLEMENTATION

This section provides a closer look at the overall or-ganization of the filter bank circuit and gives a detaileddescription of the low-level implementation of some of themost important and interesting components.

A. Overall Organization, Data Flow, and Control

The architecture of the asynchronous IFIR filter bank isshown in Fig. 8. Comparing it with Fig. 2, it can be seenthat the only difference is that the dual-port RAM and theaddress sequencing and control logic have been partitionedinto nine IFIR modules, corresponding to the nine IFIRfilters in Fig. 1(b). A top-level sequencer in turn requestseach of the IFIR modules to “drive” the data path with thenecessary sequence of operands (Fig. 3) and control signals,corresponding to the processing of one input sample. Thecoefficient module in Fig. 8 communicates directly withthe multiplier in the data path, and the sequence of filtercoefficients delivered by the coefficient module correspondsto a concatenation of the sequence dictated by the IFIRmodules (but there is no direct communication between thetwo parts).

The partitioning of the dual-port RAM and the distri-bution of the control logic has several advantages thatsignificantly reduce power consumption. In the RAM’s itminimizes the capacitance of the bit lines and in the addresssequencing logic it allows global buses and decoding logicto be avoided.

274 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 8: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 8. Overall architecture of the asynchronous IFIR filter bank.

Fig. 8 provides a closer look at one of the IFIR modules,and the important details are explained here: in a standardCMOS RAM, the set of internal row-select signals follows aone-hot -rail four-phase protocol, but the external addressbus is typically encoded using a binary representation. Sincethe RAM’s in the nine IFIR modules are on-chip andof a moderate size, it is possible to generate the addresssignal directly as a one-hot encoded signal. The circuitrydeveloped for the address sequencer consists of two cyclicone-hot counters—a step counter and an offset counter(cf. Fig. 3)—from which the one-hot address is directlydecoded. The address sequence is quite complex, yet thestructure of one-hot counters and decoding logic providesa very power-efficient solution as explained in Section VI.

The same power efficient one-hot counter was used inthe top-level sequencer. Each of the bits in the one-hotcode is connected directly to one of the IFIR modules, thusproviding a request signal for that module. This is seen as

in the detailed view of Fig. 8.In conclusion, the reader should notice the data-driven

nature of the implementation and, in particular, the dis-tributed, autonomous, and hierarchical structure of simpleone-hot control units. The activation scheme in the controllogic is very complex, and a corresponding synchronousimplementation is not viable.

B. The AMA Data Path

It is desirable to have a fast processing unit with lowlatency but at the same time avoid pipeline registers, sincesuch registers consume substantial power. To achieve thesegoals, the data path in Fig. 9 was developed. A blockdiagram of the data path is seen to the left, and the requestflow and data flow are illustrated to the right. The figureis only an example illustrating the principle; it is not an

exact diagram. For instance, in the real implementation,the adders of the multiplier are activated according tothe number of ones in the coefficients. Sometimes thecoefficients correspond to a simple shift, and in that casenone of the adders are activated. The total number ofadditions for an entire AMA operation therefore varies fromtwo to four. The logic required for this behavior is notincluded in the figure.

The interface of the data path follows the four-phasebundled-data protocol as illustrated in Fig. 9, but the dataflow and the request flow are quite different inside the datapath. This is illustrated by the shading and the arrows,respectively. The idea of the approach is simple: the validityof the sum bits in an adder can be guaranteed to be correctin a sequential order from the LS full adder to the MS fulladder. The first full adder in the multiplier can thereforestart computation immediately after the first bit is computedin the adder above it. The implementation of a dual-railripple-carry full adder that enables this is explained inSection V-C.

In the example in Fig. 9, the shaded full adders indicatefull adders that have finished computation. It is noticed thatthe computational wavefront progresses diagonally ratherthan straight down. Forcing this request flow has twosignificant advantages: 1) the completion of the compu-tation is directly indicated by the request output of theaccumulator, and therefore no completion logic is required,and 2) all hazards can be eliminated if each full adder onlycomputes sum and carry outputs once, i.e., when requested.The power consumption in such two-dimensional arraystructures can otherwise be quite severe [34].

The biggest disadvantage of the suggested approach isthe fact that correct operation relies on delay matching. Itmust be guaranteed that the computation taking place in

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 275

Page 9: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 9. Structure of and signal-flow in the data path.

Fig. 10. The full adder used in the data path.

one row of full adders never “overtakes” the computationtaking place in the row of full adders above it. The circuitimplementation of the full adder is therefore critical andwas carefully developed.

C. The Full Adder

Fig. 10 shows a transistor diagram of the full adder usedthroughout the data path, as well as a detailed view ofthe top row of full adders in Fig. 9. As can be seen,

the operand and sum bits are standard single-rail signalswhereas the carry is a dual-rail signal. The dual-rail carryenables a strictly sequential evaluation order. To minimizethe power consumption associated with the highly activecarry signals, dynamic domino logic was used for the circuitimplementation. This type of logic is well suited for dual-rail encoded signals [31], [33], and it reduces the load onsignals to a minimum since all evaluation can be carried outin NMOS logic. The precharge is carried out by a single

276 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 10: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 11. Tag control logic for add–multiply module.

PMOS transistor, and to minimize the time required forthe precharge operation all full adders are precharged inparallel.

The chosen circuit implementation of the full adder is de-rived from [35] and is shown in Fig. 10(b). To minimize thevariation in propagation delay two dummy transistors areinserted to make the number of transistors from the outputnode to ground identical for any path in the evaluation logic.Note also that the placement of the carry transistors waschosen so that only one transistor loads the carry signals ineach of the two circuits (sum and carry).

D. Slicing the Data Path

Section III-C explained briefly how the slicing and tag-ging used in the data path worked. This section explainsthe details and shows that the associated circuit overheadis insignificant.

Consider the adder and the multiplier of the data path inFig. 2. Adding two “small” numbers can only extend theresult one bit beyond the slicing point of the adder, andall coefficients in the algorithm have a magnitude in therange [0; 0.5] which corresponds to shifting the operand atleast one position toward the LS bits. For that reason, thecombined add–multiply operation never causes an overflow,and the tag appended to the multiplier output is “big,” onlyif one of the inputs to the adder is “big.” This requires onlyan OR-gate, as shown in Fig. 11.

Overflow may occur in the accumulator, and the tagginglogic is therefore slightly more complex. Also, additionalmultiplexers are needed at the operand input of the MSpart in order to extend the sign of the LS part into theMS part when overflow occurs. Fig. 12 shows the adder ofthe accumulator as well as the Boolean equations for thetagging logic.

The LS part of the adder is controlled directly by therequest signal associated with the two operandsand the request input to is generated by the TagControl circuit . To support this, generatesa dual-rail encoded overflow signal . Whenever anoverflow occurs in the LS part, or one of the operand inputshas a tag with value “big,” the MS part is activated asindicated by . This signal is also the tag of theresult.

E. The Dual-Port RAM

The RAM in the filter bank is derived from a standardeight-transistor dual-port RAM cell [21]. Such a RAMtypically uses differential signaling at the bit lines. If the

Fig. 12. Adder and tag control logic for accumulator module.

size of the RAM is moderate, it is possible to avoid senseamplifiers, and read from the RAM using only a singlebit line. In order to activate only one bit line duringread operations, it was chosen to have separate access toeach select transistor in the RAM cells. The RAM wasfurther optimized by dedicating one of the ports for readingonly, thereby eliminating one bit line and the associatedselect transistor in the RAM cells. The resulting seven-transistor RAM cell is shown to the left in Fig. 13. Word1corresponds to the port with both read and write capabilitiesand word2 corresponds to the dedicated read port.

These optimizations lead to substantial power reductions,since most operations taking place in FIR filters are readoperations. When random data are stored in the RAM, thenumber of transitions on the bit-lines is halved.

In order to detect when the read and write operations havecompleted, one of the RAM cells in a word is a traditionaleight-transistor dual-port RAM cell, as shown to the right inFig. 13. The two (differential) bit-lines from that RAM cellbehave according to the four-phase dual-rail protocol, andcompletion is easily detected from the output of this cellalone: it is simply assumed that all the other bits (from theseven-transistor RAM cells) arrive at practically the sametime. In total less than 2% extra transistors are added tothe RAM in order to make it self timed and accommodatethe slicing.

F. Discussion

The data path consists entirely of dynamic logic. Thisis truly remarkable in a two-dimensional structure of fulladders and it is only possible due to the completion indica-tion found in the carry signal. No evaluation sequence canbe guaranteed in an implementation with single-rail signals.A standard synchronous implementation using domino logicwould therefore only be possible if separate clock signalswere provided for each full adder in the array, and ifthat clock signal guarantees that all inputs to the fulladder are valid and stable. It is therefore seen that aclass of circuits can be implemented with asynchronous

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 277

Page 11: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 13. RAM cells and completion detection in the RAM arrays.

logic which simply cannot be implemented using traditionalsynchronous methods. Thus, the asynchronous data pathhas no synchronous equivalent, but the data path as awhole could of cause be used as a module in an otherwisesynchronous design.

The data path illustrates an important characteristic ofasynchronous design: guaranteed average case latency. Inthe filter bank, the processing of an input sample is carriedout as a sequence of AMA operations. The latency of theindividual AMA operations depends on the filter coeffi-cients. Since they are constants (built into the circuit) thislatency is known in advance. From this follows that the totallatency of a sequence of AMA operations is also constant.From a performance point of view this means that the totallatency of a sequence of AMA operations is determined bythe average latency of the individual AMA operations. Ina synchronous design it is not possible to exploit this. Theclock period would have to be set according to the worst-case AMA operation, or a higher rate clock in combinationwith a (carry–save) add–shift–accumulate unit could beused. The first solution has worst-case performance, andthe latter solution results in significant excess power con-sumption from clocking the accumulator at a higher rate.Asynchronous design offers more freedom to the designer.In the filter bank design, the benefit of this property ismarginal due to the overlapping evaluation in the addersin the data path, but in other applications the benefit maywell be significant.

Finally, a word on handshake protocols. At the modulelevel, the design uses the four-phase bundled-data protocol,but locally, inside the RAM modules and inside the datapath, dual-rail signaling is used. The primary reason forthis is a power efficient circuit implementation, but italso has the advantage that completion detection becomespossible. In the filter bank this module level completiondetection comes at basically no cost. The RAM modules areinherently dual-rail and the asynchronous overhead is only2%. In the data path, the carry out of the most significantposition of the accumulator directly indicates completion.Such a hybrid use of handshake protocols is typical in thedesign of efficient asynchronous circuits.

VI. RESULTS

To compare synchronous and asynchronous design tech-niques, both the asynchronous IFIR filter bank and its

Fig. 14. Micrograph of the synchronous chip.

synchronous counterpart have been fabricated and tested.This section reports on the physical implementation of thetwo designs and the measured power consumption. It alsoincludes a breakdown of the sources of power consumptionin order to provide more insight into where and how poweris saved.

A. Physical Implementation of the Two Chips

The two designs were fabricated in pairs on the samewafer in a standard 0.7-m CMOS technology with tran-sistor threshold voltages V and V. Diemicrographs of the two chips are shown in Figs. 14 and 15.

The layout of the synchronous design was provided byOticon, Inc. The layout was generated automatically usingstandard cells and a single-port RAM generator. From thechip micrograph in Fig. 14, it is seen that all logic isplaced in one block of standard cells at the bottom ofthe chip and that the RAM at the top of the chip hasbeen divided into four blocks. Consequently, several IFIRfilters are mapped onto each of the four RAM’s. The chipcontains 48 000 transistors and the size of the core is 3.6

2.7 mm (excluding pad cells). The transistors in thestandard cells are scaled individually with small transistorsfor logic and larger transistors in the output drivers. Asin the asynchronous design, the RAM blocks are laid outusing custom-made generators and the dimensions of thetransistors are close to those used in the asynchronousdesign. In total, the layout of the synchronous designcomprises a good low-power design.

278 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 12: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

Fig. 15. Micrograph of the asynchronous chip.

The layout of the asynchronous design involved moremanual work. We developed a number of dedicated asyn-chronous standard cells. Examples of these are: 1) a setof Muller C-elements; 2) the precharged dual-rail-carryfull adder described in the previous section; 3) variouscells for building one-hot counters; and 4) a couple oflatch controllers. Furthermore, an asynchronous dual-portRAM layout generator was developed (cf. Section V-E).This generator sizes transistors according to the number ofwords and the capacitive loading of the bit lines in orderto minimize power consumption without compromising therise and fall time of signals. The result of this effort is ahighly customized and optimized set of building blocks.

In order to be able to control the delay matching (requestversus data) at the many bundled-data handshake interfaces,cells and modules were assembled, placed, and routedmanually. Tools for this task do exist; they were just notavailable in this (university) project. For this reason, thelayout is not very dense. The chip micrograph is shown inFig. 15. It consists of a data path and eight IFIR modules(one is shared by two IFIR filters). The IFIR modules areeasily identified from their dense and regular RAM blocks.As explained previously, they consist of a RAM block andthe associated address sequencing logic. The core of thedesign contains approximately 70 000 transistors, and thesize of the design is 4.4 4.4 mm (excluding I/O padcells).

B. Power Measurements

Two different test vector sets were developed to measurethe power consumption: one set to activate only the leastsignificant slice of the asynchronous chip (typical data),and the other to activate the two slices. Both chips werefunctionally correct and operational at the required sam-pling rate, down to a supply voltage of 1.55 V. Table 2

Table 2Power Consumption of The Two Chips

Table 3Breakdown of the Power Consumption in the AsynchronousDesign for Worst-Case Data (>50 dB)

shows the measured power consumption of the cores.As expected, the power consumption in the asynchronousdesign depends on the input data. Furthermore, the tableshows that the asynchronous design has a remarkablylow power consumption: 4.0–5.5 times lower than thesynchronous design.

In order to provide more insight into these matters, Table3 shows the breakdown of the power consumption in theasynchronous design. This breakdown is based on HSPICEsimulations of extracted layout, calibrated to match themeasured power consumption in the data path and in theIFIR modules.

The power consumption of the IFIR dual-port RAM’sand the AMA array is directly proportional to the degreeof slicing of the data path, whereas the power consumptionin the other modules is independent of this. The slicingreduces power consumption by 30% in the typical case,and it is the single most important factor contributing tothe reduced power consumption.

For worst-case data, the difference between the twodesigns is a factor of four. A closer look at the two designsshows that this difference is found across all modules inthe two designs: approximately 1 : 4 in the RAM’s and thedata path, and 1 : 6 in the control logic. For the RAM’sthe optimizations explained in Section V-E explain the1 : 4 power reduction, and the same techniques could beexploited in a synchronous design.

The power savings that can be attributed to the asyn-chronous techniques stem from the hierarchical one-hotaddress sequencing logic, the AMA, and the slicing of thedata path. Altogether, this accounts for a 1 : 4.3 differencein power consumption for typical-case input data.

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 279

Page 13: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

VII. CONCLUSION

The paper explores the use of asynchronous circuit tech-niques in the design of digital signal-processing circuitswith low power consumption. The vehicle for this studyis a seven-band IFIR filter bank. This circuit constitutesa major part of the fully digital hearing aid, DigiFocus,manufactured by Oticon, Inc.—an industrial applicationwhere low power consumption is of paramount impor-tance.

The asynchronous re-implementation and the existingsynchronous counterpart were fabricated in the same 0.7-

m CMOS technology. The synchronous design contains48 000 transistors and its power consumption is approxi-mately 470 W. The asynchronous design contains 70 000transistors and its power consumption is 85W whenprocessing input data corresponding to a sound pressurelevel less than 50 dB.

This fivefold power reduction is a strong argument infavor of asynchronous design, but a note of warning isappropriate: it is difficult to make a fair comparison ofdifferent design methods based on quantitative power fig-ures for the resulting circuits. First, such a comparison isonly valid for the particular benchmark circuit considered.Second, if the benchmark circuit is too small it may wellbe biased and favor one method, and if it is too complex,many factors other than the design methods themselves mayoffset the result. The IFIR filter bank considered in thispaper represents a nontrivial yet moderately sized circuit.The synchronous design and its asynchronous counterpartuse the same basic architecture: some RAM blocks and asingle data path. At the layout level, the transistor sizing andstandard-cell style are also comparable. Therefore, the twodesigns do allow the conclusion that asynchronous circuitsare advantageous for implementation of low-power signalprocessing circuits.

Having said this, the most important contribution of thepaper is that it demonstrates how asynchronous designtechniques offer flexibility and freedom to exploit low-leveldata dependencies in the algorithm and obtain significantpower savings. This comprehension is particularly interest-ing because, in the first place, the application seemed to bean obvious candidate for a synchronous implementation dueto the fixed sampling rate and the fixed number of steps inthe algorithm.

The asynchronous design exploits the fact that typicalreal-life audio signals are dominated by numerically smallsamples, and it adapts the number range to the actualneed. This is implemented by slicing the data path andthe RAM’s and by using a tagging and overflow detectionscheme that only activates the most significant slice whenit is necessary. This alone accounts for 30% of the powerreduction, making it the single most important measure.

The asynchronous design uses two slices to demon-strate the principle, and the 50-dB slicing point basicallydistinguishes between background noise and “real” audiosignals. The idea could easily be extended to more thantwo slices leading to additional power savings. Because

of the asynchronous implementation, the slices can havearbitrary sizes.

The slicing reduces power consumption by minimizingthe switching activity in the circuit, but it also reducesthe data path latency in the typical case. At the sametime, the fixed sampling rate results in a worst-case designwhere response time must be guaranteed. This combinationgives room for additional and significant power savings byadaptive scaling of the supply voltage. This option has notbeen pursued in the present design because of the very lownominal supply voltage in the hearing aid.

The transistor count of the asynchronous design is 45%higher than that of the synchronous design. This is due tothe distributed organization of the address sequencing logicand the higher number of RAM modules, each requiring itsown control and precharge logic. Such a tradeoff betweenarea (transistor count) and power consumption is typical forlow-power design in general.

Finally, it must be said that synchronous design and asyn-chronous design are not opposites. They are alternativesto the designer. Both have advantages and disadvantages,and most circuits involve both synchronous and asyn-chronous parts. The balance seems to be shifting towardmore asynchronous circuitry, and this paper representsa contribution toward a better understanding of whereand how asynchronous techniques can be exploited toadvantage.

ACKNOWLEDGMENT

The authors would like to thank P. Korger for hiscontributions to the design and layout of the asynchronousIFIR filter bank chip. They would also like to thank P.Kokholm Sørensen and T. Christensen from Oticon, Inc. formany good discussions, and for sharing design details of thehearing aid design. The CAD tools used in the project wereprovided through EUROPRACTICE. The layout of thesynchronous IFIR filter bank was provided by Oticon, Inc.

REFERENCES

[1] C. H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M.Roncken, and F. Schalij, “Asynchronous circuits for low power:A DCC error corrector,”IEEE Design Test.,vol. 11, no. 2, pp.22–32, Summer 1994.

[2] K. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken,F. Schalij, and R. van de Viel, “A single-rail re-implementationof a DCC error detector using a generic standard-cell library,” inProc. 2nd Working Conf. Asynchronous Design Methodologies,London, U.K., May 30–31, 1995, pp. 72–79.

[3] S. B. Furber, P. Day, J. D. Garside, N. C. Paver, S. Temple, andJ. V. Woods, “The design and evaluation of an asynchronousmicroprocessor,” inProc. Int. Conf. Computer Design,Oct.1994, pp. 217–220.

[4] S. B. Furber, J. D. Garside, S. Temple, J. Liu, P. Day, and N. C.Paver, “AMULET2e: An asynchronous embedded controller,”in Proc. Int. Symp. Advanced Research in Asynchronous Circuitsand Systems,1997, pp. 290–299.

[5] L. S. Nielsen, “Low-power asynchronous VLSI design,” Ph.D.dissertation, Dep. Inform. Technol., Tech. Univ. Denmark,Lyngby, IT-TR:1997-12, 1997.

[6] L. S. Nielsen and J. Sparsø, “An 85W asynchronous filter-bank for a digital hearing aid,” inProc. IEEE Int. Solid-StateCircuits Conf.,1998, pp. 108–109.

280 PROCEEDINGS OF THE IEEE, VOL. 87, NO. 2, FEBRUARY 1999

Page 14: Designing Asynchronous Circuits for Low Power: An IFIR ...cs5830/handouts/00740020.pdf · ... An IFIR Filter Bank for a Digital Hearing Aid ... asynchronous logic cir-cuits, CMOS

[7] T. E. Williams and M. A. Horowitz, “A zero-overhead self-timed 160 ns 54 bit CMOS divider,”IEEE J. Solid-StateCircuits, vol. 26, pp. 1651–1661, Nov. 1991.

[8] T. Williams, N. Patkar, and G. Shen, “SPARC64: A 64-b64-active-instruction out-of-order-execution MCM processor,”IEEE J. Solid-State Circuits,vol. 30, pp. 1215–1226, Nov. 1995.

[9] A. J. Martin, A. Lines, R. Manohar, M. Nystrom, P. Penzes, R.Southworth, U. V. Cummings, and T. K. Lee, “The design ofan asynchronous MIPS R3000,” inProc. 17th Conf. AdvancedResearch in VLSI,1997, pp. 164–181.

[10] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J.Hazewindus, “The first asynchronous microprocessor: The testresults,”Comput. Architecture News,vol. 17, no. 4, pp. 95–98,1989.

[11] C. D. Nielsen, J. Staunstrup, and S. R. Jones, “Potential per-formance advantages of delay-insensitivity,” inProceedings ofthe IFIP Workshop on Silicon Architectures for Neural Nets, St.Paul-de-Vence, France, Nov. 1990,M. Sami and J. Calzadilla-Daguerre, Eds. Amsterdam, The Netherlands: North-Holland,1991, pp. 367–376.

[12] L. S. Nielsen, C. Niessen, J. Sparsø, and C. H. van Berkel,“Low-power operation using self-timed circuits and adaptivescaling of the supply voltage,”IEEE Trans. VLSI Syst.,vol. 2,pp. 391–397, Dec. 1994.

[13] N. C. Paver, P. Day, C. Farnsworth, D. L. Jackson, W. A. Lien,and J. Liu, “A low-power, low-noise configurable self-timedDSP,” in Proc. Int. Symp. Advanced Research in AsynchronousCircuits and Systems,1998, pp. 32–42.

[14] D. E. Muller, “Asynchronous logics and application to infor-mation processing,” inProc. Symp. Application of SwitchingTheory in Space Technology,H. Aiken and W. F. Main, Eds.,1963, pp. 289–297.

[15] A. J. Martin, “Compiling communicating processes into delay-insensitive VLSI circuits,”Distrib. Comput.,vol. 1, no. 4, pp.226–234, 1986.

[16] C. H. (Kees) van Berkel, C. Niessen, M. Rem, and R. W. J. J.Saeijs, “VLSI programming and silicon compilation,” inProc.Int. Conf. Computer Design (ICCD),1988, pp. 150–166.

[17] I. E. Sutherland, “Micropipelines,”Commun. ACM,vol. 32, no.6, pp. 720–738, June 1989.

[18] J. Sparsø and J. Staunstrup, “Delay-insensitive multi-ring struc-tures,” Integr. VLSI J.,vol. 15, no. 3, pp. 313–340, Oct. 1993.

[19] D. E. Muller and W. S. Bartky, “A theory of asynchronouscircuits,” in Proceedings of an International Symposium on theTheory of Switching, Cambridge, Apr. 1957, Part I.(Annals ofthe Computation Laboratory of Harvard University), vol. XXIX.Cambridge, MA: Harvard Univ. Press, 1959, pp. 204–243.

[20] J. Sparsø, C. D. Nielsen, L. S. Nielsen, and J. Staunstrup,“Design of self-timed multipliers: A comparison,”IFIP Trans.,vol. A-28, pp. 165–180, 1993.

[21] N. Weste and K. Esraghian,Principles of CMOS VLSI De-sign—A Systems Perspective,2nd ed. Reading, MA: Addison-Wesley, 1993.

[22] A. P. Chandrakasan and R. W. Brodersen, “Minimizing powerconsumption in digital CMOS circuits,”Proc. IEEE, vol. 83,pp. 498–523, Apr. 1995.

[23] T. Lunner and J. Hellgren, “A digital filterbank hearingaid—Design, implementation and evaluation,” inProc.ICASSP’91,Toronto, Ont., Canada, 1991, pp. 3661–3664.

[24] P. Landman and J. Rabaey, “Architectural power analysis: Thedual bit type method,”IEEE Trans. VLSI Syst.,vol. 3, pp.173–187, June 1995.

[25] S. Hauck, “Asynchronous design methodologies: An overview,”Proc. IEEE,vol. 83, pp. 69–93, Jan. 1995.

[26] A. Peeters and K. van Berkel, “Single-rail handshake circuits,”in Proc. 2nd Working Conf. Asynchronous Design Methodolo-gies,London, U.K., May 30–31, 1995, pp. 53–62.

[27] C. L. Seitz, “System timing,” inIntroduction to VLSI Systems,C. A. Mead and L. A. Conway, Eds. Reading, MA: Addison-Wesley, 1980, ch. 7.

[28] T. Verhoeff, “Delay-insensitive codes—An overview,”Distrib.Comput.,vol. 3, no. 1, pp. 1–8, 1988.

[29] M. Dean, T. Williams, and D. Dill, “Efficient self-timingwith level-encoded 2-phase dual-rail (LEDR),” inAdvancedResearch in VLSI: Proceedings of the 1991 UC Santa CruzConference,C. H. Sequin, Ed. Cambridge, MA: MIT Press,1991, pp. 55–70.

[30] T. H.-Y. Meng, R. W. Brodersen, and D. G. Messerschmitt,“Automatic synthesis of asynchronous circuits from high-levelspecifications,”IEEE Trans. Computer-Aided Design,vol. 8, pp.1185–1205, Nov. 1989.

[31] G. M. Jacobs and R. W. Brodersen, “A fully asynchronousdigital signal processor using self-timed circuits,”IEEE J. Solid-State Circuits,vol. 25, pp. 1526–1537, Dec. 1990.

[32] S. B. Furber and P. Day, “Four-phase micropipeline latchcontrol circuits,” IEEE Trans. VLSI Syst.,vol. 4, pp. 247–253,June 1996.

[33] L. G. Heller, W. R. Griffin, J. W. Davis, and N. G. Thomas,“Cascode voltage switch logic: A differential CMOS logicfamily,” in Proc. Int. Solid State Circuits Conf.,Feb. 1984, pp.16–17.

[34] J. Haans, K. van Berkel, A. Peeters, and F. Schalij, “Asyn-chronous multipliers as combinational handshake circuits,”IFIPTrans.,vol. A-28, pp. 149–163, 1993.

[35] K. M. Chu and D. L. Pulfrey, “Design procedures for dif-ferential cascode voltage switch circuits,”IEEE J. Solid-StateCircuits, vol. SSC-21, pp. 1082–1087, Dec. 1986.

Lars S. Nielsen received the M.Sc. degree inelectrical engineering in 1992 from the Depart-ment of Computer Science, Technical Universityof Denmark, Lyngby. In 1997, he received thePh.D. degree, also from the Technical Universityof Denmark.

In 1997, he joined the hearing-aid companyOticon, Inc., Hellerup, Denmark, and his mainresearch interests are DSP applications with aspecial focus on low-power design and asyn-chronous circuit design.

Jens Sparsø(Member, IEEE) was born in Silke-borg, Denmark, in 1955. He received M.Sc.degree in electrical engineering from the Tech-nical University of Denmark, Lyngby, in 1981.

Since 1982, he has been with the Departmentof Computer Science at Technical Universityof Denmark, where he became an AssociateProfessor in 1986. He is teaching courses onvery large scale integration (VLSI), digital sys-tems design, and computer architecture, and hisresearch interests are architecture and design of

VLSI systems, i.e., design methods, circuit techniques, and the interplaybetween technology and system architecture. This has included the designand implementation of several error-correcting decodes for telecommuni-cation applications. For the past eight years, his main focus has been ondesign of asynchronous circuits and circuits with low power consumption.He spent the 1995–1996 academic year as a visiting Associate Professorwith the Computer Science Department, University of Utah, Salt LakeCity.

Prof. Sparsø is on the steering committee of several conferences inthe area, and he has given a number of tutorials on asynchronous circuitdesign at European conferences.

NIELSEN AND SPARSØ: DESIGNING ASYNCHRONOUS CIRCUITS FOR LOW POWER 281