8-bit a synchronous wave-pipelined rsfq

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 21, NO. 3, JUNE 2011 847

8-Bit Asynchronous Wave-Pipelined RSFQArithmetic-Logic Unit

T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and O. Mukhanov, Senior Member, IEEE

Abstract—We have designed and demonstrated an Arithmetic-Logic Unit (ALU) based on RSFQ technology as a required steptoward building an 8-bit RSFQ processor datapath. The circuitwas designed and fabricated with HYPRES’ standard 4.5 kA/cm�

process. The target clock frequency of the ALU is 20 GHz. In thispaper, we present the design and functionality (low-speed) test re-sults of the 8-bit ALU.

Index Terms—Adder, ALU, microprocessor, RSFQ, SFQ, timing.

I. INTRODUCTION

H IGH-PERFORMANCE COMPUTING (HPC) is one ofthe fields in which superconductor digital microelec-

tronics is trying to establish its presence, following the pathestablished by IBM’s famous Josephson project [1]. Unfortu-nately, the requirement of global timing for the superconductorac-powered latching logic, as well as the high power dissipationof the voltage-generating elements of this logic family, alongwith some other technical obstacles, made the implementationof high-speed processors impossible at tens-of-gigahertz clockrate.

With the appearance of RSFQ logic [2], the development of ahigh-performance superconductor processor became more fea-sible. Non-latching, dc-powered RSFQ logic featuring local andself-timing [3] enabled the design of processing modules op-erating at tens of gigahertz with very low power dissipation.There were two projects for developing a superconductor com-puter [4]–[6]. A major part of these projects was the developingan RSFQ microprocessor operating at minimum power whileclocking at very high rates.

Only two 8-bit prototypes of such a microprocessor—FLUX[5] and CORE [6]—were developed to date. Only CORE wassuccessfully demonstrated. And neither of them used true 8-bitwide data processing in their pipelines. FLUX microprocessorhad a novel processing-in-registers microarchitecture that al-lowed eight ALU operations to proceed simultaneously in itsdatapath, producing up to eight bits per cycle (albeit belongingto different operations). CORE used a simple bit-serial pipelinegenerating one bit of result per cycle.

Manuscript received August 04, 2010; accepted December 22, 2010. Date ofpublication February 10, 2011; date of current version May 27, 2011. This workwas supported in part by DoD Contract W911NF-09-C-003.

T. Filippov, A. Sahu, A. Kirichenko, and O. Mukhanov are with HYPRES,Inc., Elmsford, NY 10523 USA (e-mail: [email protected]).

M. Dorojevets and C. Ayala are with the Department of Electrical and Com-puter Engineering, Stony Brook University, Stony Brook, NY USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASC.2010.2103918

TABLE IALU INSTRUCTION SET

The FLUX microarchitecture was able to hide the latency ofits eight bit-serial processing pipelines by allowing any instruc-tion to start its execution as soon as the least significant bits of itsinput operands are calculated. In the bit-serial CORE processor,an instruction needs to wait until all eight bits of its inputs arecalculated sequentially. Although these approaches allowed thedesign of low-complexity execution pipelines in these first mi-croprocessor prototypes, they are not scalable or applicable tofuture 32-/64-bit RSFQ processors. That is why the develop-ment of a wide-datapath microprocessor is crucial for supercon-ductor-based HPC.

Recently HYPRES and Stony Brook University (SBU) haveundertaken a joint project to develop a 20 GHz 8-bit processordatapath. In particular, SBU develops its microarchitecture andcomplete cell-level design, while HYPRES designs a cell libraryand physical layout, fabricates chips and tests them. This is afirst attempt to develop a wide-datapath RSFQ microprocessor.Its microarchitecture was reported in [11] and [13]. The micro-processor is designed for HYPRES’ Nb 4.5-kA/cm fabricationprocess [7].

In this paper, we describe the design and functionality testresults of the major part of the 8-bit microprocessor—an Arith-metic Logic Unit—a digital circuit that performs arithmetic andlogic operations on integer operands.

II. ALU ARCHITECTURE AND DESIGN

A. ALU Architecture

In the instruction set of the processor (see Table I), addition(ADD) is the most complex and hardware consuming arithmeticoperation. Because of that, the design of our ALU is based on aparallel adder design [8].

The RSFQ logic family is naturally suitable for a deeppipelined architecture because its cells have internal memory.Deep pipeline architectures result in high throughput, but, at

1051-8223/$26.00 © 2011 IEEE

848 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 21, NO. 3, JUNE 2011

Fig. 1. Block-diagram of the 8-bit RSFQ ALU.

the same time, they inherently have large latency. The simplestripple-carry adder based ALU [8] has a latency of N clockperiods for operations on N-bit wide data. This approachmakes the design of a fast general-purpose 16- or more bitmicroprocessor impractical. Because of that, we have chosena Kogge-Stone type [9] of the carry-look-ahead adder family.In contrast to the previously explored adder designs [10], wedeveloped and implemented our adder with an asynchronouswave-pipelined microarchitecture [11]–[13].

B. ALU Components

Fig. 1 shows the block diagram of the ALU. It consists of fourtypes of blocks: INIT, ROUT1, ROUT2, and SUM,—connectedwith passive transmission lines (PTLs).

All components were simulated using the physical-levelsimulator PSCAN [14]. Then, the physical-level simulationtiming parameters were extracted and used in the VHDL li-brary. The complete VHDL ALU design and simulation for a20-GHz clock rate were performed with HYPRES’ 4.5-kA/cmstandard cell library. Each block has PTL receivers (RX) atthe input and PTL transmitters (TX) at the output. This makesrouting of interconnects easier.

The most important part of the ALU is the INIT block (Fig. 2).It performs all primitive logic functions on the operands. Therest of the ALU circuitry basically comprises a routing part ofthe Kogge-Stone adder. In Fig. 2, “D” is a D flip-flop [2], “D2”is a dual-port D flip-flop [15], “DC” is a D flip-flop with com-plementary outputs, “XOR” is an XOR cell [2], and “AND” isa dynamic AND cell [16].

The SUM blocks (Fig. 3) form the last stage of the ALU.They perform XOR function on the partial sums and carries ofthe Kogge-Stone algorithm to produce the final result. For any

Fig. 2. Schematics (a) and layout (b) of block INIT.

operation other than ADD, i.e. in the absence of carry signals,this block simply passes its input data to the output.

Blocks ROUT1 (Fig. 4) and ROUT2 (Fig. 5) provide partialsum and carry routing in accordance with the Kogge-Stone al-gorithm [9], as well as the propagation of bit-logic operation re-

FILIPPOV et al.: 8-BIT ASYNCHRONOUS WAVE-PIPELINED RSFQ ARITHMETIC-LOGIC UNIT 849

Fig. 3. Schematics (a) and layout (b) of block SUM.

Fig. 4. Schematics (a) and layout (b) of block ROUT1.

Fig. 5. Schematics (a) and layout (b) of block ROUT2.

sults. The cell C in the schematics designates a resettable MullerC element [17].

These blocks were integrated into a chip with an8-bit ALU shown in Fig. 6. The chip has approximately 8,000Josephson junctions. Note, that this chip is a product of the cur-rent 1.0- HYPRES’s lithography. We will soon be able toproduce chips with a 0.25- stepper, thereby enabling the sizeof the ALU to shrink at least threefold. That should also reducethe part of the latency caused by propagation time in the PTLs.

The simulated average ALU latency is 390 ps (with fluc-tuations of 4 ps), that includes 50 ps of signal propaga-tion delays over PTLs. The real advantage of Kogge-Stone overripple-carry architecture occurs at a wider datapath ALU (16bits and more) [13].

III. FUNCTIONALITY TEST

Extensive low-frequency functionality tests were performedon all parts of the ALU (Figs. 7–12). The experiment has shownthat the ALU correctly executes all instructions.

Fig. 6. A �� chip with 8-bit ALU.

Fig. 7. The 8-bit ALU functionality test for operation ADD.

Functionally, the most complex operation is addition (ADD).The ALU’s instruction set (Table I) includes four variations ofthis operation. In order to provide subtraction, the ALU can in-vert one or both operands and add them at a single instruction.

Fig. 7 shows the correct operation of the ALU adding 8-bitnumbers (A+B). The bottom trace is the Ready signal, precedingevery instruction execution. The 8-bit operand A, operand B,and the 8-bit outputs are shown in ascending order. The resultof the addition process comes out as modulo 256.

The most complex operation in the instruction set is “ADD-Invert A and B”, which is essentially equivalent to the arithmetic

850 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 21, NO. 3, JUNE 2011

Fig. 8. The 8-bit ALU functionality test for operation “ADD-Invert”. Here,both operands (A and B) are inverted before summing.

Fig. 9. The 8-bit ALU functionality test for operation AND.

operation (-2-A-B). The test result of the ALU performing thisoperation is shown in Fig. 8. Here, we have preserved the sameorder of traces and the same operand pattern as in Fig. 7.

The functionally simplest operations are the so-called bit-logic operations, such as AND, XOR, NOR etc. They do notproduce a “carry” bit propagating across the ALU. The resultsof logic operations performed in the INIT blocks of the ALU(Fig. 1) go directly to the output. This property of the bit-logicoperations simplifies both their testing and the pattern necessaryto perform a complete test of the ALU.

Fig. 10. The 8-bit ALU functionality test for operation NOR.

Fig. 11. The 8-bit ALU functionality test for operation XOR.

The low-speed functionality test results for four bit-logic op-erations are shown: operation AND in Fig. 9; operation NOR inFig. 10; XOR in Fig. 11; and XNOR in Fig. 12. For consistency,we placed the traces in the same order as in Fig. 7.

IV. CONCLUSION

We have designed, fabricated, and successfully tested theRSFQ 8-bit ALU. The ALU design is based on a Kogge-Stoneadder and employs an asynchronous wave-pipelined approach.This approach reduces the latency and allows us to scale theALU to a larger number of bits (up to 64). The ALU has beenfabricated with HYPRES’ standard 4.5- Nb process.

FILIPPOV et al.: 8-BIT ASYNCHRONOUS WAVE-PIPELINED RSFQ ARITHMETIC-LOGIC UNIT 851

Fig. 12. The 8-bit ALU functionality test for operation XNOR.

The targeted clock rate of the ALU is 20 GHz. Comprehensivelow-speed functionality tests have been performed for all ALUfunctions. The ALU functions properly for all instructions fromits instruction set and for all operands. As the next step, wework on the high-speed testing for the experimental evaluationof the maximum operating clock rate of the ALU.

ACKNOWLEDGMENT

The authors would like to thank D. Donnelly, R. Hunt, J. Vi-valda, D. Yohannes, and S. K. Tolpygo of the HYPRES fabrica-tion team. Discussions with and encouragement from M. Man-heimer, S. Holmes are appreciated.

REFERENCES

[1] W. Anacker, “Josephson computer technology: An IBM researchproject,” IBM Journal of Research and Development, vol. 24, no. 2,pp. 107–112, Mar. 1980.

[2] K. Likharev and V. Semenov, “RSFQ logic/memory family: A newJosephson-junction technology for sub-terahertz clock-frequency dig-ital systems,” IEEE Trans. Appl. Supercond., vol. 1, pp. 3–28, Mar.1991.

[3] O. A. Mukhanov, S. V. Rylov, V. K. Semenov, and S. V. Vyshenskii,“RSFQ logic arithmetic,” IEEE Trans. Magn., vol. MAG-25, no. 2, pp.857–860, Mar. 1989.

[4] T. Sterling, “A design analysis of a hybrid technology multithreadedarchitecture for petaflops scale computation,” in Proc. of InternationalConference on Supercomputing, 1999, pp. 386–296.

[5] P. Bunyk, M. Leung, J. Spargo, and M. Dorojevets, “FLUX-1 RSFQmicroprocessor,” IEEE Trans. Appl. Supercond., vol. 13, no. 1, p. 433,2003.

[6] A. Fujimaki, M. Tanaka, T. Yamada, Y. Yamanashi, H. Park, and N.Yoshikawa, “Bit-serial single flux quantum microprocessor CORE,”IEICE Trans. Electron., vol. E91-C, pp. 342–349, Mar. 2008.

[7] HYPRES’ Design Rules [Online]. Available: http://www.hypres.com[8] J. Y. Kim, S. Kim, and J. Kang, “Construction of an RSFQ 4-bit ALU

with half adder cells,” IEEE Trans. Appl. Supercond., vol. 15, no. 1, p.308, 2005.

[9] P. Kogge and H. S. Stone, “A parallel algorithm for the efficient so-lution of a general class of recurrence equations,” IEEE Trans. Com-puters, vol. C-22, no. 8, pp. 786–793, Aug. 1973.

[10] P. Bunyk and P. Litskevitch, “Case study in RSFQ design: Fastpipelined 32-bit adder,” IEEE Trans. Appl. Supercond., pp. 3714–3720,June 1999.

[11] M. Dorojevets, C. Ayala, and A. Kasperek, “Development and evalu-ation of design techniques for high-performance wave-pipelined widedatapath RSFQ processors,” in Proc. of the 12th Intl SuperconductiveElectronics Conference (ISEC ’09), Fukuoka, Japan, June 16–19, 2009.

[12] W. P. Burleson, M. Ciesielski, F. Klass, and W. Liu, “Wave pipelining:A tutorial and research survey,” IEEE VLSI Syst., vol. 6, pp. 464–474,Sep. 1998.

[13] M. Dorojevets, C. Ayala, and A. Kasperek, “Data-flow microarchitec-ture for wide datapath RSFQ processors: Design study,” IEEE Trans.Appl. Supercond, submitted for publication.

[14] S. Polonsky, P. Shevchenko, A. Kirichenko, D. Zinoviev, and A.Rylyakov, “PSCAN’96: New software for simulation RSFQ circuits,”IEEE Trans. Appl. Supercond., vol. 7, no. 2, pp. 2685–2689, June1997.

[15] S. V. Polonsky, V. K. Semenov, and A. F. Kirichenko, “Single fluxquantum B flip-flop and its possible applications,” IEEE Trans. Appl.Supercond., vol. 4, no. 1, p. 9, 1994.

[16] S. Kaplan, A. Kirichenko, O. Mukhanov, and S. Sarwana, “A prescalercircuit for a superconductive time-to-digital converter,” IEEE Trans.Appl. Sup., vol. 11, no. 1, p. 513, 2000.

[17] T. V. Filippov, S. V. Pflyuk, V. K. Semenov, and E. B. Wikborg, “En-coders and decimation filters for superconductor oversampling ADCs,”IEEE Trans. Appl. Supercond., vol. 11, pp. 545–549, Mar. 2001.

8-bit a synchronous wave-pipelined rsfq

Documents