fpga based implementation of 16-bit multiplier...

International Journal of VLSI Design, 2(2), 2011, pp. 113-120

FPGA BASED IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX2 MODIFIED BOOTH

ALGORITHM AND SPST ADDER USING VERILOG

Addanki Purna Ramesh1, & P. Koteswara Rao2

1Associate Professor, Department of ECE Sri Vasavi Engg College, Tadepalliguem.2M.Tech Student, Department of ECE Department of ECE Sri Vasavi Engg College, Tadepalliguem.

Abstract: In this paper, we proposed a new architecture of multiplier-and-accumulator (MAC) forhigh-speed arithmetic and low power. Multiplication occurs frequently in finite impulse response filters,fast Fourier transforms, discrete cosine transforms, convolution, and other important DSP and multimediakernels. The objective of a good multiplier and accumulator (MAC) is to provide a physically compact,good speed and low power consuming chip. To save significant power consumption of a VLSI design, it isa good direction to reduce its dynamic power that is the major part of total power dissipation. In this paper,we propose a high speed MAC adopting the new SPST implementing approach. This multiplier andaccumulator is designed by equipping the Spurious Power Suppression Technique (SPST) on a modifiedBooth encoder which is controlled by a detection unit using an AND gate. The modified booth encoder willreduce the number of partial products generated by a factor of 2. The SPST adder will avoid the unwantedaddition and thus minimize the switching power dissipation.Keywords: Booth encoder, computer arithmetic, digital signal processing, spurious power suppressiontechnique, low power.

1. INTRODUCTIONWith the recent rapid advances in multimediaand communication systems, real-time signalprocessing's like audio signal processing, video/image processing, or large-capacity data processingare increasingly being demanded.

The multiplier and multiplier-and-accumulator(MAC) [1] are the essential elements of the digitalsignal processing such as filtering, convolution, andinner products. Most digital signal processingmethods use nonlinear functions such as discretecosine transform (DCT) [2] or discrete wavelettransform (DWT) [3]. Because they are basicallyaccomplished by repetitive application ofmultiplication and addition, the speed of themultiplication and addition arithmetic's determinesthe execution speed and performance of the entirecalculation. Because the multiplier requires thelongest delay among the basic operational blocksin digital system, the critical path is determinedby the multiplier, in general. For high-speedmultiplication, the modified radix-4 Booth’salgorithm (MBA) [4] is commonly used. However,

this cannot completely solve the problem due to thelong critical path for multiplication [5], [6].

Power dissipation is recognized as a criticalparameter in modern VLSI design field. To satisfyMOORE'S law and to produce consumer electronicsgoods with more backup and less weight, lowpower VLSI design is necessary.

Fast multipliers are essential parts of digitalsignal processing systems. The speed of multiplyoperation is of great importance in digital signalprocessing as well as in the general purposeprocessors today, especially since the mediaprocessing took off. In the past multiplication wasgenerally implemented via a sequence of addition,subtraction, and shift operations. Multiplication canbe considered as a series of repeated additions. Thenumber to be added is the multiplicand, the numberof times that it is added is the multiplier, and theresult is the product. Each step of addition generatesa partial product. In most computers, the operandusually contains the same number of bits. When theoperands are interpreted as integers, the product isgenerally twice the length of operands in order to

114 International Journal of VLSI Design

preserve the information content. This repeatedaddition method that is suggested by the arithmeticdefinition is slow that it is almost always replacedby an algorithm that makes use of positionalrepresentation. It is possible to decomposemultipliers into two parts. The first part is dedicatedto the generation of partial products, and the secondone collects and adds them. The basic multiplicationprinciple is twofold i.e., evaluation of partialproducts and accumulation of the shifted partialproducts. It is performed by the successive additionsof the columns of the shifted partial product matrix.The ‘multiplier’ is successfully shifted and gates theappropriate bit of the ‘multiplicand’. The delayed,gated instance of the multiplicand must all be in thesame column of the shifted partial product matrix.They are then added to form the product bit for theparticular form.

Multiplication is therefore a multi operandoperation. To extend the multiplication to bothsigned and unsigned numbers, a convenient numbersystem would be the representation of numbers intwo's complement format.

2. OVER VIEW OF MAC

In this section, basic MAC operation is introduced.A multiplier can be divided into three operationalsteps. The first is radix-2 Booth encoding in whicha partial product is generated from the multiplicand(X) and the multiplier (Y). The second is adder arrayor partial product compression to add all partialproducts and convert them into the form of sum andcarry. The last is the final addition in which the finalmultiplication result is produced by adding the sumand the carry. If the process to accumulate themultiplied results is included, a MAC consists offour steps, as shown in Fig. 1, which shows theoperational steps explicitly.

General hardware architecture of this MAC isshown in Fig. 2. It executes the multiplicationoperation by multiplying the input multiplier(X) and the multiplicand (Y). This is added tothe previous multiplication result (Z) as theaccumulation step.

The (N)-bit 2’s complement binary number canbe expressed as

X =2

11

02 2 ,

NN i

N ii

x x−

−−

=

− + ∑ xi ∈ 0, 1 (1)

Figure 1: Basic Arithmetic Steps of Multiplication andAccumulation

Figure 2. Hardware Architecture of General MAC

If (1) is expressed in base-4 type redundant signdigit form in order to apply the radix-2 Booth’salgorithm, it would be [7].

X =/2 1

04

N

i ii

d−

=∑ (2)

di = –2x2i +1 + x2i + x2i –1 (3)

FPGA Based Implementation of 16-Bit Multiplier-Accumulator Using Radix2 Modified Booth... 115

If (2) is used, multiplication can be expressed as

X × Y =/2 1

2

02 .

Ni

ii

d Y−

=∑ (4)

If these equations are used, the afore-mentionedmultiplication-accumulation results can beexpressed as

P = X × Y × Z =/2 1 2 1

2

0 02 2 .

N Ni i

i ii j

d Y z− −

= =

+∑ ∑ (5)

Each of the two terms on the right-hand side of(5) is calculated independently and the final resultis produced by adding the two results. The MACarchitecture implemented by (5) is called thestandard design [6]. If N-bit data are multiplied, thenumber of the generated partial products isproportional to. In order to add them serially, theexecution time is also proportional to N. Thearchitecture of a multiplier, which is the fastest, usesradix-2 Booth encoding that generates partialproducts.

3. MODIFIED BOOTH ENCODERIn order to achieve high-speed, multiplicationalgorithms using parallel counters, such as themodified Booth algorithm has been proposed, andsome multipliers based on the algorithms have beenimplemented for practical use. This type ofmultiplier operates much faster than an arraymultiplier for longer operands because itscomputation time is proportional to the logarithmof the word length of operands.

Booth multiplication is a technique that allowsfor smaller, faster multiplication circuits, byrecoding the numbers that are multiplied. It ispossible to reduce the number of partial productsby half, by using the technique of radix-4 Boothrecoding [8]. The basic idea is that, instead ofshifting and adding for every column of themultiplier term and multiplying by 1 or 0, we onlytake every second column, and multiply by ± 1, ± 2,or 0, to obtain the same results. The advantage ofthis method is the halving of the number of partialproducts.

To Booth recode the multiplier term, weconsider the bits in blocks of three, such that eachblock overlaps the previous block by one bit.Grouping starts from the LSB, and the first blockonly uses two bits of the multiplier. Figure 3 showsthe grouping of bits from the multiplier term for usein modified booth encoding.

Figure 3. Grouping of Bits from the Multiplier Term

Each block is decoded to generate the correctpartial product. The encoding of the multiplier Y,using the modified booth algorithm, generates thefollowing five signed digits, –2, –1, 0, +1, +2. Eachencoded digit in the multiplier performs a certainoperation on the multiplicand, X, as illustrated inTable 1.

Table 1Encoded Five Signed Digits

Block Re-code digit Operation on X

001 0 0X001 +1 +1X010 +1 +1X011 +2 +2X100 –2 –2X101 –1 –1X110 –1 –1X111 0 0X

5. SPURIOUS POWER SUPPRESSIONTECHNIQUE

The former SPST has been discussed in [9] and [10].Figure 4 shows the five cases of a 16-bit addition inwhich the spurious switching activities occur. The1st case illustrates a transient state in which thespurious transitions of carry signals occur in theMSP though the final result of the MSP areunchanged. The 2nd and the 3rd cases describe thesituations of one negative operand adding anotherpositive operand without and with carry from LSP,respectively. Moreover, the 4th and the 5th casesrespectively demonstrate the addition of twonegative operands without and with carry-in fromLSP. In those cases, the results of the MSP arepredictable Therefore the computations in the MSPare useless and can be neglected. The data areseparated into the Most Significant Part (MSP) andthe Least Significant Part (LSP). To know whetherthe MSP affects the computation results or not. Weneed a detection logic unit to detect the effectiveranges of the inputs.


The Boolean logical equations shown belowexpress the behavioural principles of the detectionlogic unit in the MSP circuits of the SPST-basedadder/subtractor

AMSP = A [15:8] (6)

BMSP = B [15:8] (7)

Aand = A [15]. A [14] … A [8] (8)

Band = B [15] . B [14] … B [8] (9)

Anor = [15] [14] [13] ... [8]A A A A+ + + +(10)

Bnor = [15] [14] [13] ... [8]B B B B+ + + + (11)

Close = ~ ((Aand+ Anor). (Band + Bnor)) (12)

We derive the Karnaugh maps which lead to theBoolean equations (13) and (14) for the Carr_ctrl andthe sign signals, respectively. In equation (13) and(14), CLSP denotes the carry propagated from the LSPcircuits.

Carr_ctrl = (CLSP ⊕ Aand ⊕ Band) (Aand + Anor)

(Band + Bnor) (13)

Sign = CLSP.bar (Aand + Band)

+ CLSP Aand Band; (14)

Case 1:

Case 2:

Case 3:

Case 4:

Case 5:

Figure 4: Spurious Transition Cases in Multimedia/DSPProcessing

5. PROPOSED SPURIOUS POWERSUPPRESSION TECHNIQUE

The SPST uses a detection logic circuit to detect theeffective data range of arithmetic units, e.g., addersor multipliers. When a portion of data does not affectthe final computing results, the data controllingcircuits of the SPST latch this portion to avoiduseless data transitions occurring inside thearithmetic units. Besides, there is a data assertingcontrol realized by using registers to further filterout the useless spurious signals of arithmetic unitevery time when the latched portion is being turnedon. This asserting control brings evident powerreduction. Figure 5 shows the design of low poweradder/subtractor with SPST.


The adder /subtractor is divided into two parts,the most significant part (MSP) and the leastsignificant part (LSP). The MSP of the originaladder/subtractor is modified to include detectionlogic circuits, data controlling circuits, signextension circuits, logics for calculating carry in andcarry out signals. The most important part of thisstudy is the design of the control signal assertingcircuits, denoted as asserting circuits in Figure 5.Although this asserting circuit brings evident powerreduction, it may induce additional delay. There aretwo implementing approaches for the control signalassertion circuits. The first implementing approachof control signal assertion circuit is using registers.This is illustrated in Figure 6. The three outputsignals of the detection logic are close, Carr_ctrl,sign. The three output signals the detection logicunit are given a certain amount of delay before theyassert. The delay Φ, used to assert the three outputsignals, must be set in a range of Ψ < Φ < ∆, where Ψdenotes the data transient period and ? denotes theearliest required time of all the inputs. This will filterout the glitch signals as well as to keep thecomputation results correct. The restriction that Ψσmust be greater than Ψ to guarantee the registersfrom latching the wrong values of control usuallydecreases the overall speed of the applied designs.

Figure 5. Low Power Adder/Subtractor Adopting the SPST

Figure 6. Detection Logic Circuits using Registers

This issue should be noticed in high-endapplications which demands both high speed andlow power requirements. To solve this problem weadopt the other implementing approach of controlsignal assertion circuit using AND gate.

Figure 7: Detection Logic Circuits using an AND Gate


6. PROPOSED LOW POWER HIGHPERFORMANCE MULTIPLIER ANDACCUMULATOR

The proposed high speed low power multiplieris designed by equipping the SPST on a treemultiplier. There are two distinguishing designconsiderations in designing the proposed multiplieras listed in the following

(A) Applying the SPST on the Modified BoothEncoder

Figure 8 shows a computing example of Boothmultiplying two numbers “2AC9” and “006A”. Theshadow denotes that the numbers in this part ofBooth multiplication are all zero so that this part ofthe computations can be neglected. Saving thosecomputations can significantly reduce the powerconsumption caused by the transient signals.According to the analysis of the multiplicationshown in figure 8, we propose the SPST-equippedmodified-Booth encoder, which is controlled by adetection unit. The detection unit has one of the twooperands as its input to decide whether the Boothencoder calculates redundant computations asshown in Figure 9. The latches can, respectively,freeze the inputs of MUX-4 to MUX-7 or only thoseof MUX-6 to MUX-7 when the PP4 to PP7 or thePP6 to PP7 are zero; to reduce the transition powerdissipation. Figure 10. Shows the booth partialproduct generation circuit. It includes AND/OR/EX-OR logic.

Figure 8: Illustration of Multiplication using ModifiedBooth Encoding

Figure 9: SPST Equipped Modified Booth Encoder

Figure 10: Booth Partial Product Selector Logic

(B) Applying the SPST on the Compression TreeThe proposed SPST-equipped multiplier isillustrated in figure 11. The PP generator generatesfive candidates of the partial products, i.e.,{–2A,–A, 0, A, 2A}. These are then selected accordingto the Booth encoding results of the operand B.When the operand besides the Booth encoded onehas a small absolute value, there are opportunitiesto reduce the spurious power dissipated in thecompression tree.


Figutr 11: Proposed High Performance Low Power EquippedMultiplier

8. SIMULATION RESULTS

Figure 12: Schematic View of Booth Encoder

Figure 13: Simulation Waveform of Booth Encoder

Figure 14: Schematic View of MAC

Figure 15: Simulation Waveform of MAC

9. CONCLUSIONSIn this project, we propose a high speed low-powermultiplier and accumulator (MAC) adopting thenewSPST implementing approach. This MAC isdesigned by equipping the Spurious PowerSuppression Technique (SPST) on a modified Boothencoder which is controlled by a detection unitusing an AND gate. The modified booth encoder willreduce the number of partial products generated by


a factor of 2. The SPST adder will avoid theunwanted addition and thus minimize the switchingpower dissipation. The SPST MAC implementationwith AND gates have an extremely high flexibilityon adjusting the data asserting time. This facilitatesthe robustness of SPST can attain 30% speedimprovement and 22% power reduction in themodified booth encoder. This design can be verifiedusing Modelsim and Xilinx using verilog.

REFERENCES[1] G. Lakshmi Narayanan and B. Venkataramani,

“Optimization Techniques for FPGA-Based WavePipelined DSP Blocks”, IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., 13. No 7. pp 783-792, July 2005.

[2] L.Benini, G.D. Micheli, A. Macii, E. Macii, M. Poncino,and R.Scarsi,“"Glitching Power Minimization bySelective Gate Freezing”, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., 8, No. 3, pp. 287-297, June2000.

[3] S.Henzler, G. Georgakos, J. Berthold, and D.Schmitt-Landsiedel, “Fastpower-Efficientcircuit-Block Switch-off Scheme”, Electron. Lett., 40, No. 2,pp. 103-104, Jan. 2004.

[4] H. Lee, “A Power-Aware Scalable Pipelined BoothMultiplier”, In Proc. IEEE Int. SOC Conf., 2004,pp. 123-126.

[5] J. Choi, J. Jeon, and K. Choi, “Power Minimizationof Functional units by Partially GuardedComputation”, In Proc. IEEE Int. Symp. Low PowerElectron. Des., 2000, pp. 131-136.

[6] Z. Huang and M. D. Ercegovac, “High-PerformanceLow-Power Left-Toright Array Multiplier Design”,IEEE Trans. Comput., 54, No. 3, pp. 272-283, Mar.2005.

[7] K. H. Chen, K. C. Chao, J. I. Guo, J. S. Wang, andY.S. Chu, “An Efficient Spurious PowerSuppression Technique (SPST) and its Applicationson MPEG-4 AVC/H.264 Transform CodingDesign”, In Proc. IEEE Int. Symp. Low Power Electron.Des., 2005, pp. 155-160.

[8] K. H. Chen, Y. M. Chen, and Y. S. Chu, “A VersatileMultimedia Functional Unit Design using theSpurious Power Suppression Technique”, inProc. IEEE Asian Solid-State Circuits Conf., 2006,pp. 111-114.

[9] Zhijun Huang, “High Level Optimization Techniquesfor Low Power Multiplier Design” University ofCalifornia, los angels, 2003.

fpga based implementation of 16-bit multiplier...

Documents