design and analysis of low power and high speed mac

Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299

1 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org

Design and Analysis of Low Power and High Speed MAC

Designs forConvolution basedon FPGA

NavneetRanjan, VipulAgrawal*

Department of Electronic and Communication Engineering

Trinity Institute of Technology and Research, Bhopal

Madhya Pradesh, India *[email protected]

Abstract: Nowadays multimedia applications are demanding high speed computing architectures. Adders and

Multipliers are the very important functional blocks in Arithmetic and Logic Unit (ALU) of high speed

computing architectures. High performance systems in all cases require fast multiplication. This project

presents the implementations of the high speed signed and unsigned fast multipliers and their comparative

analysis. VLSI architecture widely use parallel multipliers such as Booth multiplier, Wallace multiplier and

Dadda multiplier in order to acquire their design attributes like speed and area, have been proposed here.

The acquired design parameters of the multipliers are analyzed to design optimum speed Multiply and

Accumulate (MAC) unit for multimedia applications like Filters, Synthesizers, Wireless communication

channels, etc. Finally, the designed MAC unit has also been applied to DSP technique i.e. convolution and

their performance is analyzed in terms of area and speed.

Keywords: MAC, Booth Multiplier, Wallace Multiplier, Dadda Multiplier, Convolution

1. INTRODUCTION

Multiplication is an operation in which multiplicand is added a specified number of times that is to multiplier

to give the product. In Electronic systems high speed multipliers are required for fast arithmetic operations

systems for example, systems like FIR filters, digital signal processors and microprocessors etc.

Multiplication based operation that is Multiplier and Accumulation (MAC) is frequently used in operation

like multimedia, and 3D graphics. Now days, the speed of multiplication operation is important factor in

determining the time taken by the instruction cycle of a digital signal processing chip and the demand for fast

processing is on increase because of growth in the computing and signal processing applications [1].

Performance of the systems depends on performance of the multiplier because the multiplier is usually the

critical element in the system.

Therefore, the development of optimized multipliers is important in order to achieve high performance portable multimedia devices and Digital Signal Processing systems. For many applications, it

has been important to reduce the time delay and consumption of power by the multipliers. These signal

processing applications not only demand high computation speed and capacity but as well as consume more

amount of energy. As the speed and Area remain to be the two important factors in a design, the higher speed

leads to more power consumption, thus, low power architectures will be the choice of future. As the

technology is developing rapidly with time, the researchers are focusing on de-signing of optimized

multipliers with the following objectives: high speed and less area or package of both in one multiplier [2].

Thus, innovating increased speed multiplier always results in larger area. It is found in the literature that

optimizing the multipliers architecture favors enhanced performance of embedded processors that are being

utilized in the consumer and industrial electronic products. As a result the regular structure needed for each

processing elements will also increases and thus consume area. Therefore, there is need for developing design of multiplier architecture of N-bit that supports high speed and less area.Types of multipliers are shown in the

figure 1.1 and described in detail below.

1. Serial multiplier: In serial multiplication, sequential circuits are being used with feedbacks. The inner

products are sequentially produced and then added serially as per the operation. But the speed of serial

multiplier is less as compared to parallel multiplier. Because Serial Multiplier adds each bits of the

multiplicand sequentially and the process is repeated for each of the multiplier bits and another reason

only one adder is used to add the m ×n number of partial products where m and n are number of bits of

multiplicand and multiplier respectively. Serial multipliers are used wherein area and power are of

utmost importance while delay can be ignored. The compactness of design allows the architecture of

multiplier to run at higher clock rate thus making it competitive when compared with much more

complex designs with regard to speed.

2. Parallel Multiplier:In a parallel multiplier of first generation, partial products are done by multiplying the multiplicand with each bit of the multiplier. Then the partial products are added in parallel fashion



to produce the resultant product P. The multiplication steps are divided in to two steps that are

generation of partial product and addition of this partial product. The delay caused depends on the

number of partials products to be added.

Wallace tree multiplier are fast multiplier which compress its generated partial products as soon as possible to

bring it to the final two rows of partial products and then adds it with carry propagate adder. The compression

of partial product is done by the carry save adder. Booth multiplier is based on booth algorithm and is used for signed and unsigned multiplication. Baugh Wooley multiplier is also used for signed and unsigned

multiplication.

(a) Array multiplier.

(b) Tree multiplier.

3. Serial-parallel multiplier:The serial-parallel multiplier (SPM) operates on each bit of the multiplier

serially, but uses a parallel adder for partial product accumulation, a good trade-off between time

consuming serial multiplier and area consuming parallel multiplier.

FIGURE 1. Block Diagram for types of multiplier

MAC unit it consists of Multiplier and Accumulator that contain the sum of the previous consecutive

products. The accumulator consists of adder and register; the basic function of MAC operation is given as F =

å Ai X Bi. Digital Signal Processor performance is contributed by the MAC unit. By the enhancement of

MAC unit, we can optimize performance of the Digital Signal Processors

The multiplier block of DSP processors, micro-controllers and microprocessor causes lot of delay and utilize

area. This gives a scope to implement and analyze the fast multiplier architecture and apply those analyzed

multipliers for the design of MAC unit analyze their performance in terms delay, area and make comparative

analysis.The proposed architecture is used for fixed point MAC unit. Most of the DSP application demands

floating point MAC units, so there is a need of designing multipliers for floating point arithmetic operation.

2. BACKGROUND

Prakash et al. [3]presented 16 bit Wallace multiplier using fast adder at the final state two sum final two row. Wherein, the partial products are produced by n2 AND gates in the same manner as for Dadda multipliers. In

Wallace tree multiplier, all bits of all the partial products in column matrix added parallel by group

ofcompressors avoiding propagation of any carry to the next phase. Among each group of three rows, 3-2

compressors are applied to the columns containing three bits. Columns that have only a single bit is

transferred to the next level unrectified.Time taken by the multiplication operation is reduced by employing

the proposed Wallace Multiplier architecture. As per their work the designed modified carry save adder is

faster and is highly suitable for VLSI design. Vasudeva et.al [6], presented modified booth algorithm for

signed and unsigned multiplication of two numbers along with the use of spurious power suppression

technique.SPST utilizes detection unit to find the important range of data range of the arithmetic units, e.g.,

adders or multiplier part of data that will not affect the final calculated results, the data controlling unit of the

SPST will latch this portion to avoid not needed data changeovers happening inside the arithmetic units.

Furthermore, the data asserting control realized by the use of register and filters outs the unwanted spurious signals of the arithmetic unit whenever the latched portions is turned on. This helps reduce the delay caused

by the ripple carry adder.



Riazullah and Kishore Kumar [7], presented MAC using Booth multiplier and Spurious Power Suppression

Technique(SPST).In this Multiplier the one of the input which is the multiplier, last three significant bits of it

is analyzed and then accordingly the operation of generation of partial product is done and the shifting of

three bits are done. The SPST utilizes detection unit that removes the unwanted data range of adder. The

subtractor and adder is split into two main components, the most consequential part (MSP) and the least

consequential part (LSP). The SPST adder will remove the undesirable addition and thereby reducing switching power dissipation.Veeramachaneni [8], proposed novel architectures and the designs of low power

and high speed compressors used for addition in the partial product addition stage or accumulation stage. The

compressors 3:2, 4:2 and 5:2 are the essential components in many applications where addition is required

most importantly in multiplication. The three important parameters area, power and speed of proposed

compressor as compared to conventional compressors have a better performance. A more concentration is

laid on how to make use of multiplexers in arithmetic circuits. The new compressors are a package of low

power, low transistor count and lesser delay. The new compressors are a package of low power, low transistor

count and lesser delay.Kafi et al. [9], presented an efficient design of FSM multiplier that consists of partial

product generator that generates the partial product. These partial products are added using 4 40-bit full

adders in a sequential manner and stored in register in pipelined manner. The mux is used so that the addition

goes sequentially and demux are used so that the added output is stored into the register in a sequential

manner. To control such data path,the Finite state machine is being used, the output of the finite state machine is used as select lines to control the mux and demux.

Sharma et al. [10], presented Booth multiplier and the booth multiplier along with additional modules like

logic functions, subtraction module, addition module division module squaring module are combined to

design calculator. Booth multiplier does the operation of multiplication by analyzing the last 2 bit of

multiplier and performs the operation required. For the implementation of calculator, all other blocks of

operation are implemented this booth multiplier is implemented in a calculator and several other operation

blocks and blocks like 4-bit parallel addition, parallel sub tractor, division, squaring, cubing and other logical

functions like AND,OR,NOT etc.LI et al. [11], proposed 32-bit signed and unsigned pipelined Multiplier that

contains a sign control unit that will produce the MSB of Multiplicand and Multiplier and also the select

signal required for the line of multiplexers. The carry select adder isutilized in this phase to minimize the

delay while carry propagation adder is used to add the final two partial rows. In Wallace tree compression technique is used to sum up the partial products row. In second phase of Wallace tree compression carry

select adder is utilized. The Conditional sum adder has to save both the conditional sum and carry. As

consequence the more number of multipliers are used to the benefits of both the adders, a mixed CSA-CCA

architecture is implemented to compute a final fast addition.

Karthick et al. [12] proposed different kinds of compressors are proposed and then designed. In their work

they have designed 8-bit Wallace multiplier by the use of proposed compressors and then the designed is

compared with the conventional Wallace multipliers in terms of power. Anandi and Rangarajan [13] focused

on large DSP data path operators like multipliers and MAC circuits, where to lower the energy used per

process is of larger importance. These form the heart of a main of commercial DSP Processor data path units.

They have examined first the architectural for merged MAC circuits and then implemented a high-speed/low-

power MAC architecture.

Jose and Mytheen, they have designed the FIR filter using the booth multiplier and it is low area cost. A direct form filter is designed in such way that new data sample and filter coefficients are applied with

each new clock cycle of to the inputs of multiplier. x [n] is taken as the input signal. D-FFs are used as the

delay elements [14]. Modified Booth multiplier block is used for the multiplying the input signal with the set

of filters coefficients that is corresponding to selected filter order. Then the modified Booth multiplier block

will provide the output signal y [n]. Modified Booth multiplier is of Booth algorithm, including Booth

encoder and Booth decoder, Wallace tree compressor(WTC) . The Wallace tree reduction compresses the

partial product bits. Wallace tree is being used to increase the speed of Multiplication by reduction of partial

product. Wal-lace Tree Structure is composed by using compressors, full adders and various other techniques.

WTC is a technique used to increase the speed of partial product addition operation. A Wallace tree

compressor consists of a set of full adders (FAs). Sometimes, the FA at the LSB is replaced by the half adder

(HA). The HA adds two input bits to produce one sum and one carry. All the FAs used adds three input bits at a time to produce one sum bit and one carry bit. Therefore, the PPs are added in parallel using the Wallace

tree compressor until two sequences of outputs are generated. Finally, these sequences of sum bits and carry

bits are to be given to a CLA. The CLA provides speed boost to the system. They are the fastest adders. CLA

consists of a set of full adders.

Kumar et al., they have a proposed MAC unit. In their MAC unit they have taken used Vedic Multiplier and

Wallace tree .This paper has also used the combination of UrdhvaTriyakbhyam sutra with unique addition



tree structure similar to Wallace for multiplication. They have n-bit Multiplier and 2n bit adder and 2n

accumulator [17]. BordiyaandBandil, their work is on efficient implementation of high speed multiplier using

the shift and add method, Radix-2, Radix-4 Booth multiplier algorithms. In their work they have compared

the working of the multiplier by implementing each of them separately in FIR filter[18]. For this they have

first designed three different type of multipliers using shift method, the radix -2 and radix-4 Booth multiplier

using Booth algorithm. After that they had compared the working of different multipliers by comparing the some factors which are responsible for the multiplier working like (speed, area, power). Another important

improvement in the multiplier is the reducing the number of partial products that are generated. In Booth

recording multiplier the encoder checks the inputs and then generates the encoded partial products based on

that the decoder produces exactly half of the partial products. The checking is done in triplets of the input

given[19]. The two bit is from present pair; and the third bit is from high order bit that is neighbor to the

previous pair.

Rashmi et.al, proposed direct method for performing linear convolution of finite length sequence. The

Multipliers are the simplest block of convolves. As it takes maximum of the execution time so to increase the

speed, she has used 4X4 Vedic multiplier [20]. To the input of vedic multiplier 4-bit long samples are

applied. Then we get 8-bit partial products as the output of vedic multiplier. For parallel Processing, 16

multipliers are used to generate 16 partial products. Then for additions all the outputs are latched and carry

look ahead adder is used to sum up the partial products. For the final stage of addition combination of carry save adder and ripple carry adder is used.Wallace gave a method high speed multiplication that is centered on

reduction of partials products in parallel using tree of carry save and it was called as Wallace Multiplier [21].

Dadda proposed another method to improve the Wallace method by putting forward in which the compressors

are placed according to the compressor placing technique that re-quires less umber of compressors in partial

product reduction phase at the cost of larger carry propagation adder Because of high speed characteristics of

column compression multiplier the study is prolonged column compression multiplier prolonged to be stud-

ied due to their high speed functioning. The total delay of a multiplier is proportionate to the logarithm of the

input [22]. According to T. K. Callaway column compression multiplier are much better in terms of power

when compared to array multipliers[23].Garg presented a Dynamic Range Detection based multiplier which

directed at reducing the power consumption of multiplier. This method generally chooses the operand which

has a probability of making more partial products rows zeroes, after being taken as multiplier when it’s Booth Recoding is done [24]. The power consumption is further reduced by using truncation technique on the output

product at the price of output precision. .The designed multiplier is more successful then the conventional

Booth multiplier for truncation method. As a consequence, the designed multiplier is very preferable for

portable multimedia and DSP application.

Lee, worked on relative performance analysis of different multiplier designs that is Array, Wallace, Dadda

and Reduced Area Multipliers. All the multiplier designs were modelled in Verilog HDL and synthesized.

Logic synthesis report data for Area, Speed and Auto optimization mode point to that Dadda multiplier may

not be always be faster than the Wallace multiplier, it depends on the optimization method used [25]. It is also

found that Wallace multiplier is suited for high the speed applications, the Dadda and the reduced area

designs gives best speed when synthesized to reduce area or logic usage.

Hiung shows a relative performance comparison of Radix-based Booth Encoding multiplier designs. Radix-2

Booth Encoding, Radix-4 Booth Encoding, Radix-8 Booth Encoding multiplier, Radix-16 Booth Encoding multipliers and Radix-32 Booth En-coding multiplier has been compared [26]. Synthesis data report are

obtained for Area, Speed and Auto optimized modes indicate that largest area and longest timing delay is

observed in Radix-2 Booth encoding multiplier design. As the number of Radix Based multiplier is increased,

the number of partial product rows are reduced but at the cost of complexity of operations performed. Results

indicate that Radix-4 Booth Encoding multiplier design is the best multiplier in terms of speed and area.Seo,

designed a new parallel Multiplier using radix-2 Booth Multiplier. Hybrid type of carry save adder is used by

combining multiplication and accumulation [27]. The architecture is designed in such way it has Booth

multiplier along with carry save adder tree and array for sign extension so that the bit density of the operand

can be increased. The carry save adder tree is used reduce the partial products until only two rows are

remaining and final two rows are added using fast adder like carry look ahead adder.Demori, defined two

kinds of parallel multipliers. The first kind of multiplier uses rectangular array of cells to sum the partial products this type of multipliers are called array Multipliers. Delay will be proportional to the input given to

the multiplier. The second kind of multiplier uses compressors to reduce the partials products and final two

rows are reduced using the high speed carry propagate adder [28].

3. EXISTING ARCHITECTURE

This section gives the detailed description of implemented Parallel multiplier architecture that will be used in



the design of MAC unit.

A. Booth Multiplier

Booth’s algorithm is one of the important algorithm to perform signed number multiplication. The operation

is presented by adding two already calculated values in prior that is addition denoted as A and the subtraction

denoted as S to ainitial Partial Product P and after that performs a right arithmetic shift on Partial product P.

Let’s consider x and y to be the multiplicand and multiplier. Let nx and ny represent the number of bits in X and Y [4].

FIGURE 2. Booth’s Multiplier Flow Chart

The fast Booth’s multiplier procedure obtains the product of x and y by following the rules as follows:

1. At first, calculate the values of Addition (S), Subtraction (S) and then the initial value of Partial

product P. All of these numbers should have a equal length to nx + ny + 1.

2. Addition (A): Fill the most significant (leftmost) bits with the value of Multiplicand. And the

remaining with (nx + 1) bits with zeros.

3. Addition (A): Fill the most significant (leftmost) bits with the value of Multiplicand. And the

remaining with (ny+1) bits with zeros.

4. P: Let the most significant bits be occupied with nx zeros. To the right of this zeros append the values

of y. Let the least significant (rightmost) bits be a zero.

5. Analysis the two least significant (rightmost) bits of P.

• Now if it is 01, then we will calculate the value of Partial Product +Addition and if any

overflow takes place it is ignored.

• Now if it is 10, then we will calculate the values of Partial product P+ Sub-traction S and if

there is any overflow that will be ignored.



• Now if it is 00 or 11, then nothing is done and P will be used directly in the next level of step.

6. Then Arithmetic right shift is done of single step to the value that was determined in the previous

step. Partial product P now equals to new value.

7. Now we will be repeating the steps 2 and 3 until it is done “n” times.

8. Now leave the least significant (rightmost) bit from P, resultant is the product of x and y.

Andrew Donald Booth Proposed a multiplication algorithm, which was called as Booth’s Algorithm [33]. In Booth multiplier, the number of partial products is reduced by the use of Booth encoding algorithm by taking

into consideration some specific number of bits of the multiplier instead of just one bit at a time, therefore

achieving a speed advantage over other the multiplier architecture.The figure 2.gives the detailed operation of

Booth Multiplier and the same is followed in designing the architecture of it and this designed architecture is

used to designed optimized MAC unit.

Then in first stage the last two bits of the partial product is checked and according to the rule mentioned

above the decision of addition, subtraction and leaving it unchanged is done and then the first stage is

concluded by the arithmetic right shift. The same process is followed for theother stage also and the final

partial product is given by removing the least significant bit.

B. Tree Multiplier

The most well known high speed multipliers are the ones that are offered by Wallace and Dadda. Both the

multiplier consists of three phases. In the first phase, the partial product will be generated. In the second phase, partial product that was generated will be reduced to the height of two. In the final phase, the last two

rows will be summed by using an adder. In the Wallace method, the partial products are summed as soon as

possible. Whereas, Dadda’s method does the minimum reduction possible at each level to perform the

reduction in the same number of levels as required by a Wallace multiplier [34].



FIGURE 3.Dadda Multiplier Flow Chart

Wallace Multiplier: C.S. Wallace presented a method for Multiplication centered on reduction of the partial

product bits in parallel using the carry save adder which became known as Wallace tree [21].

Following are the steps of this Multiplier:

1. Firstly the partial products are formed by analyzing the bits of multiplier

2. Next the N rows of partial products are grouped together in sets of three rows each.

3. Any additional rows that are not a member of a group of three are transferred to the next level without

modification.

4. In Wallace multiplier, all the bits of all of the partial products in each column are added together by a

set of compressors in parallel without propagating any carries.

5. Within each group of three rows, 3:2 compressors are applied to the columns containing two or three

bits.

6. Columns that are containing only a single bit are taken to the next stage level unchanged. The final

two rows that will be remaining at the end will be summed using carry propagation adder.

The main concept of these kind of multipliers is to achieve accumulation of partial product by reducing the bit

information in each column by the use of full adders and half adder. Full adder and half adder being used are

called as (3:2) compressors and (2:2) compressors as it has the capability to add three bits form a single

column and two bits from a single column of partial product thereby by producing two output bits one bit in

the same column and one bit in the of the higher bit position.



Dadda Multiplier: Dadda made the Wallace’s method better by forming a compressor placing strategy that

required few numbers of compressors in the partial product reduction stage at the price of a larger carry

propagation adder in the final stage. In both the technique, the total delay is proportional to the logarithmic of

input operand [22].

C. Partial Product Reduction

The effective employment of a digital multiplier is centered on the technique used for the addition of partial product array bits. As the delay is proportional to width of themultiplicand is given by each shifted version of

the multiplicand, the multiplier blocks will need a huge quantity of time to do the operation if conventional

adders are used to perform the addition. In carry save adder the propagation of carry is avoided by taking

carries as outputs rather than taking it to the next higher bit position. Carry save adder is made of group of full

adders that does not consist of carry changing. Thereby, an n-bit Carry Save adder will take three n-bit inputs,

that is A(n-1). . . .A(0), B(n-1). . . ..B(0), Cin(n-1)..Cin(0), then will produce two n-bit result values, Sum(n-

1). . . Sum(0) and Cout(n-1). . . cout(n-1)..cout(0). The very crucial application of a carry save adder is to

decrease the number of partial products in integer multiplication. When the carry save adder is used in

partialproduct reduction stage in the multiplier, the output carry is given to next stage to higher column.

3:2 Compressor :Compressors are quite similar to full adders when it comes to the operation it does. The

input bits given are A, B and Cin of rank0 (bit position 0) and then generates sum of bit position 0 and cout of

bit position 1 [35].

Final Stage Adder :The portable systems have gained lots of popularity and the tremendous development has

increased the circuit density of the integrated circuits. Therefore, the delay of thesesystems and area are the

important design objectives. As the adders are very important and is being extensively utilized units in

integrated circuits, effective design of adders has become the aim of many researches in VLSI design. The

final stage for addition level is important for the multipliers of any kind because it’s in this level the addition

of large size operand is done and it’s this level which will determine the overall delay of the multiplier. This

stage will use high speed carry propagate adder like ripple carry adder, which is discussed below.

4. PROPSED IMPLEMENTATION OF MAC

A MAC block includes multiplier and accumulators that contain the sum of the previous consecutive

products. MAC unit is one of the key block for digital signal processing system and plays a important role in its delay and area determination. The MAC unit is implemented using all three multipliers. The main focus in

MAC design has been to increase its speed. As speed and throughput rate has always been important concerns

of digital signal processing systems. The MAC unit is one of the important data path operators of digital

signal processing processors. As with recent development the DSPprocessors are important of biomedical

devices and such are required to be small such as hearing aid , So to its important for the DSP processor to

occupy small area and produce less delay. The major block of MAC unit is multiplier and the multiplication

scheme used should be efficient enough to produce less delay



FIGURE 4. MAC unit using all three multipliers block Diagram

The operation of the MAC unit is expressed:

F = åAixBi

Operation: Input is given to the multiplier and after the multiplication, the output of 16-Bit is given to the adder and another input to the adder is from the temporary register. Initially the value given from temporary

register is zero. When the next inputs of 8-bit is taken, the value from output register is fed to the temporary

register and when the next output from the Multiplier is given, its added with the previous sum. MAC units

fetches its inputs from the memory location and feeds to the multiplier component of Mac unit, register is

used as the accumulator. The construction of the MAC unit multiplier plays a vital role, so the selected

multipliers are Booth, Wallace and Dadda Multiplier. Then the comparative analysis is done in terms of area

and delay. The MACunit is implemented using all three multipliers. The designed MAC works on the clock,

as the consecutive clocks are given the inputs are taken in to the multiplier from register and then its product

output to accumulator that consist of adder and register that works in co-ordination of the clock.

Linear Convolution Application Using designed MAC unit

In this application five MAC unit are being used and doing the Process which is similar to the accumulation and at time the sequence of samples are getting multiplied. As with each clock pulse the input X samples

enters into the MAC unit and with new clock signal the corresponding values of X samples will shift along

the line of MAC units and get multiplied H samples values.

FIGURE 5. Block Diagram Linear Convolution Operation

In actual the line of MAC unit together is doing the Accumulation part. Once all the samples of X input enter the MAC unit with corresponding new clock cycle it leave gradually. In the operation the H



samples values are parameterized, as consequence we get all the samples of output sequence Y. This

operation of convolution we have used MAC Unit.

The figure 5 shows all the MAC units that are active for each clock cycle and as the inputs X samples get

multiplied with already defined H samples, the output Y sequence samples are accumulated sum of all MAC

units. The MAC units are interconnected to get the accumulated sum. These MAC units also work in the

coordination of clock, for each new clock cycle the samples of X sequence gets shifted along the line of MAC units and gets multiplied with H sequence samples. The adders in the MAC units also work in synchronize

manner in accordance to the clock.

5. SIMULATION RESULTS

This section contains all the simulation result and synthesis report of all the implemented multipliers, MAC

unit and convolution application using MAC Unit. The simulation result and synthesis report is obtained

using Xilinx 13.2 tool, virtex-5, XC5VLX110T-FF1136 and speed grade of -1.

1. MAC using Booth Multiplier Simulation result: Figure 6 represents the simulation result for MAC using

Booth Multiplier, in this simulation the inputs taken are in decimal type 1) (M1=7,H1=1) 2) (M1=6,H1=2)

3) (M1=5,H1=3) 4) (M1=4,H1=4) 5) (M1=3,H1=5) 6) (M1=2,H1=6) 7)(M1=1,H1=7). And the result is in

decimal type shown in dout=84. Once the sequence is given, for each output accumulation to be generated

the simulation run operation has to be utilized consecutively until all the inputs are multiplied and

accumulated.

2. MAC using Dadda Multiplier Simulation result: Figure 7 represents the simulation result of MAC

designed using Dadda multiplier, inputs taken are in decimal form are 1) (M1=7,H1=1) 2) (M1=6,H1=2)

3) (M1=5,H1=3) 4) (M1=4,H1=4) 5) (M1=3,H1=5) 6) (M1=2,H1=6) 7)(M1=1,H1=7). And the result is in

decimal, that is dout=84. Once the sequence is given, for each output accumulation to be generated the

simulation run operation has to be utilized consecutively until all the inputs are multiplied and

accumulated.

FIGURE 6: Simulation result for MAC using Booth’s Multiplier



FIGURE 7: Simulation result for MAC using Dadda Multiplier

3. MAC using Wallace Multiplier Simulation result: Figure 5.6 represents the simulation result of MAC

designed using Wallace Multiplier and Inputs taken are in decimal form for MAC are 1) (M1=7,H1=1) 2)

(M1=6,H1=2) 3) (M1=5,H1=3) 4) (M1=4,H1=4) 5) (M1=3,H1=5) 6) (M1=2,H1=6) 7)(M1=1,H1=7). And

the result is in decimal that is result is 84. Once the sequence is given, for each output accumulation to be

generated the simulation operation has to be done consecutively until all the inputs are multiplied and ac-cumulated.

FIGURE 8: Simulation result for Mac using Wallace Multiplier

4. Linear convolution application using MAC unit with Booth Multiplier: Figure 9 represents simulation

result of linear convolution using designed MAC unit with Booth Multiplier the inputs are in decimal

type, the two sequences taken are X=[3,5,4,1,2,1,2,3,4,6,7] and h=[1,2,3,4,5]. In decimal form the

convoluted result is Y= [3,11,23,36,51,49,34,23,30,36,53,63,65,58,35]. Once the inputs are given the

run operation of simulation has to be utilized consecutively so that the clocks are generated and with

each clock all the MAC units are active and the corresponding inputs in them get multiplied through the line of MAC units we get the result.

5. Linear convolution using MAC Unit with the Wallace Multiplier: Figure 10 represents simulation result

of linear convolution using designed MAC unit with Wallace Multiplier the inputs are in decimal type,

the two sequences taken are X=[3,5,4,1,2,1,2,3,4,6,7] and h=[1,2,3,4,5]. In decimal type the convoluted

result isY= [3,11,23,36,51,49,34,23,30,35,50,58,58,49,30]. Once the inputs are given the run operation

of simulation has to be utilized consecutively so that the clocks are generated and with each clock all

the MAC units are active and the corresponding inputs in them get multiplied through the line of MAC

units we get the result.

6. Linear convolution application using MAC Unit with the Dadda Multiplier. Figure11 represents

simulation result of linear convolution using designed MAC unit with Dadda Multiplier the inputs are in

decimal type, the two sequences taken are X=[3,5,4,1,2,1,2,3,4,6,7] and h=[1,2,3,4,5]. In decimal form the convoluted result is Y= [3,11,23,36,51,49,34,23,30,35,50,58,58,49,30]. Once the inputs are given

the run operation of simulation has to be utilized consecutively so that the clocks are generated and with

each clock all the MAC units are active and the corresponding inputs in them get multiplied through the

line of MAC units we get the result.



FIGURE 9: Simulation result for linear convolution application using MAC Unit with Booth Multiplier

FIGURE 10: Simulation result for linear convolution application using MAC Unit with Wallace

Multiplier

FIGURE 11: Simulation result for Linear convolution application using MAC Unit with Dadda

Multiplier

Synthesis Report

1. Synthesis report for all three Multiplier

TABLE 1: The table below gives the detailed comparative analysis of implemented Multiplier in

terms of delay and area

Multiplier Delay (ns) Slices

Booth 3.29 95

Wallace 14.66 107

Dadda 12.83 111

2. Synthesis Report for MAC using all three Multiplier.

TABLE 2: The table below gives the detailed comparative analysis of designed MAC unit in terms

of delay and area

MAC (With Multiplier) Delay (ns) Slices

Booth 3.296 111

Wallace 7.014 130

Dadda 6.95 136

3. Synthesis Report for Linear convolution application using MAC Unit with all three multiplier.

TABLE 3: Linear Convolution application using MAC synthesis report in terms of area and delay.

MAC (With Multiplier) Delay (ns) Slices

MAC with Booth 3.296 92



MAC with Wallace 2.87 88

MAC with Dadda 3.357 99

Comparative Study

This part discusses the comparison 8 X 8 bits Wallace and Dadda Multiplier. The number of full adder used

by the Wallace Multiplier is 38 and half adders are 15. The Wallace tree Multiplier uses 38 full adders and 15

half adders. The number of full adders used by Dadda multiplier is 35 and half adders used are 7. Carry

propagation adder is of 10 bits wide in case of Wallace Multiplier. 14 bit wide carry propagation adder is used

in the Dadda Multiplier. Dadda multiplier is less regular when compared to the Wallace multiplier and makes

it more difficult for VLSI design.

The synthesis report is obtained using Xilinx 13.2, Virtex-5, XC5VLX100T-FF1136 and speed grade of -1.

The implemented Multiplier is Booth, Dadda and Wallace. For Booth Multiplier the delay and area is less as

compared to other multiplier. Wallace multiplier has more delay but it takes less area as compared to Dadda. When it comes to Mac the least delay is in the Mac which uses booth Multiplier. When the MAC Units are

used for Linear convolution application, the least area occupied is MAC unit is with Dadda is used.

6. Conclusion

An architecture for Booth , Wallace, Dadda multipliers have been proposed and implemented on virtex-5

FPGA. The Synthesis report of the multipliers are tabulated and compared in terms of area and delay. As

compared to Wallace and Dadda, Booth multiplier has brought better performance in terms of speed and area.

These parallel multi-pliers has been utilized to design MAC unit and their synthesis report is tabulated. The

MAC with Booth Multiplier has less delay and area. MAC with Wallace Multiplier has more delay but has

less area as compared to MAC with Dadda Multiplier. Finally, the above designed MAC have been used for

DSP application, like convolution and the Synthesis report have been obtained and was tabulated, the results

shows MAC with Dadda has low delay.

References

[1] J. M. Rudagi, V. Amblr, V. Munavalli, R. Patil and V. Sajjan, “Design and im-plementation of efficient

multiplier using Vedic Mathematics,” Proc. IEEE Inter-national Conference on Advances in Recent

Technologies in Communication and Computing, pp. 162-166, November 2011.

[2] T. Arunachalam, S. Kirubaveni, “Analysis of High Speed Multipliers,” Proc. IEEE International

Conference on Communication and Signal Processing, pp. 211-214, April 2013.

[3] E. Prakash, R. Raju and R. Varatharajan, “Effective Method for Implementation of Wallace Multiplier

using Fast Adders,” Journal of Innovative Research and Solutions, vol. 1, no. 1, July 2013.

[4] G. Vasudev, R. Hegadi, “Design and Development of 8-Bits Fast Multiplier for Low Power

Application,” IACSIT International Journal of Engineering and Technology, vol. 4, no. 6, December

2012.

[5] A. Sen, P. Mitra and D.Datta, “Low Power MAC Unit for DSP Processor,” International Journal of Recent Technology and Engineering, vol. 1, no. 6, January 2013.

[6] G. Vasudeva, “Design and Implementation of Radix-2 Modified Booth’s Encoder Using FPGA and

ASIC Methodology,” International Journal of Recent Technology and Engineering, vol. 4, no. 3, July

2015.

[8] MD. Riazullah and K. Kishore, ”VLSI Implementation of Low Power Multiplier and Accumulator Unit

using SPST,” International Journal of Science, Engineering and TechnolgyResearchInternational Journal

of Science, Engineering and Technology Research, vol. 3, no. 12, December 2014.

[9] S. Veeramachaneni, ”Novel Architectures for High-Speed and Low-Power 3-2, 4-2 and 5-2

Compressors,” IEEE International Conference on VLSI Design and Embedded Systems, pp. 1063-9667,

January 2007.

[10] A. A. Kafi, A. Rahman, B. Mahjabeen, M. Rahman, “An efficient design of FSM based 32-bit unsigned high-speed pipelined multiplier using Verilog HDL,” 8th International Conference on Electrical and

Computer Engineering, pp. 164-167 22, December 2014.

[11] A. Sharma, A. Srivastava, A. Agarwal, D. Rana and S. Bansal, “Design and Implementation of Booth

Multiplier and Its Application Using VHDL,” International Journal of Scientific Engineering and

Technology, vol. 3, no. 5, May 2014.

[12] Q. LI, G. Liang, A. Bermak, “A High-speed 32-bit Signed/Unsigned Pipelined Multiplier,” IEEE

International Symposium on Electronic Design, pp. 207-211, 2010.

[13] S. Karthick, S. Karthika and S. Valarmathy, “Design and Analysis of Low Power Compressors,”

International Journal of Advanced Research in Electrical,Electronics and Instrumentation Engineering,

vol. 1, no. 6, December 2012.



[14] V. Anandi and R. Rangarajan, “A Full Custom MAC Using Dadda Tree Multiplier For Digital Hearing

Aid,” Journal of Theoretical and Applied Information Technology, vol. 67, no. 1, September 2014.

[15] S. Jose and S. Mytheen, “Modified Booth Multiplier Based Low-Cost FIR Filter Design,” International

Journal of Engineering Science and Innovative Technology, vol. 3, no. 5, September 2014.

[16] T. Tresa, M. A.Shameem and S. Sreedharan, “High Performance MAC Unit for FFT Implementation,”

International Open Access Journal Of Modern Engineering Research, vol. 4, no. 1, January 2014. [17] K. R. Jayalakshmi and H. S. Jacob, “Modified FSM Based 32-Bit Unsigned High Speed Pipelined

Multiplier Using Carry Look Ahead Adders In Verilog HDL,” International Journal of Innovative

Science, Engineering and Technology, vol. 2, no. 4, April 2015.

[18] Y. Kumar, K. Kumar, S. Yadav, A. Gupta, ”A Novel Approach to design High Speed MAC Unit”,

International Journal of Advanced Research in Computer Engineering and Technology, vol. 3, no. 6,

June 2014.

[19] D. Bordiya and L. Bandil, “Comparative AnalysisOf Multipliers (serial and parallel with radix based on

booth algorithm),” International Journal of Engineering Research and Technology (IJERT), vol. 2, no. 9,

September 2013.

[20] S. Vaidya and D. Dandeka, “Delay-Power Performance Comparison of Multipliers in VLSI Circuit

Design,” International Journal of Computer Networks and Communications (IJCNC), vol. 2, no. 4, July

2010. [21] R. k. Rashmi, “Parallel Hardware Implementation of Convolution using Vedic Mathematics,” IOSR

Journal of VLSI and Signal Processing (IOSR-JVSP), pp. 21-26, Nov-Dec 2012.

[22] C.S Wallace, “A suggestion for a fast Multiplier,” IEE Transaction on Electronic Computer, vol. EC-13,

pp. 14-17, 1964.

[23] L. Dadda, “Some Schemes for Parallel Multiplier,” Alta Frequenza, vol. 34, pp . 349-356, August 1965.

[24] T. K. Callaway, “Optimizing multipliers for WST,” International Conference on Wafer Scale

Integration, pp. 85-94, 1993.

[25] D. Garg and S. Arya, “Design of Configurable Booth Multiplier Using Dynamic Range Detector,”

International Journal of Electronics Engineering(IJEER), vol. 4, no. 3, March 2012.

[26] C. Y. H. Lee, L. H. Hiung, S. W. F. Lee, N. H. Hamid, “A performance comparison study on multiplier

designs,” IEEE International Conference on Embedded Systems, pp. 1-6, June 2010. [27] K. L. S. Swee, L. H. Hiung, “Performance Comparison Review of 32 bit Multiplier Designs,” IEEE

International Conference on Intelligent and Advanced Systems, pp. 854-859, June 2011.

[28] Y. H. Seo and D. W. Kim, “A New VLSI Architecture of Parallel Multiplier–Accumulator Based on

Radix-2 Modified Booth Algorithm,” EEE Transactions on Very Large Scale Integration Systems, vol.

18, no. 2, February 2010.

[29] R. De Mori,“Suggestions for an TC Fast Parallel Multiplier, “Electronics letters, vol. 5, pp. 50-51, 1965.

[30] BehroozParhami, Computer Arithmetic, Algorithms and Hardware Design. London:Oxford University

Press, 2000.

[31] P. Gurjar, R. Solanki, P. Kansliwal P, Dr. M. Vucha, “VLSIimplementation of adders for high speed

ALU” Annual IEEE in India Conference (INDICON), pp. 1-6 16-18, December 2011.

[32] M. Sudeep, M. SharathBimba, Dr. M. Vouch, “Design and FPGA Implementation of High Speed Vedic

Multiplier,” International Journal of Computer Application, vol. 116, no. 20, pp. 6-9, April 2014. [33] A. D. Booth, ”A signed binary multiplication technique, “Quarterly Journal of Mechanics and Applied

Mathematics, vol. 4, pp. 236-240, 1951.

[34] J. W. Townsend, A. J. Abraham and E. Swartzlander, “A comparison of Wallace and Dadda multiplier

delays,”International Society for Optical Engineering (SPIE)., vol. 52, no. 5, December 2003.

[35] J. Kaur, N.K. Gahlan and P. Shukla, “Delay Power Performance Comparison of Array Multiplier in

VLSI Design,” International Journal of Advanced Re-search in Computer Science and Electronics

Engineering, vol. 1, no. 3, May 2012.

[36] Y. H. Seo and D. W. Kim, “A New VLSI Architecture of Parallel Multiplier–Accumulator Based on

Radix-2 Modified Booth Algorithm,” IEEE Transactions on Very Large Scale Integration Systems, vol.

18, no. 2, February 2010

design and analysis of low power and high speed mac

Documents