design and analysis of low power and high speed mac
TRANSCRIPT
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
1 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
Design and Analysis of Low Power and High Speed MAC
Designs forConvolution basedon FPGA
NavneetRanjan, VipulAgrawal*
Department of Electronic and Communication Engineering
Trinity Institute of Technology and Research, Bhopal
Madhya Pradesh, India *[email protected]
Abstract: Nowadays multimedia applications are demanding high speed computing architectures. Adders and
Multipliers are the very important functional blocks in Arithmetic and Logic Unit (ALU) of high speed
computing architectures. High performance systems in all cases require fast multiplication. This project
presents the implementations of the high speed signed and unsigned fast multipliers and their comparative
analysis. VLSI architecture widely use parallel multipliers such as Booth multiplier, Wallace multiplier and
Dadda multiplier in order to acquire their design attributes like speed and area, have been proposed here.
The acquired design parameters of the multipliers are analyzed to design optimum speed Multiply and
Accumulate (MAC) unit for multimedia applications like Filters, Synthesizers, Wireless communication
channels, etc. Finally, the designed MAC unit has also been applied to DSP technique i.e. convolution and
their performance is analyzed in terms of area and speed.
Keywords: MAC, Booth Multiplier, Wallace Multiplier, Dadda Multiplier, Convolution
1. INTRODUCTION
Multiplication is an operation in which multiplicand is added a specified number of times that is to multiplier
to give the product. In Electronic systems high speed multipliers are required for fast arithmetic operations
systems for example, systems like FIR filters, digital signal processors and microprocessors etc.
Multiplication based operation that is Multiplier and Accumulation (MAC) is frequently used in operation
like multimedia, and 3D graphics. Now days, the speed of multiplication operation is important factor in
determining the time taken by the instruction cycle of a digital signal processing chip and the demand for fast
processing is on increase because of growth in the computing and signal processing applications [1].
Performance of the systems depends on performance of the multiplier because the multiplier is usually the
critical element in the system.
Therefore, the development of optimized multipliers is important in order to achieve high performance portable multimedia devices and Digital Signal Processing systems. For many applications, it
has been important to reduce the time delay and consumption of power by the multipliers. These signal
processing applications not only demand high computation speed and capacity but as well as consume more
amount of energy. As the speed and Area remain to be the two important factors in a design, the higher speed
leads to more power consumption, thus, low power architectures will be the choice of future. As the
technology is developing rapidly with time, the researchers are focusing on de-signing of optimized
multipliers with the following objectives: high speed and less area or package of both in one multiplier [2].
Thus, innovating increased speed multiplier always results in larger area. It is found in the literature that
optimizing the multipliers architecture favors enhanced performance of embedded processors that are being
utilized in the consumer and industrial electronic products. As a result the regular structure needed for each
processing elements will also increases and thus consume area. Therefore, there is need for developing design of multiplier architecture of N-bit that supports high speed and less area.Types of multipliers are shown in the
figure 1.1 and described in detail below.
1. Serial multiplier: In serial multiplication, sequential circuits are being used with feedbacks. The inner
products are sequentially produced and then added serially as per the operation. But the speed of serial
multiplier is less as compared to parallel multiplier. Because Serial Multiplier adds each bits of the
multiplicand sequentially and the process is repeated for each of the multiplier bits and another reason
only one adder is used to add the m ×n number of partial products where m and n are number of bits of
multiplicand and multiplier respectively. Serial multipliers are used wherein area and power are of
utmost importance while delay can be ignored. The compactness of design allows the architecture of
multiplier to run at higher clock rate thus making it competitive when compared with much more
complex designs with regard to speed.
2. Parallel Multiplier:In a parallel multiplier of first generation, partial products are done by multiplying the multiplicand with each bit of the multiplier. Then the partial products are added in parallel fashion
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
2 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
to produce the resultant product P. The multiplication steps are divided in to two steps that are
generation of partial product and addition of this partial product. The delay caused depends on the
number of partials products to be added.
Wallace tree multiplier are fast multiplier which compress its generated partial products as soon as possible to
bring it to the final two rows of partial products and then adds it with carry propagate adder. The compression
of partial product is done by the carry save adder. Booth multiplier is based on booth algorithm and is used for signed and unsigned multiplication. Baugh Wooley multiplier is also used for signed and unsigned
multiplication.
(a) Array multiplier.
(b) Tree multiplier.
3. Serial-parallel multiplier:The serial-parallel multiplier (SPM) operates on each bit of the multiplier
serially, but uses a parallel adder for partial product accumulation, a good trade-off between time
consuming serial multiplier and area consuming parallel multiplier.
FIGURE 1. Block Diagram for types of multiplier
MAC unit it consists of Multiplier and Accumulator that contain the sum of the previous consecutive
products. The accumulator consists of adder and register; the basic function of MAC operation is given as F =
å Ai X Bi. Digital Signal Processor performance is contributed by the MAC unit. By the enhancement of
MAC unit, we can optimize performance of the Digital Signal Processors
The multiplier block of DSP processors, micro-controllers and microprocessor causes lot of delay and utilize
area. This gives a scope to implement and analyze the fast multiplier architecture and apply those analyzed
multipliers for the design of MAC unit analyze their performance in terms delay, area and make comparative
analysis.The proposed architecture is used for fixed point MAC unit. Most of the DSP application demands
floating point MAC units, so there is a need of designing multipliers for floating point arithmetic operation.
2. BACKGROUND
Prakash et al. [3]presented 16 bit Wallace multiplier using fast adder at the final state two sum final two row. Wherein, the partial products are produced by n2 AND gates in the same manner as for Dadda multipliers. In
Wallace tree multiplier, all bits of all the partial products in column matrix added parallel by group
ofcompressors avoiding propagation of any carry to the next phase. Among each group of three rows, 3-2
compressors are applied to the columns containing three bits. Columns that have only a single bit is
transferred to the next level unrectified.Time taken by the multiplication operation is reduced by employing
the proposed Wallace Multiplier architecture. As per their work the designed modified carry save adder is
faster and is highly suitable for VLSI design. Vasudeva et.al [6], presented modified booth algorithm for
signed and unsigned multiplication of two numbers along with the use of spurious power suppression
technique.SPST utilizes detection unit to find the important range of data range of the arithmetic units, e.g.,
adders or multiplier part of data that will not affect the final calculated results, the data controlling unit of the
SPST will latch this portion to avoid not needed data changeovers happening inside the arithmetic units.
Furthermore, the data asserting control realized by the use of register and filters outs the unwanted spurious signals of the arithmetic unit whenever the latched portions is turned on. This helps reduce the delay caused
by the ripple carry adder.
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
3 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
Riazullah and Kishore Kumar [7], presented MAC using Booth multiplier and Spurious Power Suppression
Technique(SPST).In this Multiplier the one of the input which is the multiplier, last three significant bits of it
is analyzed and then accordingly the operation of generation of partial product is done and the shifting of
three bits are done. The SPST utilizes detection unit that removes the unwanted data range of adder. The
subtractor and adder is split into two main components, the most consequential part (MSP) and the least
consequential part (LSP). The SPST adder will remove the undesirable addition and thereby reducing switching power dissipation.Veeramachaneni [8], proposed novel architectures and the designs of low power
and high speed compressors used for addition in the partial product addition stage or accumulation stage. The
compressors 3:2, 4:2 and 5:2 are the essential components in many applications where addition is required
most importantly in multiplication. The three important parameters area, power and speed of proposed
compressor as compared to conventional compressors have a better performance. A more concentration is
laid on how to make use of multiplexers in arithmetic circuits. The new compressors are a package of low
power, low transistor count and lesser delay. The new compressors are a package of low power, low transistor
count and lesser delay.Kafi et al. [9], presented an efficient design of FSM multiplier that consists of partial
product generator that generates the partial product. These partial products are added using 4 40-bit full
adders in a sequential manner and stored in register in pipelined manner. The mux is used so that the addition
goes sequentially and demux are used so that the added output is stored into the register in a sequential
manner. To control such data path,the Finite state machine is being used, the output of the finite state machine is used as select lines to control the mux and demux.
Sharma et al. [10], presented Booth multiplier and the booth multiplier along with additional modules like
logic functions, subtraction module, addition module division module squaring module are combined to
design calculator. Booth multiplier does the operation of multiplication by analyzing the last 2 bit of
multiplier and performs the operation required. For the implementation of calculator, all other blocks of
operation are implemented this booth multiplier is implemented in a calculator and several other operation
blocks and blocks like 4-bit parallel addition, parallel sub tractor, division, squaring, cubing and other logical
functions like AND,OR,NOT etc.LI et al. [11], proposed 32-bit signed and unsigned pipelined Multiplier that
contains a sign control unit that will produce the MSB of Multiplicand and Multiplier and also the select
signal required for the line of multiplexers. The carry select adder isutilized in this phase to minimize the
delay while carry propagation adder is used to add the final two partial rows. In Wallace tree compression technique is used to sum up the partial products row. In second phase of Wallace tree compression carry
select adder is utilized. The Conditional sum adder has to save both the conditional sum and carry. As
consequence the more number of multipliers are used to the benefits of both the adders, a mixed CSA-CCA
architecture is implemented to compute a final fast addition.
Karthick et al. [12] proposed different kinds of compressors are proposed and then designed. In their work
they have designed 8-bit Wallace multiplier by the use of proposed compressors and then the designed is
compared with the conventional Wallace multipliers in terms of power. Anandi and Rangarajan [13] focused
on large DSP data path operators like multipliers and MAC circuits, where to lower the energy used per
process is of larger importance. These form the heart of a main of commercial DSP Processor data path units.
They have examined first the architectural for merged MAC circuits and then implemented a high-speed/low-
power MAC architecture.
Jose and Mytheen, they have designed the FIR filter using the booth multiplier and it is low area cost. A direct form filter is designed in such way that new data sample and filter coefficients are applied with
each new clock cycle of to the inputs of multiplier. x [n] is taken as the input signal. D-FFs are used as the
delay elements [14]. Modified Booth multiplier block is used for the multiplying the input signal with the set
of filters coefficients that is corresponding to selected filter order. Then the modified Booth multiplier block
will provide the output signal y [n]. Modified Booth multiplier is of Booth algorithm, including Booth
encoder and Booth decoder, Wallace tree compressor(WTC) . The Wallace tree reduction compresses the
partial product bits. Wallace tree is being used to increase the speed of Multiplication by reduction of partial
product. Wal-lace Tree Structure is composed by using compressors, full adders and various other techniques.
WTC is a technique used to increase the speed of partial product addition operation. A Wallace tree
compressor consists of a set of full adders (FAs). Sometimes, the FA at the LSB is replaced by the half adder
(HA). The HA adds two input bits to produce one sum and one carry. All the FAs used adds three input bits at a time to produce one sum bit and one carry bit. Therefore, the PPs are added in parallel using the Wallace
tree compressor until two sequences of outputs are generated. Finally, these sequences of sum bits and carry
bits are to be given to a CLA. The CLA provides speed boost to the system. They are the fastest adders. CLA
consists of a set of full adders.
Kumar et al., they have a proposed MAC unit. In their MAC unit they have taken used Vedic Multiplier and
Wallace tree .This paper has also used the combination of UrdhvaTriyakbhyam sutra with unique addition
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
4 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
tree structure similar to Wallace for multiplication. They have n-bit Multiplier and 2n bit adder and 2n
accumulator [17]. BordiyaandBandil, their work is on efficient implementation of high speed multiplier using
the shift and add method, Radix-2, Radix-4 Booth multiplier algorithms. In their work they have compared
the working of the multiplier by implementing each of them separately in FIR filter[18]. For this they have
first designed three different type of multipliers using shift method, the radix -2 and radix-4 Booth multiplier
using Booth algorithm. After that they had compared the working of different multipliers by comparing the some factors which are responsible for the multiplier working like (speed, area, power). Another important
improvement in the multiplier is the reducing the number of partial products that are generated. In Booth
recording multiplier the encoder checks the inputs and then generates the encoded partial products based on
that the decoder produces exactly half of the partial products. The checking is done in triplets of the input
given[19]. The two bit is from present pair; and the third bit is from high order bit that is neighbor to the
previous pair.
Rashmi et.al, proposed direct method for performing linear convolution of finite length sequence. The
Multipliers are the simplest block of convolves. As it takes maximum of the execution time so to increase the
speed, she has used 4X4 Vedic multiplier [20]. To the input of vedic multiplier 4-bit long samples are
applied. Then we get 8-bit partial products as the output of vedic multiplier. For parallel Processing, 16
multipliers are used to generate 16 partial products. Then for additions all the outputs are latched and carry
look ahead adder is used to sum up the partial products. For the final stage of addition combination of carry save adder and ripple carry adder is used.Wallace gave a method high speed multiplication that is centered on
reduction of partials products in parallel using tree of carry save and it was called as Wallace Multiplier [21].
Dadda proposed another method to improve the Wallace method by putting forward in which the compressors
are placed according to the compressor placing technique that re-quires less umber of compressors in partial
product reduction phase at the cost of larger carry propagation adder Because of high speed characteristics of
column compression multiplier the study is prolonged column compression multiplier prolonged to be stud-
ied due to their high speed functioning. The total delay of a multiplier is proportionate to the logarithm of the
input [22]. According to T. K. Callaway column compression multiplier are much better in terms of power
when compared to array multipliers[23].Garg presented a Dynamic Range Detection based multiplier which
directed at reducing the power consumption of multiplier. This method generally chooses the operand which
has a probability of making more partial products rows zeroes, after being taken as multiplier when it’s Booth Recoding is done [24]. The power consumption is further reduced by using truncation technique on the output
product at the price of output precision. .The designed multiplier is more successful then the conventional
Booth multiplier for truncation method. As a consequence, the designed multiplier is very preferable for
portable multimedia and DSP application.
Lee, worked on relative performance analysis of different multiplier designs that is Array, Wallace, Dadda
and Reduced Area Multipliers. All the multiplier designs were modelled in Verilog HDL and synthesized.
Logic synthesis report data for Area, Speed and Auto optimization mode point to that Dadda multiplier may
not be always be faster than the Wallace multiplier, it depends on the optimization method used [25]. It is also
found that Wallace multiplier is suited for high the speed applications, the Dadda and the reduced area
designs gives best speed when synthesized to reduce area or logic usage.
Hiung shows a relative performance comparison of Radix-based Booth Encoding multiplier designs. Radix-2
Booth Encoding, Radix-4 Booth Encoding, Radix-8 Booth Encoding multiplier, Radix-16 Booth Encoding multipliers and Radix-32 Booth En-coding multiplier has been compared [26]. Synthesis data report are
obtained for Area, Speed and Auto optimized modes indicate that largest area and longest timing delay is
observed in Radix-2 Booth encoding multiplier design. As the number of Radix Based multiplier is increased,
the number of partial product rows are reduced but at the cost of complexity of operations performed. Results
indicate that Radix-4 Booth Encoding multiplier design is the best multiplier in terms of speed and area.Seo,
designed a new parallel Multiplier using radix-2 Booth Multiplier. Hybrid type of carry save adder is used by
combining multiplication and accumulation [27]. The architecture is designed in such way it has Booth
multiplier along with carry save adder tree and array for sign extension so that the bit density of the operand
can be increased. The carry save adder tree is used reduce the partial products until only two rows are
remaining and final two rows are added using fast adder like carry look ahead adder.Demori, defined two
kinds of parallel multipliers. The first kind of multiplier uses rectangular array of cells to sum the partial products this type of multipliers are called array Multipliers. Delay will be proportional to the input given to
the multiplier. The second kind of multiplier uses compressors to reduce the partials products and final two
rows are reduced using the high speed carry propagate adder [28].
3. EXISTING ARCHITECTURE
This section gives the detailed description of implemented Parallel multiplier architecture that will be used in
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
5 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
the design of MAC unit.
A. Booth Multiplier
Booth’s algorithm is one of the important algorithm to perform signed number multiplication. The operation
is presented by adding two already calculated values in prior that is addition denoted as A and the subtraction
denoted as S to ainitial Partial Product P and after that performs a right arithmetic shift on Partial product P.
Let’s consider x and y to be the multiplicand and multiplier. Let nx and ny represent the number of bits in X and Y [4].
FIGURE 2. Booth’s Multiplier Flow Chart
The fast Booth’s multiplier procedure obtains the product of x and y by following the rules as follows:
1. At first, calculate the values of Addition (S), Subtraction (S) and then the initial value of Partial
product P. All of these numbers should have a equal length to nx + ny + 1.
2. Addition (A): Fill the most significant (leftmost) bits with the value of Multiplicand. And the
remaining with (nx + 1) bits with zeros.
3. Addition (A): Fill the most significant (leftmost) bits with the value of Multiplicand. And the
remaining with (ny+1) bits with zeros.
4. P: Let the most significant bits be occupied with nx zeros. To the right of this zeros append the values
of y. Let the least significant (rightmost) bits be a zero.
5. Analysis the two least significant (rightmost) bits of P.
• Now if it is 01, then we will calculate the value of Partial Product +Addition and if any
overflow takes place it is ignored.
• Now if it is 10, then we will calculate the values of Partial product P+ Sub-traction S and if
there is any overflow that will be ignored.
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
6 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
• Now if it is 00 or 11, then nothing is done and P will be used directly in the next level of step.
6. Then Arithmetic right shift is done of single step to the value that was determined in the previous
step. Partial product P now equals to new value.
7. Now we will be repeating the steps 2 and 3 until it is done “n” times.
8. Now leave the least significant (rightmost) bit from P, resultant is the product of x and y.
Andrew Donald Booth Proposed a multiplication algorithm, which was called as Booth’s Algorithm [33]. In Booth multiplier, the number of partial products is reduced by the use of Booth encoding algorithm by taking
into consideration some specific number of bits of the multiplier instead of just one bit at a time, therefore
achieving a speed advantage over other the multiplier architecture.The figure 2.gives the detailed operation of
Booth Multiplier and the same is followed in designing the architecture of it and this designed architecture is
used to designed optimized MAC unit.
Then in first stage the last two bits of the partial product is checked and according to the rule mentioned
above the decision of addition, subtraction and leaving it unchanged is done and then the first stage is
concluded by the arithmetic right shift. The same process is followed for theother stage also and the final
partial product is given by removing the least significant bit.
B. Tree Multiplier
The most well known high speed multipliers are the ones that are offered by Wallace and Dadda. Both the
multiplier consists of three phases. In the first phase, the partial product will be generated. In the second phase, partial product that was generated will be reduced to the height of two. In the final phase, the last two
rows will be summed by using an adder. In the Wallace method, the partial products are summed as soon as
possible. Whereas, Dadda’s method does the minimum reduction possible at each level to perform the
reduction in the same number of levels as required by a Wallace multiplier [34].
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
7 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
FIGURE 3.Dadda Multiplier Flow Chart
Wallace Multiplier: C.S. Wallace presented a method for Multiplication centered on reduction of the partial
product bits in parallel using the carry save adder which became known as Wallace tree [21].
Following are the steps of this Multiplier:
1. Firstly the partial products are formed by analyzing the bits of multiplier
2. Next the N rows of partial products are grouped together in sets of three rows each.
3. Any additional rows that are not a member of a group of three are transferred to the next level without
modification.
4. In Wallace multiplier, all the bits of all of the partial products in each column are added together by a
set of compressors in parallel without propagating any carries.
5. Within each group of three rows, 3:2 compressors are applied to the columns containing two or three
bits.
6. Columns that are containing only a single bit are taken to the next stage level unchanged. The final
two rows that will be remaining at the end will be summed using carry propagation adder.
The main concept of these kind of multipliers is to achieve accumulation of partial product by reducing the bit
information in each column by the use of full adders and half adder. Full adder and half adder being used are
called as (3:2) compressors and (2:2) compressors as it has the capability to add three bits form a single
column and two bits from a single column of partial product thereby by producing two output bits one bit in
the same column and one bit in the of the higher bit position.
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
8 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
Dadda Multiplier: Dadda made the Wallace’s method better by forming a compressor placing strategy that
required few numbers of compressors in the partial product reduction stage at the price of a larger carry
propagation adder in the final stage. In both the technique, the total delay is proportional to the logarithmic of
input operand [22].
C. Partial Product Reduction
The effective employment of a digital multiplier is centered on the technique used for the addition of partial product array bits. As the delay is proportional to width of themultiplicand is given by each shifted version of
the multiplicand, the multiplier blocks will need a huge quantity of time to do the operation if conventional
adders are used to perform the addition. In carry save adder the propagation of carry is avoided by taking
carries as outputs rather than taking it to the next higher bit position. Carry save adder is made of group of full
adders that does not consist of carry changing. Thereby, an n-bit Carry Save adder will take three n-bit inputs,
that is A(n-1). . . .A(0), B(n-1). . . ..B(0), Cin(n-1)..Cin(0), then will produce two n-bit result values, Sum(n-
1). . . Sum(0) and Cout(n-1). . . cout(n-1)..cout(0). The very crucial application of a carry save adder is to
decrease the number of partial products in integer multiplication. When the carry save adder is used in
partialproduct reduction stage in the multiplier, the output carry is given to next stage to higher column.
3:2 Compressor :Compressors are quite similar to full adders when it comes to the operation it does. The
input bits given are A, B and Cin of rank0 (bit position 0) and then generates sum of bit position 0 and cout of
bit position 1 [35].
Final Stage Adder :The portable systems have gained lots of popularity and the tremendous development has
increased the circuit density of the integrated circuits. Therefore, the delay of thesesystems and area are the
important design objectives. As the adders are very important and is being extensively utilized units in
integrated circuits, effective design of adders has become the aim of many researches in VLSI design. The
final stage for addition level is important for the multipliers of any kind because it’s in this level the addition
of large size operand is done and it’s this level which will determine the overall delay of the multiplier. This
stage will use high speed carry propagate adder like ripple carry adder, which is discussed below.
4. PROPSED IMPLEMENTATION OF MAC
A MAC block includes multiplier and accumulators that contain the sum of the previous consecutive
products. MAC unit is one of the key block for digital signal processing system and plays a important role in its delay and area determination. The MAC unit is implemented using all three multipliers. The main focus in
MAC design has been to increase its speed. As speed and throughput rate has always been important concerns
of digital signal processing systems. The MAC unit is one of the important data path operators of digital
signal processing processors. As with recent development the DSPprocessors are important of biomedical
devices and such are required to be small such as hearing aid , So to its important for the DSP processor to
occupy small area and produce less delay. The major block of MAC unit is multiplier and the multiplication
scheme used should be efficient enough to produce less delay
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
9 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
FIGURE 4. MAC unit using all three multipliers block Diagram
The operation of the MAC unit is expressed:
F = åAixBi
Operation: Input is given to the multiplier and after the multiplication, the output of 16-Bit is given to the adder and another input to the adder is from the temporary register. Initially the value given from temporary
register is zero. When the next inputs of 8-bit is taken, the value from output register is fed to the temporary
register and when the next output from the Multiplier is given, its added with the previous sum. MAC units
fetches its inputs from the memory location and feeds to the multiplier component of Mac unit, register is
used as the accumulator. The construction of the MAC unit multiplier plays a vital role, so the selected
multipliers are Booth, Wallace and Dadda Multiplier. Then the comparative analysis is done in terms of area
and delay. The MACunit is implemented using all three multipliers. The designed MAC works on the clock,
as the consecutive clocks are given the inputs are taken in to the multiplier from register and then its product
output to accumulator that consist of adder and register that works in co-ordination of the clock.
Linear Convolution Application Using designed MAC unit
In this application five MAC unit are being used and doing the Process which is similar to the accumulation and at time the sequence of samples are getting multiplied. As with each clock pulse the input X samples
enters into the MAC unit and with new clock signal the corresponding values of X samples will shift along
the line of MAC units and get multiplied H samples values.
FIGURE 5. Block Diagram Linear Convolution Operation
In actual the line of MAC unit together is doing the Accumulation part. Once all the samples of X input enter the MAC unit with corresponding new clock cycle it leave gradually. In the operation the H
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
10 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
samples values are parameterized, as consequence we get all the samples of output sequence Y. This
operation of convolution we have used MAC Unit.
The figure 5 shows all the MAC units that are active for each clock cycle and as the inputs X samples get
multiplied with already defined H samples, the output Y sequence samples are accumulated sum of all MAC
units. The MAC units are interconnected to get the accumulated sum. These MAC units also work in the
coordination of clock, for each new clock cycle the samples of X sequence gets shifted along the line of MAC units and gets multiplied with H sequence samples. The adders in the MAC units also work in synchronize
manner in accordance to the clock.
5. SIMULATION RESULTS
This section contains all the simulation result and synthesis report of all the implemented multipliers, MAC
unit and convolution application using MAC Unit. The simulation result and synthesis report is obtained
using Xilinx 13.2 tool, virtex-5, XC5VLX110T-FF1136 and speed grade of -1.
1. MAC using Booth Multiplier Simulation result: Figure 6 represents the simulation result for MAC using
Booth Multiplier, in this simulation the inputs taken are in decimal type 1) (M1=7,H1=1) 2) (M1=6,H1=2)
3) (M1=5,H1=3) 4) (M1=4,H1=4) 5) (M1=3,H1=5) 6) (M1=2,H1=6) 7)(M1=1,H1=7). And the result is in
decimal type shown in dout=84. Once the sequence is given, for each output accumulation to be generated
the simulation run operation has to be utilized consecutively until all the inputs are multiplied and
accumulated.
2. MAC using Dadda Multiplier Simulation result: Figure 7 represents the simulation result of MAC
designed using Dadda multiplier, inputs taken are in decimal form are 1) (M1=7,H1=1) 2) (M1=6,H1=2)
3) (M1=5,H1=3) 4) (M1=4,H1=4) 5) (M1=3,H1=5) 6) (M1=2,H1=6) 7)(M1=1,H1=7). And the result is in
decimal, that is dout=84. Once the sequence is given, for each output accumulation to be generated the
simulation run operation has to be utilized consecutively until all the inputs are multiplied and
accumulated.
FIGURE 6: Simulation result for MAC using Booth’s Multiplier
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
11 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
FIGURE 7: Simulation result for MAC using Dadda Multiplier
3. MAC using Wallace Multiplier Simulation result: Figure 5.6 represents the simulation result of MAC
designed using Wallace Multiplier and Inputs taken are in decimal form for MAC are 1) (M1=7,H1=1) 2)
(M1=6,H1=2) 3) (M1=5,H1=3) 4) (M1=4,H1=4) 5) (M1=3,H1=5) 6) (M1=2,H1=6) 7)(M1=1,H1=7). And
the result is in decimal that is result is 84. Once the sequence is given, for each output accumulation to be
generated the simulation operation has to be done consecutively until all the inputs are multiplied and ac-cumulated.
FIGURE 8: Simulation result for Mac using Wallace Multiplier
4. Linear convolution application using MAC unit with Booth Multiplier: Figure 9 represents simulation
result of linear convolution using designed MAC unit with Booth Multiplier the inputs are in decimal
type, the two sequences taken are X=[3,5,4,1,2,1,2,3,4,6,7] and h=[1,2,3,4,5]. In decimal form the
convoluted result is Y= [3,11,23,36,51,49,34,23,30,36,53,63,65,58,35]. Once the inputs are given the
run operation of simulation has to be utilized consecutively so that the clocks are generated and with
each clock all the MAC units are active and the corresponding inputs in them get multiplied through the line of MAC units we get the result.
5. Linear convolution using MAC Unit with the Wallace Multiplier: Figure 10 represents simulation result
of linear convolution using designed MAC unit with Wallace Multiplier the inputs are in decimal type,
the two sequences taken are X=[3,5,4,1,2,1,2,3,4,6,7] and h=[1,2,3,4,5]. In decimal type the convoluted
result isY= [3,11,23,36,51,49,34,23,30,35,50,58,58,49,30]. Once the inputs are given the run operation
of simulation has to be utilized consecutively so that the clocks are generated and with each clock all
the MAC units are active and the corresponding inputs in them get multiplied through the line of MAC
units we get the result.
6. Linear convolution application using MAC Unit with the Dadda Multiplier. Figure11 represents
simulation result of linear convolution using designed MAC unit with Dadda Multiplier the inputs are in
decimal type, the two sequences taken are X=[3,5,4,1,2,1,2,3,4,6,7] and h=[1,2,3,4,5]. In decimal form the convoluted result is Y= [3,11,23,36,51,49,34,23,30,35,50,58,58,49,30]. Once the inputs are given
the run operation of simulation has to be utilized consecutively so that the clocks are generated and with
each clock all the MAC units are active and the corresponding inputs in them get multiplied through the
line of MAC units we get the result.
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
12 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
FIGURE 9: Simulation result for linear convolution application using MAC Unit with Booth Multiplier
FIGURE 10: Simulation result for linear convolution application using MAC Unit with Wallace
Multiplier
FIGURE 11: Simulation result for Linear convolution application using MAC Unit with Dadda
Multiplier
Synthesis Report
1. Synthesis report for all three Multiplier
TABLE 1: The table below gives the detailed comparative analysis of implemented Multiplier in
terms of delay and area
Multiplier Delay (ns) Slices
Booth 3.29 95
Wallace 14.66 107
Dadda 12.83 111
2. Synthesis Report for MAC using all three Multiplier.
TABLE 2: The table below gives the detailed comparative analysis of designed MAC unit in terms
of delay and area
MAC (With Multiplier) Delay (ns) Slices
Booth 3.296 111
Wallace 7.014 130
Dadda 6.95 136
3. Synthesis Report for Linear convolution application using MAC Unit with all three multiplier.
TABLE 3: Linear Convolution application using MAC synthesis report in terms of area and delay.
MAC (With Multiplier) Delay (ns) Slices
MAC with Booth 3.296 92
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
13 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
MAC with Wallace 2.87 88
MAC with Dadda 3.357 99
Comparative Study
This part discusses the comparison 8 X 8 bits Wallace and Dadda Multiplier. The number of full adder used
by the Wallace Multiplier is 38 and half adders are 15. The Wallace tree Multiplier uses 38 full adders and 15
half adders. The number of full adders used by Dadda multiplier is 35 and half adders used are 7. Carry
propagation adder is of 10 bits wide in case of Wallace Multiplier. 14 bit wide carry propagation adder is used
in the Dadda Multiplier. Dadda multiplier is less regular when compared to the Wallace multiplier and makes
it more difficult for VLSI design.
The synthesis report is obtained using Xilinx 13.2, Virtex-5, XC5VLX100T-FF1136 and speed grade of -1.
The implemented Multiplier is Booth, Dadda and Wallace. For Booth Multiplier the delay and area is less as
compared to other multiplier. Wallace multiplier has more delay but it takes less area as compared to Dadda. When it comes to Mac the least delay is in the Mac which uses booth Multiplier. When the MAC Units are
used for Linear convolution application, the least area occupied is MAC unit is with Dadda is used.
6. Conclusion
An architecture for Booth , Wallace, Dadda multipliers have been proposed and implemented on virtex-5
FPGA. The Synthesis report of the multipliers are tabulated and compared in terms of area and delay. As
compared to Wallace and Dadda, Booth multiplier has brought better performance in terms of speed and area.
These parallel multi-pliers has been utilized to design MAC unit and their synthesis report is tabulated. The
MAC with Booth Multiplier has less delay and area. MAC with Wallace Multiplier has more delay but has
less area as compared to MAC with Dadda Multiplier. Finally, the above designed MAC have been used for
DSP application, like convolution and the Synthesis report have been obtained and was tabulated, the results
shows MAC with Dadda has low delay.
References
[1] J. M. Rudagi, V. Amblr, V. Munavalli, R. Patil and V. Sajjan, “Design and im-plementation of efficient
multiplier using Vedic Mathematics,” Proc. IEEE Inter-national Conference on Advances in Recent
Technologies in Communication and Computing, pp. 162-166, November 2011.
[2] T. Arunachalam, S. Kirubaveni, “Analysis of High Speed Multipliers,” Proc. IEEE International
Conference on Communication and Signal Processing, pp. 211-214, April 2013.
[3] E. Prakash, R. Raju and R. Varatharajan, “Effective Method for Implementation of Wallace Multiplier
using Fast Adders,” Journal of Innovative Research and Solutions, vol. 1, no. 1, July 2013.
[4] G. Vasudev, R. Hegadi, “Design and Development of 8-Bits Fast Multiplier for Low Power
Application,” IACSIT International Journal of Engineering and Technology, vol. 4, no. 6, December
2012.
[5] A. Sen, P. Mitra and D.Datta, “Low Power MAC Unit for DSP Processor,” International Journal of Recent Technology and Engineering, vol. 1, no. 6, January 2013.
[6] G. Vasudeva, “Design and Implementation of Radix-2 Modified Booth’s Encoder Using FPGA and
ASIC Methodology,” International Journal of Recent Technology and Engineering, vol. 4, no. 3, July
2015.
[8] MD. Riazullah and K. Kishore, ”VLSI Implementation of Low Power Multiplier and Accumulator Unit
using SPST,” International Journal of Science, Engineering and TechnolgyResearchInternational Journal
of Science, Engineering and Technology Research, vol. 3, no. 12, December 2014.
[9] S. Veeramachaneni, ”Novel Architectures for High-Speed and Low-Power 3-2, 4-2 and 5-2
Compressors,” IEEE International Conference on VLSI Design and Embedded Systems, pp. 1063-9667,
January 2007.
[10] A. A. Kafi, A. Rahman, B. Mahjabeen, M. Rahman, “An efficient design of FSM based 32-bit unsigned high-speed pipelined multiplier using Verilog HDL,” 8th International Conference on Electrical and
Computer Engineering, pp. 164-167 22, December 2014.
[11] A. Sharma, A. Srivastava, A. Agarwal, D. Rana and S. Bansal, “Design and Implementation of Booth
Multiplier and Its Application Using VHDL,” International Journal of Scientific Engineering and
Technology, vol. 3, no. 5, May 2014.
[12] Q. LI, G. Liang, A. Bermak, “A High-speed 32-bit Signed/Unsigned Pipelined Multiplier,” IEEE
International Symposium on Electronic Design, pp. 207-211, 2010.
[13] S. Karthick, S. Karthika and S. Valarmathy, “Design and Analysis of Low Power Compressors,”
International Journal of Advanced Research in Electrical,Electronics and Instrumentation Engineering,
vol. 1, no. 6, December 2012.
Navneet Ranjan* et al ISSN: 2250-3676 [IJESAT] [International Journal of Engineering Science & Advanced Technology] Volume-6, Issue-2, 287-299
14 IJESAT | May-Jun 2016 Available online @ http://www.ijesat.org
[14] V. Anandi and R. Rangarajan, “A Full Custom MAC Using Dadda Tree Multiplier For Digital Hearing
Aid,” Journal of Theoretical and Applied Information Technology, vol. 67, no. 1, September 2014.
[15] S. Jose and S. Mytheen, “Modified Booth Multiplier Based Low-Cost FIR Filter Design,” International
Journal of Engineering Science and Innovative Technology, vol. 3, no. 5, September 2014.
[16] T. Tresa, M. A.Shameem and S. Sreedharan, “High Performance MAC Unit for FFT Implementation,”
International Open Access Journal Of Modern Engineering Research, vol. 4, no. 1, January 2014. [17] K. R. Jayalakshmi and H. S. Jacob, “Modified FSM Based 32-Bit Unsigned High Speed Pipelined
Multiplier Using Carry Look Ahead Adders In Verilog HDL,” International Journal of Innovative
Science, Engineering and Technology, vol. 2, no. 4, April 2015.
[18] Y. Kumar, K. Kumar, S. Yadav, A. Gupta, ”A Novel Approach to design High Speed MAC Unit”,
International Journal of Advanced Research in Computer Engineering and Technology, vol. 3, no. 6,
June 2014.
[19] D. Bordiya and L. Bandil, “Comparative AnalysisOf Multipliers (serial and parallel with radix based on
booth algorithm),” International Journal of Engineering Research and Technology (IJERT), vol. 2, no. 9,
September 2013.
[20] S. Vaidya and D. Dandeka, “Delay-Power Performance Comparison of Multipliers in VLSI Circuit
Design,” International Journal of Computer Networks and Communications (IJCNC), vol. 2, no. 4, July
2010. [21] R. k. Rashmi, “Parallel Hardware Implementation of Convolution using Vedic Mathematics,” IOSR
Journal of VLSI and Signal Processing (IOSR-JVSP), pp. 21-26, Nov-Dec 2012.
[22] C.S Wallace, “A suggestion for a fast Multiplier,” IEE Transaction on Electronic Computer, vol. EC-13,
pp. 14-17, 1964.
[23] L. Dadda, “Some Schemes for Parallel Multiplier,” Alta Frequenza, vol. 34, pp . 349-356, August 1965.
[24] T. K. Callaway, “Optimizing multipliers for WST,” International Conference on Wafer Scale
Integration, pp. 85-94, 1993.
[25] D. Garg and S. Arya, “Design of Configurable Booth Multiplier Using Dynamic Range Detector,”
International Journal of Electronics Engineering(IJEER), vol. 4, no. 3, March 2012.
[26] C. Y. H. Lee, L. H. Hiung, S. W. F. Lee, N. H. Hamid, “A performance comparison study on multiplier
designs,” IEEE International Conference on Embedded Systems, pp. 1-6, June 2010. [27] K. L. S. Swee, L. H. Hiung, “Performance Comparison Review of 32 bit Multiplier Designs,” IEEE
International Conference on Intelligent and Advanced Systems, pp. 854-859, June 2011.
[28] Y. H. Seo and D. W. Kim, “A New VLSI Architecture of Parallel Multiplier–Accumulator Based on
Radix-2 Modified Booth Algorithm,” EEE Transactions on Very Large Scale Integration Systems, vol.
18, no. 2, February 2010.
[29] R. De Mori,“Suggestions for an TC Fast Parallel Multiplier, “Electronics letters, vol. 5, pp. 50-51, 1965.
[30] BehroozParhami, Computer Arithmetic, Algorithms and Hardware Design. London:Oxford University
Press, 2000.
[31] P. Gurjar, R. Solanki, P. Kansliwal P, Dr. M. Vucha, “VLSIimplementation of adders for high speed
ALU” Annual IEEE in India Conference (INDICON), pp. 1-6 16-18, December 2011.
[32] M. Sudeep, M. SharathBimba, Dr. M. Vouch, “Design and FPGA Implementation of High Speed Vedic
Multiplier,” International Journal of Computer Application, vol. 116, no. 20, pp. 6-9, April 2014. [33] A. D. Booth, ”A signed binary multiplication technique, “Quarterly Journal of Mechanics and Applied
Mathematics, vol. 4, pp. 236-240, 1951.
[34] J. W. Townsend, A. J. Abraham and E. Swartzlander, “A comparison of Wallace and Dadda multiplier
delays,”International Society for Optical Engineering (SPIE)., vol. 52, no. 5, December 2003.
[35] J. Kaur, N.K. Gahlan and P. Shukla, “Delay Power Performance Comparison of Array Multiplier in
VLSI Design,” International Journal of Advanced Re-search in Computer Science and Electronics
Engineering, vol. 1, no. 3, May 2012.
[36] Y. H. Seo and D. W. Kim, “A New VLSI Architecture of Parallel Multiplier–Accumulator Based on
Radix-2 Modified Booth Algorithm,” IEEE Transactions on Very Large Scale Integration Systems, vol.
18, no. 2, February 2010