literature review - inflibnetshodhganga.inflibnet.ac.in/bitstream/10603/37754/11/11_chapter...
TRANSCRIPT
4
Chapter 2
Literature Review
2.1 ADDER TOPOLOGIES
Many different adder architectures have been proposed for binary addition since
1950’s to improve various aspects of speed, area and power. Ripple Carry Adder have the
simplest architecture, but performs slower addition due to its longest carry propagation
delay (R.Uma et al (2012)). Carry Save Adder improves the speed of addition by using N
additional half adders in Ripple Carry Adder to reduce long carry propagation delay, but
it consumes more area and power than Ripple Carry Adder (Chakib Alaoui (2011)). A
carry-lookahead adder performs fast addition by reducing the amount of time required to
determine carry bits (Yu-Ting Pai and Yu-Kumg Chen (2004)). It finds the carry bit in
advance for each bit position, whether that position is going to propagate a carry if 1
comes from the nearest LSB.
On the other hand, Carry Skip Adder and Carry Select Adder speeding up the
addition where in the adders are split in blocks of N bits. In Carry Skip Adder, each block
calculates the carry bit to propagate to the next block based on MSB carry-out, each bit
sum out and LSB carry-in (Yu Pang et al (2012)). So that the next block towards MSB
need not to wait till the previous block completes the addition. The Carry Select Adder
performs parallel addition with carry-in for 0 and carry-in for 1 (Sudhanshu Shekhar et al
(2013)). Each block of adders generate final sum with only multiplexer delay. So the
Carry Select Adder performs faster than all other adders.
2.1.1 RIPPLE CARRY ADDER (RCA)
The RCA is constructed by cascading series of full adder’s as shown in Figure
2.1. The carry-out of each full adder is directly fed to the carry-in of the next full adder.
Each full adder adding three digits and generate carry bit to the next full adder to start
computation. Until the carry bit is received from the previous adder, the next adder would
not start its computation. This causes the longest delay in RCA and it increases linearly
with the bit size.
5
The delay of the RCA defined as,
t = O(N) (2.1)
where N is the operand size in bits. Even though RCA consumes more delay, due to its
regular structure, it takes lesser area and consumes lesser power. This makes RCA as best
choice to use in the low power applications. An Equally Shared Block Scheme (ESBS)
based 16-bit RCA is shown in Figure 2.1.
Figure 2.1: Schematic of a 16-bit RCA(C-Carry bit)
2.1.2 CARRY SAVE ADDER (CSA)
A 16-bit CSA structure is shown in Figure 2.2. It consists of N+1 half adders in
the first stage and N-1 full adders in second stage. In the first stage, unlike sequential
3-bit addition in RCA, here two N-bit addition happens in parallel to generate partial sum.
The partial sum values are stored in the second stage full adders. The final sum is then
computed by shifting the carry sequence from LSB to MSB through the partial sum
values.
4- bit
block
4- bit
block 0 1 2 3
4- bit
block C0 C16
Ripple Carry Stages
C4 C8 C12
6
The delay of the CSA defined as,
t = O(log N) (2.2)
Even though CSA performs faster than RCA, it increases area and power due to
its N additional half adder’s. Since CSA has regular connectivity to propagate sum &
carry to next stage, it is mostly used in multiplier designs to propagate the partial sum and
partial carry from each stage.
Figure 2.2: Schematic of a 16-bit CSA (H-Half Adder, F-Full Adder)
2.1.3 CARRY LOOK-AHEAD ADDER (CLA)
A 4-bit CLA structure is shown in Figure 2.3. It speeds up the addition by
reducing the amount of time required to determine carry bits. It uses two blocks, carry
generator (Gi) and carry propagator (Pi) which finds the carry bit in advance for each bit
position from the nearest LSB, if the carry is 1 then that position is going to propagate a
carry to next adder.
Cin
a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b
s17 s16 s15 s14 s13 s12 s11 s10 s9 s8 s7 s6 s5 s4 s3 s2 s1 s0
H H H H H H H H H H H H H H H F
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
H F F F F F F F F F F F F F F H c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1
7
The generate block can be realized using the expression
Gi = Ai . Bi for i=0,1,2,3 (2.3)
Similarly the propagate block can be realized using the expression
Pi = Ai ⊕Bi for i=0,1,2,3 (2.4)
The carry output of the (i-1)th
stage is obtained from
Ci = Gi +Pi Ci-1 for i=0,1,2,3 (2.5)
The sum output can be obtained using
Si = Ai ⊕BiCi-1 for i=0,1,2,3 (2.6)
Figure 2.3: Schematic of a 4-bit CLA
Even though the CLA is faster than RCA, it increases area and power due to its carry
generator and propagator logic.
1- bitFA 1- bitFA 1- bitFA
S3 S2 S1 S0
4-bit Carry Look-Ahead Logic
1- bitFA C0
P3G3C3 P2G2 C2 P1G1C1 P0G0
PG CG
C4
A3B3 A2B2 A1B1 A0B0
8
2.1.4 CARRY SKIP ADDER (CSKA)
A CSKA performs fast addition since adders are split in blocks of N bits. It
greatly reduces the delay of the adder through its critical path, since the carry bit for each
block can be bypassed (skip) over the blocks. It consists of simple RCA with a AND-OR
skip logic as shown in Figure 2.4. It generates carry-out from each block depending on
MSB full adder carry-out, LSB full adder carry-in and sum bit of each full adder. If the
AND-OR skip logic output is 1, the current block will be bypassed and next block will
start computation.
The delay of the CSKA defined as;
t = (O(√N) (2.7)
The additional skip logic consumes slight area overhead in the CSKA, but it is
lesser than CSA and CLA. The design schematic of a 16-bit CSKA is shown in Figure
2.4.
Figure 2.4: Schematic of a 16-bit CSKA
4- bit
block
4- bit
block
Skip Logic(2 Gates)
4- bit
block C0
C16
C4 C8 C12 0 1 2 3
Skip
Logic Skip
Logic
Skip
Logic
P[3,0] P[7,4] P[11,8] P[15,12]
9
2.1.5 CARRY SELECT ADDER (CSLA)
A CSLA generally consists of two RCA and a Multiplexer (Mux). It performs two
additions in parallel, by assuming carry-in of 0 and carry-in of 1. A CSLA performs fast
addition since adders are split in blocks of N bits A K-bit CSLA is shown in Fig.2.5. It
contains two groups of adders as one for lower N/2 bits and another for higher N/2 bits.
The higher N/2 bits group adder computes the partial sum and partial carry by assuming
carry-in 1 and carry-in 0 in parallel with the lower N/2 bits. It generates final sum and
carry based on the Mux selection input. Hence the delay of the CSLA can be defined as,
Tselect-add (N) = Tadd(N/2) + 1 (2.8)
The CSLA is widely used in High performance applications. But it consumes
large area and power due to its increased hardware resources. Many research articles have
proposed various hints to reduce area and power in the CSLA structure.
Figure 2.5: Schematic of a K-bit CSLA
k/2 + 1
k/2 – bit RCA
1 0 Mux
k/2 – bit RCA
k/2 – bit RCA
Cin
0
1
Cout, Higher N/2 bits Lower N/2 bits
N/2 + 1 N/2 + 1
N/2– 1 0
N– 1 N/2
N/2-bit RCA
N/2-bit RCA
N/2-bit RCA
10
2.1.6 OTHER ADDERS
Kogge Stone adder is a parallel prefix form carry look-ahead adder (Kogge P and
Stone H (1973)). It has regular layout and minimum logic depth (fan-out) which makes
fast adder but has large area. The delay of Kogge Stone adder is equal to log2N and has
the area (N*log2N)-N+1, where N is the number of input bits (N Zamhari et al (2012)).
Another parallel prefix adder is Brent Kung Adder (R. P Brent and H. T. Kung
(1982)). It has more logic depth (fan-out) with minimum area characteristics. So it
reduces its addition speed, but power efficient (N Zamhari et al (2012)). The delay of
Brent Kung Adder is equal to (2*log2N)-2 and has the area of (2N)-2-log2N.
Ladner Fischer adder is another parallel prefix adder (R.E. Ladner and M.J.
Fischer (1980)). Its delay and area are asymptotically optimal (i.e., logarithmic delay and
linear area). It has an additional type of “recursive step” for constructing a parallel prefix
circuit. This additional recursive step reduces the delay, but increases area. It has delay of
O(log N) and area of O(N).
An improvement that can be made to CLA design is the use of a pseudo-carry as
proposed by Ling, and is called Ling adder (H. Ling (1981)). This method allows a single
local propagate signal to be removed from the critical path.
Han-Carlson adder is a hybrid design that mix of Kogge-Stone and Brent-Kung
adder (T. Han and D. Carlson (1987)). It has logN+1 stages. The logic performs Kogge-
Stone on the odd numbered bits, and then uses one more stage to ripple into the even
positions.
The Sklansky or divide-and-conquer adder reduces the delay to log2N stages by
computing intermediate prefixes along with the large group prefixes (J. Sklansky (1960)).
This comes at the expense of fanouts that double at each level (8, 4, 2, 1). These high
fanouts cause poor performance on wide adders unless the high fanout gates are
appropriately sized or the critical signals are buffered before being used for the
intermediate prefixes. Transistor sizing can cut into the regularity of the layout because
multiple sizes of each cell are required, although the larger gates can spread into adjacent
columns.
11
2.2 MULTIPLIER TOPOLOGIES
Two classes of parallel multipliers were defined in the 1960’s. The first class of
parallel multiplier use a rectangular array of identical cells which contains AND gate
and addition logic to generate and sum the partial product bits (J. C. Majithia and
R. Kitai (1964)). These kinds of multipliers are called array multipliers and they have
delay that is proportional to the multiplier input word size, i.e. O(N). Since array
multipliers have regular structure and regular wiring connectivity, it is easier to
implement these at the layout level (R P Pal Singh et al (2009)).
The next class of parallel multipliers termed column compression multipliers,
uses counters or compressors to reduce the matrix of partial product array to two words.
Finally a carry propagate adder is used to sum these two words to get the final product.
The column compression multiplier have delay proportional to the logarithm of the
multiplier word length, i.e. O(log N) So it is faster than array multiplier, but due to its
irregular structure and interconnections it is difficult to layout.
2.2.1 ARRAY MULTIPLIERS
A 4 by 4 array multiplier structure is shown in Figure 2.6. Each cell performs the
two basic functions of partial product generation and summation. Half adders and full
adders are used to perform addition function. An unsigned N by N array multiplier
requires N2 + N cells, where N
2 contain an AND gate for partial product generation, 2N
full adder and N half adder to produce a multiplier. The worst case delay is (2N - 2) ∆c
(Bickerstaff K.C (2007)), where ∆c is the worst case adder delay. Here all the products are
generated in parallel and collected through an array of full adders and half adders, finally
they are summed using CPA. Since its regular structures, the array multiplier takes less
amount of area, but is slowest in terms of the latency.
In the 1950’s Booth algorithm used in array multipliers to perform two’s
complement multiplication (Andrew D. Booth (1951)). It computes the partial
products by examining two multiplicand bits at a time. Later higher radix modified
Booth algorithm was introduced to improve the latency performance of the regular Booth
array multiplier.
12
Figure 2.6: Schematic of a 4 by 4 Array Multiplier
The Booth Radix-4 algorithm (O. L. MacSorley (1961)) reduces the number of
partial products by half while keeping the circuit’s complexity down to a minimum. This
result in faster less power in multiplication operation. Booth Recoding makes these
advantages possible by skipping clock cycles that add nothing new in the way of product
terms. The hardware implementation for Radix-4 Booth Recoding technique use a simple
mux that selects the correct shift-and-add operation based on the groupings of bits found
in the product register. The product register holds the multiplier. The multiplicand and
the two’s complement of the multiplicand are added based on the recoding value. The
directions for the radix-4 modified Booth recoding technique are shown in Table 2.1.
FA
a0,b0
a0,b2 a1,b2 a2,b2 a3,b1
a0,b3 a1,b3 a2,b3 a3,b2
a3,b3
p7 p6 p5 p4 p3 p2 p1 p0
HA HA HA
FA FA
FA FA FA
FA FA HA
a1,b0 a0,b1 a2,b0 a1,b1 a3,b0 a2,b1
13
The three bit decodes five possible operations are add 2*multiplicand, add
multiplicand, add 0, subtract multiplicand, or subtract 2*multiplicand. It increases the
hardware complexity, but consumes only half the delays of the regular Booth multiplier.
It is possible to use higher radices, such as radix-8 or radix-16, but the additional
complexity, due to non-power of two multiples of the multiplicand, compromises
delay and area improvements.
Table 2.1: Radix-4 Modified Booth Recoding
Another method was proposed by Baugh and Wooley (Charles R. Baugh and
Bruce. A. Wooley (1973)) to handle signed bits. This technique has been developed in
order to design regular multipliers suited for 2’s complement numbers. Due to the
additional two rows, it increases the maximum column height by two. Because of the
additional two stages of partial product reduction, it increases overall delay. A modified
form of the Baugh and Wooley method (Shiann-Rong Kuang et al (2009)) is more
commonly used because it does not increase the maximum column height.
The partial product organization of the modified Baugh-Wooley method is
shown in Figure 2.7. The strategy of organization is follows,
1) Invert the MSB bits of each row except the bottom row.
2) Invert all the bits in bottom row, except the MSB bit.
3) Add a single one to the (N+1)th
and 2Nth
columns.
The negative partial product bits can be generated using a NAND gate instead of an
AND gate, which may reduce the area slightly in CMOS.
bi bi-1 bi-2 operations
0 0 0 Add 0
0 0 1 Add multiplicand
0 1 0 Add multiplicand
0 1 1 Add 2* multiplicand
1 0 0 Subtract 2* multiplicand
1 0 1 Subtract multiplicand
1 1 0 Subtract multiplicand
1 1 1 Subtract 0
14
x7 x6 x5 x4 x3 x2 x1 x0
y7 y6 y5 y4 y3 y2 y1 y0
1 #p70 p60 p50 p40 p30 p20 p10 p00
#p71 p61 p51 p41 p31 p21 p11 p01
#p72 p62 p52 p42 p32 p22 p12 p02
#p73 p63 p53 p43 p33 p23 p13 p03
#p74 p64 p54 p44 p34 p24 p14 p04
#p75 p65 p55 p45 p35 p25 p15 p05
#p76 p66 p56 p46 p36 p26 p16 p06
1 p77 #p67 #p57 #p47 #p37 #p27 #p17 #p07
#s15
s14 s13 s12
s11
s10 s9
s8 s7 s6 s5 s4 s3 s2 s1 s0
Figure 2.7: Two’s Complement Multiplication by Modified Baugh-Wooley Method
2.2.2 COLUMN COMPRESSION MULTIPLIERS
In 1964, Wallace (C.S.Wallace (1964)) introduced a scheme for fast
multiplication based on using array of full adders and half adders. He used full adders for
all three bits and half adders for all two bits in the partial products array of multiplier to
speed up the multiplication.
Later the Wallace’s approach was modified by Dadda (Luigi Dadda (1965)) using
counter placement strategy in the partial product array. Here the placement of counters
starts from the critical path in the partial product array. This placement repeats until we
get final two rows and they are summed using a carry propagate adder to get final
product. In both Wallace and Dadda methods, the delay of the multiplier is proportional
to the logarithm of the operand word-length.
Reduced area approach is an another type of partial product reduction method
proposed for area optimization in column compression multipliers (K’Andrea et al
(1993)). Another area reduction approach is proposed by Wang’s (Z. Wang et al (1995)).
These methods are based on strategic utilization of full adders and half adders to
improve area reduction and layout, while maintaining the fast speed of the Wallace
and Dadda designs.
# inverted
bit positions
15
Another partial product reduction algorithm based on the unequal delay paths
proposed by Oklobdzija (V. G. Oklobdzija (1995)). He defined the connectivity strategy
of slow inputs/outputs and fast inputs/outputs in the critical delay paths that can tolerate
an increase in delay.
A new organization of the reduction tree, which is based on the partial-product
compression similar to the Dadda approach, is proposed by Eriksson (H. Eriksson
(2006)). The connectivity of the adding cells in the triangle-shaped High-Performance
Multiplier (HPM) reduction tree is completely regular.
2.2.2.1 PARTIAL PRODUCTS REDUCTION SCHEMES
As shown in Figure 2.8, the multiplier starts with generating partial products
using AND gate array and reducing those to two rows using counters or compressors. It is
good to understand the difference between counters and compressors (V. G. Oklobdzija
and D.Villeger (1995)). The counter counts the number of active inputs and the
compressor (q:r) reduces q inputs to r outputs based on the compression ratio. In this
research we used counters for design multipliers, not used compressors. The column
compression tree includes array of counters or compressors. Finally the two rows are
added using carry propagate adder to get final product. Hybrid adder structure can be
used for carry propagate adder to perform fast addition to get final products.
Figure 2.8: Basic N by N unsigned parallel multiplier
Multiplier Multiplicand
AND array
Column Compression Tree
Final Carry Propagate Adder
N N
2N
16
Dot diagram is a notation for describing multiplication column compression
algorithms. The symbols used in dot diagrams are listed below,
Each dot - each partial product bit
Plain diagonal line - each full adder output
Crossed diagonal line - each half adder output
The dot diagram for a 8 by 8 Wallace multiplier is shown in Figure 2.9. It
was constructed based on the following Wallace algorithm,
1) Take all three bits in each column and add them using a full adder.
2) If there are two bits left in any of the column, add them using a half adder.
3) If there is just one bit left in any of the column, connect it to the next level.
4) Repeat the steps 1 o 3 until get final two rows.
5) Add the final two numbers using a carry propagate adder to get the final
product.
In each stage of the reduction, Wallace performs a preliminary grouping of
partial product rows into sets of three. Full adders and half adders are then employed
within each three row set. In the 8 by 8 example, the counters shown in Stage 1 of the
reduction are placed in four sections as determined by the preliminary grouping of partial
product bits out of the AND array into sets of three. If due to the preliminary grouping
there is only one partial product bit, then that bit is directly moved down to the next
stage. The reduction of the partial product bits in Stage 1 by the counters shown in Stage
2 demonstrates that rows which are not part of a three row set are moved down
into the next stage without modification.
17
Figure 2.9: Dot Diagram of 8 by 8 Wallace Multiplier
The complete partial product reduction of a 8 by 8 Wallace multiplier requires
four stages (intermediate matrix heights of 6, 4, 3, and 2) and uses 38 full adders and 15
half adders. To complete the multiplication, an 11 bit carry-propagate adder forms the
final product by adding the final two rows of partial product bits shown in Stage
4.
As mentioned earlier, later the Dadda was modified the Wallace’s approach using
the counter placement strategy. Table 2.2 indicates the number of reduction stages based
on the number of bits in the Dadda multiplier. The reduction stages are determined from
bottom (final two rows) to top. In each reduction stage the height of the matrix is no more
than 1.5 times the height of its subsequent matrix. For example, a 12 by 12 Dadda
multiplier requires five reduction stages with intermediate heights of 9, 6, 4, 3 and 2.
18
Table 2.2: Reduction Strategy for Dadda Multiplier
The algorithm used for a Dadda multiplier is as follows:
1) Let d1 = 2 and dj+1 = ⎣1.5 · dj⎦ is the matrix height for the jth
stage from the
final two rows. It generates the sequence: d1=2, d2=3, d3=4, etc.
2) For every column, use half adders and full adders to ensure that the number
of elements in each column will be <= dj
3) Let j = j -1 and repeat step2 until you reach the maximum height of 2 bit
column.
In Figure 2.10, the dot diagram for an8 by 8 Dadda multiplier is shown Figure 2.10. The
first six matrix heights calculated using the recursive algorithm are 2, 3, 4, 6 and 9. Since
this is a 8 by 8 multiplier, the matrix height of 9 is unnecessary. The next
matrix height to target is 6. Stage 1 of partial product reduction applies full adders and
half adders only to the columns whose total height is greater than 6. In Stage 2,
full adders and half adders are only used in columns whose total height is greater than 4.
Note that when evaluating a column’s height it is important to account for carries
from the previous column. The 8 by 8 Dadda multiplier requires four reduction stages
(matrix heights of 6, 4, 3, and 2) and uses 35 full adders, 7 half adders, and a 14
bit carry-propagate adder.
Multiplier (N)
Reduction
Stages
3 Stage 1
4 Stage 2
5 <= N <= 6 Stage 3
7 <= N <= 9 Stage 4
10 <= N <= 13 Stage 5
14 <= N <= 19 Stage 6
20 <= N <= 28 Stage 7
29 <= N <= 42 Stage 8
43 <= N <= 63 Stage 9
64 <= N <= 94 Stage 10
19
Figure 2.10: Dot Diagram of 8 by 8 Dadda Multiplier
Reduced Area multiplier (K’Andrea et al (1993), K’Andrea et al (1995),
K’Andrea et al (2001)) is an another reduction scheme to optimize the area than Wallace
and Dadda scheme. The dot diagram for a 8 by 8 Reduced Area multiplier is shown in
Figure 2.11.This multiplier requires four stages (matrix heights of 6, 4, 3, and 2) and
uses 35full adders, 7half adders, and a 10 bit carry-propagate adder. The reduction
method for the Reduced Area multiplier is:
1) For each reduction stage, the number of full adders used in each column is
⎣bi / 3⎦, where bi is the number of bits in column i. This provides the maximum
column reduction in the number of bits entering the next stage.
2) Half adders are used only for the below two conditions,
20
(i) When required to reduce the number of bits in a column to the number of bits
specified by the Dadda sequence (or)
(ii) To reduce the rightmost column containing only two bits.
Reduced Area multiplier reduction scheme is especially useful for pipelined
multipliers, because it reduces the required latches in the partial product reduction stages.
This scheme can be applied for both signed and unsigned numbers.
Figure 2.11: Dot Diagram of 8 by 8 Reduced Area Multiplier
21
A fourth type of reduction scheme, which uses full adders and half adders, is
called the High Performance Multiplier (HPM) multiplier (H. Eriksson et al (2006)).
The dot diagram for an 8 by 8 High Performance Multiplier is shown in Figure 2.12.
This multiplier requires six stages (matrix heights of 7, 6, 5, 4, 3, and 2) and uses
35full adders, 7half adders, and a 14 bit carry-propagate adder. The reduction for each
stage in the High Performance Multiplier is N-1, where N is matrix height of previous
stage.
Figure 2.12: Dot Diagram of 8 by 8 HPM Multiplier
22
A fifth type of partial product reduction scheme has been proposed by
Wang, et al. (Z. Wang et al (1995)) to design more area efficient with shorted
interconnections in the column compression multipliers. First he determines the lower
bounds on the number of adders required by a column compression multiplier. Then the
constraints have been analyzed for the distribution of adders to the different stages.
Finally he proposed a technique that attempts to maximize area efficiency while reducing
the number of cross-stage interconnections. The constraints for half adder and full adder
allocation in the column compression were analyzed and under these constraints,
considerable flexibility for implementation of the column compression multiplier and
choosing the length of the final fast adder which yields higher area efficiency.
In Wang’s research, area efficiency of the column compression part of the
multiplier is defined as:
(2.9)
where N is the total number of half adders and full adders used in the reduction stages, K
is the required number of stages, and N(k) is the number of half adders and full
adders in stage k.
The performance of any of these five multipliers Wallace, Dadda, Reduced
Area, HPM and Wang can be improved by the proposed design techniques proposed
in this research.
K.max (N (k))
N x 100%
23
2.2.2.2 THE FINAL CARRY-PROPAGATE ADDER
All the fast adder structures were developed under the assumption that the input
signals are arriving at the same time. This assumption is not realistic for many places like
input arrival profile from the multiplier partial product summation tree to the carry
propagate adder. Therefore this research concerned about which one of the schemes for
addition is most adequate as a carry propagate adder for the multiplier.
The literature deals several types of carry-propagate adders, including CSA, CLA,
CSLA and CSKA (R.Uma et al (2012)). The adder structures have been evaluated and
rated based on the delay, area and number of logic transitions (Thomas K. Callaway
and Earl E. Swartzlander, Jr (1992)). More specifically the work has been done to
evaluate the power consumption of adders (Thomas K. Callaway and Earl E.
Swartzlander, Jr (1993)).
It is well known that the signals from column compression tree applied to the
inputs of the carry propagate adder arrive first at the ends of the carry propagate adder
and the last ones are those in the middle of the carry propagate adder. So the
determination of the exact arrival time to carry propagate adder is of prime importance in
the design of the optimal final adder. To better select and design adders for column
compression multipliers Oklobdzija analyzed the input arrival times to the final adder (V.
G. Oklobdzija and D.Villeger (1995), V. G. Oklobdzija (1995)) and he suggests using
either variable block adder or RCA to sum the early LSB values, a CLA to sum the
middle region of bits, and either a conditional sum adder or CSLA to sum the early
MSB values.
Since RCA has simple and regular structure, it consumes lesser power and is area
efficient than all other existing adders. But each stage in RCA generates sum only after
receiving the carry bit from the preceding stage bit pairs. So it leads to large carry
propagation delay. The arrival profile as shown by Oklobdzija (V. G. Oklobdzija and
D.Villeger (1995)) and Balasubrahmanyam (Balasubrahmanyam et al (2012)), has a
positive slope from the LSB region to middle region of the partial products. Even though
the carry bit arrives faster from the preceding stage of the final addition, the arrival of
true values from the partial products are slower in the positive slope region. So the fast
adder is not best choice in this region and the RCA is best choice in the positive slope.
24
But this slope is not always positive in the entire multiplier region. It has constant slope
in the middle region and negative slope in the MSB side of the partial products. So
determination of the suitable adder in each region would lead to optimal performance of
the multipliers.
Based on the different arrival profile region of the partial products, this research
proposed a hybrid carry propagate adder structure for parallel multipliers which
consumes lesser power, area efficient than the regular CSLA, and faster than the CSA.
This enables optimal performance in the final addition for the multipliers proposed in this
research.