literature review - inflibnetshodhganga.inflibnet.ac.in/bitstream/10603/37754/11/11_chapter...

21
4 Chapter 2 Literature Review 2.1 ADDER TOPOLOGIES Many different adder architectures have been proposed for binary addition since 1950’s to improve various aspects of speed, area and power. Ripple Carry Adder have the simplest architecture, but performs slower addition due to its longest carry propagation delay (R.Uma et al (2012)). Carry Save Adder improves the speed of addition by using N additional half adders in Ripple Carry Adder to reduce long carry propagation delay, but it consumes more area and power than Ripple Carry Adder (Chakib Alaoui (2011)). A carry-lookahead adder performs fast addition by reducing the amount of time required to determine carry bits (Yu-Ting Pai and Yu-Kumg Chen (2004)). It finds the carry bit in advance for each bit position, whether that position is going to propagate a carry if 1 comes from the nearest LSB. On the other hand, Carry Skip Adder and Carry Select Adder speeding up the addition where in the adders are split in blocks of N bits. In Carry Skip Adder, each block calculates the carry bit to propagate to the next block based on MSB carry-out, each bit sum out and LSB carry-in (Yu Pang et al (2012)). So that the next block towards MSB need not to wait till the previous block completes the addition. The Carry Select Adder performs parallel addition with carry-in for 0 and carry-in for 1 (Sudhanshu Shekhar et al (2013)). Each block of adders generate final sum with only multiplexer delay. So the Carry Select Adder performs faster than all other adders. 2.1.1 RIPPLE CARRY ADDER (RCA) The RCA is constructed by cascading series of full adder’s as shown in Figure 2.1. The carry-out of each full adder is directly fed to the carry-in of the next full adder. Each full adder adding three digits and generate carry bit to the next full adder to start computation. Until the carry bit is received from the previous adder, the next adder would not start its computation. This causes the longest delay in RCA and it increases linearly with the bit size.

Upload: trancong

Post on 29-May-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

4

Chapter 2

Literature Review

2.1 ADDER TOPOLOGIES

Many different adder architectures have been proposed for binary addition since

1950’s to improve various aspects of speed, area and power. Ripple Carry Adder have the

simplest architecture, but performs slower addition due to its longest carry propagation

delay (R.Uma et al (2012)). Carry Save Adder improves the speed of addition by using N

additional half adders in Ripple Carry Adder to reduce long carry propagation delay, but

it consumes more area and power than Ripple Carry Adder (Chakib Alaoui (2011)). A

carry-lookahead adder performs fast addition by reducing the amount of time required to

determine carry bits (Yu-Ting Pai and Yu-Kumg Chen (2004)). It finds the carry bit in

advance for each bit position, whether that position is going to propagate a carry if 1

comes from the nearest LSB.

On the other hand, Carry Skip Adder and Carry Select Adder speeding up the

addition where in the adders are split in blocks of N bits. In Carry Skip Adder, each block

calculates the carry bit to propagate to the next block based on MSB carry-out, each bit

sum out and LSB carry-in (Yu Pang et al (2012)). So that the next block towards MSB

need not to wait till the previous block completes the addition. The Carry Select Adder

performs parallel addition with carry-in for 0 and carry-in for 1 (Sudhanshu Shekhar et al

(2013)). Each block of adders generate final sum with only multiplexer delay. So the

Carry Select Adder performs faster than all other adders.

2.1.1 RIPPLE CARRY ADDER (RCA)

The RCA is constructed by cascading series of full adder’s as shown in Figure

2.1. The carry-out of each full adder is directly fed to the carry-in of the next full adder.

Each full adder adding three digits and generate carry bit to the next full adder to start

computation. Until the carry bit is received from the previous adder, the next adder would

not start its computation. This causes the longest delay in RCA and it increases linearly

with the bit size.

5

The delay of the RCA defined as,

t = O(N) (2.1)

where N is the operand size in bits. Even though RCA consumes more delay, due to its

regular structure, it takes lesser area and consumes lesser power. This makes RCA as best

choice to use in the low power applications. An Equally Shared Block Scheme (ESBS)

based 16-bit RCA is shown in Figure 2.1.

Figure 2.1: Schematic of a 16-bit RCA(C-Carry bit)

2.1.2 CARRY SAVE ADDER (CSA)

A 16-bit CSA structure is shown in Figure 2.2. It consists of N+1 half adders in

the first stage and N-1 full adders in second stage. In the first stage, unlike sequential

3-bit addition in RCA, here two N-bit addition happens in parallel to generate partial sum.

The partial sum values are stored in the second stage full adders. The final sum is then

computed by shifting the carry sequence from LSB to MSB through the partial sum

values.

4- bit

block

4- bit

block 0 1 2 3

4- bit

block C0 C16

Ripple Carry Stages

C4 C8 C12

6

The delay of the CSA defined as,

t = O(log N) (2.2)

Even though CSA performs faster than RCA, it increases area and power due to

its N additional half adder’s. Since CSA has regular connectivity to propagate sum &

carry to next stage, it is mostly used in multiplier designs to propagate the partial sum and

partial carry from each stage.

Figure 2.2: Schematic of a 16-bit CSA (H-Half Adder, F-Full Adder)

2.1.3 CARRY LOOK-AHEAD ADDER (CLA)

A 4-bit CLA structure is shown in Figure 2.3. It speeds up the addition by

reducing the amount of time required to determine carry bits. It uses two blocks, carry

generator (Gi) and carry propagator (Pi) which finds the carry bit in advance for each bit

position from the nearest LSB, if the carry is 1 then that position is going to propagate a

carry to next adder.

Cin

a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b

s17 s16 s15 s14 s13 s12 s11 s10 s9 s8 s7 s6 s5 s4 s3 s2 s1 s0

H H H H H H H H H H H H H H H F

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

H F F F F F F F F F F F F F F H c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1

7

The generate block can be realized using the expression

Gi = Ai . Bi for i=0,1,2,3 (2.3)

Similarly the propagate block can be realized using the expression

Pi = Ai ⊕Bi for i=0,1,2,3 (2.4)

The carry output of the (i-1)th

stage is obtained from

Ci = Gi +Pi Ci-1 for i=0,1,2,3 (2.5)

The sum output can be obtained using

Si = Ai ⊕BiCi-1 for i=0,1,2,3 (2.6)

Figure 2.3: Schematic of a 4-bit CLA

Even though the CLA is faster than RCA, it increases area and power due to its carry

generator and propagator logic.

1- bitFA 1- bitFA 1- bitFA

S3 S2 S1 S0

4-bit Carry Look-Ahead Logic

1- bitFA C0

P3G3C3 P2G2 C2 P1G1C1 P0G0

PG CG

C4

A3B3 A2B2 A1B1 A0B0

8

2.1.4 CARRY SKIP ADDER (CSKA)

A CSKA performs fast addition since adders are split in blocks of N bits. It

greatly reduces the delay of the adder through its critical path, since the carry bit for each

block can be bypassed (skip) over the blocks. It consists of simple RCA with a AND-OR

skip logic as shown in Figure 2.4. It generates carry-out from each block depending on

MSB full adder carry-out, LSB full adder carry-in and sum bit of each full adder. If the

AND-OR skip logic output is 1, the current block will be bypassed and next block will

start computation.

The delay of the CSKA defined as;

t = (O(√N) (2.7)

The additional skip logic consumes slight area overhead in the CSKA, but it is

lesser than CSA and CLA. The design schematic of a 16-bit CSKA is shown in Figure

2.4.

Figure 2.4: Schematic of a 16-bit CSKA

4- bit

block

4- bit

block

Skip Logic(2 Gates)

4- bit

block C0

C16

C4 C8 C12 0 1 2 3

Skip

Logic Skip

Logic

Skip

Logic

P[3,0] P[7,4] P[11,8] P[15,12]

9

2.1.5 CARRY SELECT ADDER (CSLA)

A CSLA generally consists of two RCA and a Multiplexer (Mux). It performs two

additions in parallel, by assuming carry-in of 0 and carry-in of 1. A CSLA performs fast

addition since adders are split in blocks of N bits A K-bit CSLA is shown in Fig.2.5. It

contains two groups of adders as one for lower N/2 bits and another for higher N/2 bits.

The higher N/2 bits group adder computes the partial sum and partial carry by assuming

carry-in 1 and carry-in 0 in parallel with the lower N/2 bits. It generates final sum and

carry based on the Mux selection input. Hence the delay of the CSLA can be defined as,

Tselect-add (N) = Tadd(N/2) + 1 (2.8)

The CSLA is widely used in High performance applications. But it consumes

large area and power due to its increased hardware resources. Many research articles have

proposed various hints to reduce area and power in the CSLA structure.

Figure 2.5: Schematic of a K-bit CSLA

k/2 + 1

k/2 – bit RCA

1 0 Mux

k/2 – bit RCA

k/2 – bit RCA

Cin

0

1

Cout, Higher N/2 bits Lower N/2 bits

N/2 + 1 N/2 + 1

N/2– 1 0

N– 1 N/2

N/2-bit RCA

N/2-bit RCA

N/2-bit RCA

10

2.1.6 OTHER ADDERS

Kogge Stone adder is a parallel prefix form carry look-ahead adder (Kogge P and

Stone H (1973)). It has regular layout and minimum logic depth (fan-out) which makes

fast adder but has large area. The delay of Kogge Stone adder is equal to log2N and has

the area (N*log2N)-N+1, where N is the number of input bits (N Zamhari et al (2012)).

Another parallel prefix adder is Brent Kung Adder (R. P Brent and H. T. Kung

(1982)). It has more logic depth (fan-out) with minimum area characteristics. So it

reduces its addition speed, but power efficient (N Zamhari et al (2012)). The delay of

Brent Kung Adder is equal to (2*log2N)-2 and has the area of (2N)-2-log2N.

Ladner Fischer adder is another parallel prefix adder (R.E. Ladner and M.J.

Fischer (1980)). Its delay and area are asymptotically optimal (i.e., logarithmic delay and

linear area). It has an additional type of “recursive step” for constructing a parallel prefix

circuit. This additional recursive step reduces the delay, but increases area. It has delay of

O(log N) and area of O(N).

An improvement that can be made to CLA design is the use of a pseudo-carry as

proposed by Ling, and is called Ling adder (H. Ling (1981)). This method allows a single

local propagate signal to be removed from the critical path.

Han-Carlson adder is a hybrid design that mix of Kogge-Stone and Brent-Kung

adder (T. Han and D. Carlson (1987)). It has logN+1 stages. The logic performs Kogge-

Stone on the odd numbered bits, and then uses one more stage to ripple into the even

positions.

The Sklansky or divide-and-conquer adder reduces the delay to log2N stages by

computing intermediate prefixes along with the large group prefixes (J. Sklansky (1960)).

This comes at the expense of fanouts that double at each level (8, 4, 2, 1). These high

fanouts cause poor performance on wide adders unless the high fanout gates are

appropriately sized or the critical signals are buffered before being used for the

intermediate prefixes. Transistor sizing can cut into the regularity of the layout because

multiple sizes of each cell are required, although the larger gates can spread into adjacent

columns.

11

2.2 MULTIPLIER TOPOLOGIES

Two classes of parallel multipliers were defined in the 1960’s. The first class of

parallel multiplier use a rectangular array of identical cells which contains AND gate

and addition logic to generate and sum the partial product bits (J. C. Majithia and

R. Kitai (1964)). These kinds of multipliers are called array multipliers and they have

delay that is proportional to the multiplier input word size, i.e. O(N). Since array

multipliers have regular structure and regular wiring connectivity, it is easier to

implement these at the layout level (R P Pal Singh et al (2009)).

The next class of parallel multipliers termed column compression multipliers,

uses counters or compressors to reduce the matrix of partial product array to two words.

Finally a carry propagate adder is used to sum these two words to get the final product.

The column compression multiplier have delay proportional to the logarithm of the

multiplier word length, i.e. O(log N) So it is faster than array multiplier, but due to its

irregular structure and interconnections it is difficult to layout.

2.2.1 ARRAY MULTIPLIERS

A 4 by 4 array multiplier structure is shown in Figure 2.6. Each cell performs the

two basic functions of partial product generation and summation. Half adders and full

adders are used to perform addition function. An unsigned N by N array multiplier

requires N2 + N cells, where N

2 contain an AND gate for partial product generation, 2N

full adder and N half adder to produce a multiplier. The worst case delay is (2N - 2) ∆c

(Bickerstaff K.C (2007)), where ∆c is the worst case adder delay. Here all the products are

generated in parallel and collected through an array of full adders and half adders, finally

they are summed using CPA. Since its regular structures, the array multiplier takes less

amount of area, but is slowest in terms of the latency.

In the 1950’s Booth algorithm used in array multipliers to perform two’s

complement multiplication (Andrew D. Booth (1951)). It computes the partial

products by examining two multiplicand bits at a time. Later higher radix modified

Booth algorithm was introduced to improve the latency performance of the regular Booth

array multiplier.

12

Figure 2.6: Schematic of a 4 by 4 Array Multiplier

The Booth Radix-4 algorithm (O. L. MacSorley (1961)) reduces the number of

partial products by half while keeping the circuit’s complexity down to a minimum. This

result in faster less power in multiplication operation. Booth Recoding makes these

advantages possible by skipping clock cycles that add nothing new in the way of product

terms. The hardware implementation for Radix-4 Booth Recoding technique use a simple

mux that selects the correct shift-and-add operation based on the groupings of bits found

in the product register. The product register holds the multiplier. The multiplicand and

the two’s complement of the multiplicand are added based on the recoding value. The

directions for the radix-4 modified Booth recoding technique are shown in Table 2.1.

FA

a0,b0

a0,b2 a1,b2 a2,b2 a3,b1

a0,b3 a1,b3 a2,b3 a3,b2

a3,b3

p7 p6 p5 p4 p3 p2 p1 p0

HA HA HA

FA FA

FA FA FA

FA FA HA

a1,b0 a0,b1 a2,b0 a1,b1 a3,b0 a2,b1

13

The three bit decodes five possible operations are add 2*multiplicand, add

multiplicand, add 0, subtract multiplicand, or subtract 2*multiplicand. It increases the

hardware complexity, but consumes only half the delays of the regular Booth multiplier.

It is possible to use higher radices, such as radix-8 or radix-16, but the additional

complexity, due to non-power of two multiples of the multiplicand, compromises

delay and area improvements.

Table 2.1: Radix-4 Modified Booth Recoding

Another method was proposed by Baugh and Wooley (Charles R. Baugh and

Bruce. A. Wooley (1973)) to handle signed bits. This technique has been developed in

order to design regular multipliers suited for 2’s complement numbers. Due to the

additional two rows, it increases the maximum column height by two. Because of the

additional two stages of partial product reduction, it increases overall delay. A modified

form of the Baugh and Wooley method (Shiann-Rong Kuang et al (2009)) is more

commonly used because it does not increase the maximum column height.

The partial product organization of the modified Baugh-Wooley method is

shown in Figure 2.7. The strategy of organization is follows,

1) Invert the MSB bits of each row except the bottom row.

2) Invert all the bits in bottom row, except the MSB bit.

3) Add a single one to the (N+1)th

and 2Nth

columns.

The negative partial product bits can be generated using a NAND gate instead of an

AND gate, which may reduce the area slightly in CMOS.

bi bi-1 bi-2 operations

0 0 0 Add 0

0 0 1 Add multiplicand

0 1 0 Add multiplicand

0 1 1 Add 2* multiplicand

1 0 0 Subtract 2* multiplicand

1 0 1 Subtract multiplicand

1 1 0 Subtract multiplicand

1 1 1 Subtract 0

14

x7 x6 x5 x4 x3 x2 x1 x0

y7 y6 y5 y4 y3 y2 y1 y0

1 #p70 p60 p50 p40 p30 p20 p10 p00

#p71 p61 p51 p41 p31 p21 p11 p01

#p72 p62 p52 p42 p32 p22 p12 p02

#p73 p63 p53 p43 p33 p23 p13 p03

#p74 p64 p54 p44 p34 p24 p14 p04

#p75 p65 p55 p45 p35 p25 p15 p05

#p76 p66 p56 p46 p36 p26 p16 p06

1 p77 #p67 #p57 #p47 #p37 #p27 #p17 #p07

#s15

s14 s13 s12

s11

s10 s9

s8 s7 s6 s5 s4 s3 s2 s1 s0

Figure 2.7: Two’s Complement Multiplication by Modified Baugh-Wooley Method

2.2.2 COLUMN COMPRESSION MULTIPLIERS

In 1964, Wallace (C.S.Wallace (1964)) introduced a scheme for fast

multiplication based on using array of full adders and half adders. He used full adders for

all three bits and half adders for all two bits in the partial products array of multiplier to

speed up the multiplication.

Later the Wallace’s approach was modified by Dadda (Luigi Dadda (1965)) using

counter placement strategy in the partial product array. Here the placement of counters

starts from the critical path in the partial product array. This placement repeats until we

get final two rows and they are summed using a carry propagate adder to get final

product. In both Wallace and Dadda methods, the delay of the multiplier is proportional

to the logarithm of the operand word-length.

Reduced area approach is an another type of partial product reduction method

proposed for area optimization in column compression multipliers (K’Andrea et al

(1993)). Another area reduction approach is proposed by Wang’s (Z. Wang et al (1995)).

These methods are based on strategic utilization of full adders and half adders to

improve area reduction and layout, while maintaining the fast speed of the Wallace

and Dadda designs.

# inverted

bit positions

15

Another partial product reduction algorithm based on the unequal delay paths

proposed by Oklobdzija (V. G. Oklobdzija (1995)). He defined the connectivity strategy

of slow inputs/outputs and fast inputs/outputs in the critical delay paths that can tolerate

an increase in delay.

A new organization of the reduction tree, which is based on the partial-product

compression similar to the Dadda approach, is proposed by Eriksson (H. Eriksson

(2006)). The connectivity of the adding cells in the triangle-shaped High-Performance

Multiplier (HPM) reduction tree is completely regular.

2.2.2.1 PARTIAL PRODUCTS REDUCTION SCHEMES

As shown in Figure 2.8, the multiplier starts with generating partial products

using AND gate array and reducing those to two rows using counters or compressors. It is

good to understand the difference between counters and compressors (V. G. Oklobdzija

and D.Villeger (1995)). The counter counts the number of active inputs and the

compressor (q:r) reduces q inputs to r outputs based on the compression ratio. In this

research we used counters for design multipliers, not used compressors. The column

compression tree includes array of counters or compressors. Finally the two rows are

added using carry propagate adder to get final product. Hybrid adder structure can be

used for carry propagate adder to perform fast addition to get final products.

Figure 2.8: Basic N by N unsigned parallel multiplier

Multiplier Multiplicand

AND array

Column Compression Tree

Final Carry Propagate Adder

N N

2N

16

Dot diagram is a notation for describing multiplication column compression

algorithms. The symbols used in dot diagrams are listed below,

Each dot - each partial product bit

Plain diagonal line - each full adder output

Crossed diagonal line - each half adder output

The dot diagram for a 8 by 8 Wallace multiplier is shown in Figure 2.9. It

was constructed based on the following Wallace algorithm,

1) Take all three bits in each column and add them using a full adder.

2) If there are two bits left in any of the column, add them using a half adder.

3) If there is just one bit left in any of the column, connect it to the next level.

4) Repeat the steps 1 o 3 until get final two rows.

5) Add the final two numbers using a carry propagate adder to get the final

product.

In each stage of the reduction, Wallace performs a preliminary grouping of

partial product rows into sets of three. Full adders and half adders are then employed

within each three row set. In the 8 by 8 example, the counters shown in Stage 1 of the

reduction are placed in four sections as determined by the preliminary grouping of partial

product bits out of the AND array into sets of three. If due to the preliminary grouping

there is only one partial product bit, then that bit is directly moved down to the next

stage. The reduction of the partial product bits in Stage 1 by the counters shown in Stage

2 demonstrates that rows which are not part of a three row set are moved down

into the next stage without modification.

17

Figure 2.9: Dot Diagram of 8 by 8 Wallace Multiplier

The complete partial product reduction of a 8 by 8 Wallace multiplier requires

four stages (intermediate matrix heights of 6, 4, 3, and 2) and uses 38 full adders and 15

half adders. To complete the multiplication, an 11 bit carry-propagate adder forms the

final product by adding the final two rows of partial product bits shown in Stage

4.

As mentioned earlier, later the Dadda was modified the Wallace’s approach using

the counter placement strategy. Table 2.2 indicates the number of reduction stages based

on the number of bits in the Dadda multiplier. The reduction stages are determined from

bottom (final two rows) to top. In each reduction stage the height of the matrix is no more

than 1.5 times the height of its subsequent matrix. For example, a 12 by 12 Dadda

multiplier requires five reduction stages with intermediate heights of 9, 6, 4, 3 and 2.

18

Table 2.2: Reduction Strategy for Dadda Multiplier

The algorithm used for a Dadda multiplier is as follows:

1) Let d1 = 2 and dj+1 = ⎣1.5 · dj⎦ is the matrix height for the jth

stage from the

final two rows. It generates the sequence: d1=2, d2=3, d3=4, etc.

2) For every column, use half adders and full adders to ensure that the number

of elements in each column will be <= dj

3) Let j = j -1 and repeat step2 until you reach the maximum height of 2 bit

column.

In Figure 2.10, the dot diagram for an8 by 8 Dadda multiplier is shown Figure 2.10. The

first six matrix heights calculated using the recursive algorithm are 2, 3, 4, 6 and 9. Since

this is a 8 by 8 multiplier, the matrix height of 9 is unnecessary. The next

matrix height to target is 6. Stage 1 of partial product reduction applies full adders and

half adders only to the columns whose total height is greater than 6. In Stage 2,

full adders and half adders are only used in columns whose total height is greater than 4.

Note that when evaluating a column’s height it is important to account for carries

from the previous column. The 8 by 8 Dadda multiplier requires four reduction stages

(matrix heights of 6, 4, 3, and 2) and uses 35 full adders, 7 half adders, and a 14

bit carry-propagate adder.

Multiplier (N)

Reduction

Stages

3 Stage 1

4 Stage 2

5 <= N <= 6 Stage 3

7 <= N <= 9 Stage 4

10 <= N <= 13 Stage 5

14 <= N <= 19 Stage 6

20 <= N <= 28 Stage 7

29 <= N <= 42 Stage 8

43 <= N <= 63 Stage 9

64 <= N <= 94 Stage 10

19

Figure 2.10: Dot Diagram of 8 by 8 Dadda Multiplier

Reduced Area multiplier (K’Andrea et al (1993), K’Andrea et al (1995),

K’Andrea et al (2001)) is an another reduction scheme to optimize the area than Wallace

and Dadda scheme. The dot diagram for a 8 by 8 Reduced Area multiplier is shown in

Figure 2.11.This multiplier requires four stages (matrix heights of 6, 4, 3, and 2) and

uses 35full adders, 7half adders, and a 10 bit carry-propagate adder. The reduction

method for the Reduced Area multiplier is:

1) For each reduction stage, the number of full adders used in each column is

⎣bi / 3⎦, where bi is the number of bits in column i. This provides the maximum

column reduction in the number of bits entering the next stage.

2) Half adders are used only for the below two conditions,

20

(i) When required to reduce the number of bits in a column to the number of bits

specified by the Dadda sequence (or)

(ii) To reduce the rightmost column containing only two bits.

Reduced Area multiplier reduction scheme is especially useful for pipelined

multipliers, because it reduces the required latches in the partial product reduction stages.

This scheme can be applied for both signed and unsigned numbers.

Figure 2.11: Dot Diagram of 8 by 8 Reduced Area Multiplier

21

A fourth type of reduction scheme, which uses full adders and half adders, is

called the High Performance Multiplier (HPM) multiplier (H. Eriksson et al (2006)).

The dot diagram for an 8 by 8 High Performance Multiplier is shown in Figure 2.12.

This multiplier requires six stages (matrix heights of 7, 6, 5, 4, 3, and 2) and uses

35full adders, 7half adders, and a 14 bit carry-propagate adder. The reduction for each

stage in the High Performance Multiplier is N-1, where N is matrix height of previous

stage.

Figure 2.12: Dot Diagram of 8 by 8 HPM Multiplier

22

A fifth type of partial product reduction scheme has been proposed by

Wang, et al. (Z. Wang et al (1995)) to design more area efficient with shorted

interconnections in the column compression multipliers. First he determines the lower

bounds on the number of adders required by a column compression multiplier. Then the

constraints have been analyzed for the distribution of adders to the different stages.

Finally he proposed a technique that attempts to maximize area efficiency while reducing

the number of cross-stage interconnections. The constraints for half adder and full adder

allocation in the column compression were analyzed and under these constraints,

considerable flexibility for implementation of the column compression multiplier and

choosing the length of the final fast adder which yields higher area efficiency.

In Wang’s research, area efficiency of the column compression part of the

multiplier is defined as:

(2.9)

where N is the total number of half adders and full adders used in the reduction stages, K

is the required number of stages, and N(k) is the number of half adders and full

adders in stage k.

The performance of any of these five multipliers Wallace, Dadda, Reduced

Area, HPM and Wang can be improved by the proposed design techniques proposed

in this research.

K.max (N (k))

N x 100%

23

2.2.2.2 THE FINAL CARRY-PROPAGATE ADDER

All the fast adder structures were developed under the assumption that the input

signals are arriving at the same time. This assumption is not realistic for many places like

input arrival profile from the multiplier partial product summation tree to the carry

propagate adder. Therefore this research concerned about which one of the schemes for

addition is most adequate as a carry propagate adder for the multiplier.

The literature deals several types of carry-propagate adders, including CSA, CLA,

CSLA and CSKA (R.Uma et al (2012)). The adder structures have been evaluated and

rated based on the delay, area and number of logic transitions (Thomas K. Callaway

and Earl E. Swartzlander, Jr (1992)). More specifically the work has been done to

evaluate the power consumption of adders (Thomas K. Callaway and Earl E.

Swartzlander, Jr (1993)).

It is well known that the signals from column compression tree applied to the

inputs of the carry propagate adder arrive first at the ends of the carry propagate adder

and the last ones are those in the middle of the carry propagate adder. So the

determination of the exact arrival time to carry propagate adder is of prime importance in

the design of the optimal final adder. To better select and design adders for column

compression multipliers Oklobdzija analyzed the input arrival times to the final adder (V.

G. Oklobdzija and D.Villeger (1995), V. G. Oklobdzija (1995)) and he suggests using

either variable block adder or RCA to sum the early LSB values, a CLA to sum the

middle region of bits, and either a conditional sum adder or CSLA to sum the early

MSB values.

Since RCA has simple and regular structure, it consumes lesser power and is area

efficient than all other existing adders. But each stage in RCA generates sum only after

receiving the carry bit from the preceding stage bit pairs. So it leads to large carry

propagation delay. The arrival profile as shown by Oklobdzija (V. G. Oklobdzija and

D.Villeger (1995)) and Balasubrahmanyam (Balasubrahmanyam et al (2012)), has a

positive slope from the LSB region to middle region of the partial products. Even though

the carry bit arrives faster from the preceding stage of the final addition, the arrival of

true values from the partial products are slower in the positive slope region. So the fast

adder is not best choice in this region and the RCA is best choice in the positive slope.

24

But this slope is not always positive in the entire multiplier region. It has constant slope

in the middle region and negative slope in the MSB side of the partial products. So

determination of the suitable adder in each region would lead to optimal performance of

the multipliers.

Based on the different arrival profile region of the partial products, this research

proposed a hybrid carry propagate adder structure for parallel multipliers which

consumes lesser power, area efficient than the regular CSLA, and faster than the CSA.

This enables optimal performance in the final addition for the multipliers proposed in this

research.