time constrained multiple-constant-multiplication structures … · time constrained...

Input

-3x

<<2

+5x -7x

<<3

+9x

+11x

<<3

+13x

-15x

<<4

out1 out23

<<1

out32

<<2

out61

<<7

<<2

+51x -89x

<<5

out28 out42

<<1

out51

<<2

out54

<<3

+117x

out36out46

<<1

<<4

out41out56

<<1

out45out49 out53 out55

<<1

out50out58 out59 out60

Time Constrained Multiple-Constant-MultiplicationStructures for Real-Time Applications

Luıs Miguel Seabra Ribau Lopes do Rosario

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor: Doutor Nuno Filipe Valentim Roma

Examination CommitteeChairperson: Doutor Nuno Cavaco Gomes HortaSupervisor: Doutor Nuno Filipe Valentim Roma

Member of the Committee: Doutor Paulo Ferreira Godinho Flores

Maio 2014

Acknowledgments

This thesis was performed in the scope of project ”HELIX: Heterogeneous Multi-Core Archi-

tecture for Biological Sequence Analysis”, funded by the Portuguese Foundation for Science and

Technology (FCT) with reference PTDC/EEA-ELC/113999/2009.

I would like to express my gratitude for the support and encouragement of some people which

helped me throughout my dissertation.

First to my family who encourage me all the way. My friends who indulged me during the time

working on the dissertation and encouraged me all the time. Special thanks to Filipa, Valter,

Andre, Joao and both Marianas who were there most of the time. To Constanca who helped me

for the final push.

To my supervisor Dr. Nuno Filipe Valentim Roma who helped me manage my expectation, got me

encouraged as much as he possible could and for the hard reviews he had to go through.

To T.A. Tiago Miguel Braga da Silva Dias for his helpful comments, ideas and contributions.

Also to Dr. Levent Aksoy for the insightful comments, help in key parts of the thesis and his friendly

attitude.

Finally, but not least, for my friends and mother who are not amongst us: Marco, Bernardo and

Liliete.

Abstract

To implement multiplications by constants operations, Multiple Constant Multiplication (MCM)

structures are often a de facto alternative in Application Specific Integrated Circuits (ASIC). These

structures make use of a series of additions, subtractions and shifts to multiply a given variable

by a constant. However, the definition of the optimized MCM hardware structures for a given

constraint set is a NP-complete problem, which has motivated the proposal of many algorithms

along the last decade, mainly focused on reducing the number of additions needed to jointly

implement a set of constant coefficients. The presented research is focused on a less exploited

approach, which tries to optimize the MCM multiplier structure in order to minimize the propagation

time based on gate-level metrics. For such purpose, a modular and structural adder was proposed

to be integrated in each MCM node, that optimally implements each addition operation according

to the particular requisites and characteristics of the operands that are being considered at each

MCM node. As such, not only does the proposed approach handles the same problems as the

other algorithms by reducing the number of adders, but it also tries to start by minimizing the

propagation time and only then does it optimize the area. From the conducted simulations, it was

observed that the proposed improvement provides an average speed-up of the MCM performance

as high as 1.5, at the cost of a consequent increase of the used silicon area.

Keywords

Multiple Constant Multiplication, binary adder structure, propagation time, silicon area, design

optimization

iii

Resumo

Para implementar operacoes de multiplicacao por constantes, sao usadas estruturas MCM em

ASIC. Para multiplicar uma variavel por uma constante, estas estruturas usam uma sequencia de

somas, subtraccoes e deslocamentos de bits. O problema existente com as estruturas MCM e

ser um problema NP-completo, o que motivou a proposta de varios algoritmos na ultima decada,

sobretudo focados na reducao do numero de somas necessarias para implementar de forma

conjunta um conjunto de coeficientes constantes. Esta investigacao focou-se numa abordagem

diferente, com base numa metrica ao nıvel da porta logica optimizou-se a estrutura multiplicadora

MCM para minimizar o tempo de propagacao. Para tal, propos-se um somador modular e estru-

tural para ser integrado em cada no do MCM, onde cada operacao de adicao e implementada

de maneira optima consoante os requisitos e caracterısticas dos operandos que estao a ser con-

siderados para cada no. A abordagem apresentada trata dos mesmos problemas que os outros

algoritmos ao reduzir o numero de somadores, mas tambem vai mais alem, tentando primeiro re-

duzir o tempo de propagacao e so depois tentando optimizar a area. Das simulacoes efectuadas,

foi observado que as melhorias propostas conseguem acelerar em media ate 1.5 o desempenho

do MCM a custa do aumento de ocupacao da area de silıcio.

Palavras Chave

Multiplicacao de multiplas constantes, estrutura de somador binario, tempo de propagacao,

area de silıcio, optimizacao do circuito

v

Contents

Acronyms xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 State of the art 7

2.1 Binary Arithmetic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Ripple-Carry or Ripple-Adder structure . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Carry Look-Ahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Sklansky Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.4 Modified Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.5 Reconfigurable architecture for Increment and Decrement . . . . . . . . . . 13

2.2 Single Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Multiple Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Bull-Horrocks Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Bull-Horrocks Modified algorithm . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 N-dimensional Reduced Adder Graph . . . . . . . . . . . . . . . . . . . . . 20

2.3.4 Cumulative Benefit Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.5 Unconstrained Area optimization - ASSUME-A . . . . . . . . . . . . . . . . 25

2.4 Technology Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.1 Arithmetic Binary Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.1.A Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.1.B Increment and Decrement . . . . . . . . . . . . . . . . . . . . . . 28

2.4.2 Multiple Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Optimized Adder Structures for MCM 33

3.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

Contents

3.1.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.2 Example configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Example configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Topologic description of the adder/subtracter module . . . . . . . . . . . . . . . . . 42

4 Time model of the proposed adder/subtracter structure 49

4.1 Simplest Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 LSB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.1 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Time delay minimization through gate level metrics 55

5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Data Structure and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Proposed optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.1 Partial term finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.2 Minimizing the time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.3 Minimize area function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Results 67

6.1 Optimized Adder Structure Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.1.1 Adder Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.1.2 Increment and Decrement Structures . . . . . . . . . . . . . . . . . . . . . 70

6.2 Structured Adder Block Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2.1 Tyagi time and area evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2.2 Standard Cell time and area evaluation . . . . . . . . . . . . . . . . . . . . 73

6.3 Multiple Constant Multiplication Structure . . . . . . . . . . . . . . . . . . . . . . . 75

7 Conclusions 79

A Appendix A - Background 85

A.1 Number Representations Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1.1 Unsigned Binary Representation . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1.2 Signed Binary Representations . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1.2.A One’s Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1.2.B Two’s Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 87

viii

Contents

A.1.2.C Canonical Signed Digit . . . . . . . . . . . . . . . . . . . . . . . . 87

A.1.2.D Minimal Signed Digit . . . . . . . . . . . . . . . . . . . . . . . . . . 88

B Appendix B - Used set of coefficients 89

C Appendix C - Comparison of filter fir10 91

ix

Contents

x

List of Figures

2.1 Full Adder structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Ripple Carry structure for w bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Carry look-ahead adder structure for 8 bits. . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Carry look-ahead adder blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Sklansky prefix-adder structure for 8 bits. . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Sklansky prefix-adder blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Original scheme of the RID [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Modified scheme of the Reconfigurable architecture for Increment and Decrement

(RID) input selection block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 Scheme of the RID decision block. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.10 Modified scheme of the RID output selection block. . . . . . . . . . . . . . . . . . . 15

2.11 Single Constant Multiplication (SCM) structure to compute 23x . . . . . . . . . . . 17

2.12 Multiple constant multiplication with the terms 7 and 11. . . . . . . . . . . . . . . . 18

2.13 Bull-Horrocks Algorithm (BHA) graph representation . . . . . . . . . . . . . . . . . 19

2.14 N-dimensional Reduced Adder Graph (RAG-N) graph representation . . . . . . . . 22

2.15 Distance cases handled by the algorithm Cumulative Benefit Heuristic (Hcub). . . . 23

2.16 Graph topologies for optimal and exact distance tests. . . . . . . . . . . . . . . . . 24

2.17 Examples of cost function in a Boolean network . . . . . . . . . . . . . . . . . . . . 26

2.18 Adder’s area and time delay Tyagi’s metric comparison . . . . . . . . . . . . . . . . 28

2.19 Comparison between Half Adder (HA)+Modified Full Adder (MFA) and [20]. . . . . 28

2.20 Implementation characteristics of MCMs structures. . . . . . . . . . . . . . . . . . 30

2.21 Bit precision study for the Inverse Quantization (QI) coefficients set. . . . . . . . . 31

2.22 Bit precision study for the Forward Quantization (QF) coefficients set. . . . . . . . 32

3.1 Adder block with its inputs left shifted and the resulting output right shifted. . . . . 34

3.2 Mathematical formulation for the addition operation. . . . . . . . . . . . . . . . . . 35

3.3 Zone 3 divided into 2 sub-zones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Zone 4 divided into 2 sub-zones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Mathematical formulation for the subtracter. . . . . . . . . . . . . . . . . . . . . . . 40

xi

List of Figures

3.6 Zone 2 divided into 2 sub-zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7 Complete adder configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1 Simplest model: critical path estimation. . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Least Significant Bit (LSB) model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Definition of the critical path for the proposed model. . . . . . . . . . . . . . . . . . 52

5.1 Main algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Find partials algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Minimize path algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Minimize area algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5 Implementation of the term 89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.1 Area and propagation time for Standard Cell library. . . . . . . . . . . . . . . . . . 69

6.2 Model implementation of the Carry Look-Ahead Adder (CLA). . . . . . . . . . . . . 69

6.3 Comparison between HA+MFAand RID [20]. . . . . . . . . . . . . . . . . . . . . . 70

6.4 Model implementation of the HA+MFA and RID. . . . . . . . . . . . . . . . . . . . . 70

6.5 Adder implementations with Tyagi metrics . . . . . . . . . . . . . . . . . . . . . . . 72

6.6 Adder implementations with Standard cell metrics . . . . . . . . . . . . . . . . . . 75

C.1 Graph for a test set fir10 with the proposed adder and algorithm. . . . . . . . . . . 92

C.2 Graph for a test set fir10 with the Levent adder and algorithm [3]. . . . . . . . . . . 93

xii

List of Tables

2.1 Original control signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 45 binary representation and its possible covers. . . . . . . . . . . . . . . . . . . . 25

2.3 Characteristics of MCMs structures for the QF coefficients set. . . . . . . . . . . . 29

2.4 Characteristics of MCMs structures for QI coefficients set. . . . . . . . . . . . . . . 30

3.1 Zone 4 truth table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Table resuming the addition operation with r > 0. . . . . . . . . . . . . . . . . . . . 37

3.3 Eight different cases of the addition operation. . . . . . . . . . . . . . . . . . . . . . 38

3.4 Table resuming the addition operation with r = 0. . . . . . . . . . . . . . . . . . . . 39

3.5 Table resuming the subtraction with right shift operation. . . . . . . . . . . . . . . . 41

3.6 Eight different cases of the subtraction operation. . . . . . . . . . . . . . . . . . . . 43

3.7 Definition and implementation conditions of zone 1 . . . . . . . . . . . . . . . . . . 44

3.8 Definition and implementation conditions of zone 2 a). . . . . . . . . . . . . . . . . 45

3.9 Definition and implementation conditions of zone 2 b). . . . . . . . . . . . . . . . . 45




3.13 Zone 4 truth table for the select signal. . . . . . . . . . . . . . . . . . . . . . . . . . 47


3.15 Table resuming both operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Data structure and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Minimal Signed Digit (MSD) representations and respective paths and implemen-

tations of the term 45. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1 Propagation time and area obtained with Tyagi’s model for the structured adder

used in each MCM node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2 Propagation time and area obtained with a standard cell implementation of the

proposed adder used in each MCM node. . . . . . . . . . . . . . . . . . . . . . . . 74

6.3 Experimental results of the MCM implementation after synthesis. . . . . . . . . . . 75

xiii

List of Tables

6.4 Experimental results of the MCM implementation after place and route. . . . . . . 76

B.1 Coefficient sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

xiv

Acronyms

Hcub Cumulative Benefit Heuristic

ASIC Application Specific Integrated Circuits

ASSUME-A Unconstrained Area optimization - ASSUME-A

BHA Bull-Horrocks Algorithm

BHM Bull-Horrocks Modified algorithm

CLA Carry Look-Ahead Adder

CNF Conjunctive Normal Form

CSD Canonical Signed Digit

CSE Common Sub-Expression sharing

DCT Discrete Cosine Transform

DFS Depth Search First

FA Full Adder

FIR Finite Impulse Response

HA Half Adder

ILP Integer Linear Programming

LSB Least Significant Bit

LSOB Least Significant One Bit

MAG Minimised Adder Graph

MCM Multiple Constant Multiplication

xv

Acronyms

MFA Modified Full Adder

MSB Most Significant Bit

MSD Minimal Signed Digit

OC One’s Complement

QF Forward Quantization

QI Inverse Quantization

QP Quantization Parameters

RAG-N N-dimensional Reduced Adder Graph

RC Ripple-Carry or Ripple-Adder structure

RID Reconfigurable architecture for Increment and Decrement

SA Sklansky Adder

SAT Satisfiability

SB Signed Binary

SCM Single Constant Multiplication

STL Standard Template Library

TC Two’s Complement

VHDL VHSIC Hardware Description Language

VHSIC Very-High-Speed Integrated Circuits

xvi

1Introduction

1

1. Introduction

1.1 Motivation

High-speed multiplication is used in everything around us, from simple processors used in em-

bedded systems, to general purpose processors present in high-performance cluster of servers.

As a consequence, it has been a subject of research for a long time and even though there are

many different algorithms for number multiplication, we will focus on the Multiple Constant Mul-

tiplication (MCM) systems composed of adds and shifts operation. In fact, it is well known that

by conveniently mapping a series of adds and shifts on a processing path, we can compute the

multiplication of a given variable by a set of constant coefficients. This multi-constant mapping can

be efficiently obtained by sharing intermediate results from the adders. This reduces the amount

of hardware required to implement a given set of targets. The main difficulty that characterizes

the design of MCM systems is the fact that it is NP-complete problem, as shown in [10].

In the last years, the MCM research trend has been focused on reducing the area of the multiplier

block, while keeping the number of additions to a minimum. When processing a set of coefficients,

many different options are possible to connect the adders. Different decision lead to different ap-

proaches to optimize the solution of the MCM problem, ranging from exhaustive search, sharing of

common partial terms, but also graph based coefficient search presented in the next paragraph.

The graph based approach is one of the most prevailing techniques, represented by several

contributions in the literature. As an example, the RAG-N [12] technique uses a pre-computed

table containing the optimal costs of two adder structures depth (two adder in series or adder

steps) for each coefficient, and synthesizes the circuit based on it. However, the computational

requirements of the graph based methods are quite high. As a consequence only coefficients

up to 19 bits are possible. The HCUB [29] technique augmented the search space by adding a

heuristic algorithm capable of bigger depths (more than three adder-steps). The technique used

by Unconstrained Area optimization - ASSUME-A (ASSUME-A) [14] has a different approach.

The coefficients are represented in Canonical Signed Digit (CSD) or Minimal Signed Digit (MSD).

Both representations have the property of having a minimal non-zero digit’s number. The aim for

using this is to reduce the complexity of the algorithm, having a direct access to a minimal number

of partial terms. The solution is obtained by translating the problem to a Boolean Network and by

solving it as a 0-1 Integer Linear Programming (ILP) problem with a Satisfiability (SAT)-solver.

The main problem of the previous algorithms [12, 14, 29] concerns the absence of a direct relation

with the targeted implementation technology and, in particular, the lack of gate level metrics, i.e.

all three implementations have unitary weights for the nodes of the MCM. Recently, the work de-

veloped by Aksoy et al [2] brought some optimizations at the gate level metric, in order to achieve

smaller area. Their proposal is based on a custom adder structure made of full adders and half

adders that optimizes the resulting MCM circuit area. For such purpose, the area model of the

custom adder is introduced in a graph-generation algorithm [14], in order to achieve a smaller

2

1.2 Objectives

area than the original approach. However, it still lacks the time delay model. In [19], the author

achieved an architecture that greatly improves the latency by using a sum-of-product architecture

in conjunction with a column compression algorithm. However, it still loses in terms of the attained

throughput.

While [2] introduced the gate level metrics and [19] managed to optimize the latency with a

new technique applied to the MCM, to the best knowledge of the authors of this work, there is

still no approach to optimize the latency at a gate level with such approach. We aim to achieve

low latency and high throughput, by maintaining the shift and add method. For this purpose, a

custom adder structure will be proposed, by defining a modular architecture that is able to adapt

to the needs of different MCM specifications, with a consequent improvement of the latency when

compared to the existing state of the art.

In the presented work, the adder block is based on a hybrid Carry Look-Ahead Adder (CLA)

structure and on the Reconfigurable architecture for Increment and Decrement (RID) presented in

[20]. The cost function for the adder block provides both area and time delay measures, according

to a given implementation technology based on synthesis values. Accordingly, the cost varies

with the size of the operation that is implemented in each module of the block. Upon an accurate

modelling of the obtained adder structure, we developed an algorithm that uses the developed

model to implement the fastest possible multiplier structure.

1.2 Objectives

The main problems that were studied in the scope of this thesis can be stated by the following

questions:

• How can the speed of a multiplier block be improved?

• Is it enough to change the MCM nodes implementation?

• Can we improve the MCM design for low-latency and high throughput?

Hence, the overall objective of this work is to improve the MCM structure by improving the latency,

while still using area optimizations based on the work by Aksoy et al. [2].

1. The development of a MCM node that can minimize the time-delay while still having an

acceptable area occupation. Even though there is already some research on many types

of latency optimized adders, to the best of our knowledge there is no adder specification

specifically optimized for the MCM structure. Hence we propose an adder structure for

the MCM problem. This structure can be adapted to the needs of the designer, either for

time-delay or area improvement.

2. The proposal of an algorithm capable of finding the structure of the MCM with the smallest

critical path, given a node implementation. This objective arises as a natural extension

3

1. Introduction

of the time-delay optimization through gate-level metrics of the work done by [3]. By still

considering the area optimization, we will extend the algorithm in order to introduce the

time optimization, so the algorithm finds a solution with a given critical path as the main

constraint.

1.3 Original contributions

The original contributions of this thesis are the following:

• New Framework for MCM optimized adder structures: optimized adder structures for

MCM architectures already exist [2]. Nonetheless, there is no generic structure defined.

Through a careful analysis of the MCM requirements, a new framework has been developed.

We were able to divide the addition operation into different modules that compute the output

bits in groups, making it a scalable adder to the designer needs.

• Formal model of the critical path in a chain of adders: based on the proposed frame-

work, we built a model capable of defining the critical path of a chain of adders with a better

accuracy than the existing ones. We essentially grouped the bits of the result according to

the corresponding critical path.

• Adder structure optimized for time-delay minimization in MCMs: with the use of the

above-mentioned framework, we were able to minimize the time-delay of the multiplier block.

By considering different increment, decrement and adder implementations, we were able to

find a combination that improved the time-delay, without significantly augmenting the result-

ing area.

• Algorithm to define the optimized MCM structure with the lowest time-delay: based on

the devised model and the implemented adder structure, an algorithm was developed that

finds the lowest possible time-delay in a MCM and set it as the constraint. It also optimizes

the area without exceeding this constraint.

1.4 Thesis organization

To answer these challenges, the theory and technology behind the MCM problem will be

briefly revised in chapter 2. Chapter 3 proposes an improvement of the adder structures inside

the MCM, by designing a custom adder based on the shifts and bit width disparity between

operands. Chapter 4 presents the definition of an accurate time delay model of the MCM adder

structure. This model will be extensively used in chapter 5, to define an algorithm that solves the

problem concerning the definition the whole MCM circuit, first by checking the minimum delays

for each target coefficient, and in second place by optimizing the area with the critical path as a

4

1.4 Thesis organization

constraint. Finally the results of these modifications will be shown and discussed in chapter 6.

Chapter 7 will draw the final conclusions of this thesis.

Before proceeding with the following chapter, if the reader needs revisiting or familiarize some

basic concepts, it was decided to introduce some background concepts the appendix. For such

purpose, the appendix A will cover some basics concepts that will be needed for the understanding

of this thesis.

5

1. Introduction

6

2State of the art

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

7

2. State of the art

To improve the performance of the multiplication operation two main classes of MCM tech-

niques have been proposed in the latest years: the parallel-MCM and the multiplexed-MCM (mux-

MCM). The first class represents a single-input/multiple-output system, while the second is a

single-input/single-output system. In this last class, the multiplexing gives the possibility to further

reduce the silicon area over the parallel MCM, at the expense of the time delay introduced by

the multiplexers. As a consequence this mux-MCM is not going to be treated here, since it has a

greater time-delay than the parallel-MCM. Accordingly, from here on, we will refer to parallel-MCM

simply as MCM.

First, the binary arithmetic structures that will be used throughout the thesis are presented in the

next section. Than, to better understand how constant multiplication works on a MCM graph, a

brief introduction to Single Constant Multiplication (SCM) will be presented, followed by a revision

of the main concepts about MCM.

2.1 Binary Arithmetic Structures

In the presented research, the performance of the proposed MCM circuits will be conducted

by considering the silicon area and the propagation time of the components in the system. Since

the adder is the main expense in the MCM circuit, this section provides a brief overview of some

of the existing adder structures. On one side, there are adders that minimize area. On the other

side, those that minimize the time delay, at the expense of an augment of the area. Hence, the

best compromise of this area-time trade-off depends on the specific requirements of the circuit

to be implemented. In this section the CLA, the Sklansky Adder (SA) and the Ripple-Carry or

Ripple-Adder structure (RC) are object of a deeper study.

Throughout this section, the presented overview will heavily rely on Roma et al. [27]. In the

following, the symbol w will denote the bit-width of the input operands. Since the main aim is

focused on speed optimization, we will look at, in particular, parallel structures (CLA and SA),

which will be compared with the straightforward serial structure (RC).

2.1.1 Ripple-Carry or Ripple-Adder structure

First, let’s start by describing the Full Adder (FA). This circuit is defined by three inputs

(operands a and b, the carry-in cin) and two outputs (the sum output s and the carry-out cout).

The two outputs are given by the following expressions:

s = a⊕ b⊕ cin (2.1)

cout = a · b+ a · cin + b · cin (2.2)

The internal structure of the FA is given by the following logic circuit:

8

2.1 Binary Arithmetic Structures Figure 2.1: Full Adder structure (figure from [27]).

T cout

FA = TAND2 + TOR3 = 3 AFA= 3×AAND2 +AOR3 + 2×AXOR2

T sFA= 2× TXOR2 = 4 = 3 + 2 + 4 = 9

According to the Tyagi’s model [28], the time delays and area occupations are as follows:

The above design only works with 1 bit operands. For wider operands, the RC reuses the

above circuit by cascading w FA for a w-bit addition. This topology connects the carry-out from

one adder to the carry-in of the next adder, thus obtaining the cascaded serial structure. As will

be seen, when compared to other adder structures this one is the slower since each RC must wait

for the carry-out bit from the previous FA. For such reason, this topology is usually referred to as

the Ripple-Carry or Ripple-Adder structure. The circuit corresponding to a w-bit RC is as follows: Figure 2.2: Ripple Carry structure for w bits (figure from [27]).

According to Tyagi’s model, the time delays and area occupations are as follows:

2.1.2 Carry Look-Ahead Adder

The CLA adder is composed by a binary tree structure with dlog2we hierarchical logic levels,

as seen in figure 2.3. It is generally faster than a RC with w hierarchical logic levels. Contrasting

to the RC, where the main propagation delay comes from the carry bits that need to be serially

generated and propagated, in this topology the carry is calculated in parallel, by propagating the

signal through the tree. As figure 2.4 illustrates, the carry bits (ci) are computed according to the

following equation:

cj+1 = Gi,j + Pi,j · ci (2.3)

9

2. State of the art

T cout

RC = w × (TAND2 + TOR3) = 3w ARC= w ×AFAT sRC= (w − 1)× (TAND2 + TOR3) + TXOR2 = 9w

= (w − 1)3 + 2

Where:

Gi,k = Gj+i,k + Pj+i,k ·Gi,j (2.4)

Pi,k = Pi,j · Pj+i,k (2.5)

Gi,k is the carry signal generated from inputs i and k and Pi,k is the propagation signal generated

with the same inputs for i ≤ j < k. We can compute these two with: Gi,i = gi = ai · bi and

Pi,i = pi = ai + bi which are defined in figure 2.4a.

According to figure 2.4b, the sum vector can be obtained with:

si = ai ⊕ bi ⊕ ci (2.6)

The carry-out signal can be calculated with:

cw = G0,w−1 + P0,w−1 · c0 (2.7) Figure 2.3: Carry look-ahead adder structure for 8 bits. (figure from [27])

According to the Tyagi’s model, the time delay and area occupation models are as follows:

T cout

CLA= 2 · dlog2 we+ 3 ARC= 11 · w − 3T sCLA= 4 · dlog2 we+ 1

In these equations w = 2dlog2 we denotes the next integer power of two such that 0 < w ≤ w.

Due to the binary tree structure that is adopted by this CLA, under certain circumstances this

structure can lead to a misuse of the hardware resources. Another fact to notice is that the

slowest signal to be processed is the sum output, while the carry out roughly takes half of this

time. In the next section, we will present a faster adder, the SA.

10

2.1 Binary Arithmetic Structures (a) Block A descending

(b) Block A ascending

(c) Block B descending

(d) Block B ascending

Figure 2.4: Carry look-ahead adder blocks (figures from [27]).

2.1.3 Sklansky Adder

The use of prefix-adder structures [30] are common in the majority of today’s processors.

One such case is the use of the topology proposed by Sklansky [26]. An example of this struc-

ture is shown in figure 2.5. The prefix architectures are based on the fact that the outputs

(yw−1, yw−2, . . . , y0) are computed from the w-bits inputs (xw−1, xw−2, . . . , x0) by means of an

associative generic binary operator ?:

yi = xi ? yi−1 ; i = 1, 2, . . . , w − 1 (2.8)

with y0 = x0 Figure 2.5: Sklansky prefix-adder structure for 8 bits (figure from [27]).

Since the outputs depend on all the prior inputs, any change on them will influence the final

value. Moreover the associative property of the ? operator enables the operations to be executed

in any arbitrary order. Hence, by grouping together subsets of operands, it makes it possible the

computation of parallel partial solutions corresponding to subsets of input bits. We will denote

these groups as Yi,k. Each parcel Yi,k will be processed at different levels, by giving rise to

m = log2(w) + 2 intermediate levels needed to compute the full solution. The variable Y li,k stands

for the output operation that takes a subset of input bits (xk−1, xk−2, . . . , xi) at level l. The last

11

2. State of the art

level m will comprise the entire range of input bits, from 0 to i (Y m0,i), leading to the final result.

Y 0i,i = xi (2.9)

Y li,k = Y l−1i,j ? Y l−1j+1,k , i < j < k ; l = 1, 2, . . . ,m (2.10)

yi = Y m0,i , i = 0, 1, . . . , (w − 1) (2.11) (a) Pre-processing block

(b) Prefix computational unit

(c) Prefix neutral unit

(d) Post-processing block

(e) Pre-processing block of the leastsignificant bit

Figure 2.6: Sklansky prefix-adder blocks (figures from [27]).

Sklansky improved this architecture by adopting the tree-type model, by minimizing the pro-

cessing time of the output bits. This makes all intermediate signals to propagate through the tree

in parallel, to feed the higher level bits that need that signal. One common characteristic with

the CLA is the use of intermediate generate and propagate signals, which assume three different

meanings:

• Generation of the carry signal at logic level ‘0’ (or deactivation of the carry signal at logic

level ‘1’);

• Generation of the carry signal at logic level ‘1’;

• Propagation of the carry signal.

We name the sets of these signals as generation group (Gli,k) and propagation group (P li,k).

They are used to calculate Y l+1i,k = (Gl+1

i,k , Pl+1i,k ).

The first level signal pair corresponds to the generation bit gi and the propagation bit pi, which are

determined from the input operands on a pre-processing block (see figure 2.6a) by the following

equations: gi = aibi and pi = ai + bi. This signal pair is propagated to the higher levels as follows

(figure 2.6b):

(Gli,k, Pli,k) = (Gl−1j+1,k + P l−1j+1,k ·G

l−1i,j , P

l−1j+1,k ·G

l−1i,j ), (2.12)

12


i ≤ j ≤ k, l = 1, 2, . . . ,m

Meanwhile, the carry bits of the carry vector can be computed from the last level signals

(Gm0,i, Pm0,i) according to eq. 2.13. The sum bits are computed in a post-processing operation (see

figure 2.6d), by using eq. 2.14:

ci+1 = Gm0,i + Pm0,i · ci, i = 0, 1, . . . , (w − 1) (2.13)

si = pi ⊕ gi, with c0 = cin (2.14)

Since the first element receives a carry signal that comes from the input of the adder, we need to

include it in the pre-processing block. In this case, we can implement it as in figure 2.6e (notice

the similarities with the RC seen in section 2.1.1):

g00 = a0 · b0 + a0 · c0 + b0 · c0 (2.15)

Finally, the block in figure 2.6c represents the direct connection, taking a signal as operand and

transmitting it to the next level without any processing.

According to the Tyagi’s model [28], the time delays and area occupations are as follows:

T cout

SA = 2 · log2 w + 5 ASA= 32 · w · log2 w + 4 · w + 5

T sSA= 2 · log2 w + 5

2.1.4 Modified Full Adder

A special implementation of the FA, the Modified Full Adder (MFA), was developed in [2], with

the purpose of area minimization in a hybrid adder. The idea behind this work is to minimize area

in the subtraction of an MCM, by using a decrementer.

In a FA block, given one of the input is one, the addition (sum) and the carry out (cout) are the

functions of the input ui and the carry input (cin) given by: sum = cin ⊕ ui and cout = cin + ui.

This implementation will be later used in this thesis.

2.1.5 Reconfigurable architecture for Increment and Decrement

The increment/decrement circuit is a structure that can add/subtract one to a given number. In

this work, it is a common building block in many architectures due to its simplicity compared to the

adder. In this section the work done by Kumar et al. [20] will be studied. The architecture will be

called RID. Note that the full architecture of [20] contains four functions: increment, decrement,

Two’s Complement (TC) and priority encoder. Of theses four, we will only use the increment and

decrement. Thus the architecture used for our purpose is simpler, faster and uses less silicon

area than the original architecture.

The possibility to minimize the area of multiple architecture into one comes from:

13

2. State of the art

1. similarities between the implementation of TC and decrementer circuit;

2. the TC circuit can be used as an incrementer by complementing the input.

For example, to find the decremented output of the binary number 11011100, we complement

the input bits until occurrence of the Least Significant One Bit (LSOB). After the occurrence of the

LSOB, all the remaining output bits are kept unchanged, thus giving the binary number 11011011,

where the underlined bits are the complemented.

To find the incremented value of 11011100 we first complement the value, resulting in 00100011.

Next the TC is calculated giving 11011101 and thus having the incremented value of the initial

number.

In both of theses examples, the authors from [20] identified the common characteristics between

the different functions needed. In our specific case, the comparison between the increment and

decrement case differs in the complement and finding the LSOB respectively. Both of these cir-

cuits have in common the TC. In figure 2.7 is represented the original structure that will implement

the increment, decrement, TC and priority encoder architecture. The input of the system are Z,

Cnt1 and Cnt2. The output is O. The signals I and D are signals for intermediate calculation.

The selection of the function is selected with the help of the controls signals Cnt1 and Cnt2 given

by table 2.1.

Input Selection Block

Decision Block

Output Selection Block

Cnt1

Cnt0

D

Cnt1

Cnt0

I

Z

O

Figure 2.7: Original scheme of the RID[20].

Table 2.1: Original control signals

Cnt1 Cnt0 Operation performed0 0 Increment0 1 Decrement1 0 TC1 1 Priority Encode

To adapt the structure to our needs, some simplifications are possible. These simplifications

are associated with the functions needed later in this thesis. For this purpose, we will only choose

the functions increment/decrement associated with the control signal Cnt1 = 0. The modification

are:

• in the input block, the NOR gate is changed into a NOT gate only with input Cnt0;

• in the output block, the multiplexor with Cnt1 as select signal is not needed;

• still in the output block, the NAND gate that has inputs Cnt0 and the negated Cnt1 disap-

pears: Cnt0 ∧ Cnt1 = Cnt0 ∧ 0 = Cnt0 ∧ 1 = Cnt0.

14


In figures 2.8 and 2.10 are the schemes for the modified input and output selection blocks.

Moreover, the structure used here is the proposed prefix-based type II, also known as Sklansky,

shown in figure 2.9. This decision architecture was selected due to its low latency when compared

to the other architecture presented in [20].

cnt0

Z0

I0

01

Z1

I1

01

Z2

I2

01

Zn-1

In-1

01

Figure 2.8: Modified scheme of the RID input selection block.

Figure 2.9: Scheme of the RID deci-sion block.

cnt0

Zi

Oi

1 0

1 0 1 0

ZiZi Zi

cnt0

cnt0

Figure 2.10: Modified scheme of theRID output selection block.

With these modifications, the circuit has less latency and occupies less area. Further ahead

in this chapter, a benchmark will compare this architecture with other existing ones.

Discussion

We presented an overview of different adder implementations: one serial and two parallel

topologies. From the theoretical perspective the parallel adders are faster than the serial. More-

over the parallel topologies have more silicon area than the serial one. From the point of view of

this thesis, low delay is more important than silicon area. Therefore we will focus on the faster par-

allel topologies and choose from whichever is the best suited for the hybrid architecture presented

in this thesis.

15

2. State of the art

2.2 Single Constant Multiplication

The implementation of the SCM operation is based on a series of adds and shifts that together

multiply a given number by a constant coefficient. For example, if we want to compute x × 8 =

x× 23, this operation can be trivially implemented as a shift left of three bits.

000012(110) 3 = 010002(810)

In the following, we will adopt the subscripts 2 and 10 to distinguish between the binary and decimal

representation systems, respectively. By default, we use the decimal representation. If we replace

the input operand x by the value 3 we obtain: 000112(310) 3 = 110002(2410).

By extending this trivial example with more adders and shifts, we can make any multiplication

operation. As another simple example the multiplication of x by the constant 10 can be attained

by adding x 2 with x, followed by one final bit shift left.

x x 2 = 4x

x 0 = x

[4x+ x] 1 7−→ 10x

In this notation, x represents the primary input. The middle terms are called partial terms or partial

products. In this case, 4x is the only one. 10x is called the constant coefficient which we want to

generate.

The decomposition of a coefficient may lead to different ways of implementing the SCM. For

example, we can also obtain the above multiplication as:

x x 3 = 8x

x 1 = 2x8x+ 2x = 10x

or even:

x x 1 = 2x→ x 0 = x→ x 1 = 2x

[2x+ x+ 2x] << 1 = 10x

From the above example, we observe that the implementation 8x+ 2x will use one adder and

two shifts, while 2x + x + 2x will use two adders and three shifts. The task of choosing which

implementation to use is simple for small constants. However, as the magnitude of constants

grows, the complexity also gets larger. Furthermore the choice of the best implementation to use

also depends on the specifications of the hardware that will be used for the adders.

Another aspect that we will take into account is the number of operations in series, which

we call adder-steps. The maximum number of adder-steps defines the delay of the coefficient

computation, i.e. as the number of adder-steps increases, so does the time delay of that particular

coefficient. An instance of this is showed in figure 2.11 where the implementation 23x = 24x +

[22x + [21x + x]] has three adder-steps, whereas 23x = [24x + 22x] + [21x + x] has only two

adder-steps [3].

16

2.3 Multiple Constant Multiplication

7

3

4

≪2

≪1

+

+

23

16

≪4

+

(a) 3 adder-steps

5≪2

3

≪2

23

≪3

≪1

+

+ +

(b) 2 adder-steps

Figure 2.11: SCM structure to compute 23x

To solve the problem of optimizing the number of adder-steps in the implementation, the

MSD or CSD representation have been frequently used to provide implementations with a

minimum number of adders/subtracters, as for example, the value 55 could be represented as:

01101112CSD−−−→ 1001012 (5510). In this case, we observe that the binary implementation has 5

ones, while the CSD implementation has only 3 ones. What this means is that if the SCM is

designed by considering the binary implementation, then at least 3 adder-steps are needed:

01101112 = (3210 + 1610) + ((410 + 210) + 110).

In contrast, if the implementation is based on the CSD representation, only 2 subtracters are

needed with two adder-steps: 1001012 = (6410 − 810)− 110.

When there is more than one constant to be multiplied by the output, the guested circuit is

called a MCM. The next section will review different MCMs implementation techniques. In general,

the resolution of the MCM problem is not reduced to the resolution of multiple SCM problems.


The main principle that has been driven the MCM problem is the reduction of silicon area

through partial term sharing. It is not a simple problem and it was even proved to be NP-complete

[10]. This section will focus on the state of the art methods to decrease the needed hardware. Nat-

urally the implementations that share more partial terms are usually the ones that more efficiently

implement the MCM.

To illustrate this principle,consider the implementation of the multiplication of a given input x

by two multiplicand terms 710 = 001112 and 1110 = 010112, by trying to share as many adders

17

2. State of the art

7

3

4

≪2

≪1

+

+

(a) Computation of 7x.

11

3

8

≪3

≪1

+

+

(b) Computation of 11x.

4

7

≪2

11

3

8

≪3

≪1

+

+ +

(c) Simultaneous computation of 7xand 11x.

Figure 2.12: Multiple constant multiplication with the terms 7 and 11.

as possible. In this case, we can identify common patterns in the binary representations of these

multiplicands as is the series of two ones: 112 = 310. Once identified, the desired operations are

easily implemented by using an adder that will compute 3x. The complete MCM will integrate three

adders instead of four, as seen in figure 2.12. This method is called the Common Sub-Expression

sharing (CSE) and its principle consist in identifying bit-patterns across different implementations,

thus reducing the number of adders.

Another way to get to this result is by looking at the partial products as seen in section 2.2,

with the use of CSD and MSD representations. However, others methods also exist.

Hence, the main aim of this section is review the most used methods optimized for MCM

circuit design and to describe how can the area reduction be achieved. Some algorithms that

derive MCM graphs will be examined. In particular the work proposed by Aksoy et al.[3] revealed

some performance improvements on the automation of the selection of the best partial terms.

Others possible algorithms are the Bull-Horrocks Algorithm (BHA) [8], the Bull-Horrocks Modified

algorithm (BHM) [12], the N-dimensional Reduced Adder Graph (RAG-N) [12], the Cumulative

Benefit Heuristic (Hcub) [29] and ASSUME-A (delay constraint) [14].

2.3.1 Bull-Horrocks Algorithm

The Bull-Horrocks Algorithm (BHA) was developed as a graph-based algorithm for the design

of MCM architectures for Finite Impulse Response (FIR) filters. The technique, developed by Bull

and Horrocks [8], represents the output coefficients in a graph, by building it from the primary

input to the target coefficients.

The algorithm works as follows. In a pre-processing phase, the target coefficients are sorted into

ascending magnitude. Then, they are scaled such that all coefficient are non-negative numbers.

The processing phase consist on the generation of partial terms and their incorporation in the

graph until all the target coefficient are part of it. To keep track of the approximate distance

18


between the target coefficient and the current implemented graph, the algorithm defines a positive

error variable. The error variable is given by e = h(m)L . h(M) is the target set of coefficients, h(m)

is the coefficient number m ∈M and L represents the largest power of two which will divide h(m)

to give an integer result. As the algorithm proceeds, it creates sub-graphs from each vertex of

the graph forming the powers of two of these vertex values, called partial sum set. Note that the

maximum value of the partial sum set never exceeds max(h(m)). Its main cycle works as follows

1. For each target coefficient not yet implemented, the algorithm tries to find appropriate partial

terms, so that the error variable is minimized. The partial terms are built with the terms

already present in the graph.

2. if the error is zero, the coefficient has been reached and is added to the graph.

(a) If all coefficients are implemented the algorithm finishes.

(b) Else, change to another coefficient and go back to 1.

3. if the error is not zero, then add the term to the graph and go back to 1.

Figure 2.13: BHA graph representation of the set 1, 7, 16, 21, 33 (figure from [12]).

For example, consider the design of a multiplier by the set of terms w[n] = 1, 7, 16, 21, 33.

The construction of the graph is shown in figure 2.13. The vertex represents adders, while the

values over the vertexes are equal to the weight of the primary input at that vertex (i.e. a vertex

with value k represents the operation k × x[n]). The values over the edges represent the shift

operation on the vertex it originates from (i.e. a edge originating from y with value l represents the

operation y × l ⇔ y n with l = 2n and n ≥ 0). Take as example the vertex 7 from figure 2.13, it

is obtained as follows: 7x = 4x+ 1× 3x with ltop = 4, lbot = 1, ytop = 1 and ybot = 3.

The graph starts with the lowest valued coefficient 1. Its processing is trivial, since the vertex x[n]

is equal to the vertex 1. From there, the algorithm processes coefficient 7: it finds out that the

term 3 minimizes the error. Therefore, it adds 3 to the graph as 3 = 1 + 1 1. Once 3 is in the

graph, the algorithm can implement 7. The processing of the coefficient 16 is also trivial, since it

is a simple power of two. The rest of the terms follow the same logic.

Notice one relevant property of BHA’s algorithm, where the terms can only be implemented with

smaller valued terms. The next algorithm will solve this.

19

2. State of the art

2.3.2 Bull-Horrocks Modified algorithm

The BHA algorithm [8] involves the creation of a set of possible partial sums from existing

vertex values of the graph, from which new vertex values are synthesized. New vertexes are

then created until the set of constant coefficients is fully synthesized. An improved version of the

existing algorithm was developed by [9] to alleviate the BHA limitations: Bull-Horrocks Modified

algorithm (BHM). Some of the differentiation characteristics from the BHA are enumerated next:

1. In the BHA, the partial sums are generated with values only up to, but not exceeding, the

coefficient under consideration. Hence, the error is always positive. On the other hand, the

BHM may generate partial sum pairs above the maximum coefficient value. Hence, the error

can have both positive and negative values. At the implementation level, this leads to the

usage of subtracters, besides the usual adders.

2. In the BHA algorithm, even-valued partial sums are allowed in the partial sum set. In con-

trast, BHM reduces each partial sum by factors of 2, until it becomes odd. Only then it

generates its power-of-2 multiples. Hence all the target coefficients are represented as pos-

itive and odd numbers. This reduces the problem formulation of MCM so the processing of

positive and odd terms, which will be referred to as fundamentals from now on.

3. In the BHA, the coefficients are processed in ascending magnitude order. As with the partial

sums, the BHM algorithm first transforms all the coefficients into fundamentals. After this it

processes them in ascending order.

Due to the first enumerated property, the wider range of allowed values allows to take full

advantage of CSD-like features (e.g., by using 7 = 8 − 1 rather than 7 = 4 + 2 + 1)to achieve

more efficient implementations. Due to the second and third properties, even terms are simplified

into fundamentals, thus maximizing the number of partial sums available. While the error variable

does not increase, the partial terms are added. Take 48 as example: with this approach, it can be

simplified as 3× 24 and now the partial sum set is only composed by the fundamental 3 and all its

power of two’s multiple.

Even though this algorithm usually leads to optimal graphs, it is essentially a heuristic algorithm

based on the CSD representation. The following algorithm tackles this problem with further detail.

2.3.3 N-dimensional Reduced Adder Graph

The N-dimensional Reduced Adder Graph (RAG-N) algorithm, proposed by Dempster et al.

[12], is a graph-based algorithm that integrates both optimal and suboptimal methods. It consists

of three decision phases that aim the area optimization of the MCM. The first part is an exact

method, the second one is a heuristic method and the third is an arbitrary choice.

In the first and exact part, the algorithm makes use of two pre-computed lookup tables from

20


the Minimised Adder Graph (MAG) algorithm [11]: one for the coefficient implementation cost,

covering 1 to 4096 (the upper limit is because of computation complexity), and another table

containing different sets of fundamentals pairs that maps the optimum SCM implementation of

each of the coefficients in the cost table. If the target coefficient set is already fully synthesized

in the exact part of the algorithm, then this is the optimal solution with the lowest cost for that

set of coefficients (for proof of this statement refer to [12]). For the heuristic part, this algorithm

makes use of an improved cost metric (described in detail in [12]), which can be briefly explained

as follows: The adder distance of a new vertex from an existing graph is defined as the number of

extra adders (not in the graph) needed to reach a target vertex. The heuristic part aims to find the

minimal adder distance between the graph created by the optimal part of the algorithm and the

remaining terminating vertexes. Both the adder distance and cost will influence in the outcome

of the graph. Finally the arbitrary part exists in the case both the optimal and heuristic fail and

consist of an arbitrary choice. Each time a coefficient is added to the graph, it is removed from

the incomplete set. The complete procedure is described as follows:

1. In a pre-processing phase, the target coefficients are reduced to fundamentals and are

added to the incomplete set. After evaluating the coefficients costs using the cost lookup

table mentioned above, all cost-0 fundamentals (powers of two that have been reduced to

1) and repeated fundamentals are removed from the incomplete set. For all the selected

fundamentals that have been implemented, add them to the graph set and remove them

from the incomplete set.

2. The optimal part essentially synthesizes all target coefficients within the range of 1 to 4096

and their power of two’s multiples. The coefficients in the incomplete set with cost-1 are im-

plemented using the lookup table mentioned above. As new fundamentals are implemented,

they are inserted in the graph set and the algorithm processes the remaining unimplemented

coefficients by examining pairwise multiples of the fundamentals in the graph set. If all the

targets are synthesized this way, then the solution is optimal since all the fundamentals have

minimum cost.

3. If the optimal part cannot reach any more coefficients, then the heuristic part tries to find the

minimum adder distance that is needed to reach the terminating vertexes or target coeffi-

cients. Distance 2 vertices are considered but, since the adder distance calculated is not an

exhaustive search, the heuristic part is suboptimal.

4. If the heuristic cannot reach any fundamental in the incomplete set, then an arbitrary choice

of implementation fundamentals is chosen. The fundamental with the lowest magnitude is

selected and added to the graph set. Once a new term is added the algorithm goes back to

the optimal part.

The presented algorithm is suboptimal for coefficients outside the look-up table. In practice

there is an arbitrary choice of the synthesized fundamental and not all optimal graphs are consid-

21

2. State of the art

Figure 2.14: RAG-N graph representation of the constraint set 1, 7, 16, 21, 33 (figure from [12]).

ered (the lookup tables are precomputed or can only be computed up to 19 bits [16] due to the

involved computational complexity). Even so, the RAG-N algorithm provides better solutions than

BHM and BHA algorithms. Figure 2.14, compared to figure 2.13, illustrate this: RAG-N uses three

adders while BHM and BHA use four adders.

2.3.4 Cumulative Benefit Heuristic

The Cumulative Benefit Heuristic (Hcub) presented by Voronenko in [29] is a graph-based

algorithm heavily influenced by the RAG-N described in section 2.3.3. This algorithm is very

similar to the RAG-N algorithm. The main difference in the heuristic part is the usage of improved

distance estimators, namely, the removal of the arbitrary choice part. Another change is observed

in the algorithm organization: the fundamentals are added one at the time to the graph.

To better understand the competing algorithms, the author from [29] defines a framework for

MCM algorithms: the target set of constraints T ⊂ N, composed by the coefficients to implement

t ∈ T ; the set R ⊂ N, which contains the implemented terms (at the termination of the algorithm,

it is the solution); the successor set S ⊂ N, which contains all terms of distance 1 from the set R;

finally, the working set W ⊂ N is the newly synthesized terms at each iteration.

Furthermore, some definitions are presented to generalize some conceptions introduced by the

author. A brief overview of these definitions is exposed here, for further detail, please refer to [29].

The A-operation is defined as an operation on fundamentals, either addition or subtraction, and

an arbitrary number of shifts which do not truncate non-zero bits of the fundamental. We define

the A-operation with two integers inputs u, v (fundamentals) and one output fundamental. The

operation operates on the inputs with the following parameters: let l1, l2 ∈ N∗ be the left shifts of

both input operands, r ∈ N∗ the right shift at the output and s ∈ 0, 1 the sign that dictates if the

operation is a addition or subtraction.

Ap(u, v) = |(u l1) + (−1)s(v l2)| r (2.16)

= |2l1u+ (−1)s2l2v|2−r (2.17)

Derived directly from the A-operation comes the A-configuration. It is the set of all possible

outputs (but not equal to the inputs) of an A-operation with fixed inputs (u and v) under different

22


A-configuration:

A∗(u, v) = Ap(u, v)|p is a valid configuration − u− v

This definition can be extended to a set of inputs:

A∗(U, V ) =⋃u∈Uv∈V

A∗(u, v)− U − V

Moreover all constant can be divided into complexity classes. For that purpose, we denote Cn the

set of all constant with complexity n. By other words, an optimal SCM solutions requires exactly

n A-operation. For example, C0 = 2a|a ≥ 0 because precisely all 2-power constant require a

single left shift and no adds/subtracts.

If cn ∈ Cn is a constant, the A-distance is defined as the minimum number of extra A-operations

required to obtain cn, given R. The A-distance is defined as an NP-complete problem, therefore

the use of an estimation is adequate for higher order A-distance computation. This notion corre-

sponds to the ”adder distance” in section 2.3.3 from [12].

The different parts of the Hcub algorithm can be summarized as follows:

(a) Distance 1 - Optimal (b) Distance 2 - Heuristicexact

(c) Distance 3 - Heuristicexact

(d) Distance 4 ≥ - Heuristic esti-mation

Figure 2.15: Distance cases handled by the algorithm Hcub. Solid disk denote available sets anddashed circles denote the sets that are not computed yet (figures from [29]).

• In a pre-processing phase the Hcub algorithm transforms all input coefficients into funda-

mentals.

• For each T not synthesized, the Hcub algorithm incrementally builds the successor set S

and the ready set R at each iteration, by adding the constants from set W .

– The optimal part of the algorithm synthesizes the targets found in the successor set

as portrayed in figure 2.16a. To better understand the figure 2.16a, the equation t =

Ap(r0, r1) means that, to reach target t coefficient, an operation with operands r0 and

r1 is needed. Figure 2.15a demonstrates this procedure. If no more targets are found

in S, the optimal part cannot synthesize any more constants. This also means that the

targets are more than one A-distance away.

– Then the heuristic part is executed. As can be seen in figures 2.15b, 2.15c and 2.15d

the heuristic starts at distance 2 with Sd denoting the successor sets at distance d. It

23

2. State of the art

is separated in two sub-parts: for 1 < A-distance ≤ 3, exact tests are used and, for

3 < A-distance, distance estimator.

(a) Distance 1 topology (b) Distance 2 topologies

(c) Distance 3 topologies

Figure 2.16: Graph topologies for optimal and exact distance tests (figures from [29]).

1. For the exact calculation of the A-distance, the algorithm considers all possible

topologies that synthesize t using exactly d A-operations (each A-operation is

equivalent to one A-distance). These situations are illustrated in figures 2.16b

and 2.16c. To better understand figure 2.16b, the equation t = c1s means that, to

reach target t coefficient, the operation is built by using a single input s multiplied

by a complexity-1 constant c1. From the previous explanations, the explanation of

figure 2.16c follows the same logic. The heuristic for exact calculation is the inter-

section of A∗ with the successor set S (refer to [29] for further details about the

intersection that takes place in each of the cases in figure 2.16). This is done up

to A-distance ≤ 3, due to the involved complexity for higher distances.

2. When the target is 3 < A-distance away (see figure 2.15d), a distance estimation

is done: the successors are added to R based on a cumulative benefit estimation

(hence the ”cub” in Hcub). To achieve its goal, it supports the estimation on a ben-

efit function B that quantifies to what extent adding a successor s to the ready set

R improves the distance measure to an arbitrary target t. The benefit function fun-

damentally finds the cheapest intermediate value based on CSD cost estimation.

Such function is used to calculate the cumulative benefit of all targets in T . From

[29]:

Hcub = (R,S, T ) = arg maxs∈S

(∑t∈T

B (R, s, t)

)

24


Where B is the weighted benefit function. This form of calculating the next suc-

cessor jointly optimizes all targets without actually synthesizing a target. However,

intermediate terms are synthesized, making the set R ”closer” to the solution.

• The final solution is found when all targets from set T are in set R.

One aspect of this algorithm is the impossibility to customize the algorithm in terms of the

weight of the implementation cost (i.e. silicon area). Each operation is considered to have a fixed

unitary cost. The following algorithm tries to solve this problem for area estimation.

2.3.5 Unconstrained Area optimization - ASSUME-A

Flores et al. [14] proposed a new approach to the problem, by using a SAT based 0-1 ILP

engine to solve a set of constraints and optimization functions. Later, Aksoy et al. [2] extended

the formalization, by adding the design of a custom adder/subtracter structure with gate-metric

weights in order to decrease the area occupation. The significant number of shifts that is present

in a MCM stucture, combined with the optimization of adder structures enabled the improvement

of the existing frameworks. Furthermore, this algorithm incorporated the usage of MSD repre-

sentation for partial terms generation, as proposed by Park et al.[22], enabling the consideration

of a greater set of possible implementations for a given coefficient. In practice, the usage of the

MSD representation allows the exploitation a more sensible search. On one hand, it reduces the

search space to the minimum number of adder-steps, thanks to the MSD property that minimizes

the number of non-zero symbols. Then, it generates one or more full decompositions of each

term.

The general principles of Unconstrained Area optimization - ASSUME-A (ASSUME-A) algorithm

by Aksoy [3] can be described as follows:

1. After transforming all target coefficients into unique fundamentals, all the MSD representa-

tions of each term are determined and inserted in a set Cset;

2. Then main loop of the algorithm processes the fundamentals elements c ∈ Cset, as follows:

(a) Compute all partial term pairs that cover each element c. Note that a cover is regarded

as a possible representation decomposition. Table 2.2 exemplifies this decomposition

for element 45;

Representation implementations cover

0101101(32 + 8) + (4 + 1) 40 + 5(32 + 4) + (8 + 1) 36 + 9(32 + 1) + (8 + 4) 33 + 12

Table 2.2: 45 binary representation and its possible covers.

(b) Convert each element of the cover pair into fundamentals;

(c) Add each cover pair to the corresponding set of covers of the element being processed;

25

2. State of the art

(d) if the MSD representations have not yet been processed and they are not already in

Cset, add the representations of each cover to Cset. Covers with only one non-zero

digit are skipped.

3. Following this procedure, the algorithm builds a Boolean network with ANDs and ORs gates.

Each AND gate represents an adder or subtracter that produces some partial term value

and has two inputs for each operand. Each OR gate combines all partial terms that yield

the same value in the numbering representation (possible options:MSD or CSD or binary).

This means that the OR gate has a number of inputs equal to the number of possible im-

plementations of the selected numbering representation. For each operation, a third input is

needed, the optimization variable, by the cost function to iterate through all the all possible

implementations. There are two ways to introduce the optimization variable to finalize the

structure of the Boolean network, both lead to the same optimum solution:

• The optimization variable is included at the output of the OR gate with the addition of

an AND gate, as illustrated in figure 2.17a. This placement of the optimization variable

is used to minimize the overall number of partial terms.

• The optimization variable can also be included directly at the AND gates as a third

input. The aim of this placement is to minimize the cost of each individual operation,

as exemplified in figure 2.17b..

(a) Inclusion of an AND gate that creates an opti-mization variable used to minimize the overall num-ber of partial terms.

(b) Addition of an extra input per AND gate to cre-ate an optimization variable associated with eachpossible operation.

Figure 2.17: Examples of cost function implementations in the Boolean network (figures from [3]).

Both these functions achieve optimal solutions and yield the same cost. Under a given delay

constraint (such as the maximum number of adder-steps or combinatory levels) and after

the Boolean network has been created, the algorithm then computes the different delays,

by traversing the network from the inputs to the outputs. Theses values are subsequently

26

2.4 Technology Comparison

incorporated in the ILP solver as a constraint.

4. A set of basic rules is used to simplify the model:

• Multiple shifted versions of each input are freely available.

• If the requirements of a given operation are more stringent than those of another opera-

tion that generates the same partial term, we may remove it. For example, 15 = 9+31

requires partial terms 9 and 3, whereas 15 = 32 + 3 only requires partial term 3; thus,

we may eliminate the former.

• If a coefficient can be implemented with a single operation whose inputs are the primary

inputs of another other coefficients, then we do not need to represent this coefficient in

the Boolean network.

5. After this simplification, the Boolean network is translated into Conjunctive Normal Form

(CNF) and the cost function to be minimized is constructed as a linear function of the opti-

mization variables. Each clause is converted into a 0-1 ILP constraint. Finally, the obtained

model is fed to a generic SAT-based 0-1 ILP solver.

6. The final solution is the output of the solver.


After reviewing and summarizing the different binary arithmetic structures and algorithms, we

now present a brief comparison to find the most suitable technologies for our purpose, aiming

the minimization of the propagation time and area occupation. The technologies were tested on

a metric used by [27], called the Tyagi metric. The goal of this metric is the use of technology-

independent costs. The use of generic metrics generalizes and validates the update to newer

gate technology if necessary.

2.4.1 Arithmetic Binary Structures

2.4.1.A Adder

Aiming for a low latency adder in the MCM, we will focus on fast adders, preferably with parallel

computation. The FA, the CLA and the SA are considered for comparison. The FA is expected to

be slower than both the CLA and the SA. Based on the Tyagi model we get figures 2.18a, 2.18b

and 2.18c which model area occupation and latency of each adder in order to the bit-width.

From figure 2.18b the SA has the lowest latency as expected. The CLA is slower than SA.

However, the carry out is also worth analysing. Figure 2.18c shows that for the carry out compu-

tation the CLA is quite similar to the SA. The FA uses less area, but its time-delay is higher than

the other two. Thus for time considerations we can further study both CLA and SA.

27

2. State of the art

0

50

100

150

200

250

300

350

400

0 5 10 15 20 25 30 35

Are

a (a

rbit

ray

un

its)

Bits

Area comparison

FA (Tyagi)

CLA (Tyagi)

SA (Tyagi)

(a) Area comparison between FA, CLA and SA.

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35

Tim

e (

arb

itra

y u

nit

s)

Bits

Time comparison for output

FA (Tyagi)

CLA (Tyagi)

SA (Tyagi)

(b) Output time-delay comparison between FA, CLA and SA.

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35

Tim

e (

arb

itra

y u

nit

s)

Bits

Time comparison for carry

FA (Tyagi)

CLA (Tyagi)

SA (Tyagi)

(c) Carry out time-delay comparison between FA, CLA and SA.

Figure 2.18: Adder’s area and time delay Tyagi’s metric comparison

2.4.1.B Increment and Decrement

As the comparison in 2.4.1.A, the same is done for the RID proposed by [20] and a dual-

architecture that joins an increment and a decrement in parallel, implemented with a Half Adder

(HA) and a MFA used by [3] respectively. The figures 2.19a and 2.19b show the area occupation

and the time delay of the output.

050

100150200250300350400450500

0 5 10 15 20 25 30 35

Are

a (a

rbit

ray

un

its)

Bits

HA+MFA vs RID comparison by area

HA+MFA (Tyagi)Reconfigurable INC/DEC (Tyagi)

(a) Area comparison between HA+MFA and [20].

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

Tim

e (

arb

itra

y u

nit

s)

Bits

HA+MFA vs RID comparison by time

HA+MFA (Tyagi)

Reconfigurable INC/DEC (Tyagi)

(b) Output time-delay comparison between HA+MFA and [20].

Figure 2.19: Comparison between HA+MFA and [20].

From figure 2.19b, the theoretical model suggest that the use of the RID only compensates for

bit-widths bigger than 9 bits. For bit-widths lower or equal than 9 bits we can use the HA+MFA

architecture. There is no need to compute the carry out, because in this architecture the carry out

is generated at the same time as the others bits.

From the previous results were chosen for the adder either the SA or the CLA, and for the

increment/decrement was chosen the dual-architecture up to 8 bits and the RID for bigger bit

28


widths.

2.4.2 Multiple Constant Multiplication

For benchmarking purposes, we considered the Quantization Parameters (QP)

values defined in the H.264 video standard [25]. In particular, we will use the

set of coefficients used in Forward Quantization (QF), composed by the terms

13107; 11916; 10082; 9362; 8192; 8066; 7490; 7282; 6554; 5825; 5243; 4660; 4559; 4194; 3647; 3355; 2893.

A set with smaller coefficients of the Inverse Quantization (QI), is also provided

10; 11; 13; 14; 16; 18; 20; 23; 25; 29 and will be used at the end of this section, to compare

the involved complexity. The multiplier circuit receives a 16-bits input vector and outputs a 32-bits

vector.The coefficients sets have a maximum bit-width of 14 (13107) and a minimum of 12 (2893)

respectively, which adds some complexity to the MCM’s problem.

Based on the algorithms previously presented, we plotted the area-time graph of the MCM

circuit that is obtained with each of them: Hcub, BHM, RAG-N and ASSUME-A. The MCM nodes

were also implemented with different structures by considering the following adders: CLA, SA and

FA. The area and time measures were estimated by using Tyagi’s model [28] and computed by us-

ing MATLAB and Simulink for a basic simulation of the circuit. Next, the obtained implementation

will be compared.

Algorithm # adders adder-stepsBHM 28 8

RAG-N 25 8Hcub 26 3

ASSUME-A 27 3

Table 2.3: Characteristics of MCMs structures derived with the considered algorithms for the QFcoefficients set.

From an analysis of table 2.3, we note that the fastest implementations are provided by the

Hcub and ASSUME-A algorithms, since they both have the smallest amount of adder-steps. The

implementation with the smallest hardware requisites is the RAG-N, with 25 adders in total. The

BHM should be the worst performing algorithm.

In figure 2.20 are presented the previously mentioned algorithms implemented with different

adders. The metrics used are taken from Tyagi’s generic model [28]. Due to the proximity of the

Hcub and ASSUME-A results, we joined together both of them. The same conclusions can be

drawn from table 2.3 regarding the algorithms in figure 2.20a, the Hcub and ASSUME-A have both

the lower in terms of time delay and the smaller in terms of silicon area occupation. Regarding

the adders, the RC is the slowest adder as expected from section 2.1 but also the one with least

silicon area. The CLA and the SA are both very similar in terms of area and time-delay with the

SA being slightly faster but having a bigger silicon area occupation. In figure 2.20b are presented

both Hcub and ASSUME-A in a smaller scale. From here we can see that the results are very

29

2. State of the art

similar and that the algorithms produced very similar results in terms of generic metrics. For

further comparison we will look at the detailed bit precision study. It is the study of the dynamic

range of every operation of the MCM at the input and output of every adder.

RC

CLA

SA

RC CLA SA

0

50

100

150

200

250

300

1E+0 1E+2 1E+4 1E+6 1E+8 1E+10

Tim

e (

Arb

itra

ry u

nit

)

Area (Arbitrary unit)

tempos e areas dos diferentes algoritmos

HCUB/ASSUME-A QF

BHM QF

RAG19 QF

Multiplier

(a) Comparison between Hcub and ASSUME-A, BHM,RAG-N and a generic multiplier.

0

20

40

60

80

100

120

0E+0 1E+5 2E+5 3E+5 4E+5 5E+5 6E+5 7E+5 8E+5

Tim

e (

Arb

itra

ry u

nit

)

Area (Arbitrary unit)

tempo e area(corrigida)

ASSUME-A QF

HCUB QF

(b) Augmented comparison between ASSUME-A and Hcub

algorithm.

Figure 2.20: Implementation characteristics of MCMs structures that were derived with the con-sidered algorithms, when implemented with different adder structures: RC, SA and CLA.

In figure 2.22 and 2.21 are the precision bit studies with the QF and QI coefficient sets for

the algorithms Hcub and ASSUME-A. The brown boxes are the adders, the gray boxes are the

shifts, the green box is the primary input and the red boxes are the outputs. The numbers on the

lines are the dynamic ranges of the routing wires. We added the QI coefficients to this group to

compare the complexity of the implementations as the bit-width of the target coefficient becomes

larger. Note that even though both algorithms have the same characteristics in table 2.4, they

actually have different implementations, as it can be seen on figures 2.21a and 2.21b. These

differences are so small that in the implementation of the QF coefficients the obtained time-delay

is the same. The area has a slight difference, as it was shown in figure 2.20b.

Algorithm # adders adder-stepsHcub 8 2

ASSUME-A 8 2

Table 2.4: Characteristics of MCMs structures that were derived with the considered algorithmsfor QI coefficients set.

From figures 2.22 and 2.21 it is observed that a maximum of 30 bits is needed at any given

node of the MCM graph, as opposed to the 32 bits as considered in [21]. This envisages the pos-

sibility to obtain an implementation with a reduced area, by reducing the bit-width representation

of the operands in each adder.

Summary

In this chapter, arithmetic binary structures and MCM structures were presented as a viable

way to solve multiple constant multiplications with low area occupation and low time delay. Low

latency arithmetic binary structures were identified and the relevant state of the art algorithms

30


X

<<24x

9x+

16

Y6

<<38x

<<416x

16 16

19

16

<<118x

20

21

<<532x

16

25x-

21

Y9

21

<<38x

16

7x-

19

19

16

<<228x

Y10

29x+

21

21

19

16

<<114x

19

20

Y4

<<416x

16

Y1 Y7 Y3 Y8Y2

23x+

21

1920

13x+

20

1919

<<38x

16

5x+

18

16

<<220x

19

21

<<110x

19

11x-

20

19

20 20

<<416x

Y5

20

16

(a) MCM obtained with Hcub.

X

<<416x

Y5

20

16

Y1

<<110x

20

Y7

<<220x

21

Y3

5x+

13x+

<<24x

18

19 1919

1616

<<38x

16

19

20

Y8 Y2 Y10

29x+

21

167x-

<<38x

19

16

16

<<24x

19

2111x+

<<24x

1820

16

20

<<114x

Y4

20

19

23x+

Y6 Y9

19

21

<<416x

19

16

25x+

9x+

<<118x

21 21

20 20

16

19

<<38x

16

20

(b) MCM obtained with ASSUME-A.

Figure 2.21: Bit precision study of each operation in the obtained MCM structures for the QIcoefficients set.

for the derivation of MCM structures were presented and compared. Furthermore some basic

implementation examples were fulfilled with different adder structures, in order to understand their

influence in the characteristics of the obtained MCM circuit. In the next chapter, we will propose

highly optimized implementations of adder structures, particularly adapted for the MCM problem,

both with strict area and time models.

31

2. State of the art

X

<<38x

7x-

19

19

19

49x-

<<356x

22

16

<<112048x

16

2097x+

27

22

9x+

16

19

<<94608x

20

4559x-

2229

Y17

29

<<112048x

16

<<24x

18

16

3x-

16

<<664x

16

73x+

22

20

Y1Y15Y2Y8Y4Y11

<<448x

18

51x+

18

22

<<813058x

22

13107x+

22

30

30<<1

6554x

3277x+

28

29

<<63264x

22

28

<<114x

19

20

13x-

16

20

<<211916x

30

2979x-

28

28

<<5288x

20

285x-

18

25

25

<<19362x

<<24660x

2930

1165x-

27

18

<<41168x

23

27

4681x+

29

29 23

<<2292x

3355x-

28

25

23

5243x-

Y7

29

<<17294x

29

2051x+

18

27

28

Y9

<<14194x

29

283745x

+

Y14

<<17490x

29

28

3647x+

2828

<<198x

22

23

<<93584x

19

28

<<664x

16

63x-

22

16

22

<<81792x

19

2893x-

Y12

28

28

4033x-

<<124096x

16

28

22

Y10

28

Y3Y5 Y13 Y16 Y6

3641x-

<<17282x

29

28

28

25

28

<<3392x

22

25

16

5825x+

29

28

<<18066x

29

28

27

<<110082x

30

5041x+

29

26

<<41008x

22

28

<<138192x

16

29

<<21140x

27

(a) MCM obtained with Hcub.

Y14Y4

X

<<532x

33x+

16

16

19

16

21

<<38x

9x+

16

7x-

19

16

63x-

16

<<664x

1622

<<5288x

20

289x+

16

25

<<44624x

25

4681x+

29

29

<<7128x

16

161x+

23 22

24

3745x+

<<19362x

30

<<17490x

29

28

<<93584x

19

51x+

<<118x

20

21

22

<<63264x

2979x-

<<211916x

28

28

30

Y2 Y1

13107x+

<<813056x

22

22

30

30

22

3277x+

<<16554

28

Y15

29

28

13x+

28

16

20

20

<<71152x

27

20

1165x+

<<24660x

20

27

Y8

29

<<2252x

285x+

22

22

24

25

Y10 Y12

2893x-

28

25

<<21140x

27

Y6

3641x+

28

28

57x+

<<356x

22

19

22

16

<<17282x

30

22

<<138192x

16

Y5

29

4033x+

<<64032x

22

28

16

28

Y13

<<18066x

29

28

4194x+

28

Y9

29

4559x+

Y17

29

<<4912x

22

26

3647x+

19

<<93584x

28

22

28

28

<<41008x

22

5041x+

Y3

29

28

26

5825x+

Y16

29

<<110082x

30

<<65824x

29

5243x+

Y7

29

16

24

24

3355x+

Y11

28

19

91x+

<<228x

21

23

22

23 28

23

<<55152x

29

<<24x

18

(b) MCM obtained with ASSUME-A.

Figure 2.22: Bit precision study of each operation in the obtained MCM structures for the QFcoefficients set.

32

3Optimized Adder Structures for

MCM

Contents2.1 Binary Arithmetic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Single Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Multiple Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Technology Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

33

3. Optimized Adder Structures for MCM

In this chapter we will propose a modular adder structure that includes left shifts at the inputs

and a final right shift at the output, as seen in figure 3.1. The main goal is to design a custom

adder that is particularly adapted to the shift and add specification of present in typical MCM prob-

lems. First, we will define a modular adder structure for a generalized shift and add multiplication

operation. Then, we will define the differences imposed by the specifications of common MCM

problems, in particular.

+<<l1

<<l2

u'

v'

s' >>r ns

nu

nv

u

v

s

Figure 3.1: Adder block with its inputs left shifted and the resulting output right shifted.

To make the optimization possible, we will assume a fully modular structure. First, we will

analyse the different zones in which we can divide the operation. The main motivation behind

this segmentation of the addition operation is to facilitate the definition of a valid and generic

model, and also to understand what kind of hardware is used in each part of the block. This

customization will allow to improve the latency of the adder, while not making the area cost too

high. We will describe the implementation of both the addition and the subtraction operations. A

possible configuration for the addition and the subtraction will be also provided as an example.

To ensure the maximum generality as possible, this example contains the right shift at its output,

even though most of the MCM algorithms do not make use of it.

3.1 Addition

The first addressed operation is the addition. We separate the modular adder in four zones

and then provide a small example. The division into four zones comes directly from the properties

of the addition in the shift and add multiplication. Two of these zones have gate logic inside, while

the other two do not need any kind of circuitry.

3.1.1 Mathematical formulation

To clarify the adopted segmentation, a mathematical model will be set up. The definition of

each term and operation will be given next.

Let us assume u and v two signed binary operands, with nu and nv bits, respectively. The

non-negative integers l1 and l2 are their respective left shift parameter, while the non-negative

integer r is the right shift that is supposed to be implemented at the output. Finally, s, with ns bits,

is the signed binary output of the addition. We also define ul1 = u × 2l1 and vl2 = v × 2l2 and

34

3.1 Addition

make sure that both operands have the same size, by extending the smallest operand to match

the greatest operand. We can define 4 groups of bits from the mathematical formulation of this

operation, which we will call zones from now on. In each zone, we will define which operation will

be used.

u 0 . . . 0

l1nu

v 0 . . . 0

l2nv

v<lm:lM-1> 0 . . . 0γ=u<lM:nm-1>+v<lM:nm-1>u<nm:nM-1>+γ<nm>

+

=

0 . . . 0

A4+B4 A3 + B3 A2 A1

Zone 4 Zone 3 Zone 2 Zone 1

Figure 3.2: Mathematical formulation for the addition operation.

Note the extension of the operand v signal from bit nm = minnu + l1, nv + l2 to bit nM =

maxnu + l1, nv + l2. We will also define lm = minl1, l2 and lM = maxl1, l2. Thus, with this

signal extension we get: u′ = ul1<0:nM> and v′ = vl2<0:nM>.

According to this definition, the decomposing of this operation comes as follow:

s = s′ × 2−r

s = [u′ + v′]× 2−r

= [A1+ Zone1

+A2 × 2lm+ Zone2

+ (A3 +B3)︸︷︷︸γ

×2lM + Zone3

+ (A4 +B4 + γ<nm>)× 2nm ]× 2−r Zone4 (3.1)

where γ<nm> is the carry out from the addition γ and:

A1 = 0<0:lm−1>

A2 =

u′<lm:lM−1> if l1 < l2

v′<lm:lM−1> if l1 > l2

A3 = u′<lM :nm−1>

B3 = v′<lM :nm−1>

A4 = u′<nm:nM−1>

B4 = v′<nm:nM−1:>

If r > 0, it is still possible to restrict the optimization to the paths affected by the right shift.

Therefore we can define several sub-zones, defined as follows:

35


Zone 1 If r = 0, zone 1 generates logic zeros at output s from 0 to lm − 1. Otherwise, if

0 < r < lm we can eliminate the zeros up to bit r − 1;

Zone 2 b) If r < lm, zone 2 (b) copies the operand’s bits (bit bypass) from lm to lM − 1

directly to the output s. Otherwise, if lm ≤ r < lM we can eliminate the bypass up to bit r−1

and all zeros from zone 1;

Zone 3 If r < lM , zone 3 adds the two operands. Otherwise, if lM ≤ r < nm, zone 2 b) and

zone 1 will be eliminated and we will need to divide zone 3 in two different parts:

Zone 3 a) From bit lM to bit r − 1 we will only compute the carry chain;

Zone 3 b) From bit r to bit nm − 1 we will make the addition as normally, taking into

account the carry coming from the r bit;

Zone 3b

Zone 3 <lM:nm-1>

Zone 3ar

Figure 3.3: Zone 3 divided into 2 sub-zones, when lM < r < nm − 1.

Zone 4 if r < nm, zone 4 implements one of three operations: increment, decrement or

bypass of the biggest operand. The choice of the operation depends on the carry bit from

the previous zone and the sign extension of the operand. Table 3.1 presents the criterion for

each operation. Otherwise, if nm ≤ r < nM , we still need to calculate the carry chain from

zone 3, but we can eliminate the bypass and the zeros from zone 2 b) and zone 1. Dividing

this operation in two sub-zones we get:

sign extension cin Type0 0 Byp0 1 Inc1 0 Dec1 1 Byp

Table 3.1: Zone 4 truth table.

Zone 4 a) from bit nm to bit r − 1 we will only compute the carry chain, i.e. we need to

calculate all the carries from zone 3 and part of the carries from zone 4;

Zone 4 b) from bit r to bit nM − 1 we will use an increment, decrement or bypass;

Zone 4b

Zone 4 <nm:nM-1>

Zone 4ar

Figure 3.4: Zone 4 divided into 2 sub-zones, when nm < r < nM − 1.

We highlight three particular cases. The first case is when l1 = l2: zone 2 disappears, enlarg-

ing zone 1. Another notable case is when l1 = l2 = 0: zone 1 and 2 disappear, because there are

36

3.1 Addition

no left shifts to be done. Finally when nM = nm: zone 4 is not needed because all the calculation

is done in zone 3.

It is also worth nothing that if r ≥ nM we have s = 0 and no computation needs to be done.

Moreover, when r = 0 the number of defined zones is reduced: there are no sub-zones a), since

the b) part of zones 3 and 4 are the only defined zones. In table 3.2 we resume the different

cases, the symbol ∅ is the absence of the operand in the corresponding zone. The bit-widths in

each zone are divided in 3 columns: Length contains the overall length of each zone; columns

Operand u′ and Operand v′ have the length of operand u′ and v′, subject to the existence of

the operand in a given zone. As an example: if we have l1 > l2, then zone 2 b) will only have

operand v′. Therefore, for zone 2 b) operand u′ will have length 0 and operand v′ will have length

maxr, lm : lM − 1, where lM = l1 and lm = l2.

Bit dynamic rangeZone Type Right shift Length Operand u′ Operand v′

1 Zeros generation r < lm − 1 lm −maxr, 0 ∅ ∅2 b) Bypass r < lm lM −maxr, lm ∅ or maxr, lm : lM − 1maxr, lm : lM − 1 or ∅3 a) Carry generation lM < r < nm − 1 minr, nm − lM lM : minr, nm − 1 lM : minr, nm − 13 b) Adder r < lM nm −maxr, lM maxr, lM : nm − 1 maxr, lM : nm − 14 a) Carry generation nm < r < nM − 1minr, nM − nm nm : minr, nM − 1 nm : minr, nM − 1

4 b) Increment/Bypass/r < nm nM −maxr, nm

maxr, nm : nM − 1 ∅ orDecrement or ∅ maxr, nm : nM − 1

Table 3.2: Table resuming the addition operation with r > 0.

3.1.2 Example configurations

To understand the mechanics of the modular adder, we will see how we can generate different

cases by varying the length and signal of the operands. In table 3.3 we assume r > 0. In this

example, either zone 3 will have the a) part (if lM < r < nm − 1) or zone 4 will have the a) part (if

nm < r < nM−1). Because this is an addition configuration, no computation is required in zone 2.

In the following, for the strict purpose of providing a practical example, it will be considered the

effect of varying the adder parametrization in zone 4. If the operand with less bits is negative (v′),

then we have either to decrement or bypass the other operand (u′) in zone 4, depending on the

carry coming from zone 3 (see table 3.3, cases 2 and 4). When the signal extension is done with

ones, we have to add the extension of v′ with u′. However, considering that a series of ones is

equivalent to the TC representation of (−1), then we only have to decrease the operand u′ by one

in zone 4. On the other hand, if the carry from zone 3 is one, then it will cancel this decrement

(11112 + 12 = 00002), making it a simple bypass.

The same reasoning is done for v′ > 0, (see table 3.3 for cases 1 and 3). If the carry out from

zone 3 is one and the signal extension is made out from zeros, then the operation in zone 4 b) is

37


simplified into an increment. If the carry out is zero then the operation of zone 4 b) operation is a

bypass.

When the final right shift affects zone 4 (i.e. r > nm), the output will depend on the carry coming

from zone 3. Since there is no output in zone 4 a) (except for the carry out), the operation only

needs to implement carry logic. If zone 4 a) was assigned to an increment operation, it is only

needed to implement the increment carry logic, not needing the output part of it. Similarly, for the

decrement, there is only the need to implement carry logic for the decrement. As an example, if a

HA is used for the increment, it is only necessary the AND gate to compute the carry, leaving out

the XOR gate that computes the sum output.

nu + l1 > nv + l2 nu + l1 < nv + l2

u′ > 0 and v′ > 0Case 1: Case 5:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v 0 . . . 0

12

u+vu+γ

3b

4b

+

=

0 . . . 0

0

0

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12

u+vv+γ

3b

4b

+

=

0

00 . . . 0

u′ > 0 and v′ < 0Case 2: Case 6:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v 0 . . . 0

12

u+vu-1+γ

3b

4b

+

=

1 . . . 1

0

1

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 012

u+vv+γ

3b

4b

+

=

1

00 . . . 0

u′ < 0 and v′ > 0Case 3: Case 7:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v 0 . . . 0

12

u+vu+γ

3b

4b

+

=

0 . . . 0

1

0

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12

u+vv-1+γ

3b

4b

+

=

0

11 . . . 1

u′ < 0 and v′ < 0Case 4: Case 8:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v 0 . . . 0

12

u+vu-1+γ

3b

4b

+

=

1 . . . 1

1

1

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12

u+vv-1+γ

3b

4b

+

=

1

11 . . . 1

Table 3.3: Eight different cases of the addition operation.

Note that not all the possibles cases were shown in table 3.3. Cases 1, 2, 3 and 4 have l1 > l2

and cases 5, 6, 7 and 8 have l1 < l2. As it can be seen in cases 1, 3, 5 and 6, we have an

increment in zone 4 b) if γ<nm> = 1 or a bypass if γ<nm> = 0. In cases 2, 4, 7 and 8, we have

either a bypass (γ<nm> = 1) or a decrement (γ<nm> = 0). Moreover zone 4 disappears, for the

trivial case where nu + l1 = nv + l2. Another consequence coming from l1 = l2 is that zone 2

38

3.2 Subtraction

disappears.

Summing it up, in zone 1 zeros are generated; in zone 2 b) the operand with the smallest left

shift is bypassed; in zone 3 b) both operands add and in zone 4 b) we have either an increment,

decrement or bypass. Zones 3 a) and 4 a) only exist when lM < r < nm or nm < r < nM ,

respectively, corresponding to hardware structures where we only have to generate the carry bit

(no output is produced).

Bit dynamic rangeZone Type Length Operand u′ Operand v′

1 Zeros generation lm ∅ ∅2 b) Bypass lM − lm lm : lM − 1 lm : lM − 13 b) Adder nm − lM lM : nm − 1 lM : nm − 1

4 b) Increment/Bypass/nM − nm

Cases 1,2,3,4: nv + l2 : nu + l1 − 1 Cases 1,2,3,4: ∅Decrement Cases 5,6,7,8: ∅ Cases 5,6,7,8: nu + l1 : nv + l2 − 1

Table 3.4: Table resuming the addition operation with r = 0.

Table 3.4 summarizes the several operands involved in this operation when r = 0 and the

symbol ∅ is the absence of the operand in the corresponding zone. Having covered the addition,

there is still the subtraction to be analysed, which will be seen in the following section.

3.2 Subtraction

The subtraction case is similar to the addition. The main difference arises from the TC of one

of the operands.

• we need to complement the operand which is going to be subtracted;

• we must add ’1’ to the least significant bit of the operand that was complemented.

Because of this, we must slightly change the adder structure described in section 3.1. The main

difference in the implementation is observed inside zone 2, as it will be described in the following.

3.2.1 Mathematical formulation

By following the same approach that was adopted for the adder operation, we will start by

introducing the required adjustements in order to apply the TC to the operand to subtracted. Ac-

cordingly, the operand used in this replacement will be left shifted and complemented. Figure 3.5

presents an example, assuming that v is positive. An entirely similar description will be driven if v

was a negative number.

The subtracted value is given by:

t′ = v′ + 1

= v × 2l2 + 1

= v × 2l2 + 1× 2l2 (3.2)

39


u 0 . . . 0

l1nu

v 0 . . . 0

l2nv

λ=v<lm:lM-1>+1 0 . . . 0γ=u<lM:nm-1>+v<lM:nm-1>+λ<lM>u<nm:nM-1>+γ<nm>-1

+

=

1 . . . 1

0

1

A4+B4 A3 + B3 A2 A1

Zone 4 Zone 3 Zone 2 Zone 1

Figure 3.5: Mathematical formulation for the subtracter.

By using equation 3.1 and 3.2, and by replacing v′ by t′ we have:

s = s′ × 2−r

s = [u′ + t′]× 2−r

= [A1+ Zone1

+ A2︸︷︷︸λ

×2lm+ Zone2

+ (A3 +B3 + λ<lM>)︸︷︷︸γ

×2lM + Zone3

+ (A4 +B4 + γ<nm>)× 2nm ]× 2−r Zone4 (3.3)

where λ<lM> is the carry out bit from λ and γ<nm> is the carry out bit from the addition γ. A

carry-out protection that includes the carry-out into the signal, makes unnecessary the need to

have the extra bit. By defining the operands for each zone and 1lm is the bit for the TC conversion,

we get:

A1 = 0<0:lm−1>

A2 =

u′<lm:lM−1> if l1 < l2

v′<lm:lM−1> + 1lm if l1 > l2

A3 = u′<lM :nm−1>

B3 = v′<lM :nm−1>

A4 = u′<nm:nM−1>

B4 = v′<nm:nM−1>

In this operation, one extra sub-zone is defined inside zone 2, in addition to the sub-zones

defined in section 3.1:

Zone 1 Fill with zeros, from bit 0 until bit lm − 1;

Zone 2 If lm ≤ r < lM − 1, the previous zone is eliminated and we spawn two parts:

40

3.2 Subtraction

Zone 2 a) From bit lm to bit r − 1 we will only compute the carry chain if l1 > l2 or

implement nothing if l1 ≤ l2;

Zone 2 b) From bit r to bit lM − 1 we have either a bypass or an increment. If l1 > l2,

then the operand that passes onto zone 2 is t′ and because of the TC we need to

increment by 1, as in equation 3.2. If l1 < l2 then we have an u′ bypass.

Zone 2b

Zone 2 <lm:lM-1>

Zone 2ar

Figure 3.6: Zone 2 divided into 2 sub-zones, when lm < r < lM − 1.

Zone 3 If lM ≤ r < nm−1, zone 1 is eliminated, zone 2 comprises a carry chain computation

(when l2 ≤ l1) and zone 3 will be divided in 2 different parts:

Zone 3 a) From bit lM to bit r − 1 we compute the carry chain;

Zone 3 b) From bit r to bit nm − 1 we do the addition, as normal, taking into account

the carry coming from the r bit;

Zone 4 When nm ≤ r < nM − 1 we need to calculate the carry chain from zone 3. Zone 2

depends on the two cases seen before, but we can eliminate the zeros from zone 1. Zone 4

is then divided in two sub-zones:

Zone 4 a) from bit nm to bit r− 1 we compute the carry chain, i.e. we need to calculate

the carry from zone 2, zone 3 and zone 4 up to r − 1;

Zone 4 b) from bit r to bit nM−1 we use an increment, decrement or bypass, depending

on the case, taking into account the carry coming from zone 4 a).

Note that zone 4 only exists if nu + l1 6= nv + l2. Therefore, zone 4 does not contemplate

nu + l1 = nv + l2. The same applies to zone 2, when l1 = l2. In such case zone 1 will occupy

zone 2. Finally both zone 1 and 2 disappear if l1 = l2 = 0. Table 3.5 resumes this operation, the

symbol ∅ is the absence of the operand in the corresponding zone.

Bit dynamic rangeZone Type Right shift Length Operand u′ Operand v′

1 Zeros generation r < lm − 1 lm −maxr, 0 ∅ ∅2 a) Carry generation lm < r < lM − 1 minr, lM − lm ∅ or lm : minr, lM − 1 lm : minr, lM − 1 or ∅2 b) Bypass/Increment r < lm lM −maxr, lm ∅ or maxr, lm : lM − 1maxr, lm : lM − 1 or ∅3 a) Carry generation lM < r < nm − 1 minr, nm − lM lM : minr, nm − 1 lM : minr, nm − 13 b) Adder r < lM nm −maxr, lM maxr, lM : nm − 1 maxr, lM : nm − 14 a) Carry generation nm < r < nM − 1minr, nM − nm nm : minr, nM − 1 nm : minr, nM − 1

4 b) Increment/Bypass/r < nm nM −maxr, nm

maxr, nm : nM − 1 ∅ orDecrement or ∅ maxr, nm : nM − 1

Table 3.5: Table resuming the subtraction with right shift operation.

41


3.2.2 Example configuration.

In this configuration, r = 0, therefore there are no sub-zones a). Since it is a subtraction there

is an increment in zone 2 and consequently this zone produces a carry-out (λ). Moreover, zone

3 also has a carry-out (γ). It can be seen that in cases 1, 3, 5 and 6 we use either an increment

or a bypass to implement zone 4 b). In these cases, the selection of either increment or bypass

depends on the carry-out γ coming from zone 3 b). In cases 2, 4, 7 and 8 we use a decrement or

bypass, depending on the γ value coming from zone 3 to zone 4.

In table 3.6, eight notable cases out of 16 were presented. As it can be seen, we could have

permuted the sizes of l1 and l2 in each of the cases presented above. But half of them were left

out, making the analysis only on the ones presented. On the right column, we presented the

cases where l2 > l1. On the left column we presented the remaining cases.

Summing it up, in zone 1, we generate zeros; in zone 2 b) we increment or bypass the operand

with the smallest left shift; in zone 3 b) we add both operands, not forgetting the carry λ from zone

2 b); in zone 4 b) we have either an increment, decrement or bypass, depending on the carry γ.

With all the operations defined, we can now proceed with the formal definition of each zone

described above either for the addition or the subtraction operations. The main goal of the next

section is to define a framework to manage the modularity of the envisaged component. From

here, it is always possible to change the modules of each section, while still maintaining the

structure of the described block.

3.3 Topologic description of the adder/subtracter module

In this section we will formalize the description of each zone, the instantiation conditions and

the corresponding logic function. A set of boolean variables will help us to implement each block

activation condition. For the sake of notation simplicity, we will omit the maxr, x and minx, r,

seen in tables 3.4 and 3.6, replacing it simply with r.

The adoption of these particular implementations conditions will be particular useful to inte-

grate the several blocks that compose the whole adder/subtracter by using the ”generate” primitive

made available by the VHSIC Hardware Description Language (VHDL). Once the block is synthe-

sized, it does not change anymore. The conditions presented hereafter are only used at the

generation of the proposed adder.

Throughout this section a new notation will be used to simplify the reference to each block.

The notation is composed of a number representing the zone, a letter representing the sub-zone,

a letter representing the operand used and the last is a contraction. As an example lets take zone

2 b) with operand v in an increment we will have the following notation block 2bv inc. At the end

42


nu + l1 > nv + l2 nu + l1 < nv + l2

u > 0 and v > 0Case 1: Case 5:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v+1 0 . . . 0

1

u+v+λu+γ

3b

4b

+

=

0 . . . 0

0

0

2b

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12b

u+v+1v+γ

3b

4b

+

=

0

00 . . . 0

u > 0 and v < 0Case 2: Case 6:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v+1 0 . . . 0

1

u+v+λu-1+γ

3b

4b

+

=

1 . . . 1

0

1

2b

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12b

u+v+1v+γ

3b

4b

+

=

1

00 . . . 0

u < 0 and v > 0Case 3: Case 7:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v+1 0 . . . 0

1

u+v+λu+γ

3b

4b

+

=

0 . . . 0

1

0

2b

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12b

u+v+1v-1+γ

3b

4b

+

=

0

11 . . . 1

u < 0 and v < 0Case 4: Case 8:

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

v+1 0 . . . 0

1

u+v+λu-1+γ

3b

4b

+

=

1 . . . 1

1

1

2b

u 0 . . . 0

l 1nu

v 0 . . . 0

l 2nv

u 0 . . . 0

12b

u+v+1v-1+γ

3b

4b

+

=

1

11 . . . 1

Table 3.6: Eight different cases of the subtraction operation.

43


of this section there is a table that gathers every combination used.

OP =

1 Addition0 Subtraction

RZ =

1 r 6= 0

0 other cases

O =

1 l1 6= 0 ∧ l2 6= 0

0 other casesW =

1 nm 6= nM

0 other cases

P =

1 r ≥ nM0 other cases

X =

1 l1 > l2

0 l1 < l2

Q =

1 r ≥ nm0 other cases

Y =

1 l1 = l2

0 l1 6= l2

R =

1 r ≥ lM0 other cases

Z =

1 nm > lM

0 other cases

S =

1 r ≥ lm0 other cases

M =

1 nu + l1 > nv + l2

0 nu + l1 < nv + l2

Zone 1

This first zone, depicted in table 3.7, is where the zeros are generated, due to the shifts of both

operands. It outputs zeros and does not need any input. As an example, if we have l1 = 4 and

l2 = 6, then zone 1 will have a length of lm = l1 = 4, as can be seen in table 3.15.

Conditions Block 1

OP = 1 ∨ 0 Zero generation

r:lm-1

s

S = 0O = 1

Table 3.7: Definition and implementation conditions of zone 1

Zone 2 a)

This zone, depicted in table 3.8, is used when we have an increment in zone 2 and its only

function is to generate the carry bit when we have a right shift up to bit r. We need only to compute

the carry chain and it will serve zone 2 b) or zone 3, depending on r. Note that this block only

outputs the carry out, because it is not intended to compute anything else.

Zone 2 b)

Depending on the type of operation to be implemented, this zone will take a different imple-

mentation: while an MCM addition will result in a bypass, if the operation is a MCM subtraction

it can be either a bypass or an increment. The increment has two outputs and two inputs: a

44


Conditions Block 2a

OP = 0

Carry generate

lm:r-1

v'

cout cin=1

P = 0 ∧ S = 1Y = 0X = 1RZ = 1

Table 3.8: Definition and implementation conditions of zone 2 a).

carry cout to zone 3, the incremented operand s, the carry bit (cin) coming from zone 2 a) and the

operand to be bypassed or incremented (see table 3.9).

Conditions Block 2bu byp Conditions Block 2bv byp Conditions Block 2bv inc

OP = 0 ∨ 1

r:lM-1

s

r:lM-1

u'

Bypass

OP = 1

r:lM-1

s

r:lM-1

v'

Bypass

OP = 0

r:lM-1

s

r:lM-1

v'

Inccout cin

Y = 0 Y = 0 Y = 0R = 0 R = 0 R = 0X = 0 X = 1 X = 1

Table 3.9: Definition and implementation conditions of zone 2 b).

Zone 3 a)

This zone only exists when the final right shift and the output affects zone 3 and the operands

overlapse. The corresponding hardware is used to compute the carry chain obtained from the

addition of the two input operands. Hence, since the output is ignored up to bit r, there is no need

to include the logic for the sum result. It is sufficient to consider logic for the carry. It has one

output and 3 inputs: the carry-out bit (cout), the two operands v′ and u′ and the carry in bit (cin).

Zone 3 b)

In this zone we have the main operation of the block: the addition part. Independently of the

considered MCM operation, we compute the addition of the two overlapping operands with the

aid of an adder. It has two outputs and three inputs: the sum s, the carry-out bit cout , the two

operands v′ and u′ and the carry cin(see table 3.11).

45


Conditions Block 3a

OP = 0 ∨ 1

Carry generate

lM:r-1 lM:r-1

u' v'

cincout

P = 0 ∧R = 1Z = 1RZ = 1


Conditions Block 3b

OP = 1 ∨ 0

r:nm-1

s

Adder

r:nm-1 r:nm-1

u' v'

cincout

Q = 0Z = 1


Zone 4 a)

This zone only exists when the final right shift at the output affects zone 4 and when the

number of bits of one operand exceeds the other. It is the extension of the addition that is initiated

in the previous zones, but since the output is ignored up to bit r there is no need to include the

logic for the increment. It is sufficient to make only the logic for the carry chain. In the case of

the increment/decrement, it has one output and 2 inputs: the carry-out bit cout to zone 4 b), the

operand u′ or v′ and the select signal slt for the multiplexer (see table 3.12).

For the slt signal, we will use the truth table 3.13a and 3.13b. Note that for the bypass we do not

need to make any implementation in zone 4 a).

Conditions Block 4au Conditions Block 4av

OP = 1 ∨ 0

nm:r-1

u'

Inc/Dec sltcout

OP = 1 ∨ 0

nm:r-1

v'

Inc/Dec sltcout

P = 0 ∧Q = 1 P = 0 ∧Q = 1RZ = 1 RZ = 1Z = 1 Z = 1W = 1 W = 1M = 1 M = 0


46


v′nMcin Type slt

0 0 Byp 000 1 Inc 011 0 Dec 101 1 Byp 00

(a) M = 1

u′nMcin Type slt

0 0 Byp 000 1 Inc 011 0 Dec 101 1 Byp 00

(b) M = 0

Table 3.13: Zone 4 truth table for the select signal.

Zone 4 b)

This zone processes the most significant bits of the operands. Its function is to compute the

values of the most significant bits of the result, by considering the carry coming from zone 4 a)

or zone 3. This is a singular zone, because it has three instances inside it: the incrementer, the

decrementer and the bypass. Block 4 b) has one output and two inputs: the computed value s,

the operand u′ or v′ and the select signal slt.

The select signal depends on the operand’s most significant bit x′ (either from u′ or v′) and the

carry bit: slt[1 : 0] = x′[nM ]|Cin, where | is the concatenation operation (see table 3.13b). If

the boolean variable M = 1, then operand u′ goes through this block and the signal select is

constituted by slt[1 : 0] = v′[nM ]|Cin. For M = 0, then operand v′ goes through the block and the

select signal is composed by slt[1 : 0] = u′[nM ]|Cin (see table 3.13a).

Conditions Block 4bu Conditions Block 4bv

OP = 1 ∨ 0

r:nM-1

s

r:nM-1

u'

Inc/Dec/Byp

slt

OP = 1 ∨ 0

r:nM-1

s

r:nM-1

v'

Inc/Dec/Byp

slt

Z = 1 Z = 1W = 1 W = 1P = 0 P = 0M = 1 M = 0


Summary

With the presented framework, defined with 7 different modules, summarized in table 3.15, the

symbol ∅ is the absence of the operand in the corresponding zone. The required logic structures

are easily identified and instantiated for each zone. In figure 3.7 an example of a subtraction

configuration is given, with r = 0, lm = l1, lM = l2, nm = n1 and nM = n2. The next chapter will

47


Addition SubtractionBlocks Type Length Operand u′ Operand v′ Type Length Operand u′ Operand v′

1 Zeros gen lm − r ∅ ∅ Zeros gen lm − r ∅ ∅2a ∅ ∅ ∅ ∅ Carry gen r − lm ∅ lm : r − 1

2bu byp Bypass lM − r r : lM − 1 ∅ Bypass lM − r r : lM − 1 ∅2bv byp Bypass lM − r ∅ r : lM − 1 ∅ ∅ ∅ ∅2bv inc ∅ ∅ ∅ ∅ Inc lM − r ∅ r : lM − 1

3a Carry gen r − lM lM : r − 1 lM : r − 1 Carry gen r − lM lM : r − 1 lM : r − 13b Adder nm − r r : nm − 1 r : nm − 1 Adder nm − r r : nm − 1 r : nm − 14au Carry gen r − nm nm : r − 1 ∅ Carry gen r − nm nm : r − 1 ∅4av Carry gen r − nm ∅ nm : r − 1 Carry gen r − nm ∅ nm : r − 14bu Inc/Dec/Bp nM − r r : nM − 1 ∅ Inc/Dec/Bp nM − r r : nM − 1 ∅4bv Inc/Dec/Bp nM − r ∅ r : nM − 1 Inc/Dec/Bp nM − r ∅ r : nM − 1

Table 3.15: Table resuming both operations.

address the discussion about the possible different implementations that can be considered for

the adders, decrement and increment logic.

n1:n2-1

n1:n2-1

Inc/Dec/Byp slt

l1:l2-1

l1:l2-1

Inc

l2:n1-1

Adder

l2:n1-1l2:n1-1

cout cout cin

s

v' u'

1

Zero generation

0:l1-1

cin

Figure 3.7: Complete adder configuration.

Before finishing this chapter, it is important to note that we should eliminate zone 1 for the use

in conventional MCMs. The reason for this simplification is the positive and odd characteristic of

generated terms in MCMs. As a consequence of this, some boolean variables from the beginning

of this section can be eliminated: S = 1 can be substituted by R = 0 ∧ RZ = 1 and S = 0 is

equivalent to RZ = 0.

48

4Time model of the proposed

adder/subtracter structure

Contents3.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3 Topologic description of the adder/subtracter module . . . . . . . . . . . . . . 42

49

4. Time model of the proposed adder/subtracter structure

In this chapter some models are presented to calculate the time delay path of the

adder/subtracter structures proposed in the previous section. This is necessary for the estimation

of the critical path in the MCM solver. Three types of models will be described, with some more

emphasis on the proposed one.

At this point, it is important to emphasize that some models that are usually adopted by cur-

rent MCM decompositions are not always accurate for time constrained applications. However,

the currently available computational resources already allow the usage of better models. Hence

we will present three different models: the simplest model, corresponds to the worst-case sce-

nario, the Least Significant Bit (LSB) model, corresponds to a best-case scenario, and finally the

proposed model is more realistic than the previous two.

To help with the characterization of those models, some variables are defined: n denotes the

number of adder-steps, tKf is the absolute time delay (time delay in the MCM structure) at adder

K, tK is the relative time delay (time delay of the adder structure isolated) at adder K, tKZ is the

relative time-delay in adder K for zone Z, tKMSBoutis the latency of the output’s Most Significant

Bit (MSB) in the adder K, tKLSBinis the latency of the input’s LSB in the adder K. Note that

1 ≤ K ≤ n. To simplify the notation we use tnf ⇔ tf .

4.1 Simplest Model

The simplest model that can be defined is the one where the delay coming from a given node

is assumed to be the maximum time a signal takes to pass through all circuitry inside the node.

What this means is that the signal is only ready when all the previous output signals are computed.

Hence, the critical path goes ”horizontally” before going to the next adder. This model has the

advantage of being fast to be computed, but is very inaccurate for the proposed structure, since

it is essentially a worst case scenario for the majority of adders (including the proposed adder).

The model’s time delay for n cascaded adders is: tf =∑nK=1 t

K =∑nK=1(tKMSBout

− tKLSBin).

Figure 4.1 exemplifies the critical path of a structure with two adders, with t1 = t1MSBout− t1LSBin

and t2 = t2MSBout− t2LSBin

. The total time-delay with n = 2 is: tf = t1 + t2 = (t1MSBout− t1LSBin

) +

(t2MSBout− t2LSBin

).

Z2Z4 Z3

Z4 Z3

tf

t1

t2

Figure 4.1: Simplest model: critical path estimation.

50

4.2 LSB Model

4.2 LSB Model

An improved and more accurate model was proposed by [5]. For K adder-steps, it uses the

assumption that the critical path goes through the LSB bits from the first adder down to the K − 1

adder and then from the LSB to the MSB on theKth adder. Essentially, the critical path progresses

”vertically” through the firstK−1 adders and ”horizontally” through the last adder. This model uses

low computation and represents the most optimistic scenario in an adder chain. The time-delay

for such a model is given by: tf =∑nK=1 t

K = (tnMSBout− tnLSBin

) +∑n−1K=1(tKLSBout

− tKLSBin).

Figure 4.2 exemplifies the critical path with three adders, where t1 = t1LSBout− t1LSBin

, t2 =

t2LSBout− t2LSBin

and t3 = t3MSBout− t3LSBin

. The total time-delay is: tf = t1 + t2 + t3 = (t1LSBout−

t1LSBin) + (t2LSBout

− t2LSBin) + (t3MSBout

− t3LSBin).

Z2Z4 Z3

Z2Z4 Z3

Z4 Z3

t1

tf

t2

t3

Figure 4.2: LSB model.

Due to the irregularity of the proposed adder structure, the LSB model does not accurately

characterize it. The same happens with the RC, CLA or SA chains. Therefore, we need a more

refined model, with the inclusion of the irregular characteristics of the adder, which will be pre-

sented in the next section.

4.3 Proposed Model

In this section, an adapted model that is capable of coping with the complexity of the proposed

adder structure is considered. The aim of this model is to achieve a more realistic description of

the adder. Ideally, the optimal model would refine the model to the individual wire (or bit) level,

knowing exactly where each signal goes through. In the proposed approach, instead of modelling

each bit, the bits are grouped together by zone and the critical path of each zone is taken. We

start with the model formulation and then present some examples.

51


4.3.1 Formalization

Based on the proposed adder structure, three zones with non-zero delay are identified: zones

4, 3 and 2, in the case of the subtraction. For the addition, only zones 4 and 3 have delay.

Figure 4.3 presents an example of how we defined the model and how the critical path (bold

arrow) is obtained. As before, we define a structure with n adders-steps, with three valid zones

Z ∈ 2, 3, 4 and 1 ≤ K ≤ n.

Z2Z4 Z3

Z2Z4 Z3

Z4 Z3

t1

tf

t2

t3

Figure 4.3: Definition of the critical path for the proposed model.

Due to the higher granularity of this model, we need to define some extra variables. tsKZ is the

accumulated time-delay up to the sum output of adder K in zone Z. tcKZ defines the accumulated

time-delay of the carry-out up to adder K in zone Z. SKZ represents the set of bits that define zone

Z in adder K. For example S13 = 4, 5, 6, 7, 8, 9, 10 =< 4, 10 > means that zone 3 in adder 1 is

defined from bit 4 to bit 10. The path taken between each zone or adder is denoted by appending

in the subscript either ”in→S”, ”cin→S”, ”in→cout” or ”cin→cout”. For example, if there is a path

inside adder K in zone 3 from the operands input to the carry out, the notation tK3 in→cout is used.

Note that the all values ”in→S”, ”cin→S”, ”in→cout” or ”cin→cout” are known and are calculated

based on the adder size.

The proposed model also defines a function tp(K,Z) that returns the highest accumulated time

delay for a given zone Z at the beginning of adder K. To evaluate if there is any connection

between zone Za in adder K and zone Zb in adder K − 1, the boolean function Ωb(K,Za, Zb) is

defined as follows:

Ωb(K,Za, Zb) =

1 if SKZa

∩ SK−1Zb6= ∅

0 otherwise

Hence, the propagation paths that can be followed are enumerated next:

• The time delay at the beginning of a given adder K is given by:

tp(K,Z) = max(Ωb(K,Z, 2)× tsK−12 ,Ωb(K,Z, 3)× tsK−13 ,Ωb(K,Z, 4)× tsK−14

)Where ts04 = ts03 = ts02 = 0 and tKZ is defined for every Z ∈ 2, 3, 4 and 1 ≤ K ≤ n.

52

4.3 Proposed Model

• The time delay at the end of zone 4 of a given adder K is given by:

tsK4 = max[tK4 in→S + tp(K, 4), tK4 cin→S + tcK3

](4.1)

with tcK3 = max[tK3 in→cout + tp(K, 3), tK3 cin→cout + tcK2

]• The time delay at the end of zone 3 at a given adder K is given by:

tsK3 = max[tK3 in→S + tp(K, 3), tK3 cin→S + tcK2

](4.2)

with tcK2 = tK2 in→cout + tp(K, 2)

• The time delay at the end of zone 2, at given adder K, is given by:

tsK2 = tK2 in→S + tp(K, 2) (4.3)

The final time-delay tf for the proposed model with n adder-steps is given by:

tf = max (tsn2 , tsn3 , ts

n4 ) (4.4)

From the above description, it is easy to understand that this is a more computational intensive

model, particularly for the calculation of the intersection SKZa∩SK−1Zb

. Nevertheless this approach is

closer to the wire model, while still maintaining the computational independency from the number

of bits in the adders.

4.3.2 Example

To better understand how the model works, an example is provided in figure 4.3. Suppose the

time delays tKZ at each individual zone are known. With function tp(K,Z) and the bit-widths of the

zones Z ∈ 2, 3, 4 in adders K ∈ 1, 2, 3, it is easy to analyse the connections between zones

from adder K − 1 to adder K. In the considered example, the connections at zone 2, 3 and 4 of

adder 2 are defined as:

tp(2, 2) = max(Ωb(2, 2, 2)× ts12,Ωb(2, 2, 3)× ts13,Ωb(2, 2, 4)× ts14

)= max

(1× ts12, 1× ts13, 0× ts14

)tp(2, 3) = max

(Ωb(2, 3, 2)× ts12,Ωb(2, 3, 3)× ts13,Ωb(2, 3, 4)× ts14

)= max

(0× ts12, 1× ts13, 1× ts14

)tp(2, 4) = max

(Ωb(2, 4, 2)× ts12,Ωb(2, 4, 3)× ts13,Ωb(2, 4, 4)× ts14

)= max

(0× ts12, 0× ts13, 0× ts14

)Now applying the same to adder 3:

Note that tp(1, 2) = tp(1, 3) = tp(1, 4) = 0.

The first adder time-delay is given by:

t1f = t1 = ts14 = t14 cin→S + t13 cin→cout + t12 in→cout

53


tp(3, 2) = max(Ωb(3, 2, 2)× ts22,Ωb(3, 2, 3)× ts23,Ωb(3, 2, 4)× ts24

)= max

(0× ts22, 0× ts23, 0× ts24

)tp(3, 3) = max

(Ωb(3, 3, 2)× ts22,Ωb(3, 3, 3)× ts23,Ωb(3, 3, 4)× ts24

)= max

(1× ts22, 1× ts23, 0× ts24

)tp(3, 4) = max

(Ωb(3, 4, 2)× ts22,Ωb(3, 4, 3)× ts23,Ωb(3, 4, 4)× ts24

)= max

(0× ts22, 1× ts23, 1× ts24

)In this example we assumed that t14 + tp(1, 4) < t14 cin→S + tc13 in equation 4.1, with the carries

defined as tc13 = t13 cin→cout + tc12 and tc12 = t12 in→cout + tp(1, 2) = t12 in→cout

The second adder time-delay is given by:

t2f = ts23 = t23 in→S + tp(2, 3)

assuming t23 in→S + tp(2, 3) > t23 cin→S + tc22 in equation 4.2.

The third adder time-delay is given by:

tf = t3f = ts34 = t34 in→S + tp(2, 4)

assuming t34 in→S + tp(3, 4) > t34 cin→S + tc33 in equation 4.1.

Now substituting function tp:

t1f = t14 cin→S + t13 cin→cout + t12 in→cout

t2f = t23 in→S + ts14 = t23 in→S + t14 cin→S + t13 cin→cout + t12 in→cout

tf = t34 in→S + ts23 = t34 in→S + t23 in→S + t14 cin→S + t13 cin→cout + t12 in→cout

Since all variables are known, it is possible to calculate the delay of the whole system. These

steps have to be calculated for every combination of the adders for a realistic description of the

structure delay.

Summary

In this section, a model for the time delay of the proposed adder structure is defined. The aim

of this model is to provide a more realistic description of the proposed structure in terms of signal

propagation through multiple adder steps. Two other simpler that have been adopted in the current

literature were also reviewed and the new model was presented based on the specifications of

the proposed adder. A mathematical formulation of the proposed model was also presented for

an easier integration with the optimization algorithm that will be presented in the next chapter.

54

5Time delay minimization through

gate level metrics

Contents4.1 Simplest Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 LSB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

55

5. Time delay minimization through gate level metrics

Even though improving the hardware of the MCM circuit reduces its latency, such reduction

can be even improved if the MCM organization is defined by taking the internal structure of the

proposed adder into account. Hence, the optimization algorithm that is now proposed aims finding

the minimum delay multiplication for a given set of coefficients. Such algorithm is based on the

results of [14] and [3]. In particular, it can be seen as an optimization, in the time domain, of the

algorithm detailed in [3].

The motivation for the development of this algorithm is to give the circuit designers with alter-

native ways to build and improve their MCMs. Although the main function of this algorithm is to

find the smallest time-delay that ensures the implementation of all the coefficients, it is still able to

have an efficient area occupation.

Most of the algorithms described in section 2.3 use time and area as unitary weights, without

looking at the particular technology that is used or the possibility to change their weights. The

presented work aims to extend the results of [3], which introduced gate-level metrics for the area

occupation, by taking into account not only the adder’s area but also the adder’s time delay. This

algorithm works in three parts: first, it carries out an exhaustive search of all MSD implementations

of the coefficients and its partial terms; then it finds the critical path among the set of coefficients;

and finally, it uses the same ILP 0-1 solver used in [3] to find the best area minimization with the

critical path as constraint.

The rest of this chapter is organized as follows. First, it presents a small overview of the

functionalities of the program. Then, the data structures are summarized. Finally, each core

function of the algorithm will be described. The explanation of the algorithm is divided in four

parts: the main function, the decomposition of the terms; the critical path search; and finally, the

strategy for area minimization.

5.1 Functions

The first idea behind the proposed algorithm is to minimize the critical path of the MCM. To

achieve this objective, the algorithm was divided into three main functions, briefly described next.

Three core functions are listed: the find partials function builds the implementation tree that will be

used in the others functions; the minimize path function finds the minimal time delay correspond-

ing to the considered coefficients; and finally, the minimize area function minimizes the area of

the MCM but taking the time constraint into account.

The vocabulary used herein approximates the one that is usually used in this research domain:

a term is a generic designation for the a node value of the MCM. The first input corresponds to a

term at the beginning of the MCM, the coefficients are terms located at the end of the MCM, the

partial terms are specifics terms used as inputs to other nodes. Finally an implementation is an

operation and the path is a series of operations. The fundamental is defined as positive and odd

56

5.1 Functions

term and will be used in throughout this chapter.

A – find partials Given a set of constant coefficients, the algorithm uses the MSD represen-

tation to implement the coefficients and their partial terms. For such purpose, the path delay

and area occupation of an adder are calculated based on the gate-level model for each of the

different representations of a term. For the specific case of the custom adder that was proposed

in chapter 3, a new function was developed to compute the metric corresponding to the gate-

level time delay and area occupation. This function requires the following input parameters: the

two operands of the adder, the result of the operation, the shifts, and the type of operation (ad-

dition/subtraction). With these inputs, it calculates the values of area and time-delay from the

elementary block. The calculated values are then stored along with the implementation. The con-

struction of the tree is approached from the bottom to the top, i.e. from the coefficients to the first

input.

B – minimize path Once the tree with all possible implementation is built, the algorithm tra-

verses again in a bottom-top approach. The search starts by looking for a path and the minimum

time delay variable with value infinite. The first path is evaluated and its time delay stored. From

there, the search algorithm iterates through all the possible implementations, looking for a smaller

time delay. To guarantee that the search is exhaustive, it uses a recursive Depth Search First

(DFS) that prunes sections of the tree which have higher delay than the current minimal path. For

each target coefficient, its implementation, time delay and area occupation is stored. Then the

coefficients whose implementation have the highest time-delay are chosen as the critical path. It

is important to note that at this point there is no area optimization, only time optimization. Here the

graph is complete and it is guaranteed that we have a graph with the smallest time delay possible.

C – minimize area This part focuses on area minimization, and is based on both the work

developed by Flores et al. [14] and the extension of Aksoy [3] for custom time-weights. The

main difference lies in the construction of the set of implementations that is sent to the ILP solver.

The proposed algorithm tries to minimize the area while still taking into account the time delay.

With the critical path that was found in the previous function, the algorithm identifies all possible

implementations of the target coefficients that do not exceed the critical time. Once this set of

implementations are selected, the algorithm builds the boolean network and the constraints for

the 0-1 ILP problem. From there, the ILP solver outputs the solution. The resulting solution

is characterized by the minimum area for the given boolean network subject to the considered

time constraints. It can be argued that the time constraints could also have been included in the

constraint that are fed to the ILP solver. However, it was not in the scope of this work to implement

a new model, but to improve the existing one.

57


5.2 Data Structure and classes

One main data structure and four classes were used for the implementation of the algorithm.

The data structure is static and is used to store the information needed by many functions of

the program. It stores the fundamentals in a binary tree, by using a map container. The used

containers are based on the C++ Standard Template Library (STL) [24] (revision C++98), and are

represented in italic. It also stores a vector container, with the target coefficients, and a vector

with the available fundamentals. The bit width of the first input is also stored in this structure. As

for the classes, one class is used for the generation of the MSD representation. It contains three

values for each partial sum: two operands and one fundamental. Another class is used to manage

the adder costs, containing for each zone of the adder the corresponding time-delays and silicon

area. The third class contains the bit width of the operands, the shift to be executed at each single

operation, the operands of the adder, the fundamental, and a variable that encodes the size of

each zone. This variable is later used for accessing the propagation time and area of the adder

by a map container. Finally, the fourth class stores the fundamental and the list container of the

corresponding MSD implementations.

Type name Variable Data

data t

- coef - vector with the target coefficients- unimp - vector with unimplemented coefficients- imp - vector with implemented coefficients- x bitwidth - variable with main input bit width- partials - pointer to the partials structure

map area t - first - term value- second - implementation matrix with time and area cost

CAdder - tz2, tz3, tz4 - time delay for zone 2, 3 and 4- area - area occupation for adder

CImp

- coef - fundamental- op u, op v - operands- OP - flag for the type of operation (addition/subtraction)- z - variable with encoded zone bit width- n u, n v - operands bit widths- l1, l2 - operands shift left- r - output right shift

CFundamentals- coef - fundamental- CImplemented - number of possible implementations- list<CImp> - container list with all MSD implementations

Table 5.1: Data structure and classes

As described in table 5.1, the data structure data t refers to general data concerning the MCM,

with information about the input, the target coefficients and the implemented terms. It is used

throughout the code and it is the core structure around which the program works, thus repre-

senting a global structure. It stores the input bit-width that was passed by argument to the main

function, as well as the target coefficients after that were read from an input file. It also stores

the terms that were determined during the execution of the find partials function, while using the

58

5.2 Data Structure and classes

unimp vector as a list of terms to be processed by the function find partials. As soon as these

terms are processed, their implementations are stored in a map container, where the key is the

term and the value is a pointer to a member of the class CFundamental.

The CFundamentals class is used for accessing a specific term. It is accessed through

the data t.partials map and stores all the information for each term. It keeps all the possible

implementations of each fundamental in a list container of CImp. For example, the term 83

(10100112CSD−−−→ 1010101) has 5 different implementations: 80 + 3, 66 + 17, 65 + 18, 68 + 15

and 63 + 20. Hence, the implementation list for the term 83 will have five entries, one for each

unique implementation. Finally the CImplemented variable is for keeping track of the number of

implementations.

The CImp class is used to store each unique implementation of a given fundamental. It can

only be accessed through the CFundamental class. From this class, we can access the time and

area cost of each implementation. The costs are obtained from the CAdder class, located in a

map container. The key is the 32 bits unsigned integer z, representing the bit widths of several

zones, encoded as follows: z = z2 + z3× 103 + z4× 106.

The class CAdder is used to store the time and area values of a specific adder. It has four

integers (32 bits) variables. The motivation behind this is to have a pre-calculated set of adder

parameters, thus speeding up the process of calculating the areas and times of each individual

adder.

The map area t structure is mainly used in the minimize area function as memory to accom-

modate the terms that can be used in the area optimization. This means that all the terms whose

implementations do not exceed the critical time are kept in this structure. For every term, there

is an entry pointing to a matrix with as many columns as the number of implementations for that

term. The first row is used to check the state of each implementation: 0 for implementations that

do not meet the time constraint or have not yet been analyzed; 1 for future possible implemen-

tations; and 2 for implementations that will be passed to the ILP solver. If the index of the first

row has values 1 or 2, the second row stores the time needed for the end of the critical path, if

not it has value 0. This structure is of critical importance for the ILP solver, because it is from

this data that the Boolean Network is built. Taking as example the term 83, the map key would

be 83 and the map value matrix would have two rows, one for the flag value and the other for

the time needed, and five columns, one for each implementation of 83 under the representation

MSD. Its contents, at initialization time, are zero in the first row and infinite in the second row.

As the program is running, the values in both rows are changed according to the evolution of the

MCMoptimization algorithm.

In the next sections, the main functions of the algorithm will be presented with detail: the main

function, the find partials function, the minimize path function, and the minimize area function.

59


5.3 Proposed optimization algorithm

The main function is responsible for the invocation of the auxiliary functions and the core func-

tions of the algorithm. The auxiliary functions take care of the ordering of the terms, reading

parsing and writing to files and the calling of the ILP solver. The core functions do all the com-

putational task. The main function can be invoked with up to 4 arguments. The first is acessed

through the ”-f” flag and takes a file with the coefficients separated by newlines. The coefficients

can have two entries in the file, which the algorithm filters to eliminate duplicated coefficients. The

second argument is the bit-width of the input through the ”-b” flag, with a default value of 16 bits.

The third argument is the type of adder structure to be used, through the ”-a” flag. There are

three types of adders supported in the conducted implementation: the structural adder structure

proposed in chapter 3 (as the default), the custom adder developed by Aksoy in [3] and a simple

full adder. Finally, the last argument is the optimization switcher: if ”-opt” is present, the area opti-

mization is on, if ”-opt” is not present, then the optimization is off and only the time optimization is

done.

A flowchart for the main function is detailed in figure 5.1. The first sub-function to be invoked

reads the target coefficients from the input file that was passed as parameter of the main function

and adds them to the target coefficient list: data.coef from the data t structure. Not only does it

remove any duplicate coefficients found in the file, but it also initializes the solution file with the

target coefficients. It then loads the pre-computed MSD tables in memory. The coefficients are

loaded into the data.unimp vector from the data t structure and, for every term in data.unimp,

the first core function find partial is called. Once the data.unimp is empty, the implementations

are ordered by ascending time delay. A list of all implemented terms is available in the vector

data.imp. For every coefficient in data.coef, the minimize path is run. The path containing the

implementation and the propagation time needed are returned from the function and are saved in

two vectors. From this point on, it is a modification of the work developed by [14] and [3]. The main

function iterates again through all the coefficients while it feeds them to the minimize area core

function. The map area t is built here. Finally, it builds the constraint file in a sub-function and

writes the boolean network file in another sub-function to feed the ILP solver. Once the ILP solver

is finished, the main functions parses the solution, presents some statistics, writes the results to

a file and finishes the execution.

5.3.1 Partial term finding

The goal of the find partials function is to build the implementation tree with all possible imple-

mentations. This function is called from the main function and receives, as arguments, the term to

be implemented, the data t structure, the pre-computed tables with MSD representations, and the

type of adder structure to use. Figure 5.2 depicts the flowchart of this function. For a given term, it

60


Begin

Read coef file

Load MSD tables

Copies coefficients to data.unimp

Iterates data.unimp END

?

YES

Find_partialsNO

Orders implementations

by time delay

Iterates data.coef END

?

YES

Minimize_path

NO

Saves path

and time

Iterates data.coef END

?

YES

Minimize_area

NO

Writes constraint and

Boolean network to file

ILP solver

RETURN

Read solution

from ILP solver

output file

Write final

solution to output

file

Figure 5.1: Main algorithm flowchart

Begin

Checks the bit-width of the

coefficient and filters the MSD

table

Generates the implementations

and calculates its area and time

cost

Erases coefficient from data.unimp

Adds coefficient to data.imp

Adds operands to data.unimp

RETURN

Figure 5.2: Find partials algorithm flowchart

61


MSD representation MSD path implementation

1010101(64− 16) + (−4 + 1) 48− 3(64− 4) + (−16 + 1) 60− 15(64 + 1) + (−16− 4) 65− 20

1010011(64− 16) + (−2− 1) 48− 3(64− 2) + (−16− 1) 62− 17(64− 1) + (−16− 2) 63− 18

0110101(32 + 16) + (−4 + 1) 48− 3(32− 4) + (16 + 1) 28 + 17(32 + 1) + (16− 4) 33 + 12

0110011(32 + 16) + (−2− 1) 48− 3(32− 1) + (16− 1) 30 + 15(32− 1) + (16− 2) 31 + 14

0101101(32 + 8) + (4 + 1) 40 + 5(32 + 4) + (8 + 1) 36 + 9(32 + 1) + (8 + 4) 33 + 12

Table 5.2: MSD representations and respective paths and implementations of the term 45.

will determine and list all positive and odd operands used in each different implementations. Each

time a new operand is found, it is added to the data.unimp vector from the data t structure. To be

considered new, an operand must not be present in either data.unimp or data.imp vectors. For

instance, the term 35 (0100112) has five different implementations based on two MSD represen-

tations: the binary 0100112 and the CSD 0101012. The implementations derived from these two

representations are obtained by combining the non-zero bits in groups of two. From the binary

(0100112) representation, three implementations are derived: 32 + 3, 34 + 1 and 33 + 2, while the

CSD 0101012 representation has three implementations: 36−1, 31+4 and 3+32. In this case, the

operands 3, 34, 33, 36, and 31 will be introduced in data.unimp. Notice that although there is one

repeated operand (the term 3 on the first and the last 35’s implementations), it is introduced only

once. The find partials function considers the same operation if the operands and their respective

shifts (multiples of power of twos) are in a different order, therefore avoiding duplicated imple-

mentations. For instance, the term 35 has two implementation 15 + 3 and the reverse operation

3 + 15, but only one is processed. The other will not be added to the structure.

To better understand how the MSD representation is used here, consider for example the term

45(01011012). It has 5 MSD representations (1010101, 1010011, 0110101, 0110011 and 0101101)

that spawn 15 implementations from which the algorithm identifies 11 unique implementations.

Table 5.2 details these MSD representations and its derivatives. The proposed algorithm only

uses the last column for the implementations. From there, it transforms the numbers into fun-

damentals (positive and odd) and stores them in the CfFundamental implementation list. The

motivation behind this choice is to further minimize the number of adders. Under the MSD paths

column, every path needs 3 adders, although some implementations can be further optimized and

use only two adders. For instance, the path 40 + 5 and 36 + 9 only needs an additional adder that

62


implements 5 and 9 respectively. The terms 40 and 36 can be implemented with shifts left from

the terms 5 and 9, respectively: 40 = 53 and 36 = 92. This type of MCM definition comes from

the work done by Dempster [12] and Flores [14].

5.3.2 Minimizing the time

Begin

Loads data from

data.partials[coef]

RETURN

empty_path

and INF

Implementations

with ones?

NO

YES

Finds the

fastest imp

with ones

RETURN path

and timeIterates

implementations

END

?

Found

implementation

?YES

NO

Cur_time

>

critical_timeYES

Ret_time==INF

?

Minimize_path

NO

Cur_time

<

best_time

NO

YES

NO

YES

best_time=cur_time

best_path=path

NO

YES

Figure 5.3: Minimize path algorithm flowchart

The goal of the minimize path recursive function is to minimize the latency of the processing

structure for each of the targeted coefficients. The minimize path function receives as arguments

the current term to be minimized, the current chosen path, and the reference to the current time

delay. If successful, there are no explored paths with smaller time delay, it returns the chosen

63


path and the resulting time delay by reference. If unsuccessful, there is at least one explored path

with smaller or equal time delay, it returns an empty path and infinite time. The motivation behind

the recursive behavior of the function is to keep track of the path being used, as it can be easily

modified with this architecture.

A flowchart for this function is presented in Figure 5.3. When the function is called to minimize

the time propagation of a term, it iterates through all the implementations of the term. At each

iteration, the algorithm will compare the current time needed to implement the term, with the

critical time at that point. The first call from the main function invokes the minimize path with

a critical time equal to infinite. Hence, the first iteration of each targeted coefficient is always

successful. From the second iteration on, it only depends on the new critical time: if lesser, the

new path is stored, if greater or equal, the path is discarded. To assess the time needed for the

operands of each implementation, the function calls itself recursively for each operand and the

process repeats itself for all subsequent operands, until an implementation with ones is found.

This is the base case of the recursion and the returned value is critical time zero and the path-

vector with a single entry: one.

As it can be inferred from its description, the minimize path function is computationally heavy.

To reduce the computational load of the function, some optimization are introduced. One way

is by short-circuiting the base case, the implementations with ones. I.e., the function starts the

processing of each term by looking for implementations with first inputs as operands and save

them. If more than one implementation with one operand is found, then the fastest is chosen.

By following this procedure, the minimal time delay is guaranteed to be found, as the algorithm

performs a exhaustive search over all the search space. The critical time delay of the MCM,

which will be used in the minimize area function, is taken from the highest delay among the target

coefficient’s minimum time delays.

5.3.3 Minimize area function

Whereas the previous function sets the minimum time delay for each individual coefficient,

this function relaxes the minimum time delays of each coefficient, that have a time delay lesser

than the MCM critical time, to include the maximum amount of implementations with the potential

to minimize the area in the Boolean Network. With more variations of implementations of one

term, there will be more approaches to jointly optimize the paths of the target coefficients in order

to minimize the resulting area occupation. Therefore, the time constraint of this function is the

highest time delay among all time delay of the target coefficients, i.e. it is the critical time delay

of the MCM. The critical path depends on the maximum adder-step and on the type of adder

structure to be used. It is possible to build different MCM graphs by using different adders types.

This function takes as arguments the data t structure, the current term to be implemented, a

vector with the path, the current critical time, and the map containing the area implementations.

64


Begin

Loads data from

data.partials[coef]

RETURN

empty_path

and INF

Implementations

with ones? YES

Locks

implementations

in area_imps

RETURN path

and time

NO

Iterates

implementations

END

?

Found

implementation

?YES

NO

Cur_time

>

critical_timeYES

Ret_time==INF

?

Minimize_area

NO

NO

YES

Locks

implementation in

area_imps

NO

YES

Figure 5.4: Minimize area algorithm flowchart

2+1

9+48

1+8

32+57

Figure 5.5: Implementation of the term 89

65


This part of the code makes all the analysis for the construction of the Boolean Network and

is based on the work of [14] and [3]. A flowchart explaining the recursive function is presented in

figure 5.4. In a manner similar to the minimize path, while iterating through the implementations

of a term, the recursive function will check if the time needed to implement it does not exceed

the critical time. If it exceeds, it is discarded. Otherwise it locks it from future changes in the

map area t structure, saves the remaining time, and continues with another implementation. If

another implementation tries to use an already locked term, it will check the time needed on the

locked term. If it can meet the remaining time, then the program continues and the same happens

to the operands. On the other hand, if the remaining time is not met, the program continues.

Another situation is observed when a prior path has locked several implementations. In this case,

the implementations are checked one at the time for usability, i.e. if there is the possibility to unlock

one implementation by checking the remaining time saved. This function terminates if for each

target coefficients, there is at least one implementation with a time delay lesser or equal than the

critical time delay. This function is always guaranteed to finish, because it can find at least one (if

the minimum time delay of the coefficient is equal to the time constraint) or more implementations

(if the minimum time delay of the coefficient is lesser than the time constraint) for each coefficient.

To clarify this, consider the following example. We want to implement the coefficients 89 and 57

with a time delay of 72 (arbitrary time units) and 49, respectively. From the previous discussion, it

is clear that the critical time is given by the the path taken by the coefficient 89 and has a value

of 72 time units, and that the coefficient 57 has a time slack of 13 time units. Figure 5.5 depicts

the final implementation. Notice that this implementation uses the term 57 which, is also a target

coefficient. However, the 57 coefficient has more implementations with different time delays that

range from 49 to 55 time units. The algorithm always starts by the smallest coefficient, therefore

57 is processed taking into account all implementations that do not exceed 72 time units. The

implementations of 57 will be added to the map area t and locked, since the maximum time delay

of 55 time units does not exceed the critical time. After the algorithm is finished with the coefficient

57, 89 is processed. In this case, all implementations of the term 57 are removed except those

that do not exceed the critical time. In this example, only the implementation of 57 = 48 + 9 is able

to meet the critical time. The time delay is the smallest and has a value of 49 time units. Hence,

the area t structure will have one 57 implementation and one 89 implementation.

Summary

In this chapter, a new algorithm based on the minimization of the MCM propagation time is

proposed. Although its main goal is the minimization of the MCM critical path, area optimization

was also addressed. The work detailed in [14] and [3] was used as basis for the area optimization,

by adapting it to the time constraints.

66

6Results

Contents5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Data Structure and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Proposed optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 60

67

6. Results

To evaluate the proposed approach to improve the performance of the MCM units by using the

developed structures adder block, optimized in terms if the resulting propagation time, together

with the proposed adaptations in the optimizations algorithms, a comprehensive experimental

procedure was conducted. All the implemented structures in this chapter were described with

structural VHDL, synthesized with Synopsys Design Vision R© and placed-and-routed with Ca-

dence Encounter R©, by using the CMOS process from UMC 90nm logic gate technology library

[1], called standard cell throughout this chapter, and by considering the typical operating condi-

tions (V dd = 1, 2V , T = 25oC). The optimization algorithms that build the MCM structure were

implemented in C++98.

This chapter is divided into three parts:i) evaluations of optimized adder and increment im-

plementations; ii) evaluations of proposed structure to implement each node of the MCM; and

evaluation of the MCM structures based on the developed structural adder and the adapted opti-

mization algorithm. For such purpose, it was adopted a comprehensive set of benchmarks, which

allowed a fair comparison with other state of the art approaches.

6.1 Optimized Adder Structure Evaluation

This section evaluates the considered set of adder/subtracter and increment/decrement struc-

tures that were considered for this implementation in standard cells if the several zones of the

structures adder blocks that was proposed. In particular, the adopted model for each structure

area and propagation time based on Tyagi’s model in section 2.4.1 will be validated whit a real

library of standard cells. This section is organized in the same way as section 2.4.1: first we will

study the adder and then the increment/decrement architectures.

Each of the analysed blocks was synthesized using the incremental flag. This options syn-

thesizes the circuit from the low level blocks to the high level blocks. The goal of not optimizing

everything at the top level is to keep the original structure of each module. If the tool had the

freedom to optimize everything at the top level, then the area cost and the propagation time delay

would be altered, so that the block meets the constraints at any possible cost. In the case of a

propagation time constraint, it will use logic gates that take more area but have smaller latency.

The inverse constraint would be also true.

6.1.1 Adder Structures

The results corresponding to each adder implementation in standard cell, are presented in

figures 6.1a, 6.1b and 6.1c. The FA is kept for comparison.

As it can be seen in figure 6.1b, the SA is faster than both the FA and the CLA when it

comes to the output of the sum. The parallel structure propagates de signal faster than the other

two architectures. However, if we look to the carry out delay in figure 6.1c, it is shown that the

68

6.1 Optimized Adder Structure Evaluation

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30 35

Are

a(u

m2 )

Bits

Area comparison

FA (DV)

CLA (DV)

SA (DV)

(a) Area comparison between FA, CLA and SA.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 5 10 15 20 25 30 35

Tim

e (

ns)

Bits

Time comparison for output

FA (DV)

CLA (DV)

SA (DV)

(b) Output propagation time comparison between FA, CLA and SA.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 5 10 15 20 25 30 35

Tim

e (

ns)

Bits

Time comparison for carry

FA (DV)

CLA (DV)

SA (DV)

(c) Carry out propagation time comparison between FA, CLA and SA.

Figure 6.1: Area and propagation time of several adder structures when implemented with aStandard Cell library.

CLA and the SA structures have the same time delay up to 8 bits. Nevertheless, the CLA is

faster for bit widths greater than 9 bits than the SA. As a result, for the modular adder structure

presented in section 3.3, we get better results with the CLA than with the SA adder when zone 4

is present, which is the majority of the cases. Hence, the CLA is a wiser choice. With the results

of the synthesis, we have modelled the CLA architecture in figures 6.2a and 6.2b. It is the only

architecture used in the implementation of the adder.

0

200

400

600

800

1000

1200

1400

1600

0 5 10 15 20 25 30 35

Are

a (u

m2

)

Bits

FA and CLA area comparison

CLA model

CLA Std Cell

(a) CLA area: model and Standard Cell comparison

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0 5 10 15 20 25 30 35

Tim

e (

ns)

Bits

FA and CLA time comparison

CLA model

CLA Std Cell

(b) CLA propagation time ti the sum output: model and Standard Cellcomparison

Figure 6.2: Comparison of the area and time delay of the CLA adder structure: model and Stan-dard Cell implementation.

69

6. Results

6.1.2 Increment and Decrement Structures

For the increment and decrement structures,it was observed that the serial architectures can

be faster for lower bit-widths than parallel architectures, as observed with the Tyagi model used in

section 2.4.1.B. To evaluate this model, we synthesized the circuit and obtained the results shown

in figure 6.3a and 6.3b.

0

100

200

300

400

500

600

700

800

900

0 5 10 15 20 25 30 35

Are

a (u

m2)

Bits

HA+MFA vs RID comparison by area

HA+MFA (DV)Reconfigurable INC/DEC (DV)

(a) Area comparison between HA+MFA and RID [20].

0

0,5

1

1,5

2

2,5

0 5 10 15 20 25 30 35Ti

me

(n

s)

Bits

HA+MFA vs RID comparison by time

HA+MFA (DV)

Reconfigurable INC/DEC (DV)

(b) Output propagation time comparison between HA+MFAand RID[20].

Figure 6.3: Comparison between HA+MFAand RID [20].

We can see that for the same area, we get improvements in propagation time for all bit-

widths with the RID structures. Hence, the RID has proved to be a good choice for the incre-

ment/decrement implementation. With the results of the synthesis, we have defined a model of

the RID and the HA+MFA architectures, as presented in figures 6.4a and 6.4b.

0

100

200

300

400

500

600

700

800

900

0 5 10 15 20 25 30 35

Are

a (u

m2

)

Bits

HA+MFA and RID area comparison

HA+MFA modelHA+MFA Std CellRID modelRID Std Cell

(a) Implementation area of the HA+MFA and RID: model and Stan-dard Cell implementation.

0

0,5

1

1,5

2

2,5

0 5 10 15 20 25 30 35

Tim

e (

ns)

Bits

HA+MFA and RID time comparison

HA+MFA model

HA+MFA Std Cell

RID model

RID Std Cell

(b) Propagation time of the HA+MFA and RID: model and StandardCell implementation.

Figure 6.4: Area and propagation time modelling and Standard Cell implementation of theHA+MFA and RID.

According to the obtained results, we will use the RID structure proposed by [20] for zone 2

and 4 and we will use the CLA to implement the additions in zone 3.

6.2 Structured Adder Block Evaluation

In this section, the area and time models, built in chapter 4, will be analysed. The result of the

adder are presented both for the Tyagi model and for the standard cell implementation, allowing a

direct comparison of these two models.

70


6.2.1 Tyagi time and area evaluation

Based on the model that was built in chapter 4, we obtained the values in table 6.1, by varying

the zone bit width of the proposed adder. Whenever Z2 > 0, we take into account the logic used

in the case of an increment in zone 2, as it was seen in section 3.2.1.

Zones bit width ModelAdder size Z4 Z3 Z2 Time Area #

(arbitrary unit) (arbitrary unit)

2 0 0 2 3 6 10 2 0 5 18 2

4

0 0 4 5 12 30 2 2 7 44 40 4 0 9 41 52 2 0 8 31 6

8

0 0 8 9 24 70 2 4 5 30 80 4 2 9 66 90 4 4 9 93 100 8 0 13 85 112 2 2 8 13 122 2 4 10 43 132 4 2 10 59 144 2 2 10 43 154 4 0 12 65 16

16

0 0 16 10 224 170 2 8 9 42 180 4 8 9 60 190 8 2 13 110 200 8 4 13 137 210 8 8 13 193 220 16 0 17 173 232 2 8 14 55 242 4 8 14 77 254 2 4 12 55 264 2 8 16 67 274 4 8 16 89 284 8 2 14 115 294 8 4 14 121 308 2 4 16 79 318 4 2 16 95 328 4 4 16 101 338 8 0 18 133 34

32

0 0 32 11 464 350 2 16 10 242 360 4 16 10 260 370 8 16 13 309 380 16 2 17 198 390 16 4 17 225 400 16 8 17 281 410 16 16 17 397 420 32 0 21 349 432 2 16 16 255 44

71

6. Results

2 4 16 16 277 452 8 16 16 321 462 16 2 17 210 472 16 4 17 237 482 16 8 17 293 494 2 16 18 267 504 4 16 18 289 514 8 16 18 333 524 16 2 17 222 534 16 4 17 249 544 16 8 17 305 558 2 8 20 91 568 4 8 20 113 578 8 8 20 157 588 16 8 20 245 598 2 16 22 291 608 4 16 22 313 618 8 16 22 357 6216 2 8 21 267 6316 4 8 21 289 6416 8 8 21 333 6516 16 0 21 397 66

Table 6.1: Propagation time and area obtained with Tyagi’s model for the structured adder usedin each MCM node.

0

5

10

15

20

25

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70

Tim

e (

arb

itra

ry u

nit

)

Are

a (a

rbit

rary

un

it)

Area

Time

Figure 6.5: Distribution of propagation time and area occupation of the structured adder imple-mentation with Tyagi metric, when considering different implementations.

As it was expected, both the propagation time and the implementation area grow as the size of

the adder grows. It is worth noting a particular situation that occurs when the values Z2 ≥ Z3+Z4:

in this case, the propagation time is higher than the other increment case (Z4 ≥ Z3+Z2) because

zone 4 receives an anticipated carry out from zone 3 due to the usage of a CLA in zone 3, as it

was described in section 2.4.1.A and section 6.1.1. Note also the pure adders: they take more

time than the equivalent hybrid structure and in some cases they take even more area. In table

fact, when comparing the configurations in row 5 and 6 in table 6.1, i.e. the pure adder structure

with 4 bits in zone 3 with a structural adder with 2 bits in zone 4 and 2 bits in zone 3, the hybrid

72


structure is faster thanks to the anticipated of the carry in zone 4. Moreover, row 5 has a greater

area. The same is not true for row 4 configuration, where the propagation time is even less time

than the time in row 6 configuration. Nevertheless, the area is greater.

The chart in 6.5 was built with the data from table 6.1 (in the same order). The evolution of the

adder propagation time and area occupation is not linear, but it can achieve the desired result, i.e.

time minimization while still having an acceptable area. When comparing with the time and area

of the simpler CLA in figures 2.18b and 2.18a, we can see that the proposed adder has a smaller

propagation time for a similar area.

6.2.2 Standard Cell time and area evaluation

Table 6.2 presents the results concerning the propagation time and area occupation after the

synthesis of the proposed structural adder block with the standard cell library.

Zones bit width SynthesisAdder size Z4 Z3 Z2 Time (ns) Area (µm2) #

2 0 0 2 0.14 7 10 2 0 0.25 61 2

4

0 0 4 0.16 18 30 2 2 0.26 87 40 4 0 0.27 206 52 2 0 0.28 157 6

8

0 0 8 0.35 217 70 2 4 0.27 140 80 4 2 0.31 325 90 4 4 0.32 287 100 8 0 0.35 501 112 2 2 0.35 257 122 2 4 0.31 317 132 4 2 0.35 435 144 2 2 0.38 361 154 4 0 0.37 463 16

16

0 0 16 0.46 645 170 2 8 0.37 308 180 4 8 0.40 548 190 8 2 0.38 574 200 8 4 0.37 682 210 8 8 0.44 716 220 16 0 0.43 991 232 2 8 0.40 350 242 4 8 0.43 529 254 2 4 0.37 415 264 2 8 0.47 531 274 4 8 0.48 606 284 8 2 0.45 835 294 8 4 0.44 951 308 2 4 0.47 544 318 4 2 0.49 727 328 4 4 0.49 780 338 8 0 0.5 843 34

73

6. Results

32

0 0 32 0.5 1779 350 2 16 0.36 762 360 4 16 0.39 821 370 8 16 0.43 1087 380 16 2 0.44 1222 390 16 4 0.45 1161 400 16 8 0.48 1214 410 16 16 0.46 1650 420 32 0 0.57 2361 432 2 16 0.40 807 442 4 16 0.40 1061 452 8 16 0.42 1324 462 16 2 0.44 1542 472 16 4 0.45 1534 482 16 8 0.48 1444 494 2 16 0.46 855 504 4 16 0.45 1062 514 8 16 0.46 1459 524 16 2 0.48 1606 534 16 4 0.48 1655 544 16 8 0.49 1858 558 2 8 0.56 617 568 4 8 0.56 759 578 8 8 0.57 1023 588 16 8 0.56 2003 598 2 16 0.54 1133 608 4 16 0.54 1195 618 8 16 0.54 1547 62

16 2 8 0.67 1142 6316 4 8 0.66 1259 6416 8 8 0.62 1680 6516 16 0 0.61 1847 66

Table 6.2: Propagation time and area obtained with a standard cell implementation of the pro-posed adder used in each MCM node.

According to the presented results, it can be observed that the previously presented Tyagi’s

model of the propagation time and implementation area follows with close accuracy the respec-

tive trend of the real (measured) results, when the circuit is implemented with a standard cells

library. This aspect is very important, since this same model was used as input of the optimization

procedures that were presented in chapter 5.

The chart 6.6 was built with the data from table 6.2 (in the same order). The evolution of

the adder propagation time and area are not linear but achieve the desired result, i.e. time min-

imization while still having an acceptable area. When comparing these values with the time and

area of the CLA in figures 6.1b and 6.1a we can see that the proposed adder structure has a

smaller propagation time for a similar area. The result is in line with what was previously seen in

section 6.2.1.

74

6.3 Multiple Constant Multiplication Structure

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0

500

1000

1500

2000

2500

0 10 20 30 40 50 60 70

Tim

e (

ns)

Are

a (u

m^2

)

Area

Time

Figure 6.6: Distribution of time delay and area occupation for different adder implementation withStandard cell metric


To understand the impact of the proposed adder block on the performance (propagation time

and implementation area) of a MCM structure, we compare an implementation based on [2] versus

the implementation with our adder. As experiment sets we used the coefficients sets from [2]

and [20]. The first sets were computed with MATLAB R© using the Remez algorithm, while the

second set is a difficult to obtain minimal set from [18]. Finally the last set (the tenth set) is a set

composed by large prime numbers (seven coefficients from 0 to 4k, seven coefficients from 4k to

32k and seven coefficients from 32k to 64k), corresponding to a particularly difficult factorisation.

In table B.1 of Appendix C we enumerate each these sets that were used to calculate the MCMs.

ASSUME-A algorithm Proposed algorithmAdder from [2] Proposed adder Proposed adder

FilterOperationsDelay (ns)Area (µm2) Delay (ns)Area (µm2) OperationsDelay (ns)Area (µm2)0 18 2.93 14100 1.54 22224 23 1.45 260651 10 2.08 10156 0.95 19124 10 1.01 175872 17 2.53 14138 1.25 22106 18 1.19 239463 15 2.90 12681 1.40 19930 18 1.19 240474 28 2.78 30712 1.33 45472 30 1.36 448825 34 2.73 33584 1.34 48520 37 1.47 474486 21 2.74 16602 1.47 25859 25 1.49 279707 30 2.94 21755 1.42 36805 39 1.46 397308 46 3.05 41852 1.68 58046 51 1.68 649199 32 2.76 22784 1.57 34689 38 1.51 3977710 41 2.31 31913 1.73 44403 49 1.69 52901

Avg. - - - 1.92 55% - 1.95 68%Min. - - - 1.34 39% - 1.37 41%Max. - - - 2.19 88% - 2.44 90%

Table 6.3: Experimental results of the MCM implementation after synthesis.

The procedure for this test was:i) the algorithm was run with no time limit for the given co-

efficients sets; ii) once the MCM solutions are found, we translated the MCMs decompositions

75

6. Results

into VHDL language according to the used adder structure. The code went through the high-level

synthesis and then through the place and route process. The analysis results after synthesis and

after the place and route are presented in tables 6.3 and 6.4.

ASSUME-A algorithm Proposed algorithmAdder from [2] Proposed adder Proposed adder

FilterOperationsDelay (ns)Area (µm2) Delay (ns)Area (µm2) OperationsDelay (ns)Area (µm2)0 18 2.838 13869 1.823 23698 23 1.866 260441 10 2.024 11172 1.207 18167 10 1.435 145582 17 2.392 14981 1.555 22689 18 1.533 239803 15 2.833 12293 1.676 21025 18 1.557 236644 28 2.678 27918 1.815 40300 30 1.811 409675 34 2.634 31297 1.859 43593 37 1.976 448956 21 2.655 17542 1.875 26895 25 1.920 287817 30 2.856 22509 1.831 37084 39 1.878 415558 46 3.012 39843 2.231 54025 51 2.281 568589 32 2.662 24270 2.000 37087 38 1.866 43305

10 41 2.985 39830 2.077 47534 49 2.235 52941Avg. - - - 1.50 51% - 1.46 60%Min. - - - 1.33 19% - 1.32 30%Max. - - - 1.69 71% - 1.82 92%

Table 6.4: Experimental results of the MCM implementation after place and route.

Tables 6.3 and 6.4 show the results obtained for the synthesis and the place-and-route tool.

For the synthesis, comparing the adder used in [2] with the proposed adder. The first eleven rows

are the data from each implementation. The last three rows are the speedup and area penaliza-

tion for each implementation. The average, minimum and maximum speedup are the values that

shows how much faster is the implementation when compared to a reference (ASSUME-A with

adder from [2]). If the speedup is one than the implementation has the same speed as the refer-

ence. The formula used for calculating each individual speedup istreftialgo

. The average, minimum

and maximum area penalizations are the quantification of how much more area percentage the

implementation has augmented when compared to the reference. The formula used for the each

individual area penalization isAialgoAref

− 1. It is a percentage that estimates how much bigger is the

implementation compared to a reference.

As it can be observed, by simply replacing the conventional adder structures with the proposed

adder in the MCM structure obtained with [3], there is an average speed up of 1, 92 and a silicon

area augmentation of 55% in average. Clearly the proposed adder introduces an improvement

in latency over the existing technology. This performance can still be improved by applying the

developed optimization algorithm, which takes into account a time minimization procedure using

the time/area model that was previously defined and evaluated for the proposed adder structure.

When comparing the proposed algorithm performance with the previous technology from [3], there

is an average speed up of 1.95 and an average augmentation of silicon area of 68%. Hence, the

proposed algorithm gets better results across the board. Note the maximum speed-up of 2.44

76


versus 2.19, the result obtained by the ASSUME-A algorithm. The proposed algorithm obtains

at least equal or better gains than the ASSUME-A: the minimum gain is 1.37 for the proposed

algorithm and 1.34 for the ASSUME-A algorithm.

After repeating the same evaluation with the place and route, we observed an average speed up

of 1, 50 and an average silicon area augmentation of 51%. Again the proposed adder brings an

improvement in latency over the existing technology, even though it is slightly less significant than

the results from the synthesis. Comparing the proposed algorithm performance with the previous

technology from [3], there is an average speed up of 1.46 and an average augmentation of silicon

area of 60%. This time the result are not so good as previously in the synthesis: the average

speed up is smaller. Note the maximum speed-up of 1.82 versus 1.69, the result obtained by the

ASSUME-A algorithm. As in the synthesis, the proposed algorithm obtains at least equal or better

gains than the ASSUME-A: the minimum gain is 1.32 for the proposed algorithm and 1.33 for the

ASSUME-A algorithm.

Summary

In this section, we presented a comprehensive evaluation of the hardware structures and op-

timizations algorithms that were described in the previous section. First, an analysis of the adder

block with different operator resolutions were presented. Then, the optimization algorithms were

compared using different filters, demonstrating the advantages if using the proposed structural

adder in each node of the MCM multiplier.

77

6. Results

78

7Conclusions

Contents6.1 Optimized Adder Structure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 686.2 Structured Adder Block Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 706.3 Multiple Constant Multiplication Structure . . . . . . . . . . . . . . . . . . . . . 75

79

7. Conclusions

This thesis addresses the problem of maximizing the processing speed of a MCM circuit. We

resorted to current state of the art arithmetic circuits and to the previous work presented in [3] to

build an hybrid adder structure and a new algorithm for time minimization, respectively.

Based on the shift and add operations of the MCM structures, we identified several optimiza-

tions at the implementation level to build a modular and structural adder. The proposed adder

structure is a modular circuit scalable by zones, where each zone is defined by a simple arith-

metic operation. In each zone, several different implementations were studied with the aim of

minimizing the latency. While the complexity of this hybrid adder slightly augments, its latency

is lower than conventional adders of the same bit-width. This was achieved by separating the

hybrid adder into four main zones: zone 1 does not have any operation and is a simple zero

generation; zone 2 is either an increment or a bypass; zone 3 is an addition; finally, zone 4 is an

increment/decrement/bypass.

According to the obtained experimental results, we achieved speed-ups up to 2.5 times when

compared to the hardware proposed by [2]. Even though it occupies more silicon area than some

state of the art implementations, we noted that the proposed structural adder has a fairly linear

growth in terms of area. Moreover, the organization of this modular adder facilitates the usage of

other possible technologies in each zone.

As for the algorithm based on the work proposed in [3], we developed a latency minimization

oriented algorithm, weighted by an area minimization criteria, based on the CSD representation.

The algorithm was built around the time constraint of a given MCM. This time constraint is de-

fined as the longest minimal time to compute a given coefficient in the coefficient set of the MCM.

After the computation of the critical time corresponding to all possible alternatives, the algorithm

chooses the most convenient implementation for each coefficient, by identifying the implementa-

tions that satisfy the imposed time constraint and takes the ones that jointly achieve a minimal

area. Once the algorithm has terminated, the result is a MCM structure with minimal propagation

time at the lowest cost in terms of area. Hence, the introduced novelty that is presented by the

present approach is the extension of this optimization to the logic-cells level and the usage of time

instead of area, as the optimization constraint.

As the experimental results show, the conducted synthesis proves that the MCM structures that

were obtained with our algorithm have at least the same or better performance than other state of

the art algorithms. This is mainly due to the proposed refined model, adopted in the construction

of the MCM. As for the place and route, the gains were more marginal than those obtained with the

synthesis, when comparing the achieved performance with the other state of the art algorithms.

In terms of future work, the application of the proposed approach in real applications would

prove its usability, namely in time critical operations (e.g. filtering, optimization, etc...), used by

real-time Digital Signal Processing applications.

80

Bibliography

[1] (2009). Faraday ASIC Cell Library FSD0A A 90nm Standard Cell. Faraday Technology Cor-

poration.

[2] Aksoy, L., Costa, E., Flores, P., and Monteiro, J. (2007a). Optimization of area in digital fir filters

using gate-level metrics. In Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE,

pages 420 –423.

[3] Aksoy, L., da Costa, E., Flores, P., and Monteiro, J. (2008). Exact and approximate algorithms

for the optimization of area and delay in multiple constant multiplications. Computer-Aided

Design of Integrated Circuits and Systems, IEEE Transactions on, 27(6):1013 –1026.

[4] Aksoy, L., Gunes, E. O., Costa, E., Flores, P., and Monteiro, J. (2007b). Effect of number repre-

sentation on the achievable minimum number of operations in multiple constant multiplications.

In Signal Processing Systems, 2007 IEEE Workshop on, pages 424 –429.

[5] Aktan, M., Yurdakul, A., and Dundar, G. (July). An algorithm for the design of low-power

hardware-efficient fir filters. Circuits and Systems I: Regular Papers, IEEE Transactions on,

55(6):1536–1545.

[6] Arroz, G. (2009). Arquitectura de Computadores, 2a edicao, chapter 5.2.3 Operacoes com

Numeros em Complemento para 2. IST PRESS.

[7] Avizienis, A. (1961). Signed-digit numbe representations for fast parallel arithmetic. Electronic

Computers, IRE Transactions on, EC-10(3):389 –400.

[8] Bull, D. and Horrocks, D. (1988). Primitive operator digital filter synthesis using a shift biased

algorithm. In Circuits and Systems, 1988., IEEE International Symposium on, pages 1529

–1532 vol.2.

[9] Bull, D. and Horrocks, D. (1991). Primitive operator digital filters. Circuits, Devices and

Systems, IEE Proceedings G, 138(3):401 –412.

[10] Cappello, P. and Steiglitz, K. (1984). Some complexity issues in digital signal processing.

Acoustics, Speech and Signal Processing, IEEE Transactions on, 32(5):1037 – 1041.

81

Bibliography

[11] Dempster, A. and Macleod, M. (1994). Constant integer multiplication using minimum adders.

Circuits, Devices and Systems, IEE Proceedings -, 141(5):407 –413.

[12] Dempster, A. and Macleod, M. (1995). Use of minimum-adder multiplier blocks in fir digital

filters. Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on,

42(9):569 –577.

[13] Dempster, A. and Macleod, M. (2004). Using all signed-digit representations to design single

integer multipliers using subexpression elimination. In Circuits and Systems, 2004. ISCAS ’04.

Proceedings of the 2004 International Symposium on, volume 3, pages III – 165–8 Vol.3.

[14] Flores, P., Monteiro, J., and Costa, E. (2005). An exact algorithm for the maximal shar-

ing of partial terms in multiple constant multiplications. In Computer-Aided Design, 2005.

ICCAD-2005. IEEE/ACM International Conference on, pages 13 – 16.

[15] Garner, H. L. (1966). Number systems and arithmetic. volume 6 of Advances in Computers,

pages 131 – 194. Elsevier.

[16] Gustafsson, O., Dempster, A., and Wanhammar, L. (2002). Extended results for minimum-

adder constant integer multipliers. In Circuits and Systems, 2002. ISCAS 2002. IEEE

International Symposium on, volume 1, pages I–73 – I–76 vol.1.

[17] Hartley, R. (1996). Subexpression sharing in filters using canonic signed digit multipli-

ers. Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on,

43(10):677 –688.

[18] Johansson, K., Gustafsson, O., DeBrunner, L., and Wanhammar, L. (2011). Minimum adder

depth multiple constant multiplication algorithm for low power fir filters. In Circuits and Systems

(ISCAS), 2011 IEEE International Symposium on, pages 1439 –1442.

[19] Kumar, R., Mandal, A., and Khatri, S. (2012). An efficient arithmetic sum-of-product (sop)

based multiplication approach for fir filters and dft. In Computer Design (ICCD), 2012 IEEE 30th

International Conference on, pages 195 –200.

[20] Kumar, V., Phaneendra, P., Ahmed, S., Sreehari, V., Muthukrishnan, N., and Srinivas, M.

(2011). A reconfigurable inc/dec/2’s complement/priority encoder circuit with improved decision

block. In Electronic System Design (ISED), 2011 International Symposium on, pages 100 –105.

[21] Malvar, H., Hallapuro, A., Karczewicz, M., and Kerofsky, L. (2003). Low-complexity transform

and quantization in h.264/avc. Circuits and Systems for Video Technology, IEEE Transactions

on, 13(7):598 – 603.

[22] Park, I.-C. and Kang, H.-J. (2001). Digital filter synthesis based on minimal signed digit

representation. In Design Automation Conference, 2001. Proceedings, pages 468 – 473.

82

Bibliography

[23] Pasko, R., Schaumont, P., Derudder, V., Vernalde, S., and Durackova, D. (1999). A new

algorithm for elimination of common subexpressions. Computer-Aided Design of Integrated

Circuits and Systems, IEEE Transactions on, 18(1):58 –68.

[24] Plauger, P., Lee, M., Musser, D., and Stepanov, A. A. (2000). C++ Standard Template Library.

Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition.

[25] Richardson, I. E. (2010). The H.264 Advanced Video Compression Standard, Second

Edition, chapter 7 - H.264 transform and coding, page 194. John Wiley and Sons.

[26] Sklansky, J. (1960). Conditional-sum addition logic. Electronic Computers, IRE Transactions

on, EC-9(2):226 –231.

[27] Sousa, L., Roma, N., and Dias, T. (2003). Efficient adder architectures for high-performance

vlsi design. Technical report, INESC-ID Lisbon. Technical Report No. RT/001/2003-CDIL.

[28] Tyagi, A. (1993). A reduced-area scheme for carry-select adders. Computers, IEEE

Transactions on, 42(10):1163–1170.

[29] Voronenko, Y. and Puschel, M. (2007). Multiplierless multiple constant multiplication. ACM

Trans. Algorithms, 3(2).

[30] Zimmermann, R. (1997). Binary Adder Architectures for Cell-Based VLSI and their

Synthesis. PhD thesis, Swiss Federal Institute of Technology, Zurich.

83

Bibliography

84

AAppendix A - Background

85

A. Appendix A - Background

We will present a brief review about number representation systems, by explaining the repre-

sentations commonly used in MCM design and how to implement conversions between them.

A.1 Number Representations Systems

In this section, the most prevailing number representation systems will be briefly reviewed,

with a particular attention to Signed Binary (SB) representations. With this in mind, the TC [6]

will be considered as the most usual representation for electronic circuits. Next, the CSD [7] and

the MSD [22] will also be reviewed, due to their relevance in MCM design, by offering minimal

non-zero digits representations of the numbers to be operated.

A.1.1 Unsigned Binary Representation

The binary system [6] is generally used in electronics as the default representation of unsigned

digits. This system uses two symbols: 0 and 1. Any unsigned number can thus be represented

by a well defined sequence of bits. Its conversion to the base-10 system (decimal) is done by the

following equation:

y10 =∑x∈Z

αx2 · 2x (A.1)

where αx2 is the binary digit in position x. Note that the 2 and 10 subscript denote a binary and

decimal number respectively.

As an example, given the binary number 1001102, we apply the above formula to obtain: y10 =

12 × 25 + 02 × 24 + 02 × 23 + 12 × 22 + 12 × 21 + 02 × 20 = 3810

The main limitations of this representation is the impossibility of representing negative values.

A.1.2 Signed Binary Representations

Several different representations have been proposed to represent signed numbers. In this

subsection we review three commonly used representations: One’s Complement (OC), TC, CSD

and MSD. In particular the later two are frequently used as useful decompositions for constants

in MCMs.

A.1.2.A One’s Complement

Just as the unsigned binary representation representation, the OC representation also uses

the 0 and 1 symbols. However, it is able to extend the representation to negative values. The OC

is defined as the value that is obtained by inverting all bits of a number in the unsigned binary

representation. In a N -bit one’s complement numbering system, one can represent numbers

within the range −(2N−1 − 1) to 2N−1 − 1. For example, by taking the number 00112 (310) we can

convert it to negative by simply taking the complement of all bits: 11002 (−310). Note that the most

86

A.1 Number Representations Systems

significant bit represents the sign of the number: 1 for negative and 0 for positive.

The main disadvantage of this representation is the existence of two representations for the zero

values: the positive 00002 (+010) and the negative 11112 (−010).

A.1.2.B Two’s Complement

The TC representation is very similar to the unsigned binary representation as it also uses

both 0 and 1 symbols. However, it offers a slightly bigger representation range than the OC:

numbers within the range −(2N−1) to 2N−1−1 can be represented. As in OC, the most-significant

bit is the sign of the number. To convert a positive number into a negative value, we simply take

the OC of a positive representation and add one. Furthermore, the addition operation is imple-

mented just like in the unsigned binary representation. The subtraction is also straightforward

and is implemented just like the addition, i.e. subtracting 5 from 15 is the same as adding -5

and 15. As an example, by using an 8 bits representation, the number -5 is represented by:

00000101complemented−−−−−−−−−→ 11111010

+1−−→ 11111011.

To convert a negative number to a positive number, the procedure is the same as in the above

example:11111011complemented−−−−−−−−−→ 00000100

+1−−→ 00000101.

Now if both are added, the result of the operation is 0: 11111011 + 00000101 = 00000000.

The possibility of representing negative numbers is of great interest in MCM. Next, two different

signed representations are presented.

A.1.2.C Canonical Signed Digit

Contrasting to the previous binary representations, the CSD uses representation [7] has 3

symbols instead of just two: 1, 0 and 1, where the last symbol (the negated one) is equivalent

to −1. This representation system is essentially a re-encoding of the binary representation, by

converting every sequence of 1’s followed by a 0 into a 1 followed by a segment of 0’s and a

−1 (...001110...CSD−−−→ ...010010...), thus achieving the minimum amount of non-zeros symbols.

According to [15], on average the number of non-zero digits is reduced by 33%.

For the sake of illustration, lets consider the 14 value, represented in TC as: 01110⇔ 14 = 8+4+2.

When converted into CSD, the new representation will be: 10010 ⇔ 14 = 16− 2. In this example

we decreased the number of non-zeros symbols from 3 to 2.

Due to its particular characteristics, this representation has been shown to be very convenient

to minimize the number of adders. However, while there is only one possible implementation in

CSD, the MSD representation (described below) offers the possibility to use several alternatives

representations.

87

A. Appendix A - Background

A.1.2.D Minimal Signed Digit

The MSD [22] is very similar to the CSD. However, while CSD has only one possible represen-

tation for each value, MSD can have several possible representations. This is mainly due to the

way a MSD number is constructed. The conversion is realized by taking the CSD representation

and by subsequently applying the following two transformation rules on the obtained numbers:

• if 101 → 011, this is a reorganization of the bits: 1 × 4 + 0 × 2 + 1 × 1 = 4 − 1 = 3 into

0× 4 + 1× 2 + 1× 1 = 2 + 1 = 3

• if 101→ 011, as the previous example: 1× 4 + 0× 2 + 1× 1 = −4 + 1 = −3 into 0× 4 + 1×

2 + 1× 1 = −2− 1 = −3

To illustrate this procedure, let’s consider the CSD representation of the number 715: 10101010101.

By transforming both underlined symbols, we get:01100110101

However, the alternate grouping 10101010101 is also possible, giving rise to the distinct

representation:01101001101. By successively applying the same procedure to the last represen-

tation we obtain the following grouping: 01101001101 which lend the following new representation:

01011001011.

Hence by using the MSD numbering system, we produced 2 different representations of the

number 715. In general, while the CSD only allowed to obtain one different representation, MSD

can provide more, making a more diverse decomposition of the term.

Discussion

The subset of different number representations that were revised in this section will be used

throughout this document. The binary TC representation is paramount in electronic circuits and

will be used to represent the input and output operators of the MCM circuits, that will be debated

in this thesis. For the partial term decomposition, the CSD numbering system will be used to build

the MCM structures. This decomposition was first used by Avizienis [7], Hartley [17], Bull et al. [9],

Dempster et al. [11] [12] [13] and Pasko et al. [23]. On the other hand, the MSD representation

seems to have been first used as a decomposition technique in the MCM problem by Park et al.

[22], and later by Flores et al. [14] and Aksoy et al. [4] [3].

88

BAppendix B - Used set of

coefficients

ContentsA.1 Number Representations Systems . . . . . . . . . . . . . . . . . . . . . . . . . 86

89

B. Appendix B - Used set of coefficients

Set name # coefficients Max coefficients’ bitwidth Coefficientsfir00 14 12 -710;-662;-499;-266;-35;327;398;

505;582;699;710;1943;2987;3395fir01 61 7 -1;0;0;0;0;0;0;0;0;-1;-1;0;1;1;1;0;0;-1;-1;-1;

0;1;2;2;1;-1;-2;-3;-2;0;3;4;3;1;-2;-5;-5;-3;1;5;7;6;1;-5;-9;-10;-5;3;11;15;12;2;-13;-24;

-26;-14;14;51;89;117;128fir02 51 9 0;0;0;0;0;0;0;0;0;0;0;0;1;1;1;1;-1;-2;-3;-3;-1;2;

5;8;8;6;1;-6;-14;-19;-19;-12;2;19;35;43;38;18;-15;-53;-84;-96;-78;-23;67;182;303;411;485;512

fir03 20 12 -1;-6;0;15;13;-20;-44;1;80;63;-85;-173;1;283;223;-297;-650;2;1628;3058

fir04 41 11 -23;40;32;25;10;-10;-29;-35;-23;3;33;51;45;14;-30;-66;-74;-44;14;75;107;88;20;

-71;-142;-151;-81;46;175;240;190;24;-203;-388;-419;-218;221;818;1426;1880;2048

fir05 61 11 70;546;-39;4;-34;-44;-30;2;34;48;35;1;-35;-52;-40;-4;36;58;47;9;-37;-64;-55;-14;

38;71;64;20;-39;-79;-76;-29;39;90;91;40;-40;-103;-112;-55;40;122;140;77;

-41;-150;-184;-112;42;197;262;177;-42;-296;-441;-345;42;656;1329;1851;2048

fir06 30 14 0;-1;0;4;4;-7;-17;0;36;29;-42;-87;2;148;113;-148;-293;4;450;335;-419;-819;6;1246;958;-1242;-2670;7;6544;12240

fir07 30 14 33;-14;-51;-51;13;86;71;-50;-156;-98;111;250;113;-217;-377;-106;387;543;55;-661-768;79;

1134;1114;-411;-2155;-1887;1571;6898;10896fir08 50 16 -5;1;10;12;-3;-24;-21;15;51;33;-39;-92;-43;

83;150;44;-159;-227;-24;276;319;-30;-444;-418;140;676;512;-332;-981;-579;637;1368;591;-1104;-1854;-503;1806;2470;243;-2893;-3298;339;4743;

4603;-1688;-8759;-7615;6322;27640;43590fir09 30 14 -59;-11;34;89;110;59;-57;-179;-217;-108;120;347;

406;198;-216;-614;-711;-343;380;1074;1248;607;-697;-2023;-2463;-1281;1657;5700;9592;11976

fir10 21 16 233;293;379;769;2693;3499;3917;6247;7307;19753;20269;28279;29147;31189;37879;

39313;40127;44563;50147;59063;63463

Table B.1: Coefficient sets.

90

CAppendix C - Comparison of filter

fir10

91

C. Appendix C - Comparison of filter fir10

Input

+3x

+5x

-7x <<3

+9x

-31x

<<5

+33x

+41x

<<5

+257x

+259x

<<8

-287x-767x

+769x

+2693x

<<7

-4087x

<<12

-32767x

<<15

+32773x

<<15

+40961x

-65535x

<<16

+99x

+195x

-379x

<<7

<<8

<<8

+1283x

+49155x

<<14

-59391x

<<11

+85x

-139x+293x

-631x

<<7

<<8

+2565x

<<13

-103x

<<4

+233x

<<5

+905x

<<7

-3499x

<<9

+6247x

+7307x

<<10

<<4

<<5

<<5

+4617x

+50147x

<<5

+37879x

<<10

-59063x

<<3

-31189x

<<4

-39313x

<<4

-3917x

<<5

<<4

-28279x

<<7

-40127x

out1

-20269x-63463x

<<3

-44563x

<<4

out2out3

<<6

-19753x

out4

-29147x

<<2

<<4

<<3

out5out6

out7out8

out9out10

out11out12

out13out14

out15out16

out17out18

out19out20

out21

FigureC

.1:G

raphfora

testsetfir10w

iththe

proposedadderand

algorithm.

92

Input

+3x

+5x

-7x

+9x

+65x

+65537x

+769x

+1281x

+40961x

+65543x

<<16

+43x

+103x<<5

+195x

+227x

-379x

<<7

<<8

+833x<<8

+899x

+12297x<<12

-2693x<<10

+6247x

<<11

-28279x

<<7

<<3

+61x

+293x

<<8

<<13

+40967x

<<13

-40127x

<<13

<<3

<<5

+233x

<<5

<<7

-28663x<<12

<<5

+585x

+37879x

<<10

-63463x

<<5

+7307x

-31189x

<<9

-39313x

<<4

+3499x

<<4

+50147x

<<8

<<5

-20269x

+19753x

<<5

-59063x

<<8

out1

out2

+29147x

out3

+3917x

out4

<<2

<<5

+44563x<<2

<<4

out5

out6

out7

out8

out9

out10

out11

out12

out13

out14

out15

out16

out17

out18

out19

out20

out21

Figu

reC

.2:

Gra

phfo

rate

stse

tfir1

0w

ithth

eLe

vent

adde

rand

algo

rithm

[3].

93

C. Appendix C - Comparison of filter fir10

94

time constrained multiple-constant-multiplication structures … · time constrained...

Documents