time constrained multiple-constant-multiplication structures … · time constrained...
TRANSCRIPT
Input
-3x
<<2
+5x -7x
<<3
+9x
+11x
<<3
+13x
-15x
<<4
out1 out23
<<1
out32
<<2
out61
<<7
<<2
+51x -89x
<<5
out28 out42
<<1
out51
<<2
out54
<<3
+117x
out36out46
<<1
<<4
out41out56
<<1
out45out49 out53 out55
<<1
out50out58 out59 out60
Time Constrained Multiple-Constant-MultiplicationStructures for Real-Time Applications
Luıs Miguel Seabra Ribau Lopes do Rosario
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor: Doutor Nuno Filipe Valentim Roma
Examination CommitteeChairperson: Doutor Nuno Cavaco Gomes HortaSupervisor: Doutor Nuno Filipe Valentim Roma
Member of the Committee: Doutor Paulo Ferreira Godinho Flores
Maio 2014
Acknowledgments
This thesis was performed in the scope of project ”HELIX: Heterogeneous Multi-Core Archi-
tecture for Biological Sequence Analysis”, funded by the Portuguese Foundation for Science and
Technology (FCT) with reference PTDC/EEA-ELC/113999/2009.
I would like to express my gratitude for the support and encouragement of some people which
helped me throughout my dissertation.
First to my family who encourage me all the way. My friends who indulged me during the time
working on the dissertation and encouraged me all the time. Special thanks to Filipa, Valter,
Andre, Joao and both Marianas who were there most of the time. To Constanca who helped me
for the final push.
To my supervisor Dr. Nuno Filipe Valentim Roma who helped me manage my expectation, got me
encouraged as much as he possible could and for the hard reviews he had to go through.
To T.A. Tiago Miguel Braga da Silva Dias for his helpful comments, ideas and contributions.
Also to Dr. Levent Aksoy for the insightful comments, help in key parts of the thesis and his friendly
attitude.
Finally, but not least, for my friends and mother who are not amongst us: Marco, Bernardo and
Liliete.
Abstract
To implement multiplications by constants operations, Multiple Constant Multiplication (MCM)
structures are often a de facto alternative in Application Specific Integrated Circuits (ASIC). These
structures make use of a series of additions, subtractions and shifts to multiply a given variable
by a constant. However, the definition of the optimized MCM hardware structures for a given
constraint set is a NP-complete problem, which has motivated the proposal of many algorithms
along the last decade, mainly focused on reducing the number of additions needed to jointly
implement a set of constant coefficients. The presented research is focused on a less exploited
approach, which tries to optimize the MCM multiplier structure in order to minimize the propagation
time based on gate-level metrics. For such purpose, a modular and structural adder was proposed
to be integrated in each MCM node, that optimally implements each addition operation according
to the particular requisites and characteristics of the operands that are being considered at each
MCM node. As such, not only does the proposed approach handles the same problems as the
other algorithms by reducing the number of adders, but it also tries to start by minimizing the
propagation time and only then does it optimize the area. From the conducted simulations, it was
observed that the proposed improvement provides an average speed-up of the MCM performance
as high as 1.5, at the cost of a consequent increase of the used silicon area.
Keywords
Multiple Constant Multiplication, binary adder structure, propagation time, silicon area, design
optimization
iii
Resumo
Para implementar operacoes de multiplicacao por constantes, sao usadas estruturas MCM em
ASIC. Para multiplicar uma variavel por uma constante, estas estruturas usam uma sequencia de
somas, subtraccoes e deslocamentos de bits. O problema existente com as estruturas MCM e
ser um problema NP-completo, o que motivou a proposta de varios algoritmos na ultima decada,
sobretudo focados na reducao do numero de somas necessarias para implementar de forma
conjunta um conjunto de coeficientes constantes. Esta investigacao focou-se numa abordagem
diferente, com base numa metrica ao nıvel da porta logica optimizou-se a estrutura multiplicadora
MCM para minimizar o tempo de propagacao. Para tal, propos-se um somador modular e estru-
tural para ser integrado em cada no do MCM, onde cada operacao de adicao e implementada
de maneira optima consoante os requisitos e caracterısticas dos operandos que estao a ser con-
siderados para cada no. A abordagem apresentada trata dos mesmos problemas que os outros
algoritmos ao reduzir o numero de somadores, mas tambem vai mais alem, tentando primeiro re-
duzir o tempo de propagacao e so depois tentando optimizar a area. Das simulacoes efectuadas,
foi observado que as melhorias propostas conseguem acelerar em media ate 1.5 o desempenho
do MCM a custa do aumento de ocupacao da area de silıcio.
Palavras Chave
Multiplicacao de multiplas constantes, estrutura de somador binario, tempo de propagacao,
area de silıcio, optimizacao do circuito
v
Contents
Acronyms xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 State of the art 7
2.1 Binary Arithmetic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Ripple-Carry or Ripple-Adder structure . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Carry Look-Ahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Sklansky Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Modified Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Reconfigurable architecture for Increment and Decrement . . . . . . . . . . 13
2.2 Single Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Multiple Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Bull-Horrocks Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Bull-Horrocks Modified algorithm . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 N-dimensional Reduced Adder Graph . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Cumulative Benefit Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.5 Unconstrained Area optimization - ASSUME-A . . . . . . . . . . . . . . . . 25
2.4 Technology Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Arithmetic Binary Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1.A Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1.B Increment and Decrement . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Multiple Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Optimized Adder Structures for MCM 33
3.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
Contents
3.1.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Example configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Example configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Topologic description of the adder/subtracter module . . . . . . . . . . . . . . . . . 42
4 Time model of the proposed adder/subtracter structure 49
4.1 Simplest Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 LSB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Time delay minimization through gate level metrics 55
5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Data Structure and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Proposed optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Partial term finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 Minimizing the time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.3 Minimize area function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Results 67
6.1 Optimized Adder Structure Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.1 Adder Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.2 Increment and Decrement Structures . . . . . . . . . . . . . . . . . . . . . 70
6.2 Structured Adder Block Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.1 Tyagi time and area evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.2 Standard Cell time and area evaluation . . . . . . . . . . . . . . . . . . . . 73
6.3 Multiple Constant Multiplication Structure . . . . . . . . . . . . . . . . . . . . . . . 75
7 Conclusions 79
A Appendix A - Background 85
A.1 Number Representations Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.1.1 Unsigned Binary Representation . . . . . . . . . . . . . . . . . . . . . . . . 86
A.1.2 Signed Binary Representations . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.1.2.A One’s Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.1.2.B Two’s Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 87
viii
Contents
A.1.2.C Canonical Signed Digit . . . . . . . . . . . . . . . . . . . . . . . . 87
A.1.2.D Minimal Signed Digit . . . . . . . . . . . . . . . . . . . . . . . . . . 88
B Appendix B - Used set of coefficients 89
C Appendix C - Comparison of filter fir10 91
ix
Contents
x
List of Figures
2.1 Full Adder structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Ripple Carry structure for w bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Carry look-ahead adder structure for 8 bits. . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Carry look-ahead adder blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Sklansky prefix-adder structure for 8 bits. . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Sklansky prefix-adder blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Original scheme of the RID [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Modified scheme of the Reconfigurable architecture for Increment and Decrement
(RID) input selection block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Scheme of the RID decision block. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.10 Modified scheme of the RID output selection block. . . . . . . . . . . . . . . . . . . 15
2.11 Single Constant Multiplication (SCM) structure to compute 23x . . . . . . . . . . . 17
2.12 Multiple constant multiplication with the terms 7 and 11. . . . . . . . . . . . . . . . 18
2.13 Bull-Horrocks Algorithm (BHA) graph representation . . . . . . . . . . . . . . . . . 19
2.14 N-dimensional Reduced Adder Graph (RAG-N) graph representation . . . . . . . . 22
2.15 Distance cases handled by the algorithm Cumulative Benefit Heuristic (Hcub). . . . 23
2.16 Graph topologies for optimal and exact distance tests. . . . . . . . . . . . . . . . . 24
2.17 Examples of cost function in a Boolean network . . . . . . . . . . . . . . . . . . . . 26
2.18 Adder’s area and time delay Tyagi’s metric comparison . . . . . . . . . . . . . . . . 28
2.19 Comparison between Half Adder (HA)+Modified Full Adder (MFA) and [20]. . . . . 28
2.20 Implementation characteristics of MCMs structures. . . . . . . . . . . . . . . . . . 30
2.21 Bit precision study for the Inverse Quantization (QI) coefficients set. . . . . . . . . 31
2.22 Bit precision study for the Forward Quantization (QF) coefficients set. . . . . . . . 32
3.1 Adder block with its inputs left shifted and the resulting output right shifted. . . . . 34
3.2 Mathematical formulation for the addition operation. . . . . . . . . . . . . . . . . . 35
3.3 Zone 3 divided into 2 sub-zones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Zone 4 divided into 2 sub-zones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Mathematical formulation for the subtracter. . . . . . . . . . . . . . . . . . . . . . . 40
xi
List of Figures
3.6 Zone 2 divided into 2 sub-zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Complete adder configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Simplest model: critical path estimation. . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Least Significant Bit (LSB) model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Definition of the critical path for the proposed model. . . . . . . . . . . . . . . . . . 52
5.1 Main algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Find partials algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Minimize path algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Minimize area algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Implementation of the term 89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Area and propagation time for Standard Cell library. . . . . . . . . . . . . . . . . . 69
6.2 Model implementation of the Carry Look-Ahead Adder (CLA). . . . . . . . . . . . . 69
6.3 Comparison between HA+MFAand RID [20]. . . . . . . . . . . . . . . . . . . . . . 70
6.4 Model implementation of the HA+MFA and RID. . . . . . . . . . . . . . . . . . . . . 70
6.5 Adder implementations with Tyagi metrics . . . . . . . . . . . . . . . . . . . . . . . 72
6.6 Adder implementations with Standard cell metrics . . . . . . . . . . . . . . . . . . 75
C.1 Graph for a test set fir10 with the proposed adder and algorithm. . . . . . . . . . . 92
C.2 Graph for a test set fir10 with the Levent adder and algorithm [3]. . . . . . . . . . . 93
xii
List of Tables
2.1 Original control signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 45 binary representation and its possible covers. . . . . . . . . . . . . . . . . . . . 25
2.3 Characteristics of MCMs structures for the QF coefficients set. . . . . . . . . . . . 29
2.4 Characteristics of MCMs structures for QI coefficients set. . . . . . . . . . . . . . . 30
3.1 Zone 4 truth table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Table resuming the addition operation with r > 0. . . . . . . . . . . . . . . . . . . . 37
3.3 Eight different cases of the addition operation. . . . . . . . . . . . . . . . . . . . . . 38
3.4 Table resuming the addition operation with r = 0. . . . . . . . . . . . . . . . . . . . 39
3.5 Table resuming the subtraction with right shift operation. . . . . . . . . . . . . . . . 41
3.6 Eight different cases of the subtraction operation. . . . . . . . . . . . . . . . . . . . 43
3.7 Definition and implementation conditions of zone 1 . . . . . . . . . . . . . . . . . . 44
3.8 Definition and implementation conditions of zone 2 a). . . . . . . . . . . . . . . . . 45
3.9 Definition and implementation conditions of zone 2 b). . . . . . . . . . . . . . . . . 45
3.10 Definition and implementation conditions of zone 3 a). . . . . . . . . . . . . . . . . 46
3.11 Definition and implementation conditions of zone 3 b). . . . . . . . . . . . . . . . . 46
3.12 Definition and implementation conditions of zone 4 a). . . . . . . . . . . . . . . . . 46
3.13 Zone 4 truth table for the select signal. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.14 Definition and implementation conditions of zone 4 b). . . . . . . . . . . . . . . . . 47
3.15 Table resuming both operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Data structure and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Minimal Signed Digit (MSD) representations and respective paths and implemen-
tations of the term 45. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1 Propagation time and area obtained with Tyagi’s model for the structured adder
used in each MCM node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Propagation time and area obtained with a standard cell implementation of the
proposed adder used in each MCM node. . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Experimental results of the MCM implementation after synthesis. . . . . . . . . . . 75
xiii
List of Tables
6.4 Experimental results of the MCM implementation after place and route. . . . . . . 76
B.1 Coefficient sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xiv
Acronyms
Hcub Cumulative Benefit Heuristic
ASIC Application Specific Integrated Circuits
ASSUME-A Unconstrained Area optimization - ASSUME-A
BHA Bull-Horrocks Algorithm
BHM Bull-Horrocks Modified algorithm
CLA Carry Look-Ahead Adder
CNF Conjunctive Normal Form
CSD Canonical Signed Digit
CSE Common Sub-Expression sharing
DCT Discrete Cosine Transform
DFS Depth Search First
FA Full Adder
FIR Finite Impulse Response
HA Half Adder
ILP Integer Linear Programming
LSB Least Significant Bit
LSOB Least Significant One Bit
MAG Minimised Adder Graph
MCM Multiple Constant Multiplication
xv
Acronyms
MFA Modified Full Adder
MSB Most Significant Bit
MSD Minimal Signed Digit
OC One’s Complement
QF Forward Quantization
QI Inverse Quantization
QP Quantization Parameters
RAG-N N-dimensional Reduced Adder Graph
RC Ripple-Carry or Ripple-Adder structure
RID Reconfigurable architecture for Increment and Decrement
SA Sklansky Adder
SAT Satisfiability
SB Signed Binary
SCM Single Constant Multiplication
STL Standard Template Library
TC Two’s Complement
VHDL VHSIC Hardware Description Language
VHSIC Very-High-Speed Integrated Circuits
xvi
1Introduction
1
1. Introduction
1.1 Motivation
High-speed multiplication is used in everything around us, from simple processors used in em-
bedded systems, to general purpose processors present in high-performance cluster of servers.
As a consequence, it has been a subject of research for a long time and even though there are
many different algorithms for number multiplication, we will focus on the Multiple Constant Mul-
tiplication (MCM) systems composed of adds and shifts operation. In fact, it is well known that
by conveniently mapping a series of adds and shifts on a processing path, we can compute the
multiplication of a given variable by a set of constant coefficients. This multi-constant mapping can
be efficiently obtained by sharing intermediate results from the adders. This reduces the amount
of hardware required to implement a given set of targets. The main difficulty that characterizes
the design of MCM systems is the fact that it is NP-complete problem, as shown in [10].
In the last years, the MCM research trend has been focused on reducing the area of the multiplier
block, while keeping the number of additions to a minimum. When processing a set of coefficients,
many different options are possible to connect the adders. Different decision lead to different ap-
proaches to optimize the solution of the MCM problem, ranging from exhaustive search, sharing of
common partial terms, but also graph based coefficient search presented in the next paragraph.
The graph based approach is one of the most prevailing techniques, represented by several
contributions in the literature. As an example, the RAG-N [12] technique uses a pre-computed
table containing the optimal costs of two adder structures depth (two adder in series or adder
steps) for each coefficient, and synthesizes the circuit based on it. However, the computational
requirements of the graph based methods are quite high. As a consequence only coefficients
up to 19 bits are possible. The HCUB [29] technique augmented the search space by adding a
heuristic algorithm capable of bigger depths (more than three adder-steps). The technique used
by Unconstrained Area optimization - ASSUME-A (ASSUME-A) [14] has a different approach.
The coefficients are represented in Canonical Signed Digit (CSD) or Minimal Signed Digit (MSD).
Both representations have the property of having a minimal non-zero digit’s number. The aim for
using this is to reduce the complexity of the algorithm, having a direct access to a minimal number
of partial terms. The solution is obtained by translating the problem to a Boolean Network and by
solving it as a 0-1 Integer Linear Programming (ILP) problem with a Satisfiability (SAT)-solver.
The main problem of the previous algorithms [12, 14, 29] concerns the absence of a direct relation
with the targeted implementation technology and, in particular, the lack of gate level metrics, i.e.
all three implementations have unitary weights for the nodes of the MCM. Recently, the work de-
veloped by Aksoy et al [2] brought some optimizations at the gate level metric, in order to achieve
smaller area. Their proposal is based on a custom adder structure made of full adders and half
adders that optimizes the resulting MCM circuit area. For such purpose, the area model of the
custom adder is introduced in a graph-generation algorithm [14], in order to achieve a smaller
2
1.2 Objectives
area than the original approach. However, it still lacks the time delay model. In [19], the author
achieved an architecture that greatly improves the latency by using a sum-of-product architecture
in conjunction with a column compression algorithm. However, it still loses in terms of the attained
throughput.
While [2] introduced the gate level metrics and [19] managed to optimize the latency with a
new technique applied to the MCM, to the best knowledge of the authors of this work, there is
still no approach to optimize the latency at a gate level with such approach. We aim to achieve
low latency and high throughput, by maintaining the shift and add method. For this purpose, a
custom adder structure will be proposed, by defining a modular architecture that is able to adapt
to the needs of different MCM specifications, with a consequent improvement of the latency when
compared to the existing state of the art.
In the presented work, the adder block is based on a hybrid Carry Look-Ahead Adder (CLA)
structure and on the Reconfigurable architecture for Increment and Decrement (RID) presented in
[20]. The cost function for the adder block provides both area and time delay measures, according
to a given implementation technology based on synthesis values. Accordingly, the cost varies
with the size of the operation that is implemented in each module of the block. Upon an accurate
modelling of the obtained adder structure, we developed an algorithm that uses the developed
model to implement the fastest possible multiplier structure.
1.2 Objectives
The main problems that were studied in the scope of this thesis can be stated by the following
questions:
• How can the speed of a multiplier block be improved?
• Is it enough to change the MCM nodes implementation?
• Can we improve the MCM design for low-latency and high throughput?
Hence, the overall objective of this work is to improve the MCM structure by improving the latency,
while still using area optimizations based on the work by Aksoy et al. [2].
1. The development of a MCM node that can minimize the time-delay while still having an
acceptable area occupation. Even though there is already some research on many types
of latency optimized adders, to the best of our knowledge there is no adder specification
specifically optimized for the MCM structure. Hence we propose an adder structure for
the MCM problem. This structure can be adapted to the needs of the designer, either for
time-delay or area improvement.
2. The proposal of an algorithm capable of finding the structure of the MCM with the smallest
critical path, given a node implementation. This objective arises as a natural extension
3
1. Introduction
of the time-delay optimization through gate-level metrics of the work done by [3]. By still
considering the area optimization, we will extend the algorithm in order to introduce the
time optimization, so the algorithm finds a solution with a given critical path as the main
constraint.
1.3 Original contributions
The original contributions of this thesis are the following:
• New Framework for MCM optimized adder structures: optimized adder structures for
MCM architectures already exist [2]. Nonetheless, there is no generic structure defined.
Through a careful analysis of the MCM requirements, a new framework has been developed.
We were able to divide the addition operation into different modules that compute the output
bits in groups, making it a scalable adder to the designer needs.
• Formal model of the critical path in a chain of adders: based on the proposed frame-
work, we built a model capable of defining the critical path of a chain of adders with a better
accuracy than the existing ones. We essentially grouped the bits of the result according to
the corresponding critical path.
• Adder structure optimized for time-delay minimization in MCMs: with the use of the
above-mentioned framework, we were able to minimize the time-delay of the multiplier block.
By considering different increment, decrement and adder implementations, we were able to
find a combination that improved the time-delay, without significantly augmenting the result-
ing area.
• Algorithm to define the optimized MCM structure with the lowest time-delay: based on
the devised model and the implemented adder structure, an algorithm was developed that
finds the lowest possible time-delay in a MCM and set it as the constraint. It also optimizes
the area without exceeding this constraint.
1.4 Thesis organization
To answer these challenges, the theory and technology behind the MCM problem will be
briefly revised in chapter 2. Chapter 3 proposes an improvement of the adder structures inside
the MCM, by designing a custom adder based on the shifts and bit width disparity between
operands. Chapter 4 presents the definition of an accurate time delay model of the MCM adder
structure. This model will be extensively used in chapter 5, to define an algorithm that solves the
problem concerning the definition the whole MCM circuit, first by checking the minimum delays
for each target coefficient, and in second place by optimizing the area with the critical path as a
4
1.4 Thesis organization
constraint. Finally the results of these modifications will be shown and discussed in chapter 6.
Chapter 7 will draw the final conclusions of this thesis.
Before proceeding with the following chapter, if the reader needs revisiting or familiarize some
basic concepts, it was decided to introduce some background concepts the appendix. For such
purpose, the appendix A will cover some basics concepts that will be needed for the understanding
of this thesis.
5
1. Introduction
6
2State of the art
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
7
2. State of the art
To improve the performance of the multiplication operation two main classes of MCM tech-
niques have been proposed in the latest years: the parallel-MCM and the multiplexed-MCM (mux-
MCM). The first class represents a single-input/multiple-output system, while the second is a
single-input/single-output system. In this last class, the multiplexing gives the possibility to further
reduce the silicon area over the parallel MCM, at the expense of the time delay introduced by
the multiplexers. As a consequence this mux-MCM is not going to be treated here, since it has a
greater time-delay than the parallel-MCM. Accordingly, from here on, we will refer to parallel-MCM
simply as MCM.
First, the binary arithmetic structures that will be used throughout the thesis are presented in the
next section. Than, to better understand how constant multiplication works on a MCM graph, a
brief introduction to Single Constant Multiplication (SCM) will be presented, followed by a revision
of the main concepts about MCM.
2.1 Binary Arithmetic Structures
In the presented research, the performance of the proposed MCM circuits will be conducted
by considering the silicon area and the propagation time of the components in the system. Since
the adder is the main expense in the MCM circuit, this section provides a brief overview of some
of the existing adder structures. On one side, there are adders that minimize area. On the other
side, those that minimize the time delay, at the expense of an augment of the area. Hence, the
best compromise of this area-time trade-off depends on the specific requirements of the circuit
to be implemented. In this section the CLA, the Sklansky Adder (SA) and the Ripple-Carry or
Ripple-Adder structure (RC) are object of a deeper study.
Throughout this section, the presented overview will heavily rely on Roma et al. [27]. In the
following, the symbol w will denote the bit-width of the input operands. Since the main aim is
focused on speed optimization, we will look at, in particular, parallel structures (CLA and SA),
which will be compared with the straightforward serial structure (RC).
2.1.1 Ripple-Carry or Ripple-Adder structure
First, let’s start by describing the Full Adder (FA). This circuit is defined by three inputs
(operands a and b, the carry-in cin) and two outputs (the sum output s and the carry-out cout).
The two outputs are given by the following expressions:
s = a⊕ b⊕ cin (2.1)
cout = a · b+ a · cin + b · cin (2.2)
The internal structure of the FA is given by the following logic circuit:
8
2.1 Binary Arithmetic Structures Figure 2.1: Full Adder structure (figure from [27]).
T cout
FA = TAND2 + TOR3 = 3 AFA= 3×AAND2 +AOR3 + 2×AXOR2
T sFA= 2× TXOR2 = 4 = 3 + 2 + 4 = 9
According to the Tyagi’s model [28], the time delays and area occupations are as follows:
The above design only works with 1 bit operands. For wider operands, the RC reuses the
above circuit by cascading w FA for a w-bit addition. This topology connects the carry-out from
one adder to the carry-in of the next adder, thus obtaining the cascaded serial structure. As will
be seen, when compared to other adder structures this one is the slower since each RC must wait
for the carry-out bit from the previous FA. For such reason, this topology is usually referred to as
the Ripple-Carry or Ripple-Adder structure. The circuit corresponding to a w-bit RC is as follows: Figure 2.2: Ripple Carry structure for w bits (figure from [27]).
According to Tyagi’s model, the time delays and area occupations are as follows:
2.1.2 Carry Look-Ahead Adder
The CLA adder is composed by a binary tree structure with dlog2we hierarchical logic levels,
as seen in figure 2.3. It is generally faster than a RC with w hierarchical logic levels. Contrasting
to the RC, where the main propagation delay comes from the carry bits that need to be serially
generated and propagated, in this topology the carry is calculated in parallel, by propagating the
signal through the tree. As figure 2.4 illustrates, the carry bits (ci) are computed according to the
following equation:
cj+1 = Gi,j + Pi,j · ci (2.3)
9
2. State of the art
T cout
RC = w × (TAND2 + TOR3) = 3w ARC= w ×AFAT sRC= (w − 1)× (TAND2 + TOR3) + TXOR2 = 9w
= (w − 1)3 + 2
Where:
Gi,k = Gj+i,k + Pj+i,k ·Gi,j (2.4)
Pi,k = Pi,j · Pj+i,k (2.5)
Gi,k is the carry signal generated from inputs i and k and Pi,k is the propagation signal generated
with the same inputs for i ≤ j < k. We can compute these two with: Gi,i = gi = ai · bi and
Pi,i = pi = ai + bi which are defined in figure 2.4a.
According to figure 2.4b, the sum vector can be obtained with:
si = ai ⊕ bi ⊕ ci (2.6)
The carry-out signal can be calculated with:
cw = G0,w−1 + P0,w−1 · c0 (2.7) Figure 2.3: Carry look-ahead adder structure for 8 bits. (figure from [27])
According to the Tyagi’s model, the time delay and area occupation models are as follows:
T cout
CLA= 2 · dlog2 we+ 3 ARC= 11 · w − 3T sCLA= 4 · dlog2 we+ 1
In these equations w = 2dlog2 we denotes the next integer power of two such that 0 < w ≤ w.
Due to the binary tree structure that is adopted by this CLA, under certain circumstances this
structure can lead to a misuse of the hardware resources. Another fact to notice is that the
slowest signal to be processed is the sum output, while the carry out roughly takes half of this
time. In the next section, we will present a faster adder, the SA.
10
2.1 Binary Arithmetic Structures (a) Block A descending
(b) Block A ascending
(c) Block B descending
(d) Block B ascending
Figure 2.4: Carry look-ahead adder blocks (figures from [27]).
2.1.3 Sklansky Adder
The use of prefix-adder structures [30] are common in the majority of today’s processors.
One such case is the use of the topology proposed by Sklansky [26]. An example of this struc-
ture is shown in figure 2.5. The prefix architectures are based on the fact that the outputs
(yw−1, yw−2, . . . , y0) are computed from the w-bits inputs (xw−1, xw−2, . . . , x0) by means of an
associative generic binary operator ?:
yi = xi ? yi−1 ; i = 1, 2, . . . , w − 1 (2.8)
with y0 = x0 Figure 2.5: Sklansky prefix-adder structure for 8 bits (figure from [27]).
Since the outputs depend on all the prior inputs, any change on them will influence the final
value. Moreover the associative property of the ? operator enables the operations to be executed
in any arbitrary order. Hence, by grouping together subsets of operands, it makes it possible the
computation of parallel partial solutions corresponding to subsets of input bits. We will denote
these groups as Yi,k. Each parcel Yi,k will be processed at different levels, by giving rise to
m = log2(w) + 2 intermediate levels needed to compute the full solution. The variable Y li,k stands
for the output operation that takes a subset of input bits (xk−1, xk−2, . . . , xi) at level l. The last
11
2. State of the art
level m will comprise the entire range of input bits, from 0 to i (Y m0,i), leading to the final result.
Y 0i,i = xi (2.9)
Y li,k = Y l−1i,j ? Y l−1j+1,k , i < j < k ; l = 1, 2, . . . ,m (2.10)
yi = Y m0,i , i = 0, 1, . . . , (w − 1) (2.11) (a) Pre-processing block
(b) Prefix computational unit
(c) Prefix neutral unit
(d) Post-processing block
(e) Pre-processing block of the leastsignificant bit
Figure 2.6: Sklansky prefix-adder blocks (figures from [27]).
Sklansky improved this architecture by adopting the tree-type model, by minimizing the pro-
cessing time of the output bits. This makes all intermediate signals to propagate through the tree
in parallel, to feed the higher level bits that need that signal. One common characteristic with
the CLA is the use of intermediate generate and propagate signals, which assume three different
meanings:
• Generation of the carry signal at logic level ‘0’ (or deactivation of the carry signal at logic
level ‘1’);
• Generation of the carry signal at logic level ‘1’;
• Propagation of the carry signal.
We name the sets of these signals as generation group (Gli,k) and propagation group (P li,k).
They are used to calculate Y l+1i,k = (Gl+1
i,k , Pl+1i,k ).
The first level signal pair corresponds to the generation bit gi and the propagation bit pi, which are
determined from the input operands on a pre-processing block (see figure 2.6a) by the following
equations: gi = aibi and pi = ai + bi. This signal pair is propagated to the higher levels as follows
(figure 2.6b):
(Gli,k, Pli,k) = (Gl−1j+1,k + P l−1j+1,k ·G
l−1i,j , P
l−1j+1,k ·G
l−1i,j ), (2.12)
12
2.1 Binary Arithmetic Structures
i ≤ j ≤ k, l = 1, 2, . . . ,m
Meanwhile, the carry bits of the carry vector can be computed from the last level signals
(Gm0,i, Pm0,i) according to eq. 2.13. The sum bits are computed in a post-processing operation (see
figure 2.6d), by using eq. 2.14:
ci+1 = Gm0,i + Pm0,i · ci, i = 0, 1, . . . , (w − 1) (2.13)
si = pi ⊕ gi, with c0 = cin (2.14)
Since the first element receives a carry signal that comes from the input of the adder, we need to
include it in the pre-processing block. In this case, we can implement it as in figure 2.6e (notice
the similarities with the RC seen in section 2.1.1):
g00 = a0 · b0 + a0 · c0 + b0 · c0 (2.15)
Finally, the block in figure 2.6c represents the direct connection, taking a signal as operand and
transmitting it to the next level without any processing.
According to the Tyagi’s model [28], the time delays and area occupations are as follows:
T cout
SA = 2 · log2 w + 5 ASA= 32 · w · log2 w + 4 · w + 5
T sSA= 2 · log2 w + 5
2.1.4 Modified Full Adder
A special implementation of the FA, the Modified Full Adder (MFA), was developed in [2], with
the purpose of area minimization in a hybrid adder. The idea behind this work is to minimize area
in the subtraction of an MCM, by using a decrementer.
In a FA block, given one of the input is one, the addition (sum) and the carry out (cout) are the
functions of the input ui and the carry input (cin) given by: sum = cin ⊕ ui and cout = cin + ui.
This implementation will be later used in this thesis.
2.1.5 Reconfigurable architecture for Increment and Decrement
The increment/decrement circuit is a structure that can add/subtract one to a given number. In
this work, it is a common building block in many architectures due to its simplicity compared to the
adder. In this section the work done by Kumar et al. [20] will be studied. The architecture will be
called RID. Note that the full architecture of [20] contains four functions: increment, decrement,
Two’s Complement (TC) and priority encoder. Of theses four, we will only use the increment and
decrement. Thus the architecture used for our purpose is simpler, faster and uses less silicon
area than the original architecture.
The possibility to minimize the area of multiple architecture into one comes from:
13
2. State of the art
1. similarities between the implementation of TC and decrementer circuit;
2. the TC circuit can be used as an incrementer by complementing the input.
For example, to find the decremented output of the binary number 11011100, we complement
the input bits until occurrence of the Least Significant One Bit (LSOB). After the occurrence of the
LSOB, all the remaining output bits are kept unchanged, thus giving the binary number 11011011,
where the underlined bits are the complemented.
To find the incremented value of 11011100 we first complement the value, resulting in 00100011.
Next the TC is calculated giving 11011101 and thus having the incremented value of the initial
number.
In both of theses examples, the authors from [20] identified the common characteristics between
the different functions needed. In our specific case, the comparison between the increment and
decrement case differs in the complement and finding the LSOB respectively. Both of these cir-
cuits have in common the TC. In figure 2.7 is represented the original structure that will implement
the increment, decrement, TC and priority encoder architecture. The input of the system are Z,
Cnt1 and Cnt2. The output is O. The signals I and D are signals for intermediate calculation.
The selection of the function is selected with the help of the controls signals Cnt1 and Cnt2 given
by table 2.1.
Input Selection Block
Decision Block
Output Selection Block
Cnt1
Cnt0
D
Cnt1
Cnt0
I
Z
O
Figure 2.7: Original scheme of the RID[20].
Table 2.1: Original control signals
Cnt1 Cnt0 Operation performed0 0 Increment0 1 Decrement1 0 TC1 1 Priority Encode
To adapt the structure to our needs, some simplifications are possible. These simplifications
are associated with the functions needed later in this thesis. For this purpose, we will only choose
the functions increment/decrement associated with the control signal Cnt1 = 0. The modification
are:
• in the input block, the NOR gate is changed into a NOT gate only with input Cnt0;
• in the output block, the multiplexor with Cnt1 as select signal is not needed;
• still in the output block, the NAND gate that has inputs Cnt0 and the negated Cnt1 disap-
pears: Cnt0 ∧ Cnt1 = Cnt0 ∧ 0 = Cnt0 ∧ 1 = Cnt0.
14
2.1 Binary Arithmetic Structures
In figures 2.8 and 2.10 are the schemes for the modified input and output selection blocks.
Moreover, the structure used here is the proposed prefix-based type II, also known as Sklansky,
shown in figure 2.9. This decision architecture was selected due to its low latency when compared
to the other architecture presented in [20].
cnt0
Z0
I0
01
Z1
I1
01
Z2
I2
01
Zn-1
In-1
01
Figure 2.8: Modified scheme of the RID input selection block.
Figure 2.9: Scheme of the RID deci-sion block.
cnt0
Zi
Oi
1 0
1 0 1 0
ZiZi Zi
cnt0
cnt0
Figure 2.10: Modified scheme of theRID output selection block.
With these modifications, the circuit has less latency and occupies less area. Further ahead
in this chapter, a benchmark will compare this architecture with other existing ones.
Discussion
We presented an overview of different adder implementations: one serial and two parallel
topologies. From the theoretical perspective the parallel adders are faster than the serial. More-
over the parallel topologies have more silicon area than the serial one. From the point of view of
this thesis, low delay is more important than silicon area. Therefore we will focus on the faster par-
allel topologies and choose from whichever is the best suited for the hybrid architecture presented
in this thesis.
15
2. State of the art
2.2 Single Constant Multiplication
The implementation of the SCM operation is based on a series of adds and shifts that together
multiply a given number by a constant coefficient. For example, if we want to compute x × 8 =
x× 23, this operation can be trivially implemented as a shift left of three bits.
000012(110) 3 = 010002(810)
In the following, we will adopt the subscripts 2 and 10 to distinguish between the binary and decimal
representation systems, respectively. By default, we use the decimal representation. If we replace
the input operand x by the value 3 we obtain: 000112(310) 3 = 110002(2410).
By extending this trivial example with more adders and shifts, we can make any multiplication
operation. As another simple example the multiplication of x by the constant 10 can be attained
by adding x 2 with x, followed by one final bit shift left.
x x 2 = 4x
x 0 = x
[4x+ x] 1 7−→ 10x
In this notation, x represents the primary input. The middle terms are called partial terms or partial
products. In this case, 4x is the only one. 10x is called the constant coefficient which we want to
generate.
The decomposition of a coefficient may lead to different ways of implementing the SCM. For
example, we can also obtain the above multiplication as:
x x 3 = 8x
x 1 = 2x8x+ 2x = 10x
or even:
x x 1 = 2x→ x 0 = x→ x 1 = 2x
[2x+ x+ 2x] << 1 = 10x
From the above example, we observe that the implementation 8x+ 2x will use one adder and
two shifts, while 2x + x + 2x will use two adders and three shifts. The task of choosing which
implementation to use is simple for small constants. However, as the magnitude of constants
grows, the complexity also gets larger. Furthermore the choice of the best implementation to use
also depends on the specifications of the hardware that will be used for the adders.
Another aspect that we will take into account is the number of operations in series, which
we call adder-steps. The maximum number of adder-steps defines the delay of the coefficient
computation, i.e. as the number of adder-steps increases, so does the time delay of that particular
coefficient. An instance of this is showed in figure 2.11 where the implementation 23x = 24x +
[22x + [21x + x]] has three adder-steps, whereas 23x = [24x + 22x] + [21x + x] has only two
adder-steps [3].
16
2.3 Multiple Constant Multiplication
7
3
4
≪2
≪1
+
+
23
16
≪4
+
(a) 3 adder-steps
5≪2
3
≪2
23
≪3
≪1
+
+ +
(b) 2 adder-steps
Figure 2.11: SCM structure to compute 23x
To solve the problem of optimizing the number of adder-steps in the implementation, the
MSD or CSD representation have been frequently used to provide implementations with a
minimum number of adders/subtracters, as for example, the value 55 could be represented as:
01101112CSD−−−→ 1001012 (5510). In this case, we observe that the binary implementation has 5
ones, while the CSD implementation has only 3 ones. What this means is that if the SCM is
designed by considering the binary implementation, then at least 3 adder-steps are needed:
01101112 = (3210 + 1610) + ((410 + 210) + 110).
In contrast, if the implementation is based on the CSD representation, only 2 subtracters are
needed with two adder-steps: 1001012 = (6410 − 810)− 110.
When there is more than one constant to be multiplied by the output, the guested circuit is
called a MCM. The next section will review different MCMs implementation techniques. In general,
the resolution of the MCM problem is not reduced to the resolution of multiple SCM problems.
2.3 Multiple Constant Multiplication
The main principle that has been driven the MCM problem is the reduction of silicon area
through partial term sharing. It is not a simple problem and it was even proved to be NP-complete
[10]. This section will focus on the state of the art methods to decrease the needed hardware. Nat-
urally the implementations that share more partial terms are usually the ones that more efficiently
implement the MCM.
To illustrate this principle,consider the implementation of the multiplication of a given input x
by two multiplicand terms 710 = 001112 and 1110 = 010112, by trying to share as many adders
17
2. State of the art
7
3
4
≪2
≪1
+
+
(a) Computation of 7x.
11
3
8
≪3
≪1
+
+
(b) Computation of 11x.
4
7
≪2
11
3
8
≪3
≪1
+
+ +
(c) Simultaneous computation of 7xand 11x.
Figure 2.12: Multiple constant multiplication with the terms 7 and 11.
as possible. In this case, we can identify common patterns in the binary representations of these
multiplicands as is the series of two ones: 112 = 310. Once identified, the desired operations are
easily implemented by using an adder that will compute 3x. The complete MCM will integrate three
adders instead of four, as seen in figure 2.12. This method is called the Common Sub-Expression
sharing (CSE) and its principle consist in identifying bit-patterns across different implementations,
thus reducing the number of adders.
Another way to get to this result is by looking at the partial products as seen in section 2.2,
with the use of CSD and MSD representations. However, others methods also exist.
Hence, the main aim of this section is review the most used methods optimized for MCM
circuit design and to describe how can the area reduction be achieved. Some algorithms that
derive MCM graphs will be examined. In particular the work proposed by Aksoy et al.[3] revealed
some performance improvements on the automation of the selection of the best partial terms.
Others possible algorithms are the Bull-Horrocks Algorithm (BHA) [8], the Bull-Horrocks Modified
algorithm (BHM) [12], the N-dimensional Reduced Adder Graph (RAG-N) [12], the Cumulative
Benefit Heuristic (Hcub) [29] and ASSUME-A (delay constraint) [14].
2.3.1 Bull-Horrocks Algorithm
The Bull-Horrocks Algorithm (BHA) was developed as a graph-based algorithm for the design
of MCM architectures for Finite Impulse Response (FIR) filters. The technique, developed by Bull
and Horrocks [8], represents the output coefficients in a graph, by building it from the primary
input to the target coefficients.
The algorithm works as follows. In a pre-processing phase, the target coefficients are sorted into
ascending magnitude. Then, they are scaled such that all coefficient are non-negative numbers.
The processing phase consist on the generation of partial terms and their incorporation in the
graph until all the target coefficient are part of it. To keep track of the approximate distance
18
2.3 Multiple Constant Multiplication
between the target coefficient and the current implemented graph, the algorithm defines a positive
error variable. The error variable is given by e = h(m)L . h(M) is the target set of coefficients, h(m)
is the coefficient number m ∈M and L represents the largest power of two which will divide h(m)
to give an integer result. As the algorithm proceeds, it creates sub-graphs from each vertex of
the graph forming the powers of two of these vertex values, called partial sum set. Note that the
maximum value of the partial sum set never exceeds max(h(m)). Its main cycle works as follows
1. For each target coefficient not yet implemented, the algorithm tries to find appropriate partial
terms, so that the error variable is minimized. The partial terms are built with the terms
already present in the graph.
2. if the error is zero, the coefficient has been reached and is added to the graph.
(a) If all coefficients are implemented the algorithm finishes.
(b) Else, change to another coefficient and go back to 1.
3. if the error is not zero, then add the term to the graph and go back to 1.
Figure 2.13: BHA graph representation of the set 1, 7, 16, 21, 33 (figure from [12]).
For example, consider the design of a multiplier by the set of terms w[n] = 1, 7, 16, 21, 33.
The construction of the graph is shown in figure 2.13. The vertex represents adders, while the
values over the vertexes are equal to the weight of the primary input at that vertex (i.e. a vertex
with value k represents the operation k × x[n]). The values over the edges represent the shift
operation on the vertex it originates from (i.e. a edge originating from y with value l represents the
operation y × l ⇔ y n with l = 2n and n ≥ 0). Take as example the vertex 7 from figure 2.13, it
is obtained as follows: 7x = 4x+ 1× 3x with ltop = 4, lbot = 1, ytop = 1 and ybot = 3.
The graph starts with the lowest valued coefficient 1. Its processing is trivial, since the vertex x[n]
is equal to the vertex 1. From there, the algorithm processes coefficient 7: it finds out that the
term 3 minimizes the error. Therefore, it adds 3 to the graph as 3 = 1 + 1 1. Once 3 is in the
graph, the algorithm can implement 7. The processing of the coefficient 16 is also trivial, since it
is a simple power of two. The rest of the terms follow the same logic.
Notice one relevant property of BHA’s algorithm, where the terms can only be implemented with
smaller valued terms. The next algorithm will solve this.
19
2. State of the art
2.3.2 Bull-Horrocks Modified algorithm
The BHA algorithm [8] involves the creation of a set of possible partial sums from existing
vertex values of the graph, from which new vertex values are synthesized. New vertexes are
then created until the set of constant coefficients is fully synthesized. An improved version of the
existing algorithm was developed by [9] to alleviate the BHA limitations: Bull-Horrocks Modified
algorithm (BHM). Some of the differentiation characteristics from the BHA are enumerated next:
1. In the BHA, the partial sums are generated with values only up to, but not exceeding, the
coefficient under consideration. Hence, the error is always positive. On the other hand, the
BHM may generate partial sum pairs above the maximum coefficient value. Hence, the error
can have both positive and negative values. At the implementation level, this leads to the
usage of subtracters, besides the usual adders.
2. In the BHA algorithm, even-valued partial sums are allowed in the partial sum set. In con-
trast, BHM reduces each partial sum by factors of 2, until it becomes odd. Only then it
generates its power-of-2 multiples. Hence all the target coefficients are represented as pos-
itive and odd numbers. This reduces the problem formulation of MCM so the processing of
positive and odd terms, which will be referred to as fundamentals from now on.
3. In the BHA, the coefficients are processed in ascending magnitude order. As with the partial
sums, the BHM algorithm first transforms all the coefficients into fundamentals. After this it
processes them in ascending order.
Due to the first enumerated property, the wider range of allowed values allows to take full
advantage of CSD-like features (e.g., by using 7 = 8 − 1 rather than 7 = 4 + 2 + 1)to achieve
more efficient implementations. Due to the second and third properties, even terms are simplified
into fundamentals, thus maximizing the number of partial sums available. While the error variable
does not increase, the partial terms are added. Take 48 as example: with this approach, it can be
simplified as 3× 24 and now the partial sum set is only composed by the fundamental 3 and all its
power of two’s multiple.
Even though this algorithm usually leads to optimal graphs, it is essentially a heuristic algorithm
based on the CSD representation. The following algorithm tackles this problem with further detail.
2.3.3 N-dimensional Reduced Adder Graph
The N-dimensional Reduced Adder Graph (RAG-N) algorithm, proposed by Dempster et al.
[12], is a graph-based algorithm that integrates both optimal and suboptimal methods. It consists
of three decision phases that aim the area optimization of the MCM. The first part is an exact
method, the second one is a heuristic method and the third is an arbitrary choice.
In the first and exact part, the algorithm makes use of two pre-computed lookup tables from
20
2.3 Multiple Constant Multiplication
the Minimised Adder Graph (MAG) algorithm [11]: one for the coefficient implementation cost,
covering 1 to 4096 (the upper limit is because of computation complexity), and another table
containing different sets of fundamentals pairs that maps the optimum SCM implementation of
each of the coefficients in the cost table. If the target coefficient set is already fully synthesized
in the exact part of the algorithm, then this is the optimal solution with the lowest cost for that
set of coefficients (for proof of this statement refer to [12]). For the heuristic part, this algorithm
makes use of an improved cost metric (described in detail in [12]), which can be briefly explained
as follows: The adder distance of a new vertex from an existing graph is defined as the number of
extra adders (not in the graph) needed to reach a target vertex. The heuristic part aims to find the
minimal adder distance between the graph created by the optimal part of the algorithm and the
remaining terminating vertexes. Both the adder distance and cost will influence in the outcome
of the graph. Finally the arbitrary part exists in the case both the optimal and heuristic fail and
consist of an arbitrary choice. Each time a coefficient is added to the graph, it is removed from
the incomplete set. The complete procedure is described as follows:
1. In a pre-processing phase, the target coefficients are reduced to fundamentals and are
added to the incomplete set. After evaluating the coefficients costs using the cost lookup
table mentioned above, all cost-0 fundamentals (powers of two that have been reduced to
1) and repeated fundamentals are removed from the incomplete set. For all the selected
fundamentals that have been implemented, add them to the graph set and remove them
from the incomplete set.
2. The optimal part essentially synthesizes all target coefficients within the range of 1 to 4096
and their power of two’s multiples. The coefficients in the incomplete set with cost-1 are im-
plemented using the lookup table mentioned above. As new fundamentals are implemented,
they are inserted in the graph set and the algorithm processes the remaining unimplemented
coefficients by examining pairwise multiples of the fundamentals in the graph set. If all the
targets are synthesized this way, then the solution is optimal since all the fundamentals have
minimum cost.
3. If the optimal part cannot reach any more coefficients, then the heuristic part tries to find the
minimum adder distance that is needed to reach the terminating vertexes or target coeffi-
cients. Distance 2 vertices are considered but, since the adder distance calculated is not an
exhaustive search, the heuristic part is suboptimal.
4. If the heuristic cannot reach any fundamental in the incomplete set, then an arbitrary choice
of implementation fundamentals is chosen. The fundamental with the lowest magnitude is
selected and added to the graph set. Once a new term is added the algorithm goes back to
the optimal part.
The presented algorithm is suboptimal for coefficients outside the look-up table. In practice
there is an arbitrary choice of the synthesized fundamental and not all optimal graphs are consid-
21
2. State of the art
Figure 2.14: RAG-N graph representation of the constraint set 1, 7, 16, 21, 33 (figure from [12]).
ered (the lookup tables are precomputed or can only be computed up to 19 bits [16] due to the
involved computational complexity). Even so, the RAG-N algorithm provides better solutions than
BHM and BHA algorithms. Figure 2.14, compared to figure 2.13, illustrate this: RAG-N uses three
adders while BHM and BHA use four adders.
2.3.4 Cumulative Benefit Heuristic
The Cumulative Benefit Heuristic (Hcub) presented by Voronenko in [29] is a graph-based
algorithm heavily influenced by the RAG-N described in section 2.3.3. This algorithm is very
similar to the RAG-N algorithm. The main difference in the heuristic part is the usage of improved
distance estimators, namely, the removal of the arbitrary choice part. Another change is observed
in the algorithm organization: the fundamentals are added one at the time to the graph.
To better understand the competing algorithms, the author from [29] defines a framework for
MCM algorithms: the target set of constraints T ⊂ N, composed by the coefficients to implement
t ∈ T ; the set R ⊂ N, which contains the implemented terms (at the termination of the algorithm,
it is the solution); the successor set S ⊂ N, which contains all terms of distance 1 from the set R;
finally, the working set W ⊂ N is the newly synthesized terms at each iteration.
Furthermore, some definitions are presented to generalize some conceptions introduced by the
author. A brief overview of these definitions is exposed here, for further detail, please refer to [29].
The A-operation is defined as an operation on fundamentals, either addition or subtraction, and
an arbitrary number of shifts which do not truncate non-zero bits of the fundamental. We define
the A-operation with two integers inputs u, v (fundamentals) and one output fundamental. The
operation operates on the inputs with the following parameters: let l1, l2 ∈ N∗ be the left shifts of
both input operands, r ∈ N∗ the right shift at the output and s ∈ 0, 1 the sign that dictates if the
operation is a addition or subtraction.
Ap(u, v) = |(u l1) + (−1)s(v l2)| r (2.16)
= |2l1u+ (−1)s2l2v|2−r (2.17)
Derived directly from the A-operation comes the A-configuration. It is the set of all possible
outputs (but not equal to the inputs) of an A-operation with fixed inputs (u and v) under different
22
2.3 Multiple Constant Multiplication
A-configuration:
A∗(u, v) = Ap(u, v)|p is a valid configuration − u− v
This definition can be extended to a set of inputs:
A∗(U, V ) =⋃u∈Uv∈V
A∗(u, v)− U − V
Moreover all constant can be divided into complexity classes. For that purpose, we denote Cn the
set of all constant with complexity n. By other words, an optimal SCM solutions requires exactly
n A-operation. For example, C0 = 2a|a ≥ 0 because precisely all 2-power constant require a
single left shift and no adds/subtracts.
If cn ∈ Cn is a constant, the A-distance is defined as the minimum number of extra A-operations
required to obtain cn, given R. The A-distance is defined as an NP-complete problem, therefore
the use of an estimation is adequate for higher order A-distance computation. This notion corre-
sponds to the ”adder distance” in section 2.3.3 from [12].
The different parts of the Hcub algorithm can be summarized as follows:
(a) Distance 1 - Optimal (b) Distance 2 - Heuristicexact
(c) Distance 3 - Heuristicexact
(d) Distance 4 ≥ - Heuristic esti-mation
Figure 2.15: Distance cases handled by the algorithm Hcub. Solid disk denote available sets anddashed circles denote the sets that are not computed yet (figures from [29]).
• In a pre-processing phase the Hcub algorithm transforms all input coefficients into funda-
mentals.
• For each T not synthesized, the Hcub algorithm incrementally builds the successor set S
and the ready set R at each iteration, by adding the constants from set W .
– The optimal part of the algorithm synthesizes the targets found in the successor set
as portrayed in figure 2.16a. To better understand the figure 2.16a, the equation t =
Ap(r0, r1) means that, to reach target t coefficient, an operation with operands r0 and
r1 is needed. Figure 2.15a demonstrates this procedure. If no more targets are found
in S, the optimal part cannot synthesize any more constants. This also means that the
targets are more than one A-distance away.
– Then the heuristic part is executed. As can be seen in figures 2.15b, 2.15c and 2.15d
the heuristic starts at distance 2 with Sd denoting the successor sets at distance d. It
23
2. State of the art
is separated in two sub-parts: for 1 < A-distance ≤ 3, exact tests are used and, for
3 < A-distance, distance estimator.
(a) Distance 1 topology (b) Distance 2 topologies
(c) Distance 3 topologies
Figure 2.16: Graph topologies for optimal and exact distance tests (figures from [29]).
1. For the exact calculation of the A-distance, the algorithm considers all possible
topologies that synthesize t using exactly d A-operations (each A-operation is
equivalent to one A-distance). These situations are illustrated in figures 2.16b
and 2.16c. To better understand figure 2.16b, the equation t = c1s means that, to
reach target t coefficient, the operation is built by using a single input s multiplied
by a complexity-1 constant c1. From the previous explanations, the explanation of
figure 2.16c follows the same logic. The heuristic for exact calculation is the inter-
section of A∗ with the successor set S (refer to [29] for further details about the
intersection that takes place in each of the cases in figure 2.16). This is done up
to A-distance ≤ 3, due to the involved complexity for higher distances.
2. When the target is 3 < A-distance away (see figure 2.15d), a distance estimation
is done: the successors are added to R based on a cumulative benefit estimation
(hence the ”cub” in Hcub). To achieve its goal, it supports the estimation on a ben-
efit function B that quantifies to what extent adding a successor s to the ready set
R improves the distance measure to an arbitrary target t. The benefit function fun-
damentally finds the cheapest intermediate value based on CSD cost estimation.
Such function is used to calculate the cumulative benefit of all targets in T . From
[29]:
Hcub = (R,S, T ) = arg maxs∈S
(∑t∈T
B (R, s, t)
)
24
2.3 Multiple Constant Multiplication
Where B is the weighted benefit function. This form of calculating the next suc-
cessor jointly optimizes all targets without actually synthesizing a target. However,
intermediate terms are synthesized, making the set R ”closer” to the solution.
• The final solution is found when all targets from set T are in set R.
One aspect of this algorithm is the impossibility to customize the algorithm in terms of the
weight of the implementation cost (i.e. silicon area). Each operation is considered to have a fixed
unitary cost. The following algorithm tries to solve this problem for area estimation.
2.3.5 Unconstrained Area optimization - ASSUME-A
Flores et al. [14] proposed a new approach to the problem, by using a SAT based 0-1 ILP
engine to solve a set of constraints and optimization functions. Later, Aksoy et al. [2] extended
the formalization, by adding the design of a custom adder/subtracter structure with gate-metric
weights in order to decrease the area occupation. The significant number of shifts that is present
in a MCM stucture, combined with the optimization of adder structures enabled the improvement
of the existing frameworks. Furthermore, this algorithm incorporated the usage of MSD repre-
sentation for partial terms generation, as proposed by Park et al.[22], enabling the consideration
of a greater set of possible implementations for a given coefficient. In practice, the usage of the
MSD representation allows the exploitation a more sensible search. On one hand, it reduces the
search space to the minimum number of adder-steps, thanks to the MSD property that minimizes
the number of non-zero symbols. Then, it generates one or more full decompositions of each
term.
The general principles of Unconstrained Area optimization - ASSUME-A (ASSUME-A) algorithm
by Aksoy [3] can be described as follows:
1. After transforming all target coefficients into unique fundamentals, all the MSD representa-
tions of each term are determined and inserted in a set Cset;
2. Then main loop of the algorithm processes the fundamentals elements c ∈ Cset, as follows:
(a) Compute all partial term pairs that cover each element c. Note that a cover is regarded
as a possible representation decomposition. Table 2.2 exemplifies this decomposition
for element 45;
Representation implementations cover
0101101(32 + 8) + (4 + 1) 40 + 5(32 + 4) + (8 + 1) 36 + 9(32 + 1) + (8 + 4) 33 + 12
Table 2.2: 45 binary representation and its possible covers.
(b) Convert each element of the cover pair into fundamentals;
(c) Add each cover pair to the corresponding set of covers of the element being processed;
25
2. State of the art
(d) if the MSD representations have not yet been processed and they are not already in
Cset, add the representations of each cover to Cset. Covers with only one non-zero
digit are skipped.
3. Following this procedure, the algorithm builds a Boolean network with ANDs and ORs gates.
Each AND gate represents an adder or subtracter that produces some partial term value
and has two inputs for each operand. Each OR gate combines all partial terms that yield
the same value in the numbering representation (possible options:MSD or CSD or binary).
This means that the OR gate has a number of inputs equal to the number of possible im-
plementations of the selected numbering representation. For each operation, a third input is
needed, the optimization variable, by the cost function to iterate through all the all possible
implementations. There are two ways to introduce the optimization variable to finalize the
structure of the Boolean network, both lead to the same optimum solution:
• The optimization variable is included at the output of the OR gate with the addition of
an AND gate, as illustrated in figure 2.17a. This placement of the optimization variable
is used to minimize the overall number of partial terms.
• The optimization variable can also be included directly at the AND gates as a third
input. The aim of this placement is to minimize the cost of each individual operation,
as exemplified in figure 2.17b..
(a) Inclusion of an AND gate that creates an opti-mization variable used to minimize the overall num-ber of partial terms.
(b) Addition of an extra input per AND gate to cre-ate an optimization variable associated with eachpossible operation.
Figure 2.17: Examples of cost function implementations in the Boolean network (figures from [3]).
Both these functions achieve optimal solutions and yield the same cost. Under a given delay
constraint (such as the maximum number of adder-steps or combinatory levels) and after
the Boolean network has been created, the algorithm then computes the different delays,
by traversing the network from the inputs to the outputs. Theses values are subsequently
26
2.4 Technology Comparison
incorporated in the ILP solver as a constraint.
4. A set of basic rules is used to simplify the model:
• Multiple shifted versions of each input are freely available.
• If the requirements of a given operation are more stringent than those of another opera-
tion that generates the same partial term, we may remove it. For example, 15 = 9+31
requires partial terms 9 and 3, whereas 15 = 32 + 3 only requires partial term 3; thus,
we may eliminate the former.
• If a coefficient can be implemented with a single operation whose inputs are the primary
inputs of another other coefficients, then we do not need to represent this coefficient in
the Boolean network.
5. After this simplification, the Boolean network is translated into Conjunctive Normal Form
(CNF) and the cost function to be minimized is constructed as a linear function of the opti-
mization variables. Each clause is converted into a 0-1 ILP constraint. Finally, the obtained
model is fed to a generic SAT-based 0-1 ILP solver.
6. The final solution is the output of the solver.
2.4 Technology Comparison
After reviewing and summarizing the different binary arithmetic structures and algorithms, we
now present a brief comparison to find the most suitable technologies for our purpose, aiming
the minimization of the propagation time and area occupation. The technologies were tested on
a metric used by [27], called the Tyagi metric. The goal of this metric is the use of technology-
independent costs. The use of generic metrics generalizes and validates the update to newer
gate technology if necessary.
2.4.1 Arithmetic Binary Structures
2.4.1.A Adder
Aiming for a low latency adder in the MCM, we will focus on fast adders, preferably with parallel
computation. The FA, the CLA and the SA are considered for comparison. The FA is expected to
be slower than both the CLA and the SA. Based on the Tyagi model we get figures 2.18a, 2.18b
and 2.18c which model area occupation and latency of each adder in order to the bit-width.
From figure 2.18b the SA has the lowest latency as expected. The CLA is slower than SA.
However, the carry out is also worth analysing. Figure 2.18c shows that for the carry out compu-
tation the CLA is quite similar to the SA. The FA uses less area, but its time-delay is higher than
the other two. Thus for time considerations we can further study both CLA and SA.
27
2. State of the art
0
50
100
150
200
250
300
350
400
0 5 10 15 20 25 30 35
Are
a (a
rbit
ray
un
its)
Bits
Area comparison
FA (Tyagi)
CLA (Tyagi)
SA (Tyagi)
(a) Area comparison between FA, CLA and SA.
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
Tim
e (
arb
itra
y u
nit
s)
Bits
Time comparison for output
FA (Tyagi)
CLA (Tyagi)
SA (Tyagi)
(b) Output time-delay comparison between FA, CLA and SA.
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
Tim
e (
arb
itra
y u
nit
s)
Bits
Time comparison for carry
FA (Tyagi)
CLA (Tyagi)
SA (Tyagi)
(c) Carry out time-delay comparison between FA, CLA and SA.
Figure 2.18: Adder’s area and time delay Tyagi’s metric comparison
2.4.1.B Increment and Decrement
As the comparison in 2.4.1.A, the same is done for the RID proposed by [20] and a dual-
architecture that joins an increment and a decrement in parallel, implemented with a Half Adder
(HA) and a MFA used by [3] respectively. The figures 2.19a and 2.19b show the area occupation
and the time delay of the output.
050
100150200250300350400450500
0 5 10 15 20 25 30 35
Are
a (a
rbit
ray
un
its)
Bits
HA+MFA vs RID comparison by area
HA+MFA (Tyagi)Reconfigurable INC/DEC (Tyagi)
(a) Area comparison between HA+MFA and [20].
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
Tim
e (
arb
itra
y u
nit
s)
Bits
HA+MFA vs RID comparison by time
HA+MFA (Tyagi)
Reconfigurable INC/DEC (Tyagi)
(b) Output time-delay comparison between HA+MFA and [20].
Figure 2.19: Comparison between HA+MFA and [20].
From figure 2.19b, the theoretical model suggest that the use of the RID only compensates for
bit-widths bigger than 9 bits. For bit-widths lower or equal than 9 bits we can use the HA+MFA
architecture. There is no need to compute the carry out, because in this architecture the carry out
is generated at the same time as the others bits.
From the previous results were chosen for the adder either the SA or the CLA, and for the
increment/decrement was chosen the dual-architecture up to 8 bits and the RID for bigger bit
28
2.4 Technology Comparison
widths.
2.4.2 Multiple Constant Multiplication
For benchmarking purposes, we considered the Quantization Parameters (QP)
values defined in the H.264 video standard [25]. In particular, we will use the
set of coefficients used in Forward Quantization (QF), composed by the terms
13107; 11916; 10082; 9362; 8192; 8066; 7490; 7282; 6554; 5825; 5243; 4660; 4559; 4194; 3647; 3355; 2893.
A set with smaller coefficients of the Inverse Quantization (QI), is also provided
10; 11; 13; 14; 16; 18; 20; 23; 25; 29 and will be used at the end of this section, to compare
the involved complexity. The multiplier circuit receives a 16-bits input vector and outputs a 32-bits
vector.The coefficients sets have a maximum bit-width of 14 (13107) and a minimum of 12 (2893)
respectively, which adds some complexity to the MCM’s problem.
Based on the algorithms previously presented, we plotted the area-time graph of the MCM
circuit that is obtained with each of them: Hcub, BHM, RAG-N and ASSUME-A. The MCM nodes
were also implemented with different structures by considering the following adders: CLA, SA and
FA. The area and time measures were estimated by using Tyagi’s model [28] and computed by us-
ing MATLAB and Simulink for a basic simulation of the circuit. Next, the obtained implementation
will be compared.
Algorithm # adders adder-stepsBHM 28 8
RAG-N 25 8Hcub 26 3
ASSUME-A 27 3
Table 2.3: Characteristics of MCMs structures derived with the considered algorithms for the QFcoefficients set.
From an analysis of table 2.3, we note that the fastest implementations are provided by the
Hcub and ASSUME-A algorithms, since they both have the smallest amount of adder-steps. The
implementation with the smallest hardware requisites is the RAG-N, with 25 adders in total. The
BHM should be the worst performing algorithm.
In figure 2.20 are presented the previously mentioned algorithms implemented with different
adders. The metrics used are taken from Tyagi’s generic model [28]. Due to the proximity of the
Hcub and ASSUME-A results, we joined together both of them. The same conclusions can be
drawn from table 2.3 regarding the algorithms in figure 2.20a, the Hcub and ASSUME-A have both
the lower in terms of time delay and the smaller in terms of silicon area occupation. Regarding
the adders, the RC is the slowest adder as expected from section 2.1 but also the one with least
silicon area. The CLA and the SA are both very similar in terms of area and time-delay with the
SA being slightly faster but having a bigger silicon area occupation. In figure 2.20b are presented
both Hcub and ASSUME-A in a smaller scale. From here we can see that the results are very
29
2. State of the art
similar and that the algorithms produced very similar results in terms of generic metrics. For
further comparison we will look at the detailed bit precision study. It is the study of the dynamic
range of every operation of the MCM at the input and output of every adder.
RC
CLA
SA
RC CLA SA
0
50
100
150
200
250
300
1E+0 1E+2 1E+4 1E+6 1E+8 1E+10
Tim
e (
Arb
itra
ry u
nit
)
Area (Arbitrary unit)
tempos e areas dos diferentes algoritmos
HCUB/ASSUME-A QF
BHM QF
RAG19 QF
Multiplier
(a) Comparison between Hcub and ASSUME-A, BHM,RAG-N and a generic multiplier.
0
20
40
60
80
100
120
0E+0 1E+5 2E+5 3E+5 4E+5 5E+5 6E+5 7E+5 8E+5
Tim
e (
Arb
itra
ry u
nit
)
Area (Arbitrary unit)
tempo e area(corrigida)
ASSUME-A QF
HCUB QF
(b) Augmented comparison between ASSUME-A and Hcub
algorithm.
Figure 2.20: Implementation characteristics of MCMs structures that were derived with the con-sidered algorithms, when implemented with different adder structures: RC, SA and CLA.
In figure 2.22 and 2.21 are the precision bit studies with the QF and QI coefficient sets for
the algorithms Hcub and ASSUME-A. The brown boxes are the adders, the gray boxes are the
shifts, the green box is the primary input and the red boxes are the outputs. The numbers on the
lines are the dynamic ranges of the routing wires. We added the QI coefficients to this group to
compare the complexity of the implementations as the bit-width of the target coefficient becomes
larger. Note that even though both algorithms have the same characteristics in table 2.4, they
actually have different implementations, as it can be seen on figures 2.21a and 2.21b. These
differences are so small that in the implementation of the QF coefficients the obtained time-delay
is the same. The area has a slight difference, as it was shown in figure 2.20b.
Algorithm # adders adder-stepsHcub 8 2
ASSUME-A 8 2
Table 2.4: Characteristics of MCMs structures that were derived with the considered algorithmsfor QI coefficients set.
From figures 2.22 and 2.21 it is observed that a maximum of 30 bits is needed at any given
node of the MCM graph, as opposed to the 32 bits as considered in [21]. This envisages the pos-
sibility to obtain an implementation with a reduced area, by reducing the bit-width representation
of the operands in each adder.
Summary
In this chapter, arithmetic binary structures and MCM structures were presented as a viable
way to solve multiple constant multiplications with low area occupation and low time delay. Low
latency arithmetic binary structures were identified and the relevant state of the art algorithms
30
2.4 Technology Comparison
X
<<24x
9x+
16
Y6
<<38x
<<416x
16 16
19
16
<<118x
20
21
<<532x
16
25x-
21
Y9
21
<<38x
16
7x-
19
19
16
<<228x
Y10
29x+
21
21
19
16
<<114x
19
20
Y4
<<416x
16
Y1 Y7 Y3 Y8Y2
23x+
21
1920
13x+
20
1919
<<38x
16
5x+
18
16
<<220x
19
21
<<110x
19
11x-
20
19
20 20
<<416x
Y5
20
16
(a) MCM obtained with Hcub.
X
<<416x
Y5
20
16
Y1
<<110x
20
Y7
<<220x
21
Y3
5x+
13x+
<<24x
18
19 1919
1616
<<38x
16
19
20
Y8 Y2 Y10
29x+
21
167x-
<<38x
19
16
16
<<24x
19
2111x+
<<24x
1820
16
20
<<114x
Y4
20
19
23x+
Y6 Y9
19
21
<<416x
19
16
25x+
9x+
<<118x
21 21
20 20
16
19
<<38x
16
20
(b) MCM obtained with ASSUME-A.
Figure 2.21: Bit precision study of each operation in the obtained MCM structures for the QIcoefficients set.
for the derivation of MCM structures were presented and compared. Furthermore some basic
implementation examples were fulfilled with different adder structures, in order to understand their
influence in the characteristics of the obtained MCM circuit. In the next chapter, we will propose
highly optimized implementations of adder structures, particularly adapted for the MCM problem,
both with strict area and time models.
31
2. State of the art
X
<<38x
7x-
19
19
19
49x-
<<356x
22
16
<<112048x
16
2097x+
27
22
9x+
16
19
<<94608x
20
4559x-
2229
Y17
29
<<112048x
16
<<24x
18
16
3x-
16
<<664x
16
73x+
22
20
Y1Y15Y2Y8Y4Y11
<<448x
18
51x+
18
22
<<813058x
22
13107x+
22
30
30<<1
6554x
3277x+
28
29
<<63264x
22
28
<<114x
19
20
13x-
16
20
<<211916x
30
2979x-
28
28
<<5288x
20
285x-
18
25
25
<<19362x
<<24660x
2930
1165x-
27
18
<<41168x
23
27
4681x+
29
29 23
<<2292x
3355x-
28
25
23
5243x-
Y7
29
<<17294x
29
2051x+
18
27
28
Y9
<<14194x
29
283745x
+
Y14
<<17490x
29
28
3647x+
2828
<<198x
22
23
<<93584x
19
28
<<664x
16
63x-
22
16
22
<<81792x
19
2893x-
Y12
28
28
4033x-
<<124096x
16
28
22
Y10
28
Y3Y5 Y13 Y16 Y6
3641x-
<<17282x
29
28
28
25
28
<<3392x
22
25
16
5825x+
29
28
<<18066x
29
28
27
<<110082x
30
5041x+
29
26
<<41008x
22
28
<<138192x
16
29
<<21140x
27
(a) MCM obtained with Hcub.
Y14Y4
X
<<532x
33x+
16
16
19
16
21
<<38x
9x+
16
7x-
19
16
63x-
16
<<664x
1622
<<5288x
20
289x+
16
25
<<44624x
25
4681x+
29
29
<<7128x
16
161x+
23 22
24
3745x+
<<19362x
30
<<17490x
29
28
<<93584x
19
51x+
<<118x
20
21
22
<<63264x
2979x-
<<211916x
28
28
30
Y2 Y1
13107x+
<<813056x
22
22
30
30
22
3277x+
<<16554
28
Y15
29
28
13x+
28
16
20
20
<<71152x
27
20
1165x+
<<24660x
20
27
Y8
29
<<2252x
285x+
22
22
24
25
Y10 Y12
2893x-
28
25
<<21140x
27
Y6
3641x+
28
28
57x+
<<356x
22
19
22
16
<<17282x
30
22
<<138192x
16
Y5
29
4033x+
<<64032x
22
28
16
28
Y13
<<18066x
29
28
4194x+
28
Y9
29
4559x+
Y17
29
<<4912x
22
26
3647x+
19
<<93584x
28
22
28
28
<<41008x
22
5041x+
Y3
29
28
26
5825x+
Y16
29
<<110082x
30
<<65824x
29
5243x+
Y7
29
16
24
24
3355x+
Y11
28
19
91x+
<<228x
21
23
22
23 28
23
<<55152x
29
<<24x
18
(b) MCM obtained with ASSUME-A.
Figure 2.22: Bit precision study of each operation in the obtained MCM structures for the QFcoefficients set.
32
3Optimized Adder Structures for
MCM
Contents2.1 Binary Arithmetic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Single Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Multiple Constant Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Technology Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
33
3. Optimized Adder Structures for MCM
In this chapter we will propose a modular adder structure that includes left shifts at the inputs
and a final right shift at the output, as seen in figure 3.1. The main goal is to design a custom
adder that is particularly adapted to the shift and add specification of present in typical MCM prob-
lems. First, we will define a modular adder structure for a generalized shift and add multiplication
operation. Then, we will define the differences imposed by the specifications of common MCM
problems, in particular.
+<<l1
<<l2
u'
v'
s' >>r ns
nu
nv
u
v
s
Figure 3.1: Adder block with its inputs left shifted and the resulting output right shifted.
To make the optimization possible, we will assume a fully modular structure. First, we will
analyse the different zones in which we can divide the operation. The main motivation behind
this segmentation of the addition operation is to facilitate the definition of a valid and generic
model, and also to understand what kind of hardware is used in each part of the block. This
customization will allow to improve the latency of the adder, while not making the area cost too
high. We will describe the implementation of both the addition and the subtraction operations. A
possible configuration for the addition and the subtraction will be also provided as an example.
To ensure the maximum generality as possible, this example contains the right shift at its output,
even though most of the MCM algorithms do not make use of it.
3.1 Addition
The first addressed operation is the addition. We separate the modular adder in four zones
and then provide a small example. The division into four zones comes directly from the properties
of the addition in the shift and add multiplication. Two of these zones have gate logic inside, while
the other two do not need any kind of circuitry.
3.1.1 Mathematical formulation
To clarify the adopted segmentation, a mathematical model will be set up. The definition of
each term and operation will be given next.
Let us assume u and v two signed binary operands, with nu and nv bits, respectively. The
non-negative integers l1 and l2 are their respective left shift parameter, while the non-negative
integer r is the right shift that is supposed to be implemented at the output. Finally, s, with ns bits,
is the signed binary output of the addition. We also define ul1 = u × 2l1 and vl2 = v × 2l2 and
34
3.1 Addition
make sure that both operands have the same size, by extending the smallest operand to match
the greatest operand. We can define 4 groups of bits from the mathematical formulation of this
operation, which we will call zones from now on. In each zone, we will define which operation will
be used.
u 0 . . . 0
l1nu
v 0 . . . 0
l2nv
v<lm:lM-1> 0 . . . 0γ=u<lM:nm-1>+v<lM:nm-1>u<nm:nM-1>+γ<nm>
+
=
0 . . . 0
A4+B4 A3 + B3 A2 A1
Zone 4 Zone 3 Zone 2 Zone 1
Figure 3.2: Mathematical formulation for the addition operation.
Note the extension of the operand v signal from bit nm = minnu + l1, nv + l2 to bit nM =
maxnu + l1, nv + l2. We will also define lm = minl1, l2 and lM = maxl1, l2. Thus, with this
signal extension we get: u′ = ul1<0:nM> and v′ = vl2<0:nM>.
According to this definition, the decomposing of this operation comes as follow:
s = s′ × 2−r
s = [u′ + v′]× 2−r
= [A1+ Zone1
+A2 × 2lm+ Zone2
+ (A3 +B3)︸ ︷︷ ︸γ
×2lM + Zone3
+ (A4 +B4 + γ<nm>)× 2nm ]× 2−r Zone4 (3.1)
where γ<nm> is the carry out from the addition γ and:
A1 = 0<0:lm−1>
A2 =
u′<lm:lM−1> if l1 < l2
v′<lm:lM−1> if l1 > l2
A3 = u′<lM :nm−1>
B3 = v′<lM :nm−1>
A4 = u′<nm:nM−1>
B4 = v′<nm:nM−1:>
If r > 0, it is still possible to restrict the optimization to the paths affected by the right shift.
Therefore we can define several sub-zones, defined as follows:
35
3. Optimized Adder Structures for MCM
Zone 1 If r = 0, zone 1 generates logic zeros at output s from 0 to lm − 1. Otherwise, if
0 < r < lm we can eliminate the zeros up to bit r − 1;
Zone 2 b) If r < lm, zone 2 (b) copies the operand’s bits (bit bypass) from lm to lM − 1
directly to the output s. Otherwise, if lm ≤ r < lM we can eliminate the bypass up to bit r−1
and all zeros from zone 1;
Zone 3 If r < lM , zone 3 adds the two operands. Otherwise, if lM ≤ r < nm, zone 2 b) and
zone 1 will be eliminated and we will need to divide zone 3 in two different parts:
Zone 3 a) From bit lM to bit r − 1 we will only compute the carry chain;
Zone 3 b) From bit r to bit nm − 1 we will make the addition as normally, taking into
account the carry coming from the r bit;
Zone 3b
Zone 3 <lM:nm-1>
Zone 3ar
Figure 3.3: Zone 3 divided into 2 sub-zones, when lM < r < nm − 1.
Zone 4 if r < nm, zone 4 implements one of three operations: increment, decrement or
bypass of the biggest operand. The choice of the operation depends on the carry bit from
the previous zone and the sign extension of the operand. Table 3.1 presents the criterion for
each operation. Otherwise, if nm ≤ r < nM , we still need to calculate the carry chain from
zone 3, but we can eliminate the bypass and the zeros from zone 2 b) and zone 1. Dividing
this operation in two sub-zones we get:
sign extension cin Type0 0 Byp0 1 Inc1 0 Dec1 1 Byp
Table 3.1: Zone 4 truth table.
Zone 4 a) from bit nm to bit r − 1 we will only compute the carry chain, i.e. we need to
calculate all the carries from zone 3 and part of the carries from zone 4;
Zone 4 b) from bit r to bit nM − 1 we will use an increment, decrement or bypass;
Zone 4b
Zone 4 <nm:nM-1>
Zone 4ar
Figure 3.4: Zone 4 divided into 2 sub-zones, when nm < r < nM − 1.
We highlight three particular cases. The first case is when l1 = l2: zone 2 disappears, enlarg-
ing zone 1. Another notable case is when l1 = l2 = 0: zone 1 and 2 disappear, because there are
36
3.1 Addition
no left shifts to be done. Finally when nM = nm: zone 4 is not needed because all the calculation
is done in zone 3.
It is also worth nothing that if r ≥ nM we have s = 0 and no computation needs to be done.
Moreover, when r = 0 the number of defined zones is reduced: there are no sub-zones a), since
the b) part of zones 3 and 4 are the only defined zones. In table 3.2 we resume the different
cases, the symbol ∅ is the absence of the operand in the corresponding zone. The bit-widths in
each zone are divided in 3 columns: Length contains the overall length of each zone; columns
Operand u′ and Operand v′ have the length of operand u′ and v′, subject to the existence of
the operand in a given zone. As an example: if we have l1 > l2, then zone 2 b) will only have
operand v′. Therefore, for zone 2 b) operand u′ will have length 0 and operand v′ will have length
maxr, lm : lM − 1, where lM = l1 and lm = l2.
Bit dynamic rangeZone Type Right shift Length Operand u′ Operand v′
1 Zeros generation r < lm − 1 lm −maxr, 0 ∅ ∅2 b) Bypass r < lm lM −maxr, lm ∅ or maxr, lm : lM − 1maxr, lm : lM − 1 or ∅3 a) Carry generation lM < r < nm − 1 minr, nm − lM lM : minr, nm − 1 lM : minr, nm − 13 b) Adder r < lM nm −maxr, lM maxr, lM : nm − 1 maxr, lM : nm − 14 a) Carry generation nm < r < nM − 1minr, nM − nm nm : minr, nM − 1 nm : minr, nM − 1
4 b) Increment/Bypass/r < nm nM −maxr, nm
maxr, nm : nM − 1 ∅ orDecrement or ∅ maxr, nm : nM − 1
Table 3.2: Table resuming the addition operation with r > 0.
3.1.2 Example configurations
To understand the mechanics of the modular adder, we will see how we can generate different
cases by varying the length and signal of the operands. In table 3.3 we assume r > 0. In this
example, either zone 3 will have the a) part (if lM < r < nm − 1) or zone 4 will have the a) part (if
nm < r < nM−1). Because this is an addition configuration, no computation is required in zone 2.
In the following, for the strict purpose of providing a practical example, it will be considered the
effect of varying the adder parametrization in zone 4. If the operand with less bits is negative (v′),
then we have either to decrement or bypass the other operand (u′) in zone 4, depending on the
carry coming from zone 3 (see table 3.3, cases 2 and 4). When the signal extension is done with
ones, we have to add the extension of v′ with u′. However, considering that a series of ones is
equivalent to the TC representation of (−1), then we only have to decrease the operand u′ by one
in zone 4. On the other hand, if the carry from zone 3 is one, then it will cancel this decrement
(11112 + 12 = 00002), making it a simple bypass.
The same reasoning is done for v′ > 0, (see table 3.3 for cases 1 and 3). If the carry out from
zone 3 is one and the signal extension is made out from zeros, then the operation in zone 4 b) is
37
3. Optimized Adder Structures for MCM
simplified into an increment. If the carry out is zero then the operation of zone 4 b) operation is a
bypass.
When the final right shift affects zone 4 (i.e. r > nm), the output will depend on the carry coming
from zone 3. Since there is no output in zone 4 a) (except for the carry out), the operation only
needs to implement carry logic. If zone 4 a) was assigned to an increment operation, it is only
needed to implement the increment carry logic, not needing the output part of it. Similarly, for the
decrement, there is only the need to implement carry logic for the decrement. As an example, if a
HA is used for the increment, it is only necessary the AND gate to compute the carry, leaving out
the XOR gate that computes the sum output.
nu + l1 > nv + l2 nu + l1 < nv + l2
u′ > 0 and v′ > 0Case 1: Case 5:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v 0 . . . 0
12
u+vu+γ
3b
4b
+
=
0 . . . 0
0
0
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12
u+vv+γ
3b
4b
+
=
0
00 . . . 0
u′ > 0 and v′ < 0Case 2: Case 6:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v 0 . . . 0
12
u+vu-1+γ
3b
4b
+
=
1 . . . 1
0
1
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 012
u+vv+γ
3b
4b
+
=
1
00 . . . 0
u′ < 0 and v′ > 0Case 3: Case 7:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v 0 . . . 0
12
u+vu+γ
3b
4b
+
=
0 . . . 0
1
0
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12
u+vv-1+γ
3b
4b
+
=
0
11 . . . 1
u′ < 0 and v′ < 0Case 4: Case 8:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v 0 . . . 0
12
u+vu-1+γ
3b
4b
+
=
1 . . . 1
1
1
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12
u+vv-1+γ
3b
4b
+
=
1
11 . . . 1
Table 3.3: Eight different cases of the addition operation.
Note that not all the possibles cases were shown in table 3.3. Cases 1, 2, 3 and 4 have l1 > l2
and cases 5, 6, 7 and 8 have l1 < l2. As it can be seen in cases 1, 3, 5 and 6, we have an
increment in zone 4 b) if γ<nm> = 1 or a bypass if γ<nm> = 0. In cases 2, 4, 7 and 8, we have
either a bypass (γ<nm> = 1) or a decrement (γ<nm> = 0). Moreover zone 4 disappears, for the
trivial case where nu + l1 = nv + l2. Another consequence coming from l1 = l2 is that zone 2
38
3.2 Subtraction
disappears.
Summing it up, in zone 1 zeros are generated; in zone 2 b) the operand with the smallest left
shift is bypassed; in zone 3 b) both operands add and in zone 4 b) we have either an increment,
decrement or bypass. Zones 3 a) and 4 a) only exist when lM < r < nm or nm < r < nM ,
respectively, corresponding to hardware structures where we only have to generate the carry bit
(no output is produced).
Bit dynamic rangeZone Type Length Operand u′ Operand v′
1 Zeros generation lm ∅ ∅2 b) Bypass lM − lm lm : lM − 1 lm : lM − 13 b) Adder nm − lM lM : nm − 1 lM : nm − 1
4 b) Increment/Bypass/nM − nm
Cases 1,2,3,4: nv + l2 : nu + l1 − 1 Cases 1,2,3,4: ∅Decrement Cases 5,6,7,8: ∅ Cases 5,6,7,8: nu + l1 : nv + l2 − 1
Table 3.4: Table resuming the addition operation with r = 0.
Table 3.4 summarizes the several operands involved in this operation when r = 0 and the
symbol ∅ is the absence of the operand in the corresponding zone. Having covered the addition,
there is still the subtraction to be analysed, which will be seen in the following section.
3.2 Subtraction
The subtraction case is similar to the addition. The main difference arises from the TC of one
of the operands.
• we need to complement the operand which is going to be subtracted;
• we must add ’1’ to the least significant bit of the operand that was complemented.
Because of this, we must slightly change the adder structure described in section 3.1. The main
difference in the implementation is observed inside zone 2, as it will be described in the following.
3.2.1 Mathematical formulation
By following the same approach that was adopted for the adder operation, we will start by
introducing the required adjustements in order to apply the TC to the operand to subtracted. Ac-
cordingly, the operand used in this replacement will be left shifted and complemented. Figure 3.5
presents an example, assuming that v is positive. An entirely similar description will be driven if v
was a negative number.
The subtracted value is given by:
t′ = v′ + 1
= v × 2l2 + 1
= v × 2l2 + 1× 2l2 (3.2)
39
3. Optimized Adder Structures for MCM
u 0 . . . 0
l1nu
v 0 . . . 0
l2nv
λ=v<lm:lM-1>+1 0 . . . 0γ=u<lM:nm-1>+v<lM:nm-1>+λ<lM>u<nm:nM-1>+γ<nm>-1
+
=
1 . . . 1
0
1
A4+B4 A3 + B3 A2 A1
Zone 4 Zone 3 Zone 2 Zone 1
Figure 3.5: Mathematical formulation for the subtracter.
By using equation 3.1 and 3.2, and by replacing v′ by t′ we have:
s = s′ × 2−r
s = [u′ + t′]× 2−r
= [A1+ Zone1
+ A2︸︷︷︸λ
×2lm+ Zone2
+ (A3 +B3 + λ<lM>)︸ ︷︷ ︸γ
×2lM + Zone3
+ (A4 +B4 + γ<nm>)× 2nm ]× 2−r Zone4 (3.3)
where λ<lM> is the carry out bit from λ and γ<nm> is the carry out bit from the addition γ. A
carry-out protection that includes the carry-out into the signal, makes unnecessary the need to
have the extra bit. By defining the operands for each zone and 1lm is the bit for the TC conversion,
we get:
A1 = 0<0:lm−1>
A2 =
u′<lm:lM−1> if l1 < l2
v′<lm:lM−1> + 1lm if l1 > l2
A3 = u′<lM :nm−1>
B3 = v′<lM :nm−1>
A4 = u′<nm:nM−1>
B4 = v′<nm:nM−1>
In this operation, one extra sub-zone is defined inside zone 2, in addition to the sub-zones
defined in section 3.1:
Zone 1 Fill with zeros, from bit 0 until bit lm − 1;
Zone 2 If lm ≤ r < lM − 1, the previous zone is eliminated and we spawn two parts:
40
3.2 Subtraction
Zone 2 a) From bit lm to bit r − 1 we will only compute the carry chain if l1 > l2 or
implement nothing if l1 ≤ l2;
Zone 2 b) From bit r to bit lM − 1 we have either a bypass or an increment. If l1 > l2,
then the operand that passes onto zone 2 is t′ and because of the TC we need to
increment by 1, as in equation 3.2. If l1 < l2 then we have an u′ bypass.
Zone 2b
Zone 2 <lm:lM-1>
Zone 2ar
Figure 3.6: Zone 2 divided into 2 sub-zones, when lm < r < lM − 1.
Zone 3 If lM ≤ r < nm−1, zone 1 is eliminated, zone 2 comprises a carry chain computation
(when l2 ≤ l1) and zone 3 will be divided in 2 different parts:
Zone 3 a) From bit lM to bit r − 1 we compute the carry chain;
Zone 3 b) From bit r to bit nm − 1 we do the addition, as normal, taking into account
the carry coming from the r bit;
Zone 4 When nm ≤ r < nM − 1 we need to calculate the carry chain from zone 3. Zone 2
depends on the two cases seen before, but we can eliminate the zeros from zone 1. Zone 4
is then divided in two sub-zones:
Zone 4 a) from bit nm to bit r− 1 we compute the carry chain, i.e. we need to calculate
the carry from zone 2, zone 3 and zone 4 up to r − 1;
Zone 4 b) from bit r to bit nM−1 we use an increment, decrement or bypass, depending
on the case, taking into account the carry coming from zone 4 a).
Note that zone 4 only exists if nu + l1 6= nv + l2. Therefore, zone 4 does not contemplate
nu + l1 = nv + l2. The same applies to zone 2, when l1 = l2. In such case zone 1 will occupy
zone 2. Finally both zone 1 and 2 disappear if l1 = l2 = 0. Table 3.5 resumes this operation, the
symbol ∅ is the absence of the operand in the corresponding zone.
Bit dynamic rangeZone Type Right shift Length Operand u′ Operand v′
1 Zeros generation r < lm − 1 lm −maxr, 0 ∅ ∅2 a) Carry generation lm < r < lM − 1 minr, lM − lm ∅ or lm : minr, lM − 1 lm : minr, lM − 1 or ∅2 b) Bypass/Increment r < lm lM −maxr, lm ∅ or maxr, lm : lM − 1maxr, lm : lM − 1 or ∅3 a) Carry generation lM < r < nm − 1 minr, nm − lM lM : minr, nm − 1 lM : minr, nm − 13 b) Adder r < lM nm −maxr, lM maxr, lM : nm − 1 maxr, lM : nm − 14 a) Carry generation nm < r < nM − 1minr, nM − nm nm : minr, nM − 1 nm : minr, nM − 1
4 b) Increment/Bypass/r < nm nM −maxr, nm
maxr, nm : nM − 1 ∅ orDecrement or ∅ maxr, nm : nM − 1
Table 3.5: Table resuming the subtraction with right shift operation.
41
3. Optimized Adder Structures for MCM
3.2.2 Example configuration.
In this configuration, r = 0, therefore there are no sub-zones a). Since it is a subtraction there
is an increment in zone 2 and consequently this zone produces a carry-out (λ). Moreover, zone
3 also has a carry-out (γ). It can be seen that in cases 1, 3, 5 and 6 we use either an increment
or a bypass to implement zone 4 b). In these cases, the selection of either increment or bypass
depends on the carry-out γ coming from zone 3 b). In cases 2, 4, 7 and 8 we use a decrement or
bypass, depending on the γ value coming from zone 3 to zone 4.
In table 3.6, eight notable cases out of 16 were presented. As it can be seen, we could have
permuted the sizes of l1 and l2 in each of the cases presented above. But half of them were left
out, making the analysis only on the ones presented. On the right column, we presented the
cases where l2 > l1. On the left column we presented the remaining cases.
Summing it up, in zone 1, we generate zeros; in zone 2 b) we increment or bypass the operand
with the smallest left shift; in zone 3 b) we add both operands, not forgetting the carry λ from zone
2 b); in zone 4 b) we have either an increment, decrement or bypass, depending on the carry γ.
With all the operations defined, we can now proceed with the formal definition of each zone
described above either for the addition or the subtraction operations. The main goal of the next
section is to define a framework to manage the modularity of the envisaged component. From
here, it is always possible to change the modules of each section, while still maintaining the
structure of the described block.
3.3 Topologic description of the adder/subtracter module
In this section we will formalize the description of each zone, the instantiation conditions and
the corresponding logic function. A set of boolean variables will help us to implement each block
activation condition. For the sake of notation simplicity, we will omit the maxr, x and minx, r,
seen in tables 3.4 and 3.6, replacing it simply with r.
The adoption of these particular implementations conditions will be particular useful to inte-
grate the several blocks that compose the whole adder/subtracter by using the ”generate” primitive
made available by the VHSIC Hardware Description Language (VHDL). Once the block is synthe-
sized, it does not change anymore. The conditions presented hereafter are only used at the
generation of the proposed adder.
Throughout this section a new notation will be used to simplify the reference to each block.
The notation is composed of a number representing the zone, a letter representing the sub-zone,
a letter representing the operand used and the last is a contraction. As an example lets take zone
2 b) with operand v in an increment we will have the following notation block 2bv inc. At the end
42
3.3 Topologic description of the adder/subtracter module
nu + l1 > nv + l2 nu + l1 < nv + l2
u > 0 and v > 0Case 1: Case 5:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v+1 0 . . . 0
1
u+v+λu+γ
3b
4b
+
=
0 . . . 0
0
0
2b
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12b
u+v+1v+γ
3b
4b
+
=
0
00 . . . 0
u > 0 and v < 0Case 2: Case 6:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v+1 0 . . . 0
1
u+v+λu-1+γ
3b
4b
+
=
1 . . . 1
0
1
2b
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12b
u+v+1v+γ
3b
4b
+
=
1
00 . . . 0
u < 0 and v > 0Case 3: Case 7:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v+1 0 . . . 0
1
u+v+λu+γ
3b
4b
+
=
0 . . . 0
1
0
2b
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12b
u+v+1v-1+γ
3b
4b
+
=
0
11 . . . 1
u < 0 and v < 0Case 4: Case 8:
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
v+1 0 . . . 0
1
u+v+λu-1+γ
3b
4b
+
=
1 . . . 1
1
1
2b
u 0 . . . 0
l 1nu
v 0 . . . 0
l 2nv
u 0 . . . 0
12b
u+v+1v-1+γ
3b
4b
+
=
1
11 . . . 1
Table 3.6: Eight different cases of the subtraction operation.
43
3. Optimized Adder Structures for MCM
of this section there is a table that gathers every combination used.
OP =
1 Addition0 Subtraction
RZ =
1 r 6= 0
0 other cases
O =
1 l1 6= 0 ∧ l2 6= 0
0 other casesW =
1 nm 6= nM
0 other cases
P =
1 r ≥ nM0 other cases
X =
1 l1 > l2
0 l1 < l2
Q =
1 r ≥ nm0 other cases
Y =
1 l1 = l2
0 l1 6= l2
R =
1 r ≥ lM0 other cases
Z =
1 nm > lM
0 other cases
S =
1 r ≥ lm0 other cases
M =
1 nu + l1 > nv + l2
0 nu + l1 < nv + l2
Zone 1
This first zone, depicted in table 3.7, is where the zeros are generated, due to the shifts of both
operands. It outputs zeros and does not need any input. As an example, if we have l1 = 4 and
l2 = 6, then zone 1 will have a length of lm = l1 = 4, as can be seen in table 3.15.
Conditions Block 1
OP = 1 ∨ 0 Zero generation
r:lm-1
s
S = 0O = 1
Table 3.7: Definition and implementation conditions of zone 1
Zone 2 a)
This zone, depicted in table 3.8, is used when we have an increment in zone 2 and its only
function is to generate the carry bit when we have a right shift up to bit r. We need only to compute
the carry chain and it will serve zone 2 b) or zone 3, depending on r. Note that this block only
outputs the carry out, because it is not intended to compute anything else.
Zone 2 b)
Depending on the type of operation to be implemented, this zone will take a different imple-
mentation: while an MCM addition will result in a bypass, if the operation is a MCM subtraction
it can be either a bypass or an increment. The increment has two outputs and two inputs: a
44
3.3 Topologic description of the adder/subtracter module
Conditions Block 2a
OP = 0
Carry generate
lm:r-1
v'
cout cin=1
P = 0 ∧ S = 1Y = 0X = 1RZ = 1
Table 3.8: Definition and implementation conditions of zone 2 a).
carry cout to zone 3, the incremented operand s, the carry bit (cin) coming from zone 2 a) and the
operand to be bypassed or incremented (see table 3.9).
Conditions Block 2bu byp Conditions Block 2bv byp Conditions Block 2bv inc
OP = 0 ∨ 1
r:lM-1
s
r:lM-1
u'
Bypass
OP = 1
r:lM-1
s
r:lM-1
v'
Bypass
OP = 0
r:lM-1
s
r:lM-1
v'
Inccout cin
Y = 0 Y = 0 Y = 0R = 0 R = 0 R = 0X = 0 X = 1 X = 1
Table 3.9: Definition and implementation conditions of zone 2 b).
Zone 3 a)
This zone only exists when the final right shift and the output affects zone 3 and the operands
overlapse. The corresponding hardware is used to compute the carry chain obtained from the
addition of the two input operands. Hence, since the output is ignored up to bit r, there is no need
to include the logic for the sum result. It is sufficient to consider logic for the carry. It has one
output and 3 inputs: the carry-out bit (cout), the two operands v′ and u′ and the carry in bit (cin).
Zone 3 b)
In this zone we have the main operation of the block: the addition part. Independently of the
considered MCM operation, we compute the addition of the two overlapping operands with the
aid of an adder. It has two outputs and three inputs: the sum s, the carry-out bit cout , the two
operands v′ and u′ and the carry cin(see table 3.11).
45
3. Optimized Adder Structures for MCM
Conditions Block 3a
OP = 0 ∨ 1
Carry generate
lM:r-1 lM:r-1
u' v'
cincout
P = 0 ∧R = 1Z = 1RZ = 1
Table 3.10: Definition and implementation conditions of zone 3 a).
Conditions Block 3b
OP = 1 ∨ 0
r:nm-1
s
Adder
r:nm-1 r:nm-1
u' v'
cincout
Q = 0Z = 1
Table 3.11: Definition and implementation conditions of zone 3 b).
Zone 4 a)
This zone only exists when the final right shift at the output affects zone 4 and when the
number of bits of one operand exceeds the other. It is the extension of the addition that is initiated
in the previous zones, but since the output is ignored up to bit r there is no need to include the
logic for the increment. It is sufficient to make only the logic for the carry chain. In the case of
the increment/decrement, it has one output and 2 inputs: the carry-out bit cout to zone 4 b), the
operand u′ or v′ and the select signal slt for the multiplexer (see table 3.12).
For the slt signal, we will use the truth table 3.13a and 3.13b. Note that for the bypass we do not
need to make any implementation in zone 4 a).
Conditions Block 4au Conditions Block 4av
OP = 1 ∨ 0
nm:r-1
u'
Inc/Dec sltcout
OP = 1 ∨ 0
nm:r-1
v'
Inc/Dec sltcout
P = 0 ∧Q = 1 P = 0 ∧Q = 1RZ = 1 RZ = 1Z = 1 Z = 1W = 1 W = 1M = 1 M = 0
Table 3.12: Definition and implementation conditions of zone 4 a).
46
3.3 Topologic description of the adder/subtracter module
v′nMcin Type slt
0 0 Byp 000 1 Inc 011 0 Dec 101 1 Byp 00
(a) M = 1
u′nMcin Type slt
0 0 Byp 000 1 Inc 011 0 Dec 101 1 Byp 00
(b) M = 0
Table 3.13: Zone 4 truth table for the select signal.
Zone 4 b)
This zone processes the most significant bits of the operands. Its function is to compute the
values of the most significant bits of the result, by considering the carry coming from zone 4 a)
or zone 3. This is a singular zone, because it has three instances inside it: the incrementer, the
decrementer and the bypass. Block 4 b) has one output and two inputs: the computed value s,
the operand u′ or v′ and the select signal slt.
The select signal depends on the operand’s most significant bit x′ (either from u′ or v′) and the
carry bit: slt[1 : 0] = x′[nM ]|Cin, where | is the concatenation operation (see table 3.13b). If
the boolean variable M = 1, then operand u′ goes through this block and the signal select is
constituted by slt[1 : 0] = v′[nM ]|Cin. For M = 0, then operand v′ goes through the block and the
select signal is composed by slt[1 : 0] = u′[nM ]|Cin (see table 3.13a).
Conditions Block 4bu Conditions Block 4bv
OP = 1 ∨ 0
r:nM-1
s
r:nM-1
u'
Inc/Dec/Byp
slt
OP = 1 ∨ 0
r:nM-1
s
r:nM-1
v'
Inc/Dec/Byp
slt
Z = 1 Z = 1W = 1 W = 1P = 0 P = 0M = 1 M = 0
Table 3.14: Definition and implementation conditions of zone 4 b).
Summary
With the presented framework, defined with 7 different modules, summarized in table 3.15, the
symbol ∅ is the absence of the operand in the corresponding zone. The required logic structures
are easily identified and instantiated for each zone. In figure 3.7 an example of a subtraction
configuration is given, with r = 0, lm = l1, lM = l2, nm = n1 and nM = n2. The next chapter will
47
3. Optimized Adder Structures for MCM
Addition SubtractionBlocks Type Length Operand u′ Operand v′ Type Length Operand u′ Operand v′
1 Zeros gen lm − r ∅ ∅ Zeros gen lm − r ∅ ∅2a ∅ ∅ ∅ ∅ Carry gen r − lm ∅ lm : r − 1
2bu byp Bypass lM − r r : lM − 1 ∅ Bypass lM − r r : lM − 1 ∅2bv byp Bypass lM − r ∅ r : lM − 1 ∅ ∅ ∅ ∅2bv inc ∅ ∅ ∅ ∅ Inc lM − r ∅ r : lM − 1
3a Carry gen r − lM lM : r − 1 lM : r − 1 Carry gen r − lM lM : r − 1 lM : r − 13b Adder nm − r r : nm − 1 r : nm − 1 Adder nm − r r : nm − 1 r : nm − 14au Carry gen r − nm nm : r − 1 ∅ Carry gen r − nm nm : r − 1 ∅4av Carry gen r − nm ∅ nm : r − 1 Carry gen r − nm ∅ nm : r − 14bu Inc/Dec/Bp nM − r r : nM − 1 ∅ Inc/Dec/Bp nM − r r : nM − 1 ∅4bv Inc/Dec/Bp nM − r ∅ r : nM − 1 Inc/Dec/Bp nM − r ∅ r : nM − 1
Table 3.15: Table resuming both operations.
address the discussion about the possible different implementations that can be considered for
the adders, decrement and increment logic.
n1:n2-1
n1:n2-1
Inc/Dec/Byp slt
l1:l2-1
l1:l2-1
Inc
l2:n1-1
Adder
l2:n1-1l2:n1-1
cout cout cin
s
v' u'
1
Zero generation
0:l1-1
cin
Figure 3.7: Complete adder configuration.
Before finishing this chapter, it is important to note that we should eliminate zone 1 for the use
in conventional MCMs. The reason for this simplification is the positive and odd characteristic of
generated terms in MCMs. As a consequence of this, some boolean variables from the beginning
of this section can be eliminated: S = 1 can be substituted by R = 0 ∧ RZ = 1 and S = 0 is
equivalent to RZ = 0.
48
4Time model of the proposed
adder/subtracter structure
Contents3.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3 Topologic description of the adder/subtracter module . . . . . . . . . . . . . . 42
49
4. Time model of the proposed adder/subtracter structure
In this chapter some models are presented to calculate the time delay path of the
adder/subtracter structures proposed in the previous section. This is necessary for the estimation
of the critical path in the MCM solver. Three types of models will be described, with some more
emphasis on the proposed one.
At this point, it is important to emphasize that some models that are usually adopted by cur-
rent MCM decompositions are not always accurate for time constrained applications. However,
the currently available computational resources already allow the usage of better models. Hence
we will present three different models: the simplest model, corresponds to the worst-case sce-
nario, the Least Significant Bit (LSB) model, corresponds to a best-case scenario, and finally the
proposed model is more realistic than the previous two.
To help with the characterization of those models, some variables are defined: n denotes the
number of adder-steps, tKf is the absolute time delay (time delay in the MCM structure) at adder
K, tK is the relative time delay (time delay of the adder structure isolated) at adder K, tKZ is the
relative time-delay in adder K for zone Z, tKMSBoutis the latency of the output’s Most Significant
Bit (MSB) in the adder K, tKLSBinis the latency of the input’s LSB in the adder K. Note that
1 ≤ K ≤ n. To simplify the notation we use tnf ⇔ tf .
4.1 Simplest Model
The simplest model that can be defined is the one where the delay coming from a given node
is assumed to be the maximum time a signal takes to pass through all circuitry inside the node.
What this means is that the signal is only ready when all the previous output signals are computed.
Hence, the critical path goes ”horizontally” before going to the next adder. This model has the
advantage of being fast to be computed, but is very inaccurate for the proposed structure, since
it is essentially a worst case scenario for the majority of adders (including the proposed adder).
The model’s time delay for n cascaded adders is: tf =∑nK=1 t
K =∑nK=1(tKMSBout
− tKLSBin).
Figure 4.1 exemplifies the critical path of a structure with two adders, with t1 = t1MSBout− t1LSBin
and t2 = t2MSBout− t2LSBin
. The total time-delay with n = 2 is: tf = t1 + t2 = (t1MSBout− t1LSBin
) +
(t2MSBout− t2LSBin
).
Z2Z4 Z3
Z4 Z3
tf
t1
t2
Figure 4.1: Simplest model: critical path estimation.
50
4.2 LSB Model
4.2 LSB Model
An improved and more accurate model was proposed by [5]. For K adder-steps, it uses the
assumption that the critical path goes through the LSB bits from the first adder down to the K − 1
adder and then from the LSB to the MSB on theKth adder. Essentially, the critical path progresses
”vertically” through the firstK−1 adders and ”horizontally” through the last adder. This model uses
low computation and represents the most optimistic scenario in an adder chain. The time-delay
for such a model is given by: tf =∑nK=1 t
K = (tnMSBout− tnLSBin
) +∑n−1K=1(tKLSBout
− tKLSBin).
Figure 4.2 exemplifies the critical path with three adders, where t1 = t1LSBout− t1LSBin
, t2 =
t2LSBout− t2LSBin
and t3 = t3MSBout− t3LSBin
. The total time-delay is: tf = t1 + t2 + t3 = (t1LSBout−
t1LSBin) + (t2LSBout
− t2LSBin) + (t3MSBout
− t3LSBin).
Z2Z4 Z3
Z2Z4 Z3
Z4 Z3
t1
tf
t2
t3
Figure 4.2: LSB model.
Due to the irregularity of the proposed adder structure, the LSB model does not accurately
characterize it. The same happens with the RC, CLA or SA chains. Therefore, we need a more
refined model, with the inclusion of the irregular characteristics of the adder, which will be pre-
sented in the next section.
4.3 Proposed Model
In this section, an adapted model that is capable of coping with the complexity of the proposed
adder structure is considered. The aim of this model is to achieve a more realistic description of
the adder. Ideally, the optimal model would refine the model to the individual wire (or bit) level,
knowing exactly where each signal goes through. In the proposed approach, instead of modelling
each bit, the bits are grouped together by zone and the critical path of each zone is taken. We
start with the model formulation and then present some examples.
51
4. Time model of the proposed adder/subtracter structure
4.3.1 Formalization
Based on the proposed adder structure, three zones with non-zero delay are identified: zones
4, 3 and 2, in the case of the subtraction. For the addition, only zones 4 and 3 have delay.
Figure 4.3 presents an example of how we defined the model and how the critical path (bold
arrow) is obtained. As before, we define a structure with n adders-steps, with three valid zones
Z ∈ 2, 3, 4 and 1 ≤ K ≤ n.
Z2Z4 Z3
Z2Z4 Z3
Z4 Z3
t1
tf
t2
t3
Figure 4.3: Definition of the critical path for the proposed model.
Due to the higher granularity of this model, we need to define some extra variables. tsKZ is the
accumulated time-delay up to the sum output of adder K in zone Z. tcKZ defines the accumulated
time-delay of the carry-out up to adder K in zone Z. SKZ represents the set of bits that define zone
Z in adder K. For example S13 = 4, 5, 6, 7, 8, 9, 10 =< 4, 10 > means that zone 3 in adder 1 is
defined from bit 4 to bit 10. The path taken between each zone or adder is denoted by appending
in the subscript either ”in→S”, ”cin→S”, ”in→cout” or ”cin→cout”. For example, if there is a path
inside adder K in zone 3 from the operands input to the carry out, the notation tK3 in→cout is used.
Note that the all values ”in→S”, ”cin→S”, ”in→cout” or ”cin→cout” are known and are calculated
based on the adder size.
The proposed model also defines a function tp(K,Z) that returns the highest accumulated time
delay for a given zone Z at the beginning of adder K. To evaluate if there is any connection
between zone Za in adder K and zone Zb in adder K − 1, the boolean function Ωb(K,Za, Zb) is
defined as follows:
Ωb(K,Za, Zb) =
1 if SKZa
∩ SK−1Zb6= ∅
0 otherwise
Hence, the propagation paths that can be followed are enumerated next:
• The time delay at the beginning of a given adder K is given by:
tp(K,Z) = max(Ωb(K,Z, 2)× tsK−12 ,Ωb(K,Z, 3)× tsK−13 ,Ωb(K,Z, 4)× tsK−14
)Where ts04 = ts03 = ts02 = 0 and tKZ is defined for every Z ∈ 2, 3, 4 and 1 ≤ K ≤ n.
52
4.3 Proposed Model
• The time delay at the end of zone 4 of a given adder K is given by:
tsK4 = max[tK4 in→S + tp(K, 4), tK4 cin→S + tcK3
](4.1)
with tcK3 = max[tK3 in→cout + tp(K, 3), tK3 cin→cout + tcK2
]• The time delay at the end of zone 3 at a given adder K is given by:
tsK3 = max[tK3 in→S + tp(K, 3), tK3 cin→S + tcK2
](4.2)
with tcK2 = tK2 in→cout + tp(K, 2)
• The time delay at the end of zone 2, at given adder K, is given by:
tsK2 = tK2 in→S + tp(K, 2) (4.3)
The final time-delay tf for the proposed model with n adder-steps is given by:
tf = max (tsn2 , tsn3 , ts
n4 ) (4.4)
From the above description, it is easy to understand that this is a more computational intensive
model, particularly for the calculation of the intersection SKZa∩SK−1Zb
. Nevertheless this approach is
closer to the wire model, while still maintaining the computational independency from the number
of bits in the adders.
4.3.2 Example
To better understand how the model works, an example is provided in figure 4.3. Suppose the
time delays tKZ at each individual zone are known. With function tp(K,Z) and the bit-widths of the
zones Z ∈ 2, 3, 4 in adders K ∈ 1, 2, 3, it is easy to analyse the connections between zones
from adder K − 1 to adder K. In the considered example, the connections at zone 2, 3 and 4 of
adder 2 are defined as:
tp(2, 2) = max(Ωb(2, 2, 2)× ts12,Ωb(2, 2, 3)× ts13,Ωb(2, 2, 4)× ts14
)= max
(1× ts12, 1× ts13, 0× ts14
)tp(2, 3) = max
(Ωb(2, 3, 2)× ts12,Ωb(2, 3, 3)× ts13,Ωb(2, 3, 4)× ts14
)= max
(0× ts12, 1× ts13, 1× ts14
)tp(2, 4) = max
(Ωb(2, 4, 2)× ts12,Ωb(2, 4, 3)× ts13,Ωb(2, 4, 4)× ts14
)= max
(0× ts12, 0× ts13, 0× ts14
)Now applying the same to adder 3:
Note that tp(1, 2) = tp(1, 3) = tp(1, 4) = 0.
The first adder time-delay is given by:
t1f = t1 = ts14 = t14 cin→S + t13 cin→cout + t12 in→cout
53
4. Time model of the proposed adder/subtracter structure
tp(3, 2) = max(Ωb(3, 2, 2)× ts22,Ωb(3, 2, 3)× ts23,Ωb(3, 2, 4)× ts24
)= max
(0× ts22, 0× ts23, 0× ts24
)tp(3, 3) = max
(Ωb(3, 3, 2)× ts22,Ωb(3, 3, 3)× ts23,Ωb(3, 3, 4)× ts24
)= max
(1× ts22, 1× ts23, 0× ts24
)tp(3, 4) = max
(Ωb(3, 4, 2)× ts22,Ωb(3, 4, 3)× ts23,Ωb(3, 4, 4)× ts24
)= max
(0× ts22, 1× ts23, 1× ts24
)In this example we assumed that t14 + tp(1, 4) < t14 cin→S + tc13 in equation 4.1, with the carries
defined as tc13 = t13 cin→cout + tc12 and tc12 = t12 in→cout + tp(1, 2) = t12 in→cout
The second adder time-delay is given by:
t2f = ts23 = t23 in→S + tp(2, 3)
assuming t23 in→S + tp(2, 3) > t23 cin→S + tc22 in equation 4.2.
The third adder time-delay is given by:
tf = t3f = ts34 = t34 in→S + tp(2, 4)
assuming t34 in→S + tp(3, 4) > t34 cin→S + tc33 in equation 4.1.
Now substituting function tp:
t1f = t14 cin→S + t13 cin→cout + t12 in→cout
t2f = t23 in→S + ts14 = t23 in→S + t14 cin→S + t13 cin→cout + t12 in→cout
tf = t34 in→S + ts23 = t34 in→S + t23 in→S + t14 cin→S + t13 cin→cout + t12 in→cout
Since all variables are known, it is possible to calculate the delay of the whole system. These
steps have to be calculated for every combination of the adders for a realistic description of the
structure delay.
Summary
In this section, a model for the time delay of the proposed adder structure is defined. The aim
of this model is to provide a more realistic description of the proposed structure in terms of signal
propagation through multiple adder steps. Two other simpler that have been adopted in the current
literature were also reviewed and the new model was presented based on the specifications of
the proposed adder. A mathematical formulation of the proposed model was also presented for
an easier integration with the optimization algorithm that will be presented in the next chapter.
54
5Time delay minimization through
gate level metrics
Contents4.1 Simplest Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 LSB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
55
5. Time delay minimization through gate level metrics
Even though improving the hardware of the MCM circuit reduces its latency, such reduction
can be even improved if the MCM organization is defined by taking the internal structure of the
proposed adder into account. Hence, the optimization algorithm that is now proposed aims finding
the minimum delay multiplication for a given set of coefficients. Such algorithm is based on the
results of [14] and [3]. In particular, it can be seen as an optimization, in the time domain, of the
algorithm detailed in [3].
The motivation for the development of this algorithm is to give the circuit designers with alter-
native ways to build and improve their MCMs. Although the main function of this algorithm is to
find the smallest time-delay that ensures the implementation of all the coefficients, it is still able to
have an efficient area occupation.
Most of the algorithms described in section 2.3 use time and area as unitary weights, without
looking at the particular technology that is used or the possibility to change their weights. The
presented work aims to extend the results of [3], which introduced gate-level metrics for the area
occupation, by taking into account not only the adder’s area but also the adder’s time delay. This
algorithm works in three parts: first, it carries out an exhaustive search of all MSD implementations
of the coefficients and its partial terms; then it finds the critical path among the set of coefficients;
and finally, it uses the same ILP 0-1 solver used in [3] to find the best area minimization with the
critical path as constraint.
The rest of this chapter is organized as follows. First, it presents a small overview of the
functionalities of the program. Then, the data structures are summarized. Finally, each core
function of the algorithm will be described. The explanation of the algorithm is divided in four
parts: the main function, the decomposition of the terms; the critical path search; and finally, the
strategy for area minimization.
5.1 Functions
The first idea behind the proposed algorithm is to minimize the critical path of the MCM. To
achieve this objective, the algorithm was divided into three main functions, briefly described next.
Three core functions are listed: the find partials function builds the implementation tree that will be
used in the others functions; the minimize path function finds the minimal time delay correspond-
ing to the considered coefficients; and finally, the minimize area function minimizes the area of
the MCM but taking the time constraint into account.
The vocabulary used herein approximates the one that is usually used in this research domain:
a term is a generic designation for the a node value of the MCM. The first input corresponds to a
term at the beginning of the MCM, the coefficients are terms located at the end of the MCM, the
partial terms are specifics terms used as inputs to other nodes. Finally an implementation is an
operation and the path is a series of operations. The fundamental is defined as positive and odd
56
5.1 Functions
term and will be used in throughout this chapter.
A – find partials Given a set of constant coefficients, the algorithm uses the MSD represen-
tation to implement the coefficients and their partial terms. For such purpose, the path delay
and area occupation of an adder are calculated based on the gate-level model for each of the
different representations of a term. For the specific case of the custom adder that was proposed
in chapter 3, a new function was developed to compute the metric corresponding to the gate-
level time delay and area occupation. This function requires the following input parameters: the
two operands of the adder, the result of the operation, the shifts, and the type of operation (ad-
dition/subtraction). With these inputs, it calculates the values of area and time-delay from the
elementary block. The calculated values are then stored along with the implementation. The con-
struction of the tree is approached from the bottom to the top, i.e. from the coefficients to the first
input.
B – minimize path Once the tree with all possible implementation is built, the algorithm tra-
verses again in a bottom-top approach. The search starts by looking for a path and the minimum
time delay variable with value infinite. The first path is evaluated and its time delay stored. From
there, the search algorithm iterates through all the possible implementations, looking for a smaller
time delay. To guarantee that the search is exhaustive, it uses a recursive Depth Search First
(DFS) that prunes sections of the tree which have higher delay than the current minimal path. For
each target coefficient, its implementation, time delay and area occupation is stored. Then the
coefficients whose implementation have the highest time-delay are chosen as the critical path. It
is important to note that at this point there is no area optimization, only time optimization. Here the
graph is complete and it is guaranteed that we have a graph with the smallest time delay possible.
C – minimize area This part focuses on area minimization, and is based on both the work
developed by Flores et al. [14] and the extension of Aksoy [3] for custom time-weights. The
main difference lies in the construction of the set of implementations that is sent to the ILP solver.
The proposed algorithm tries to minimize the area while still taking into account the time delay.
With the critical path that was found in the previous function, the algorithm identifies all possible
implementations of the target coefficients that do not exceed the critical time. Once this set of
implementations are selected, the algorithm builds the boolean network and the constraints for
the 0-1 ILP problem. From there, the ILP solver outputs the solution. The resulting solution
is characterized by the minimum area for the given boolean network subject to the considered
time constraints. It can be argued that the time constraints could also have been included in the
constraint that are fed to the ILP solver. However, it was not in the scope of this work to implement
a new model, but to improve the existing one.
57
5. Time delay minimization through gate level metrics
5.2 Data Structure and classes
One main data structure and four classes were used for the implementation of the algorithm.
The data structure is static and is used to store the information needed by many functions of
the program. It stores the fundamentals in a binary tree, by using a map container. The used
containers are based on the C++ Standard Template Library (STL) [24] (revision C++98), and are
represented in italic. It also stores a vector container, with the target coefficients, and a vector
with the available fundamentals. The bit width of the first input is also stored in this structure. As
for the classes, one class is used for the generation of the MSD representation. It contains three
values for each partial sum: two operands and one fundamental. Another class is used to manage
the adder costs, containing for each zone of the adder the corresponding time-delays and silicon
area. The third class contains the bit width of the operands, the shift to be executed at each single
operation, the operands of the adder, the fundamental, and a variable that encodes the size of
each zone. This variable is later used for accessing the propagation time and area of the adder
by a map container. Finally, the fourth class stores the fundamental and the list container of the
corresponding MSD implementations.
Type name Variable Data
data t
- coef - vector with the target coefficients- unimp - vector with unimplemented coefficients- imp - vector with implemented coefficients- x bitwidth - variable with main input bit width- partials - pointer to the partials structure
map area t - first - term value- second - implementation matrix with time and area cost
CAdder - tz2, tz3, tz4 - time delay for zone 2, 3 and 4- area - area occupation for adder
CImp
- coef - fundamental- op u, op v - operands- OP - flag for the type of operation (addition/subtraction)- z - variable with encoded zone bit width- n u, n v - operands bit widths- l1, l2 - operands shift left- r - output right shift
CFundamentals- coef - fundamental- CImplemented - number of possible implementations- list<CImp> - container list with all MSD implementations
Table 5.1: Data structure and classes
As described in table 5.1, the data structure data t refers to general data concerning the MCM,
with information about the input, the target coefficients and the implemented terms. It is used
throughout the code and it is the core structure around which the program works, thus repre-
senting a global structure. It stores the input bit-width that was passed by argument to the main
function, as well as the target coefficients after that were read from an input file. It also stores
the terms that were determined during the execution of the find partials function, while using the
58
5.2 Data Structure and classes
unimp vector as a list of terms to be processed by the function find partials. As soon as these
terms are processed, their implementations are stored in a map container, where the key is the
term and the value is a pointer to a member of the class CFundamental.
The CFundamentals class is used for accessing a specific term. It is accessed through
the data t.partials map and stores all the information for each term. It keeps all the possible
implementations of each fundamental in a list container of CImp. For example, the term 83
(10100112CSD−−−→ 1010101) has 5 different implementations: 80 + 3, 66 + 17, 65 + 18, 68 + 15
and 63 + 20. Hence, the implementation list for the term 83 will have five entries, one for each
unique implementation. Finally the CImplemented variable is for keeping track of the number of
implementations.
The CImp class is used to store each unique implementation of a given fundamental. It can
only be accessed through the CFundamental class. From this class, we can access the time and
area cost of each implementation. The costs are obtained from the CAdder class, located in a
map container. The key is the 32 bits unsigned integer z, representing the bit widths of several
zones, encoded as follows: z = z2 + z3× 103 + z4× 106.
The class CAdder is used to store the time and area values of a specific adder. It has four
integers (32 bits) variables. The motivation behind this is to have a pre-calculated set of adder
parameters, thus speeding up the process of calculating the areas and times of each individual
adder.
The map area t structure is mainly used in the minimize area function as memory to accom-
modate the terms that can be used in the area optimization. This means that all the terms whose
implementations do not exceed the critical time are kept in this structure. For every term, there
is an entry pointing to a matrix with as many columns as the number of implementations for that
term. The first row is used to check the state of each implementation: 0 for implementations that
do not meet the time constraint or have not yet been analyzed; 1 for future possible implemen-
tations; and 2 for implementations that will be passed to the ILP solver. If the index of the first
row has values 1 or 2, the second row stores the time needed for the end of the critical path, if
not it has value 0. This structure is of critical importance for the ILP solver, because it is from
this data that the Boolean Network is built. Taking as example the term 83, the map key would
be 83 and the map value matrix would have two rows, one for the flag value and the other for
the time needed, and five columns, one for each implementation of 83 under the representation
MSD. Its contents, at initialization time, are zero in the first row and infinite in the second row.
As the program is running, the values in both rows are changed according to the evolution of the
MCMoptimization algorithm.
In the next sections, the main functions of the algorithm will be presented with detail: the main
function, the find partials function, the minimize path function, and the minimize area function.
59
5. Time delay minimization through gate level metrics
5.3 Proposed optimization algorithm
The main function is responsible for the invocation of the auxiliary functions and the core func-
tions of the algorithm. The auxiliary functions take care of the ordering of the terms, reading
parsing and writing to files and the calling of the ILP solver. The core functions do all the com-
putational task. The main function can be invoked with up to 4 arguments. The first is acessed
through the ”-f” flag and takes a file with the coefficients separated by newlines. The coefficients
can have two entries in the file, which the algorithm filters to eliminate duplicated coefficients. The
second argument is the bit-width of the input through the ”-b” flag, with a default value of 16 bits.
The third argument is the type of adder structure to be used, through the ”-a” flag. There are
three types of adders supported in the conducted implementation: the structural adder structure
proposed in chapter 3 (as the default), the custom adder developed by Aksoy in [3] and a simple
full adder. Finally, the last argument is the optimization switcher: if ”-opt” is present, the area opti-
mization is on, if ”-opt” is not present, then the optimization is off and only the time optimization is
done.
A flowchart for the main function is detailed in figure 5.1. The first sub-function to be invoked
reads the target coefficients from the input file that was passed as parameter of the main function
and adds them to the target coefficient list: data.coef from the data t structure. Not only does it
remove any duplicate coefficients found in the file, but it also initializes the solution file with the
target coefficients. It then loads the pre-computed MSD tables in memory. The coefficients are
loaded into the data.unimp vector from the data t structure and, for every term in data.unimp,
the first core function find partial is called. Once the data.unimp is empty, the implementations
are ordered by ascending time delay. A list of all implemented terms is available in the vector
data.imp. For every coefficient in data.coef, the minimize path is run. The path containing the
implementation and the propagation time needed are returned from the function and are saved in
two vectors. From this point on, it is a modification of the work developed by [14] and [3]. The main
function iterates again through all the coefficients while it feeds them to the minimize area core
function. The map area t is built here. Finally, it builds the constraint file in a sub-function and
writes the boolean network file in another sub-function to feed the ILP solver. Once the ILP solver
is finished, the main functions parses the solution, presents some statistics, writes the results to
a file and finishes the execution.
5.3.1 Partial term finding
The goal of the find partials function is to build the implementation tree with all possible imple-
mentations. This function is called from the main function and receives, as arguments, the term to
be implemented, the data t structure, the pre-computed tables with MSD representations, and the
type of adder structure to use. Figure 5.2 depicts the flowchart of this function. For a given term, it
60
5.3 Proposed optimization algorithm
Begin
Read coef file
Load MSD tables
Copies coefficients to data.unimp
Iterates data.unimp END
?
YES
Find_partialsNO
Orders implementations
by time delay
Iterates data.coef END
?
YES
Minimize_path
NO
Saves path
and time
Iterates data.coef END
?
YES
Minimize_area
NO
Writes constraint and
Boolean network to file
ILP solver
RETURN
Read solution
from ILP solver
output file
Write final
solution to output
file
Figure 5.1: Main algorithm flowchart
Begin
Checks the bit-width of the
coefficient and filters the MSD
table
Generates the implementations
and calculates its area and time
cost
Erases coefficient from data.unimp
Adds coefficient to data.imp
Adds operands to data.unimp
RETURN
Figure 5.2: Find partials algorithm flowchart
61
5. Time delay minimization through gate level metrics
MSD representation MSD path implementation
1010101(64− 16) + (−4 + 1) 48− 3(64− 4) + (−16 + 1) 60− 15(64 + 1) + (−16− 4) 65− 20
1010011(64− 16) + (−2− 1) 48− 3(64− 2) + (−16− 1) 62− 17(64− 1) + (−16− 2) 63− 18
0110101(32 + 16) + (−4 + 1) 48− 3(32− 4) + (16 + 1) 28 + 17(32 + 1) + (16− 4) 33 + 12
0110011(32 + 16) + (−2− 1) 48− 3(32− 1) + (16− 1) 30 + 15(32− 1) + (16− 2) 31 + 14
0101101(32 + 8) + (4 + 1) 40 + 5(32 + 4) + (8 + 1) 36 + 9(32 + 1) + (8 + 4) 33 + 12
Table 5.2: MSD representations and respective paths and implementations of the term 45.
will determine and list all positive and odd operands used in each different implementations. Each
time a new operand is found, it is added to the data.unimp vector from the data t structure. To be
considered new, an operand must not be present in either data.unimp or data.imp vectors. For
instance, the term 35 (0100112) has five different implementations based on two MSD represen-
tations: the binary 0100112 and the CSD 0101012. The implementations derived from these two
representations are obtained by combining the non-zero bits in groups of two. From the binary
(0100112) representation, three implementations are derived: 32 + 3, 34 + 1 and 33 + 2, while the
CSD 0101012 representation has three implementations: 36−1, 31+4 and 3+32. In this case, the
operands 3, 34, 33, 36, and 31 will be introduced in data.unimp. Notice that although there is one
repeated operand (the term 3 on the first and the last 35’s implementations), it is introduced only
once. The find partials function considers the same operation if the operands and their respective
shifts (multiples of power of twos) are in a different order, therefore avoiding duplicated imple-
mentations. For instance, the term 35 has two implementation 15 + 3 and the reverse operation
3 + 15, but only one is processed. The other will not be added to the structure.
To better understand how the MSD representation is used here, consider for example the term
45(01011012). It has 5 MSD representations (1010101, 1010011, 0110101, 0110011 and 0101101)
that spawn 15 implementations from which the algorithm identifies 11 unique implementations.
Table 5.2 details these MSD representations and its derivatives. The proposed algorithm only
uses the last column for the implementations. From there, it transforms the numbers into fun-
damentals (positive and odd) and stores them in the CfFundamental implementation list. The
motivation behind this choice is to further minimize the number of adders. Under the MSD paths
column, every path needs 3 adders, although some implementations can be further optimized and
use only two adders. For instance, the path 40 + 5 and 36 + 9 only needs an additional adder that
62
5.3 Proposed optimization algorithm
implements 5 and 9 respectively. The terms 40 and 36 can be implemented with shifts left from
the terms 5 and 9, respectively: 40 = 53 and 36 = 92. This type of MCM definition comes from
the work done by Dempster [12] and Flores [14].
5.3.2 Minimizing the time
Begin
Loads data from
data.partials[coef]
RETURN
empty_path
and INF
Implementations
with ones?
NO
YES
Finds the
fastest imp
with ones
RETURN path
and timeIterates
implementations
END
?
Found
implementation
?YES
NO
Cur_time
>
critical_timeYES
Ret_time==INF
?
Minimize_path
NO
Cur_time
<
best_time
NO
YES
NO
YES
best_time=cur_time
best_path=path
NO
YES
Figure 5.3: Minimize path algorithm flowchart
The goal of the minimize path recursive function is to minimize the latency of the processing
structure for each of the targeted coefficients. The minimize path function receives as arguments
the current term to be minimized, the current chosen path, and the reference to the current time
delay. If successful, there are no explored paths with smaller time delay, it returns the chosen
63
5. Time delay minimization through gate level metrics
path and the resulting time delay by reference. If unsuccessful, there is at least one explored path
with smaller or equal time delay, it returns an empty path and infinite time. The motivation behind
the recursive behavior of the function is to keep track of the path being used, as it can be easily
modified with this architecture.
A flowchart for this function is presented in Figure 5.3. When the function is called to minimize
the time propagation of a term, it iterates through all the implementations of the term. At each
iteration, the algorithm will compare the current time needed to implement the term, with the
critical time at that point. The first call from the main function invokes the minimize path with
a critical time equal to infinite. Hence, the first iteration of each targeted coefficient is always
successful. From the second iteration on, it only depends on the new critical time: if lesser, the
new path is stored, if greater or equal, the path is discarded. To assess the time needed for the
operands of each implementation, the function calls itself recursively for each operand and the
process repeats itself for all subsequent operands, until an implementation with ones is found.
This is the base case of the recursion and the returned value is critical time zero and the path-
vector with a single entry: one.
As it can be inferred from its description, the minimize path function is computationally heavy.
To reduce the computational load of the function, some optimization are introduced. One way
is by short-circuiting the base case, the implementations with ones. I.e., the function starts the
processing of each term by looking for implementations with first inputs as operands and save
them. If more than one implementation with one operand is found, then the fastest is chosen.
By following this procedure, the minimal time delay is guaranteed to be found, as the algorithm
performs a exhaustive search over all the search space. The critical time delay of the MCM,
which will be used in the minimize area function, is taken from the highest delay among the target
coefficient’s minimum time delays.
5.3.3 Minimize area function
Whereas the previous function sets the minimum time delay for each individual coefficient,
this function relaxes the minimum time delays of each coefficient, that have a time delay lesser
than the MCM critical time, to include the maximum amount of implementations with the potential
to minimize the area in the Boolean Network. With more variations of implementations of one
term, there will be more approaches to jointly optimize the paths of the target coefficients in order
to minimize the resulting area occupation. Therefore, the time constraint of this function is the
highest time delay among all time delay of the target coefficients, i.e. it is the critical time delay
of the MCM. The critical path depends on the maximum adder-step and on the type of adder
structure to be used. It is possible to build different MCM graphs by using different adders types.
This function takes as arguments the data t structure, the current term to be implemented, a
vector with the path, the current critical time, and the map containing the area implementations.
64
5.3 Proposed optimization algorithm
Begin
Loads data from
data.partials[coef]
RETURN
empty_path
and INF
Implementations
with ones? YES
Locks
implementations
in area_imps
RETURN path
and time
NO
Iterates
implementations
END
?
Found
implementation
?YES
NO
Cur_time
>
critical_timeYES
Ret_time==INF
?
Minimize_area
NO
NO
YES
Locks
implementation in
area_imps
NO
YES
Figure 5.4: Minimize area algorithm flowchart
2+1
9+48
1+8
32+57
Figure 5.5: Implementation of the term 89
65
5. Time delay minimization through gate level metrics
This part of the code makes all the analysis for the construction of the Boolean Network and
is based on the work of [14] and [3]. A flowchart explaining the recursive function is presented in
figure 5.4. In a manner similar to the minimize path, while iterating through the implementations
of a term, the recursive function will check if the time needed to implement it does not exceed
the critical time. If it exceeds, it is discarded. Otherwise it locks it from future changes in the
map area t structure, saves the remaining time, and continues with another implementation. If
another implementation tries to use an already locked term, it will check the time needed on the
locked term. If it can meet the remaining time, then the program continues and the same happens
to the operands. On the other hand, if the remaining time is not met, the program continues.
Another situation is observed when a prior path has locked several implementations. In this case,
the implementations are checked one at the time for usability, i.e. if there is the possibility to unlock
one implementation by checking the remaining time saved. This function terminates if for each
target coefficients, there is at least one implementation with a time delay lesser or equal than the
critical time delay. This function is always guaranteed to finish, because it can find at least one (if
the minimum time delay of the coefficient is equal to the time constraint) or more implementations
(if the minimum time delay of the coefficient is lesser than the time constraint) for each coefficient.
To clarify this, consider the following example. We want to implement the coefficients 89 and 57
with a time delay of 72 (arbitrary time units) and 49, respectively. From the previous discussion, it
is clear that the critical time is given by the the path taken by the coefficient 89 and has a value
of 72 time units, and that the coefficient 57 has a time slack of 13 time units. Figure 5.5 depicts
the final implementation. Notice that this implementation uses the term 57 which, is also a target
coefficient. However, the 57 coefficient has more implementations with different time delays that
range from 49 to 55 time units. The algorithm always starts by the smallest coefficient, therefore
57 is processed taking into account all implementations that do not exceed 72 time units. The
implementations of 57 will be added to the map area t and locked, since the maximum time delay
of 55 time units does not exceed the critical time. After the algorithm is finished with the coefficient
57, 89 is processed. In this case, all implementations of the term 57 are removed except those
that do not exceed the critical time. In this example, only the implementation of 57 = 48 + 9 is able
to meet the critical time. The time delay is the smallest and has a value of 49 time units. Hence,
the area t structure will have one 57 implementation and one 89 implementation.
Summary
In this chapter, a new algorithm based on the minimization of the MCM propagation time is
proposed. Although its main goal is the minimization of the MCM critical path, area optimization
was also addressed. The work detailed in [14] and [3] was used as basis for the area optimization,
by adapting it to the time constraints.
66
6Results
Contents5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Data Structure and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Proposed optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 60
67
6. Results
To evaluate the proposed approach to improve the performance of the MCM units by using the
developed structures adder block, optimized in terms if the resulting propagation time, together
with the proposed adaptations in the optimizations algorithms, a comprehensive experimental
procedure was conducted. All the implemented structures in this chapter were described with
structural VHDL, synthesized with Synopsys Design Vision R© and placed-and-routed with Ca-
dence Encounter R©, by using the CMOS process from UMC 90nm logic gate technology library
[1], called standard cell throughout this chapter, and by considering the typical operating condi-
tions (V dd = 1, 2V , T = 25oC). The optimization algorithms that build the MCM structure were
implemented in C++98.
This chapter is divided into three parts:i) evaluations of optimized adder and increment im-
plementations; ii) evaluations of proposed structure to implement each node of the MCM; and
evaluation of the MCM structures based on the developed structural adder and the adapted opti-
mization algorithm. For such purpose, it was adopted a comprehensive set of benchmarks, which
allowed a fair comparison with other state of the art approaches.
6.1 Optimized Adder Structure Evaluation
This section evaluates the considered set of adder/subtracter and increment/decrement struc-
tures that were considered for this implementation in standard cells if the several zones of the
structures adder blocks that was proposed. In particular, the adopted model for each structure
area and propagation time based on Tyagi’s model in section 2.4.1 will be validated whit a real
library of standard cells. This section is organized in the same way as section 2.4.1: first we will
study the adder and then the increment/decrement architectures.
Each of the analysed blocks was synthesized using the incremental flag. This options syn-
thesizes the circuit from the low level blocks to the high level blocks. The goal of not optimizing
everything at the top level is to keep the original structure of each module. If the tool had the
freedom to optimize everything at the top level, then the area cost and the propagation time delay
would be altered, so that the block meets the constraints at any possible cost. In the case of a
propagation time constraint, it will use logic gates that take more area but have smaller latency.
The inverse constraint would be also true.
6.1.1 Adder Structures
The results corresponding to each adder implementation in standard cell, are presented in
figures 6.1a, 6.1b and 6.1c. The FA is kept for comparison.
As it can be seen in figure 6.1b, the SA is faster than both the FA and the CLA when it
comes to the output of the sum. The parallel structure propagates de signal faster than the other
two architectures. However, if we look to the carry out delay in figure 6.1c, it is shown that the
68
6.1 Optimized Adder Structure Evaluation
0
200
400
600
800
1000
1200
1400
0 5 10 15 20 25 30 35
Are
a(u
m2 )
Bits
Area comparison
FA (DV)
CLA (DV)
SA (DV)
(a) Area comparison between FA, CLA and SA.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 5 10 15 20 25 30 35
Tim
e (
ns)
Bits
Time comparison for output
FA (DV)
CLA (DV)
SA (DV)
(b) Output propagation time comparison between FA, CLA and SA.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 5 10 15 20 25 30 35
Tim
e (
ns)
Bits
Time comparison for carry
FA (DV)
CLA (DV)
SA (DV)
(c) Carry out propagation time comparison between FA, CLA and SA.
Figure 6.1: Area and propagation time of several adder structures when implemented with aStandard Cell library.
CLA and the SA structures have the same time delay up to 8 bits. Nevertheless, the CLA is
faster for bit widths greater than 9 bits than the SA. As a result, for the modular adder structure
presented in section 3.3, we get better results with the CLA than with the SA adder when zone 4
is present, which is the majority of the cases. Hence, the CLA is a wiser choice. With the results
of the synthesis, we have modelled the CLA architecture in figures 6.2a and 6.2b. It is the only
architecture used in the implementation of the adder.
0
200
400
600
800
1000
1200
1400
1600
0 5 10 15 20 25 30 35
Are
a (u
m2
)
Bits
FA and CLA area comparison
CLA model
CLA Std Cell
(a) CLA area: model and Standard Cell comparison
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
0 5 10 15 20 25 30 35
Tim
e (
ns)
Bits
FA and CLA time comparison
CLA model
CLA Std Cell
(b) CLA propagation time ti the sum output: model and Standard Cellcomparison
Figure 6.2: Comparison of the area and time delay of the CLA adder structure: model and Stan-dard Cell implementation.
69
6. Results
6.1.2 Increment and Decrement Structures
For the increment and decrement structures,it was observed that the serial architectures can
be faster for lower bit-widths than parallel architectures, as observed with the Tyagi model used in
section 2.4.1.B. To evaluate this model, we synthesized the circuit and obtained the results shown
in figure 6.3a and 6.3b.
0
100
200
300
400
500
600
700
800
900
0 5 10 15 20 25 30 35
Are
a (u
m2)
Bits
HA+MFA vs RID comparison by area
HA+MFA (DV)Reconfigurable INC/DEC (DV)
(a) Area comparison between HA+MFA and RID [20].
0
0,5
1
1,5
2
2,5
0 5 10 15 20 25 30 35Ti
me
(n
s)
Bits
HA+MFA vs RID comparison by time
HA+MFA (DV)
Reconfigurable INC/DEC (DV)
(b) Output propagation time comparison between HA+MFAand RID[20].
Figure 6.3: Comparison between HA+MFAand RID [20].
We can see that for the same area, we get improvements in propagation time for all bit-
widths with the RID structures. Hence, the RID has proved to be a good choice for the incre-
ment/decrement implementation. With the results of the synthesis, we have defined a model of
the RID and the HA+MFA architectures, as presented in figures 6.4a and 6.4b.
0
100
200
300
400
500
600
700
800
900
0 5 10 15 20 25 30 35
Are
a (u
m2
)
Bits
HA+MFA and RID area comparison
HA+MFA modelHA+MFA Std CellRID modelRID Std Cell
(a) Implementation area of the HA+MFA and RID: model and Stan-dard Cell implementation.
0
0,5
1
1,5
2
2,5
0 5 10 15 20 25 30 35
Tim
e (
ns)
Bits
HA+MFA and RID time comparison
HA+MFA model
HA+MFA Std Cell
RID model
RID Std Cell
(b) Propagation time of the HA+MFA and RID: model and StandardCell implementation.
Figure 6.4: Area and propagation time modelling and Standard Cell implementation of theHA+MFA and RID.
According to the obtained results, we will use the RID structure proposed by [20] for zone 2
and 4 and we will use the CLA to implement the additions in zone 3.
6.2 Structured Adder Block Evaluation
In this section, the area and time models, built in chapter 4, will be analysed. The result of the
adder are presented both for the Tyagi model and for the standard cell implementation, allowing a
direct comparison of these two models.
70
6.2 Structured Adder Block Evaluation
6.2.1 Tyagi time and area evaluation
Based on the model that was built in chapter 4, we obtained the values in table 6.1, by varying
the zone bit width of the proposed adder. Whenever Z2 > 0, we take into account the logic used
in the case of an increment in zone 2, as it was seen in section 3.2.1.
Zones bit width ModelAdder size Z4 Z3 Z2 Time Area #
(arbitrary unit) (arbitrary unit)
2 0 0 2 3 6 10 2 0 5 18 2
4
0 0 4 5 12 30 2 2 7 44 40 4 0 9 41 52 2 0 8 31 6
8
0 0 8 9 24 70 2 4 5 30 80 4 2 9 66 90 4 4 9 93 100 8 0 13 85 112 2 2 8 13 122 2 4 10 43 132 4 2 10 59 144 2 2 10 43 154 4 0 12 65 16
16
0 0 16 10 224 170 2 8 9 42 180 4 8 9 60 190 8 2 13 110 200 8 4 13 137 210 8 8 13 193 220 16 0 17 173 232 2 8 14 55 242 4 8 14 77 254 2 4 12 55 264 2 8 16 67 274 4 8 16 89 284 8 2 14 115 294 8 4 14 121 308 2 4 16 79 318 4 2 16 95 328 4 4 16 101 338 8 0 18 133 34
32
0 0 32 11 464 350 2 16 10 242 360 4 16 10 260 370 8 16 13 309 380 16 2 17 198 390 16 4 17 225 400 16 8 17 281 410 16 16 17 397 420 32 0 21 349 432 2 16 16 255 44
71
6. Results
2 4 16 16 277 452 8 16 16 321 462 16 2 17 210 472 16 4 17 237 482 16 8 17 293 494 2 16 18 267 504 4 16 18 289 514 8 16 18 333 524 16 2 17 222 534 16 4 17 249 544 16 8 17 305 558 2 8 20 91 568 4 8 20 113 578 8 8 20 157 588 16 8 20 245 598 2 16 22 291 608 4 16 22 313 618 8 16 22 357 6216 2 8 21 267 6316 4 8 21 289 6416 8 8 21 333 6516 16 0 21 397 66
Table 6.1: Propagation time and area obtained with Tyagi’s model for the structured adder usedin each MCM node.
0
5
10
15
20
25
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70
Tim
e (
arb
itra
ry u
nit
)
Are
a (a
rbit
rary
un
it)
Area
Time
Figure 6.5: Distribution of propagation time and area occupation of the structured adder imple-mentation with Tyagi metric, when considering different implementations.
As it was expected, both the propagation time and the implementation area grow as the size of
the adder grows. It is worth noting a particular situation that occurs when the values Z2 ≥ Z3+Z4:
in this case, the propagation time is higher than the other increment case (Z4 ≥ Z3+Z2) because
zone 4 receives an anticipated carry out from zone 3 due to the usage of a CLA in zone 3, as it
was described in section 2.4.1.A and section 6.1.1. Note also the pure adders: they take more
time than the equivalent hybrid structure and in some cases they take even more area. In table
fact, when comparing the configurations in row 5 and 6 in table 6.1, i.e. the pure adder structure
with 4 bits in zone 3 with a structural adder with 2 bits in zone 4 and 2 bits in zone 3, the hybrid
72
6.2 Structured Adder Block Evaluation
structure is faster thanks to the anticipated of the carry in zone 4. Moreover, row 5 has a greater
area. The same is not true for row 4 configuration, where the propagation time is even less time
than the time in row 6 configuration. Nevertheless, the area is greater.
The chart in 6.5 was built with the data from table 6.1 (in the same order). The evolution of the
adder propagation time and area occupation is not linear, but it can achieve the desired result, i.e.
time minimization while still having an acceptable area. When comparing with the time and area
of the simpler CLA in figures 2.18b and 2.18a, we can see that the proposed adder has a smaller
propagation time for a similar area.
6.2.2 Standard Cell time and area evaluation
Table 6.2 presents the results concerning the propagation time and area occupation after the
synthesis of the proposed structural adder block with the standard cell library.
Zones bit width SynthesisAdder size Z4 Z3 Z2 Time (ns) Area (µm2) #
2 0 0 2 0.14 7 10 2 0 0.25 61 2
4
0 0 4 0.16 18 30 2 2 0.26 87 40 4 0 0.27 206 52 2 0 0.28 157 6
8
0 0 8 0.35 217 70 2 4 0.27 140 80 4 2 0.31 325 90 4 4 0.32 287 100 8 0 0.35 501 112 2 2 0.35 257 122 2 4 0.31 317 132 4 2 0.35 435 144 2 2 0.38 361 154 4 0 0.37 463 16
16
0 0 16 0.46 645 170 2 8 0.37 308 180 4 8 0.40 548 190 8 2 0.38 574 200 8 4 0.37 682 210 8 8 0.44 716 220 16 0 0.43 991 232 2 8 0.40 350 242 4 8 0.43 529 254 2 4 0.37 415 264 2 8 0.47 531 274 4 8 0.48 606 284 8 2 0.45 835 294 8 4 0.44 951 308 2 4 0.47 544 318 4 2 0.49 727 328 4 4 0.49 780 338 8 0 0.5 843 34
73
6. Results
32
0 0 32 0.5 1779 350 2 16 0.36 762 360 4 16 0.39 821 370 8 16 0.43 1087 380 16 2 0.44 1222 390 16 4 0.45 1161 400 16 8 0.48 1214 410 16 16 0.46 1650 420 32 0 0.57 2361 432 2 16 0.40 807 442 4 16 0.40 1061 452 8 16 0.42 1324 462 16 2 0.44 1542 472 16 4 0.45 1534 482 16 8 0.48 1444 494 2 16 0.46 855 504 4 16 0.45 1062 514 8 16 0.46 1459 524 16 2 0.48 1606 534 16 4 0.48 1655 544 16 8 0.49 1858 558 2 8 0.56 617 568 4 8 0.56 759 578 8 8 0.57 1023 588 16 8 0.56 2003 598 2 16 0.54 1133 608 4 16 0.54 1195 618 8 16 0.54 1547 62
16 2 8 0.67 1142 6316 4 8 0.66 1259 6416 8 8 0.62 1680 6516 16 0 0.61 1847 66
Table 6.2: Propagation time and area obtained with a standard cell implementation of the pro-posed adder used in each MCM node.
According to the presented results, it can be observed that the previously presented Tyagi’s
model of the propagation time and implementation area follows with close accuracy the respec-
tive trend of the real (measured) results, when the circuit is implemented with a standard cells
library. This aspect is very important, since this same model was used as input of the optimization
procedures that were presented in chapter 5.
The chart 6.6 was built with the data from table 6.2 (in the same order). The evolution of
the adder propagation time and area are not linear but achieve the desired result, i.e. time min-
imization while still having an acceptable area. When comparing these values with the time and
area of the CLA in figures 6.1b and 6.1a we can see that the proposed adder structure has a
smaller propagation time for a similar area. The result is in line with what was previously seen in
section 6.2.1.
74
6.3 Multiple Constant Multiplication Structure
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0
500
1000
1500
2000
2500
0 10 20 30 40 50 60 70
Tim
e (
ns)
Are
a (u
m^2
)
Area
Time
Figure 6.6: Distribution of time delay and area occupation for different adder implementation withStandard cell metric
6.3 Multiple Constant Multiplication Structure
To understand the impact of the proposed adder block on the performance (propagation time
and implementation area) of a MCM structure, we compare an implementation based on [2] versus
the implementation with our adder. As experiment sets we used the coefficients sets from [2]
and [20]. The first sets were computed with MATLAB R© using the Remez algorithm, while the
second set is a difficult to obtain minimal set from [18]. Finally the last set (the tenth set) is a set
composed by large prime numbers (seven coefficients from 0 to 4k, seven coefficients from 4k to
32k and seven coefficients from 32k to 64k), corresponding to a particularly difficult factorisation.
In table B.1 of Appendix C we enumerate each these sets that were used to calculate the MCMs.
ASSUME-A algorithm Proposed algorithmAdder from [2] Proposed adder Proposed adder
FilterOperationsDelay (ns)Area (µm2) Delay (ns)Area (µm2) OperationsDelay (ns)Area (µm2)0 18 2.93 14100 1.54 22224 23 1.45 260651 10 2.08 10156 0.95 19124 10 1.01 175872 17 2.53 14138 1.25 22106 18 1.19 239463 15 2.90 12681 1.40 19930 18 1.19 240474 28 2.78 30712 1.33 45472 30 1.36 448825 34 2.73 33584 1.34 48520 37 1.47 474486 21 2.74 16602 1.47 25859 25 1.49 279707 30 2.94 21755 1.42 36805 39 1.46 397308 46 3.05 41852 1.68 58046 51 1.68 649199 32 2.76 22784 1.57 34689 38 1.51 3977710 41 2.31 31913 1.73 44403 49 1.69 52901
Avg. - - - 1.92 55% - 1.95 68%Min. - - - 1.34 39% - 1.37 41%Max. - - - 2.19 88% - 2.44 90%
Table 6.3: Experimental results of the MCM implementation after synthesis.
The procedure for this test was:i) the algorithm was run with no time limit for the given co-
efficients sets; ii) once the MCM solutions are found, we translated the MCMs decompositions
75
6. Results
into VHDL language according to the used adder structure. The code went through the high-level
synthesis and then through the place and route process. The analysis results after synthesis and
after the place and route are presented in tables 6.3 and 6.4.
ASSUME-A algorithm Proposed algorithmAdder from [2] Proposed adder Proposed adder
FilterOperationsDelay (ns)Area (µm2) Delay (ns)Area (µm2) OperationsDelay (ns)Area (µm2)0 18 2.838 13869 1.823 23698 23 1.866 260441 10 2.024 11172 1.207 18167 10 1.435 145582 17 2.392 14981 1.555 22689 18 1.533 239803 15 2.833 12293 1.676 21025 18 1.557 236644 28 2.678 27918 1.815 40300 30 1.811 409675 34 2.634 31297 1.859 43593 37 1.976 448956 21 2.655 17542 1.875 26895 25 1.920 287817 30 2.856 22509 1.831 37084 39 1.878 415558 46 3.012 39843 2.231 54025 51 2.281 568589 32 2.662 24270 2.000 37087 38 1.866 43305
10 41 2.985 39830 2.077 47534 49 2.235 52941Avg. - - - 1.50 51% - 1.46 60%Min. - - - 1.33 19% - 1.32 30%Max. - - - 1.69 71% - 1.82 92%
Table 6.4: Experimental results of the MCM implementation after place and route.
Tables 6.3 and 6.4 show the results obtained for the synthesis and the place-and-route tool.
For the synthesis, comparing the adder used in [2] with the proposed adder. The first eleven rows
are the data from each implementation. The last three rows are the speedup and area penaliza-
tion for each implementation. The average, minimum and maximum speedup are the values that
shows how much faster is the implementation when compared to a reference (ASSUME-A with
adder from [2]). If the speedup is one than the implementation has the same speed as the refer-
ence. The formula used for calculating each individual speedup istreftialgo
. The average, minimum
and maximum area penalizations are the quantification of how much more area percentage the
implementation has augmented when compared to the reference. The formula used for the each
individual area penalization isAialgoAref
− 1. It is a percentage that estimates how much bigger is the
implementation compared to a reference.
As it can be observed, by simply replacing the conventional adder structures with the proposed
adder in the MCM structure obtained with [3], there is an average speed up of 1, 92 and a silicon
area augmentation of 55% in average. Clearly the proposed adder introduces an improvement
in latency over the existing technology. This performance can still be improved by applying the
developed optimization algorithm, which takes into account a time minimization procedure using
the time/area model that was previously defined and evaluated for the proposed adder structure.
When comparing the proposed algorithm performance with the previous technology from [3], there
is an average speed up of 1.95 and an average augmentation of silicon area of 68%. Hence, the
proposed algorithm gets better results across the board. Note the maximum speed-up of 2.44
76
6.3 Multiple Constant Multiplication Structure
versus 2.19, the result obtained by the ASSUME-A algorithm. The proposed algorithm obtains
at least equal or better gains than the ASSUME-A: the minimum gain is 1.37 for the proposed
algorithm and 1.34 for the ASSUME-A algorithm.
After repeating the same evaluation with the place and route, we observed an average speed up
of 1, 50 and an average silicon area augmentation of 51%. Again the proposed adder brings an
improvement in latency over the existing technology, even though it is slightly less significant than
the results from the synthesis. Comparing the proposed algorithm performance with the previous
technology from [3], there is an average speed up of 1.46 and an average augmentation of silicon
area of 60%. This time the result are not so good as previously in the synthesis: the average
speed up is smaller. Note the maximum speed-up of 1.82 versus 1.69, the result obtained by the
ASSUME-A algorithm. As in the synthesis, the proposed algorithm obtains at least equal or better
gains than the ASSUME-A: the minimum gain is 1.32 for the proposed algorithm and 1.33 for the
ASSUME-A algorithm.
Summary
In this section, we presented a comprehensive evaluation of the hardware structures and op-
timizations algorithms that were described in the previous section. First, an analysis of the adder
block with different operator resolutions were presented. Then, the optimization algorithms were
compared using different filters, demonstrating the advantages if using the proposed structural
adder in each node of the MCM multiplier.
77
6. Results
78
7Conclusions
Contents6.1 Optimized Adder Structure Evaluation . . . . . . . . . . . . . . . . . . . . . . . 686.2 Structured Adder Block Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 706.3 Multiple Constant Multiplication Structure . . . . . . . . . . . . . . . . . . . . . 75
79
7. Conclusions
This thesis addresses the problem of maximizing the processing speed of a MCM circuit. We
resorted to current state of the art arithmetic circuits and to the previous work presented in [3] to
build an hybrid adder structure and a new algorithm for time minimization, respectively.
Based on the shift and add operations of the MCM structures, we identified several optimiza-
tions at the implementation level to build a modular and structural adder. The proposed adder
structure is a modular circuit scalable by zones, where each zone is defined by a simple arith-
metic operation. In each zone, several different implementations were studied with the aim of
minimizing the latency. While the complexity of this hybrid adder slightly augments, its latency
is lower than conventional adders of the same bit-width. This was achieved by separating the
hybrid adder into four main zones: zone 1 does not have any operation and is a simple zero
generation; zone 2 is either an increment or a bypass; zone 3 is an addition; finally, zone 4 is an
increment/decrement/bypass.
According to the obtained experimental results, we achieved speed-ups up to 2.5 times when
compared to the hardware proposed by [2]. Even though it occupies more silicon area than some
state of the art implementations, we noted that the proposed structural adder has a fairly linear
growth in terms of area. Moreover, the organization of this modular adder facilitates the usage of
other possible technologies in each zone.
As for the algorithm based on the work proposed in [3], we developed a latency minimization
oriented algorithm, weighted by an area minimization criteria, based on the CSD representation.
The algorithm was built around the time constraint of a given MCM. This time constraint is de-
fined as the longest minimal time to compute a given coefficient in the coefficient set of the MCM.
After the computation of the critical time corresponding to all possible alternatives, the algorithm
chooses the most convenient implementation for each coefficient, by identifying the implementa-
tions that satisfy the imposed time constraint and takes the ones that jointly achieve a minimal
area. Once the algorithm has terminated, the result is a MCM structure with minimal propagation
time at the lowest cost in terms of area. Hence, the introduced novelty that is presented by the
present approach is the extension of this optimization to the logic-cells level and the usage of time
instead of area, as the optimization constraint.
As the experimental results show, the conducted synthesis proves that the MCM structures that
were obtained with our algorithm have at least the same or better performance than other state of
the art algorithms. This is mainly due to the proposed refined model, adopted in the construction
of the MCM. As for the place and route, the gains were more marginal than those obtained with the
synthesis, when comparing the achieved performance with the other state of the art algorithms.
In terms of future work, the application of the proposed approach in real applications would
prove its usability, namely in time critical operations (e.g. filtering, optimization, etc...), used by
real-time Digital Signal Processing applications.
80
Bibliography
[1] (2009). Faraday ASIC Cell Library FSD0A A 90nm Standard Cell. Faraday Technology Cor-
poration.
[2] Aksoy, L., Costa, E., Flores, P., and Monteiro, J. (2007a). Optimization of area in digital fir filters
using gate-level metrics. In Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE,
pages 420 –423.
[3] Aksoy, L., da Costa, E., Flores, P., and Monteiro, J. (2008). Exact and approximate algorithms
for the optimization of area and delay in multiple constant multiplications. Computer-Aided
Design of Integrated Circuits and Systems, IEEE Transactions on, 27(6):1013 –1026.
[4] Aksoy, L., Gunes, E. O., Costa, E., Flores, P., and Monteiro, J. (2007b). Effect of number repre-
sentation on the achievable minimum number of operations in multiple constant multiplications.
In Signal Processing Systems, 2007 IEEE Workshop on, pages 424 –429.
[5] Aktan, M., Yurdakul, A., and Dundar, G. (July). An algorithm for the design of low-power
hardware-efficient fir filters. Circuits and Systems I: Regular Papers, IEEE Transactions on,
55(6):1536–1545.
[6] Arroz, G. (2009). Arquitectura de Computadores, 2a edicao, chapter 5.2.3 Operacoes com
Numeros em Complemento para 2. IST PRESS.
[7] Avizienis, A. (1961). Signed-digit numbe representations for fast parallel arithmetic. Electronic
Computers, IRE Transactions on, EC-10(3):389 –400.
[8] Bull, D. and Horrocks, D. (1988). Primitive operator digital filter synthesis using a shift biased
algorithm. In Circuits and Systems, 1988., IEEE International Symposium on, pages 1529
–1532 vol.2.
[9] Bull, D. and Horrocks, D. (1991). Primitive operator digital filters. Circuits, Devices and
Systems, IEE Proceedings G, 138(3):401 –412.
[10] Cappello, P. and Steiglitz, K. (1984). Some complexity issues in digital signal processing.
Acoustics, Speech and Signal Processing, IEEE Transactions on, 32(5):1037 – 1041.
81
Bibliography
[11] Dempster, A. and Macleod, M. (1994). Constant integer multiplication using minimum adders.
Circuits, Devices and Systems, IEE Proceedings -, 141(5):407 –413.
[12] Dempster, A. and Macleod, M. (1995). Use of minimum-adder multiplier blocks in fir digital
filters. Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on,
42(9):569 –577.
[13] Dempster, A. and Macleod, M. (2004). Using all signed-digit representations to design single
integer multipliers using subexpression elimination. In Circuits and Systems, 2004. ISCAS ’04.
Proceedings of the 2004 International Symposium on, volume 3, pages III – 165–8 Vol.3.
[14] Flores, P., Monteiro, J., and Costa, E. (2005). An exact algorithm for the maximal shar-
ing of partial terms in multiple constant multiplications. In Computer-Aided Design, 2005.
ICCAD-2005. IEEE/ACM International Conference on, pages 13 – 16.
[15] Garner, H. L. (1966). Number systems and arithmetic. volume 6 of Advances in Computers,
pages 131 – 194. Elsevier.
[16] Gustafsson, O., Dempster, A., and Wanhammar, L. (2002). Extended results for minimum-
adder constant integer multipliers. In Circuits and Systems, 2002. ISCAS 2002. IEEE
International Symposium on, volume 1, pages I–73 – I–76 vol.1.
[17] Hartley, R. (1996). Subexpression sharing in filters using canonic signed digit multipli-
ers. Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on,
43(10):677 –688.
[18] Johansson, K., Gustafsson, O., DeBrunner, L., and Wanhammar, L. (2011). Minimum adder
depth multiple constant multiplication algorithm for low power fir filters. In Circuits and Systems
(ISCAS), 2011 IEEE International Symposium on, pages 1439 –1442.
[19] Kumar, R., Mandal, A., and Khatri, S. (2012). An efficient arithmetic sum-of-product (sop)
based multiplication approach for fir filters and dft. In Computer Design (ICCD), 2012 IEEE 30th
International Conference on, pages 195 –200.
[20] Kumar, V., Phaneendra, P., Ahmed, S., Sreehari, V., Muthukrishnan, N., and Srinivas, M.
(2011). A reconfigurable inc/dec/2’s complement/priority encoder circuit with improved decision
block. In Electronic System Design (ISED), 2011 International Symposium on, pages 100 –105.
[21] Malvar, H., Hallapuro, A., Karczewicz, M., and Kerofsky, L. (2003). Low-complexity transform
and quantization in h.264/avc. Circuits and Systems for Video Technology, IEEE Transactions
on, 13(7):598 – 603.
[22] Park, I.-C. and Kang, H.-J. (2001). Digital filter synthesis based on minimal signed digit
representation. In Design Automation Conference, 2001. Proceedings, pages 468 – 473.
82
Bibliography
[23] Pasko, R., Schaumont, P., Derudder, V., Vernalde, S., and Durackova, D. (1999). A new
algorithm for elimination of common subexpressions. Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, 18(1):58 –68.
[24] Plauger, P., Lee, M., Musser, D., and Stepanov, A. A. (2000). C++ Standard Template Library.
Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition.
[25] Richardson, I. E. (2010). The H.264 Advanced Video Compression Standard, Second
Edition, chapter 7 - H.264 transform and coding, page 194. John Wiley and Sons.
[26] Sklansky, J. (1960). Conditional-sum addition logic. Electronic Computers, IRE Transactions
on, EC-9(2):226 –231.
[27] Sousa, L., Roma, N., and Dias, T. (2003). Efficient adder architectures for high-performance
vlsi design. Technical report, INESC-ID Lisbon. Technical Report No. RT/001/2003-CDIL.
[28] Tyagi, A. (1993). A reduced-area scheme for carry-select adders. Computers, IEEE
Transactions on, 42(10):1163–1170.
[29] Voronenko, Y. and Puschel, M. (2007). Multiplierless multiple constant multiplication. ACM
Trans. Algorithms, 3(2).
[30] Zimmermann, R. (1997). Binary Adder Architectures for Cell-Based VLSI and their
Synthesis. PhD thesis, Swiss Federal Institute of Technology, Zurich.
83
Bibliography
84
AAppendix A - Background
85
A. Appendix A - Background
We will present a brief review about number representation systems, by explaining the repre-
sentations commonly used in MCM design and how to implement conversions between them.
A.1 Number Representations Systems
In this section, the most prevailing number representation systems will be briefly reviewed,
with a particular attention to Signed Binary (SB) representations. With this in mind, the TC [6]
will be considered as the most usual representation for electronic circuits. Next, the CSD [7] and
the MSD [22] will also be reviewed, due to their relevance in MCM design, by offering minimal
non-zero digits representations of the numbers to be operated.
A.1.1 Unsigned Binary Representation
The binary system [6] is generally used in electronics as the default representation of unsigned
digits. This system uses two symbols: 0 and 1. Any unsigned number can thus be represented
by a well defined sequence of bits. Its conversion to the base-10 system (decimal) is done by the
following equation:
y10 =∑x∈Z
αx2 · 2x (A.1)
where αx2 is the binary digit in position x. Note that the 2 and 10 subscript denote a binary and
decimal number respectively.
As an example, given the binary number 1001102, we apply the above formula to obtain: y10 =
12 × 25 + 02 × 24 + 02 × 23 + 12 × 22 + 12 × 21 + 02 × 20 = 3810
The main limitations of this representation is the impossibility of representing negative values.
A.1.2 Signed Binary Representations
Several different representations have been proposed to represent signed numbers. In this
subsection we review three commonly used representations: One’s Complement (OC), TC, CSD
and MSD. In particular the later two are frequently used as useful decompositions for constants
in MCMs.
A.1.2.A One’s Complement
Just as the unsigned binary representation representation, the OC representation also uses
the 0 and 1 symbols. However, it is able to extend the representation to negative values. The OC
is defined as the value that is obtained by inverting all bits of a number in the unsigned binary
representation. In a N -bit one’s complement numbering system, one can represent numbers
within the range −(2N−1 − 1) to 2N−1 − 1. For example, by taking the number 00112 (310) we can
convert it to negative by simply taking the complement of all bits: 11002 (−310). Note that the most
86
A.1 Number Representations Systems
significant bit represents the sign of the number: 1 for negative and 0 for positive.
The main disadvantage of this representation is the existence of two representations for the zero
values: the positive 00002 (+010) and the negative 11112 (−010).
A.1.2.B Two’s Complement
The TC representation is very similar to the unsigned binary representation as it also uses
both 0 and 1 symbols. However, it offers a slightly bigger representation range than the OC:
numbers within the range −(2N−1) to 2N−1−1 can be represented. As in OC, the most-significant
bit is the sign of the number. To convert a positive number into a negative value, we simply take
the OC of a positive representation and add one. Furthermore, the addition operation is imple-
mented just like in the unsigned binary representation. The subtraction is also straightforward
and is implemented just like the addition, i.e. subtracting 5 from 15 is the same as adding -5
and 15. As an example, by using an 8 bits representation, the number -5 is represented by:
00000101complemented−−−−−−−−−→ 11111010
+1−−→ 11111011.
To convert a negative number to a positive number, the procedure is the same as in the above
example:11111011complemented−−−−−−−−−→ 00000100
+1−−→ 00000101.
Now if both are added, the result of the operation is 0: 11111011 + 00000101 = 00000000.
The possibility of representing negative numbers is of great interest in MCM. Next, two different
signed representations are presented.
A.1.2.C Canonical Signed Digit
Contrasting to the previous binary representations, the CSD uses representation [7] has 3
symbols instead of just two: 1, 0 and 1, where the last symbol (the negated one) is equivalent
to −1. This representation system is essentially a re-encoding of the binary representation, by
converting every sequence of 1’s followed by a 0 into a 1 followed by a segment of 0’s and a
−1 (...001110...CSD−−−→ ...010010...), thus achieving the minimum amount of non-zeros symbols.
According to [15], on average the number of non-zero digits is reduced by 33%.
For the sake of illustration, lets consider the 14 value, represented in TC as: 01110⇔ 14 = 8+4+2.
When converted into CSD, the new representation will be: 10010 ⇔ 14 = 16− 2. In this example
we decreased the number of non-zeros symbols from 3 to 2.
Due to its particular characteristics, this representation has been shown to be very convenient
to minimize the number of adders. However, while there is only one possible implementation in
CSD, the MSD representation (described below) offers the possibility to use several alternatives
representations.
87
A. Appendix A - Background
A.1.2.D Minimal Signed Digit
The MSD [22] is very similar to the CSD. However, while CSD has only one possible represen-
tation for each value, MSD can have several possible representations. This is mainly due to the
way a MSD number is constructed. The conversion is realized by taking the CSD representation
and by subsequently applying the following two transformation rules on the obtained numbers:
• if 101 → 011, this is a reorganization of the bits: 1 × 4 + 0 × 2 + 1 × 1 = 4 − 1 = 3 into
0× 4 + 1× 2 + 1× 1 = 2 + 1 = 3
• if 101→ 011, as the previous example: 1× 4 + 0× 2 + 1× 1 = −4 + 1 = −3 into 0× 4 + 1×
2 + 1× 1 = −2− 1 = −3
To illustrate this procedure, let’s consider the CSD representation of the number 715: 10101010101.
By transforming both underlined symbols, we get:01100110101
However, the alternate grouping 10101010101 is also possible, giving rise to the distinct
representation:01101001101. By successively applying the same procedure to the last represen-
tation we obtain the following grouping: 01101001101 which lend the following new representation:
01011001011.
Hence by using the MSD numbering system, we produced 2 different representations of the
number 715. In general, while the CSD only allowed to obtain one different representation, MSD
can provide more, making a more diverse decomposition of the term.
Discussion
The subset of different number representations that were revised in this section will be used
throughout this document. The binary TC representation is paramount in electronic circuits and
will be used to represent the input and output operators of the MCM circuits, that will be debated
in this thesis. For the partial term decomposition, the CSD numbering system will be used to build
the MCM structures. This decomposition was first used by Avizienis [7], Hartley [17], Bull et al. [9],
Dempster et al. [11] [12] [13] and Pasko et al. [23]. On the other hand, the MSD representation
seems to have been first used as a decomposition technique in the MCM problem by Park et al.
[22], and later by Flores et al. [14] and Aksoy et al. [4] [3].
88
BAppendix B - Used set of
coefficients
ContentsA.1 Number Representations Systems . . . . . . . . . . . . . . . . . . . . . . . . . 86
89
B. Appendix B - Used set of coefficients
Set name # coefficients Max coefficients’ bitwidth Coefficientsfir00 14 12 -710;-662;-499;-266;-35;327;398;
505;582;699;710;1943;2987;3395fir01 61 7 -1;0;0;0;0;0;0;0;0;-1;-1;0;1;1;1;0;0;-1;-1;-1;
0;1;2;2;1;-1;-2;-3;-2;0;3;4;3;1;-2;-5;-5;-3;1;5;7;6;1;-5;-9;-10;-5;3;11;15;12;2;-13;-24;
-26;-14;14;51;89;117;128fir02 51 9 0;0;0;0;0;0;0;0;0;0;0;0;1;1;1;1;-1;-2;-3;-3;-1;2;
5;8;8;6;1;-6;-14;-19;-19;-12;2;19;35;43;38;18;-15;-53;-84;-96;-78;-23;67;182;303;411;485;512
fir03 20 12 -1;-6;0;15;13;-20;-44;1;80;63;-85;-173;1;283;223;-297;-650;2;1628;3058
fir04 41 11 -23;40;32;25;10;-10;-29;-35;-23;3;33;51;45;14;-30;-66;-74;-44;14;75;107;88;20;
-71;-142;-151;-81;46;175;240;190;24;-203;-388;-419;-218;221;818;1426;1880;2048
fir05 61 11 70;546;-39;4;-34;-44;-30;2;34;48;35;1;-35;-52;-40;-4;36;58;47;9;-37;-64;-55;-14;
38;71;64;20;-39;-79;-76;-29;39;90;91;40;-40;-103;-112;-55;40;122;140;77;
-41;-150;-184;-112;42;197;262;177;-42;-296;-441;-345;42;656;1329;1851;2048
fir06 30 14 0;-1;0;4;4;-7;-17;0;36;29;-42;-87;2;148;113;-148;-293;4;450;335;-419;-819;6;1246;958;-1242;-2670;7;6544;12240
fir07 30 14 33;-14;-51;-51;13;86;71;-50;-156;-98;111;250;113;-217;-377;-106;387;543;55;-661-768;79;
1134;1114;-411;-2155;-1887;1571;6898;10896fir08 50 16 -5;1;10;12;-3;-24;-21;15;51;33;-39;-92;-43;
83;150;44;-159;-227;-24;276;319;-30;-444;-418;140;676;512;-332;-981;-579;637;1368;591;-1104;-1854;-503;1806;2470;243;-2893;-3298;339;4743;
4603;-1688;-8759;-7615;6322;27640;43590fir09 30 14 -59;-11;34;89;110;59;-57;-179;-217;-108;120;347;
406;198;-216;-614;-711;-343;380;1074;1248;607;-697;-2023;-2463;-1281;1657;5700;9592;11976
fir10 21 16 233;293;379;769;2693;3499;3917;6247;7307;19753;20269;28279;29147;31189;37879;
39313;40127;44563;50147;59063;63463
Table B.1: Coefficient sets.
90
CAppendix C - Comparison of filter
fir10
91
C. Appendix C - Comparison of filter fir10
Input
+3x
+5x
-7x <<3
+9x
-31x
<<5
+33x
+41x
<<5
+257x
+259x
<<8
-287x-767x
+769x
+2693x
<<7
-4087x
<<12
-32767x
<<15
+32773x
<<15
+40961x
-65535x
<<16
+99x
+195x
-379x
<<7
<<8
<<8
+1283x
+49155x
<<14
-59391x
<<11
+85x
-139x+293x
-631x
<<7
<<8
+2565x
<<13
-103x
<<4
+233x
<<5
+905x
<<7
-3499x
<<9
+6247x
+7307x
<<10
<<4
<<5
<<5
+4617x
+50147x
<<5
+37879x
<<10
-59063x
<<3
-31189x
<<4
-39313x
<<4
-3917x
<<5
<<4
-28279x
<<7
-40127x
out1
-20269x-63463x
<<3
-44563x
<<4
out2out3
<<6
-19753x
out4
-29147x
<<2
<<4
<<3
out5out6
out7out8
out9out10
out11out12
out13out14
out15out16
out17out18
out19out20
out21
FigureC
.1:G
raphfora
testsetfir10w
iththe
proposedadderand
algorithm.
92
Input
+3x
+5x
-7x
+9x
+65x
+65537x
+769x
+1281x
+40961x
+65543x
<<16
+43x
+103x<<5
+195x
+227x
-379x
<<7
<<8
+833x<<8
+899x
+12297x<<12
-2693x<<10
+6247x
<<11
-28279x
<<7
<<3
+61x
+293x
<<8
<<13
+40967x
<<13
-40127x
<<13
<<3
<<5
+233x
<<5
<<7
-28663x<<12
<<5
+585x
+37879x
<<10
-63463x
<<5
+7307x
-31189x
<<9
-39313x
<<4
+3499x
<<4
+50147x
<<8
<<5
-20269x
+19753x
<<5
-59063x
<<8
out1
out2
+29147x
out3
+3917x
out4
<<2
<<5
+44563x<<2
<<4
out5
out6
out7
out8
out9
out10
out11
out12
out13
out14
out15
out16
out17
out18
out19
out20
out21
Figu
reC
.2:
Gra
phfo
rate
stse
tfir1
0w
ithth
eLe
vent
adde
rand
algo
rithm
[3].
93
C. Appendix C - Comparison of filter fir10
94