challenges in automatic optimization of arithmetic circuits
DESCRIPTION
csda. csda. Challenges in Automatic Optimization of Arithmetic Circuits. Ajay K. Verma , Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL). - PowerPoint PPT PresentationTRANSCRIPT
Ajay K. Verma, Philip Brisk and Paolo Ienne
Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)
Ecole Polytechnique Fédérale de Lausanne (EPFL)
csda
csda
Challenges in Automatic Challenges in Automatic OptimizationOptimization
of Arithmetic Circuitsof Arithmetic Circuits
2
Circuit PerformanceCircuit PerformanceDepends Heavily on the DescriptionDepends Heavily on the Description
“Software”Multiplier
Multiplier withOptimized
Compressor Tree
Multiplier withCompressor Tree
3
Pre-Synthesis Optimization of Pre-Synthesis Optimization of Arithmetic CircuitsArithmetic Circuits
Original Circuit Description
Original Circuit Description
Physical DesignPhysical Design
Logic SynthesisLogic Synthesis
Arithmetic optimizationsArithmetic optimizations
Known architectures
Automaticarchitecture exploration
4
Automation and Computer Automation and Computer ArithmeticArithmetic
Automation Heuristics to optimize general
classes of circuits Kernel and co-kernel
extraction [Brayton82] Decomposition based
approaches for general circuits [Bertacco97, Mishchenko01, Yang02]
Algorithmic approaches for a particular class of circuits Variable group size CLA adder
[Lee91] Irregular partial product
compressors [Stelling98]
5
Logic SynthesisLogic Synthesis
Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form
And when there are plenty of XOR gates?
QPR
bbbbbbbbQ
aaaaaaaaP
43322110
43322110
Before expansion : 0.37 ns (138.2 μm2) After expansion : 0.26 ns (146.9 μm2)
Before expansion : 0.22 ns (58.8 μm2) After expansion : 0.27 ns (221.2 μm2)
QPR
babababaQ
babababababaP
)( 03213012
032121300330
)( QPPQ
?
6
OutlineOutline
Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008
Optimizingat Word-Level
Verma & Ienne; DAC 2006
ExploringMicroscopicStructure
CreatingMacroscopic
Structure
Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee
Verma, Brisk, & Ienne; IWLS 2008
7
OutlineOutline
Optimizingat Word-Level
CreatingMacroscopic
Structure
ExploringMicroscopicStructure
Low Complexity HighHigh Granularity Low
8
OutlineOutline
Optimizingat Word-Level
CreatingMacroscopic
Structure
ExploringMicroscopicStructure
9
Clustering: Maximization of the Use of Clustering: Maximization of the Use of
Carry-Save RepresentationCarry-Save Representation
Two addition nodes areseparated by NOT
The two addition nodesare clustered
Goal: Swap the adders with other logic operations while
preserving the semantics to cluster additions
10
Examples of TransformationsExamples of Transformations
(A << k) A . 2k
Advancing shift left over add (distributivity of multiplication over addition)
Advancing shift right overaddition is more complex
Advancing SEL over add (existence of the identity element of addition)
(C ? A : D) + (C ? B : 0)C ? (A + B) : D
11
Some Transformations Have a Some Transformations Have a CostCost
Advancing PP over add (distributive property of multiplication over addition)
This transformation has a
significant cost in terms of area!
12
Generation of All Pareto-Optimal Generation of All Pareto-Optimal ImplementationsImplementations
Theorem: The transformations form apersistent and confluent reduction system
Pareto-optimal: better than any other in terms of
area or critical-path delay
13
Example: Example: adpcmdecodeadpcmdecode Kernel Kernel
Compressor tree
AND network
0.85 ns,5678 μm2
0.51 ns,4901 μm2
14
OutlineOutline
Optimizingat Word-Level
CreatingMacroscopic
Structure
ExploringMicroscopicStructure
Limited scope for optimizations
Bit-level
15
Implementation of Subcircuits Implementation of Subcircuits Corresponding to Contiguous Layers Can Corresponding to Contiguous Layers Can
Be ImprovedBe Improved
ADD
LZD
Arithmetic
Logic
Leading Zero AnticipatorA direct implementation of LZA
in carry-select fashion [Gerwig99]
16
Input CondensationInput Condensation
Leader expressions: • Sufficient to evaluate the whole of an expression• Once you evaluate them, you can discard the input bits
Compute all leader expressions in parallel
Recursively compute leader expressions
again
IN
SomeLargeCircuit
OUT
IN
L |L| < |IN|
Smaller circuit
OUT
sc
8-input parallel counter
Leader expressions
17
Progressive Decomposition: Progressive Decomposition: Algorithm OverviewAlgorithm Overview
Choose a subset of input bitsHow many bits?Many different combinations?
Find leader expressionsOptimize via Boolean ring propertiesFind identities
Discard dependent expressions
x y zz = f(x, y)
Rewrite circuit in terms of leader expressions Recursively process the remaining circuit
18
Example: 3-Input Adder (sExample: 3-Input Adder (s22 Output)Output)
X = [a1b1 + (a1 + b1)a0b0] [(a1 b1 a0b0)c1 + c0(a0 b0)(c1 + (a1 b1 a0b0))]
Ripple-Carry Adder
L(X, {a1, b1, c1}) = {a1 b1 c1, a1b1 b1c1 a1c1}
sum carry Carry-save adder
0
++
X
++
a0 b0a1 b1
+
c1 c0
0
0Ripple-Carry Adder
3:2 Compressor CSA
a0 b0 c0
CSA
a1 b1 c1
++
0
0
X
19
A Better Division Is Used for A Better Division Is Used for Leader Expression Computation Leader Expression Computation
X = ab (c d e) cd (a b e)
X = (ab + cd) (a b c d e)
Based on the identity: pq (p q) = 0
Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U V
The ideal membership problem can be used to determine the existence of such U and V
20
Progressive Decomposition: Progressive Decomposition: Qualitative AnalysisQualitative Analysis
Completely agnostic of the type of circuit to optimize
Automatically infers successful circuit designs from the literature… Carry-lookahead adder (beyond minimal sizes)
Structured LZD/LOD circuit
Optimized LZA circuit (no sum computation)
Carry-save addition
Parallel counter
…and discovers some unknown to us!
Multi-Input comparisons (min/max)
21
Multi-Input ComparatorMulti-Input Comparator(Min/max of k n-bit Integers)(Min/max of k n-bit Integers)
Binary tree of comparatorsNumber of comparators: k − 1
Critical path delay: O(log n log k)Hardware area: O(kn)
0.46 ns, 1755 μm2
Pairwise comparison of inputs
Number of comparators: k (k − 1)/2Critical path delay: O(log n + log k)Hardware area: O(k2n)
0.21 ns, 3479 μm2
log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5
With Our Structuring Algorithm:
Bitwidth reduction using dominators and LODsNumber of LODs: k log* n
Critical path delay: O(log n + log k log* n)Hardware area: O(kn)
0.22 ns, 1331 μm2
22
OutlineOutline
Optimizingat Word-Level
CreatingMacroscopic
Structure
ExploringMicroscopicStructure
Reed-Muller formcan be very inefficient
Efficient implementationof the leader expressions ?
ExhaustiveExploration
23
Problem StatementProblem Statement
Given a set of Boolean expressions, generate all their Pareto-optimal implementations
no “reuse”total “reuse”
selective “reuse”
24
EnumeratingEnumeratingCommon Sub-ExpressionsCommon Sub-Expressions
Root: Original Reed-Muller form
Eitherxy or
xyreplaced by a new variable
The nodes of the DAG correspond to all partial implementations of the two expressions with
some sharing between them
25
Pruning the Enumeration DAGPruning the Enumeration DAG
The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressionsEnumerating the whole DAG is computationally
infeasible
Pruning CriteriaRecognizing node equivalence (width reduction)Merging some reductions into a single one
(height reduction)Delaying certain reductions (branch reduction)
26
There Is Scope for Further There Is Scope for Further Pruning…Pruning…
Area and delay for all 6-bit adders generated by our algorithm
Without any pruning, it would be impossible to handle
expressions with more than five variables
Number of possible implementations: >1060
Number of explored implementations: 2687
Number of actual Pareto-optimal
implementations: 4
27
……but the Enumeration Algorithm Finds but the Enumeration Algorithm Finds Interesting Non-Trivial Relations!Interesting Non-Trivial Relations!
0123 aaaa
00010203 babababa
0123 bbbb
10111213 babababa
20212223 babababa
30313233 babababa
0211200110 bababababa 1221 baba
0110 baba 1221 baba 021120 bababa
+
4x4-bit multiplier:better than our best manually-designed
cell-based multiplier?!
The method has been generalized for higher bitwidth multipliersIt reduced the delay of the best cell-based 8 x 8-bit multiplier by 10%
Verma & Ienne; ASP-DAC 2007Best Paper Award nominee
28
SummarySummary
Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008
Optimizingat Word-Level
Verma & Ienne; DAC 2006
ExploringMicroscopicStructure
CreatingMacroscopic
Structure
Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee
Verma, Brisk, & Ienne; IWLS 2008
29
Computer Arithmetic and Computer Arithmetic and AutomationAutomation
Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures
Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit
Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues
It is perhaps high time to tryto change all this…
30
Thanks!