mult logic opt
TRANSCRIPT
-
7/27/2019 Mult Logic Opt
1/52
ECE 4514Digital Design II
Spring 2008
Lecture 18:
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Optimizing Area
A Tools/Methods Lecture
Patrick Schaumont
-
7/27/2019 Mult Logic Opt
2/52
Optimization and Backend Verification
Previous 4 lectures (Lecture 13-16):How to get Verilog mapped into hardware
Next 2 lectures:How to optimize Verilog for hardware implementationToday (Lecture 17): Optimize for (smaller) areaThursday (Lecture 18): Optimize for (higher) performance
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
-
7/27/2019 Mult Logic Opt
3/52
Why do we need optimization?
A given algorithm can be implemented in manydifferent ways in digital hardware
Each implementation is characterized byA certain area (In FPGA: Slices)A certain cycle budget (the amount of cycles needed tocomplete one iteration through the algorithm)
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
The application domain and other externalrequirements provide constraints for the area or theperformance (=cycle_budget -1)
'Create a circuit not larger than can fit in a Spartan3S100
FPGA ...' (area)'Create an implementation that can complete at least 500million additions per second ..' (performance)
-
7/27/2019 Mult Logic Opt
4/52
-
7/27/2019 Mult Logic Opt
5/52
Area-Delay Product
Area(eg. Slices)
Assume that a given designcan be implemented in 5
different ways, and that all ofthem have the same Area-Delayproduct. All these points lie on
a hyperbole, sinceArea x Delay = Constant
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
implementation to use?
By introducing CONSTRAINTS
-
7/27/2019 Mult Logic Opt
6/52
Area-Delay Product
An Area constraint
Area(eg. Slices)
Area Constraint
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
'Smaller then ...'
-
7/27/2019 Mult Logic Opt
7/52
Area-Delay Product
An Area constraint
Area(eg. Slices)
Most optimal point (smallest delay)for given area constraint
Area Constraint
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
'Smaller then ...'
-
7/27/2019 Mult Logic Opt
8/52
Area-Delay Product
A Performance Constraint
Area(eg. Slices) Performance ConstraintFaster then ...'
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
9/52
-
7/27/2019 Mult Logic Opt
10/52
Area-Delay Product
If no constraint is given, all points with the same area-delay product are equally optimal
Area(eg. Slices)
Suboptimalpoints
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
11/52
Area-Delay Product
Sub-optimal points
Area(eg. Slices) Suboptimal because:
for any area constraint,blue point is faster
AND smaller
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
12/52
Area-Delay Product
Sub-optimal points
Area(eg. Slices) Suboptimal because:
for any speed constraint,blue point is faster AND
smaller
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
13/52
Area-Delay Product
Optimal points dominate sub-optimal points
Area(eg. Slices)
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
This point outperformsall points in the rectangle
defined with this pointas a corner
-
7/27/2019 Mult Logic Opt
14/52
Area-Delay Product
Finding all the optimal points for a random collection
Area(eg. Slices)
In a real design, the set of solutions willnever lie on a nice hyperbole, but
look more as a cloud of points
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
15/52
Area-Delay Product
Finding all the optimal points for a random collection
Area(eg. Slices)
You can find the optimal ones by drawing
a rectangle for each of them and eliminatingsuboptimal points
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
16/52
-
7/27/2019 Mult Logic Opt
17/52
Area-Delay Product
Finding all the optimal points for a random collection
Area(eg. Slices)
You can find the optimal ones by drawing
a rectangle for each of them and eliminatingsuboptimal points
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
18/52
Area-Delay Product
Finding all the optimal points for a random collection
Area(eg. Slices)
Finally, you can draw a 'staircase' curve
the reflects all the optimal solutionsfor your design.
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay = Performance -1(e.g. cycles)
-
7/27/2019 Mult Logic Opt
19/52
-
7/27/2019 Mult Logic Opt
20/52
Area-Delay Example
Area-delay product for an AES Sbox (component of anencryption algorithm) using different architectures
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
-
7/27/2019 Mult Logic Opt
21/52
Where do constraints come from?
Constraints are defined by the application domain orthe system that will integrate a given digital design
Application Domain(E.g. Speech Processing in Mobile Phone)
Real-Time Constraints
Cost (E.g. x $/Phone)max die size
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
. .Freq = 4 KHz)
for package y
Speech CoderDigital Design
Delay (Cycles)
Area
Clock Frequency
Battery Type,Capacity
Power Budget
-
7/27/2019 Mult Logic Opt
22/52
Area Optimization - Overview
Resource Sharing
Hardware Sharing Factor
Examples:An unshared multiplierA shared multiplier
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
-
7/27/2019 Mult Logic Opt
23/52
Resource Sharing
Unshared case
f1 f1in out
CombinationalLogic
CombinationalLogic
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
clk
in
out
in0 in1 in2 in3 in4
out0 out1 out2 out3 out4
-
7/27/2019 Mult Logic Opt
24/52
Resource Sharing
Shared case
f1in
out
CombinationalLogic
0
1
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
clk
in
out
in0 -- in1 -- in2 ---- out0 -- out1 -- out2
c
c
-
7/27/2019 Mult Logic Opt
25/52
-
7/27/2019 Mult Logic Opt
26/52
Resource Sharing
Registers are relatively big compared to combinationallogic (recall: 6 gates for a D flip-flop); whilemultiplexers are small (3 gates for a mux)
Therefore, the equation is easy to satisfy
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Af1 + Areg > Amux
-
7/27/2019 Mult Logic Opt
27/52
Resource Sharing trade-off area & performance
f1in
out
CombinationalLogic
c
0
1 f1 f1in out
CombinationalLogic
CombinationalLogic
1x function f1
2x function f1
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
1 register
1 input each cycleTHROUGHPUT 1 input eachtwo cycles
LATENCY Total compute time perinput is two cycles
Total compute time perinput is two cycles
Smaller
Faster
-
7/27/2019 Mult Logic Opt
28/52
Hardware Sharing Factor
The Hardware Sharing Factor (HSF) expresses thepotential amount of resource sharing in a digital design
DigitalLo ic
Input
DataRatefin
Output
DataRatefout
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Design
fCLK
HSF = f CLK / max(f in , fout )
HSF is the amount of clock cycles available per data item
-
7/27/2019 Mult Logic Opt
29/52
-
7/27/2019 Mult Logic Opt
30/52
Hardware Sharing Factor
Different styles of design will be reflected with adifferent HSF
1
10
Unmultiplexed, Maximum-throughput Hardware
Lowly Multiplexed Hardware(simple FSM-based control)
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
100
1000
Medium Multiplexed Hardware(control based on simple Picoblaze-like sequencers)
HSF
'infinite' Microprocessors(control is software !)
Highly Multiplexed Hardware(control based on complex microcoded sequencers)
-
7/27/2019 Mult Logic Opt
31/52
-
7/27/2019 Mult Logic Opt
32/52
Spartan3: 70 LUTs, 10.6 ns delay
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
Shift-expansion
Adder-tree
-
7/27/2019 Mult Logic Opt
33/52
Example: An unshared multiplier
The idea of a shared implementation is to 'chop' thecombinational logic in smaller similar pieces, andexecute these similar pieces over multiple clock cycles
a, b qa * b
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
"f1" "f1" "f1"
Stages with similar operation,separated by registers
-
7/27/2019 Mult Logic Opt
34/52
Example: An unshared multiplier
For a multiplication, repeated add-shift is the obvious'repeating' part to be shared.
15 1 1 1 1
12 1 1 0 0X
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
0 0 0 01 1 1 1
1 1 1 1
1 0 1 1 0 1 0 0180
Instead of adding4 partial products
at one, generateonly a single one
at-a-time andaccumulate the
result
-
7/27/2019 Mult Logic Opt
35/52
Example: An unshared multiplier
module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;
reg [15:0] tmp [7:0];
always @(a or b) begintmp[0] = b[0] ? a : 15'b0;
8 * 8 bit multiplicationwritten using add-shift
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];
end endmodule
-
7/27/2019 Mult Logic Opt
36/52
Spartan3: 108 LUTs, 20ns delay ..
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
-
7/27/2019 Mult Logic Opt
37/52
Shared multiplier
module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;
reg [15:0] tmp [7:0];
always @(a or b) begintmp[0] = b[0] ? a : 15'b0;
tmp/q
a 0
+Share the adder bymapping one
statement per clock cyle
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];
end endmodule
-
7/27/2019 Mult Logic Opt
38/52
Shared multiplier
module syn_ex4(q, done, a, b, start, clk);output [15:0] q;reg [15:0] q;output done;input [ 7:0] a, b;
input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
reg : s ;
always @(posedge clk) beginctr
-
7/27/2019 Mult Logic Opt
39/52
Shared multiplier
module syn_ex4(q, done , a, b, start , clk);output [15:0] q;reg [15:0] q;output done;input [7:0] a, b;
input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;
Controller
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
reg : s ;
always @(posedge clk) beginctr
-
7/27/2019 Mult Logic Opt
40/52
Shared multiplier
module syn_ex4(q, done, a, b, start, clk);output [15:0] q;reg [15:0] q;output done;input [7:0] a, b;
input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;
B shifts down, A shifts up
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
reg : s ;
always @(posedge clk) beginctr
-
7/27/2019 Mult Logic Opt
41/52
Shared multiplier
module syn_ex4(q, done, a, b, start, clk);output [15:0] q;reg [15:0] q;output done;input [7:0] a, b;
input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;
Accumulator
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
reg : s ;
always @(posedge clk) beginctr
-
7/27/2019 Mult Logic Opt
42/52
Sh i d f d l
-
7/27/2019 Mult Logic Opt
43/52
Sharing may reduce area at expense of delay
Area(LUT)
unsharedmultiplier
108 LUT, 20 ns
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay
sharedmultiplier
36 LUT, 32 ns
Oth S l ti ibl
-
7/27/2019 Mult Logic Opt
44/52
Other Solutions are possible ...
module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;
reg [15:0] tmp [7:0];
always @(a or b) begintmp[0] = b[0] ? a : 15'b0;
Create an implementationwith two adders
Rewrite code to twostatements per clock cycle
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];
end endmodule
Oth S l ti ibl
-
7/27/2019 Mult Logic Opt
45/52
Other Solutions are possible ...
module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;
reg [15:0] tmp [7:0];
always @(a or b) begintmp[0] = b[0] ? a : 15'b0;
Create an implementationwith four adders
Rewrite code to fourstatements per clock cycle
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);
tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];
end endmodule
Sharing may reduce area at expense of delay
-
7/27/2019 Mult Logic Opt
46/52
Sharing may reduce area at expense of delay
Area(LUT)
unsharedmultiplier
108 LUT, 20 ns
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
Delay
sharedmultiplier
36 LUT, 32 ns
This will generateadditional intermediate
solutions
-
7/27/2019 Mult Logic Opt
47/52
Example Full Adder > Half Adder
-
7/27/2019 Mult Logic Opt
48/52
Example Full Adder -> Half Adder
Propagate '0' into Ci
SABCi0
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
Co
Half Adder: S = A B, Co = A & B
Example: constant multiplication
-
7/27/2019 Mult Logic Opt
49/52
Example: constant multiplication
(* mult_style = "lut" *)
module syn_ex6(q, a);output [15:0] q;input [7:0] a;
assign q = a * 215;
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
endmodule
215 = 11010111
Spartan3: 36 LUT 13 5 ns
-
7/27/2019 Mult Logic Opt
50/52
Spartan3: 36 LUT, 13.5 ns
Cfr: 70 LUT, 10ns for full 8*8 bit
Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area
-
7/27/2019 Mult Logic Opt
51/52
Example: constant multiplication
-
7/27/2019 Mult Logic Opt
52/52
Example: constant multiplication
(* mult_style = "lut" *)
module syn_ex6(q, a);output [15:0] q;input [7:0] a;
assign q = a * 64;
Patrick SchaumontSpring 2008
ECE 4514 Digital Design IILecture 18: Optimizing Area
endmodule
When constant is a power of two,multiplication becomes a
hardwired shift: no LUT's needed at all!