mult logic opt

Upload: lakshmikanth-meduri

Post on 02-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Mult Logic Opt

    1/52

    ECE 4514Digital Design II

    Spring 2008

    Lecture 18:

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Optimizing Area

    A Tools/Methods Lecture

    Patrick Schaumont

  • 7/27/2019 Mult Logic Opt

    2/52

    Optimization and Backend Verification

    Previous 4 lectures (Lecture 13-16):How to get Verilog mapped into hardware

    Next 2 lectures:How to optimize Verilog for hardware implementationToday (Lecture 17): Optimize for (smaller) areaThursday (Lecture 18): Optimize for (higher) performance

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

  • 7/27/2019 Mult Logic Opt

    3/52

    Why do we need optimization?

    A given algorithm can be implemented in manydifferent ways in digital hardware

    Each implementation is characterized byA certain area (In FPGA: Slices)A certain cycle budget (the amount of cycles needed tocomplete one iteration through the algorithm)

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    The application domain and other externalrequirements provide constraints for the area or theperformance (=cycle_budget -1)

    'Create a circuit not larger than can fit in a Spartan3S100

    FPGA ...' (area)'Create an implementation that can complete at least 500million additions per second ..' (performance)

  • 7/27/2019 Mult Logic Opt

    4/52

  • 7/27/2019 Mult Logic Opt

    5/52

    Area-Delay Product

    Area(eg. Slices)

    Assume that a given designcan be implemented in 5

    different ways, and that all ofthem have the same Area-Delayproduct. All these points lie on

    a hyperbole, sinceArea x Delay = Constant

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

    implementation to use?

    By introducing CONSTRAINTS

  • 7/27/2019 Mult Logic Opt

    6/52

    Area-Delay Product

    An Area constraint

    Area(eg. Slices)

    Area Constraint

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

    'Smaller then ...'

  • 7/27/2019 Mult Logic Opt

    7/52

    Area-Delay Product

    An Area constraint

    Area(eg. Slices)

    Most optimal point (smallest delay)for given area constraint

    Area Constraint

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

    'Smaller then ...'

  • 7/27/2019 Mult Logic Opt

    8/52

    Area-Delay Product

    A Performance Constraint

    Area(eg. Slices) Performance ConstraintFaster then ...'

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    9/52

  • 7/27/2019 Mult Logic Opt

    10/52

    Area-Delay Product

    If no constraint is given, all points with the same area-delay product are equally optimal

    Area(eg. Slices)

    Suboptimalpoints

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    11/52

    Area-Delay Product

    Sub-optimal points

    Area(eg. Slices) Suboptimal because:

    for any area constraint,blue point is faster

    AND smaller

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    12/52

    Area-Delay Product

    Sub-optimal points

    Area(eg. Slices) Suboptimal because:

    for any speed constraint,blue point is faster AND

    smaller

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    13/52

    Area-Delay Product

    Optimal points dominate sub-optimal points

    Area(eg. Slices)

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

    This point outperformsall points in the rectangle

    defined with this pointas a corner

  • 7/27/2019 Mult Logic Opt

    14/52

    Area-Delay Product

    Finding all the optimal points for a random collection

    Area(eg. Slices)

    In a real design, the set of solutions willnever lie on a nice hyperbole, but

    look more as a cloud of points

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    15/52

    Area-Delay Product

    Finding all the optimal points for a random collection

    Area(eg. Slices)

    You can find the optimal ones by drawing

    a rectangle for each of them and eliminatingsuboptimal points

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    16/52

  • 7/27/2019 Mult Logic Opt

    17/52

    Area-Delay Product

    Finding all the optimal points for a random collection

    Area(eg. Slices)

    You can find the optimal ones by drawing

    a rectangle for each of them and eliminatingsuboptimal points

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    18/52

    Area-Delay Product

    Finding all the optimal points for a random collection

    Area(eg. Slices)

    Finally, you can draw a 'staircase' curve

    the reflects all the optimal solutionsfor your design.

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay = Performance -1(e.g. cycles)

  • 7/27/2019 Mult Logic Opt

    19/52

  • 7/27/2019 Mult Logic Opt

    20/52

    Area-Delay Example

    Area-delay product for an AES Sbox (component of anencryption algorithm) using different architectures

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

  • 7/27/2019 Mult Logic Opt

    21/52

    Where do constraints come from?

    Constraints are defined by the application domain orthe system that will integrate a given digital design

    Application Domain(E.g. Speech Processing in Mobile Phone)

    Real-Time Constraints

    Cost (E.g. x $/Phone)max die size

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    . .Freq = 4 KHz)

    for package y

    Speech CoderDigital Design

    Delay (Cycles)

    Area

    Clock Frequency

    Battery Type,Capacity

    Power Budget

  • 7/27/2019 Mult Logic Opt

    22/52

    Area Optimization - Overview

    Resource Sharing

    Hardware Sharing Factor

    Examples:An unshared multiplierA shared multiplier

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

  • 7/27/2019 Mult Logic Opt

    23/52

    Resource Sharing

    Unshared case

    f1 f1in out

    CombinationalLogic

    CombinationalLogic

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    clk

    in

    out

    in0 in1 in2 in3 in4

    out0 out1 out2 out3 out4

  • 7/27/2019 Mult Logic Opt

    24/52

    Resource Sharing

    Shared case

    f1in

    out

    CombinationalLogic

    0

    1

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    clk

    in

    out

    in0 -- in1 -- in2 ---- out0 -- out1 -- out2

    c

    c

  • 7/27/2019 Mult Logic Opt

    25/52

  • 7/27/2019 Mult Logic Opt

    26/52

    Resource Sharing

    Registers are relatively big compared to combinationallogic (recall: 6 gates for a D flip-flop); whilemultiplexers are small (3 gates for a mux)

    Therefore, the equation is easy to satisfy

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Af1 + Areg > Amux

  • 7/27/2019 Mult Logic Opt

    27/52

    Resource Sharing trade-off area & performance

    f1in

    out

    CombinationalLogic

    c

    0

    1 f1 f1in out

    CombinationalLogic

    CombinationalLogic

    1x function f1

    2x function f1

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    1 register

    1 input each cycleTHROUGHPUT 1 input eachtwo cycles

    LATENCY Total compute time perinput is two cycles

    Total compute time perinput is two cycles

    Smaller

    Faster

  • 7/27/2019 Mult Logic Opt

    28/52

    Hardware Sharing Factor

    The Hardware Sharing Factor (HSF) expresses thepotential amount of resource sharing in a digital design

    DigitalLo ic

    Input

    DataRatefin

    Output

    DataRatefout

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Design

    fCLK

    HSF = f CLK / max(f in , fout )

    HSF is the amount of clock cycles available per data item

  • 7/27/2019 Mult Logic Opt

    29/52

  • 7/27/2019 Mult Logic Opt

    30/52

    Hardware Sharing Factor

    Different styles of design will be reflected with adifferent HSF

    1

    10

    Unmultiplexed, Maximum-throughput Hardware

    Lowly Multiplexed Hardware(simple FSM-based control)

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    100

    1000

    Medium Multiplexed Hardware(control based on simple Picoblaze-like sequencers)

    HSF

    'infinite' Microprocessors(control is software !)

    Highly Multiplexed Hardware(control based on complex microcoded sequencers)

  • 7/27/2019 Mult Logic Opt

    31/52

  • 7/27/2019 Mult Logic Opt

    32/52

    Spartan3: 70 LUTs, 10.6 ns delay

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    Shift-expansion

    Adder-tree

  • 7/27/2019 Mult Logic Opt

    33/52

    Example: An unshared multiplier

    The idea of a shared implementation is to 'chop' thecombinational logic in smaller similar pieces, andexecute these similar pieces over multiple clock cycles

    a, b qa * b

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    "f1" "f1" "f1"

    Stages with similar operation,separated by registers

  • 7/27/2019 Mult Logic Opt

    34/52

    Example: An unshared multiplier

    For a multiplication, repeated add-shift is the obvious'repeating' part to be shared.

    15 1 1 1 1

    12 1 1 0 0X

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    0 0 0 01 1 1 1

    1 1 1 1

    1 0 1 1 0 1 0 0180

    Instead of adding4 partial products

    at one, generateonly a single one

    at-a-time andaccumulate the

    result

  • 7/27/2019 Mult Logic Opt

    35/52

    Example: An unshared multiplier

    module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;

    reg [15:0] tmp [7:0];

    always @(a or b) begintmp[0] = b[0] ? a : 15'b0;

    8 * 8 bit multiplicationwritten using add-shift

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];

    end endmodule

  • 7/27/2019 Mult Logic Opt

    36/52

    Spartan3: 108 LUTs, 20ns delay ..

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

  • 7/27/2019 Mult Logic Opt

    37/52

    Shared multiplier

    module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;

    reg [15:0] tmp [7:0];

    always @(a or b) begintmp[0] = b[0] ? a : 15'b0;

    tmp/q

    a 0

    +Share the adder bymapping one

    statement per clock cyle

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];

    end endmodule

  • 7/27/2019 Mult Logic Opt

    38/52

    Shared multiplier

    module syn_ex4(q, done, a, b, start, clk);output [15:0] q;reg [15:0] q;output done;input [ 7:0] a, b;

    input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    reg : s ;

    always @(posedge clk) beginctr

  • 7/27/2019 Mult Logic Opt

    39/52

    Shared multiplier

    module syn_ex4(q, done , a, b, start , clk);output [15:0] q;reg [15:0] q;output done;input [7:0] a, b;

    input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;

    Controller

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    reg : s ;

    always @(posedge clk) beginctr

  • 7/27/2019 Mult Logic Opt

    40/52

    Shared multiplier

    module syn_ex4(q, done, a, b, start, clk);output [15:0] q;reg [15:0] q;output done;input [7:0] a, b;

    input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;

    B shifts down, A shifts up

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    reg : s ;

    always @(posedge clk) beginctr

  • 7/27/2019 Mult Logic Opt

    41/52

    Shared multiplier

    module syn_ex4(q, done, a, b, start, clk);output [15:0] q;reg [15:0] q;output done;input [7:0] a, b;

    input start;input clk;reg [ 4:0] ctr;reg [15:0] shiftA;

    Accumulator

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    reg : s ;

    always @(posedge clk) beginctr

  • 7/27/2019 Mult Logic Opt

    42/52

    Sh i d f d l

  • 7/27/2019 Mult Logic Opt

    43/52

    Sharing may reduce area at expense of delay

    Area(LUT)

    unsharedmultiplier

    108 LUT, 20 ns

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay

    sharedmultiplier

    36 LUT, 32 ns

    Oth S l ti ibl

  • 7/27/2019 Mult Logic Opt

    44/52

    Other Solutions are possible ...

    module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;

    reg [15:0] tmp [7:0];

    always @(a or b) begintmp[0] = b[0] ? a : 15'b0;

    Create an implementationwith two adders

    Rewrite code to twostatements per clock cycle

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];

    end endmodule

    Oth S l ti ibl

  • 7/27/2019 Mult Logic Opt

    45/52

    Other Solutions are possible ...

    module syn_ex3(q, a, b);output [15:0] q;reg [15:0] q;input [7:0] a, b;

    reg [15:0] tmp [7:0];

    always @(a or b) begintmp[0] = b[0] ? a : 15'b0;

    Create an implementationwith four adders

    Rewrite code to fourstatements per clock cycle

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    tmp[1] = tmp[0] + (b[1] ? {a, 1'b0} : 15'b0);

    tmp[2] = tmp[1] + (b[2] ? {a, 2'b0} : 15'b0);tmp[3] = tmp[2] + (b[3] ? {a, 3'b0} : 15'b0);tmp[4] = tmp[3] + (b[4] ? {a, 4'b0} : 15'b0);tmp[5] = tmp[4] + (b[5] ? {a, 5'b0} : 15'b0);tmp[6] = tmp[5] + (b[6] ? {a, 6'b0} : 15'b0);tmp[7] = tmp[6] + (b[7] ? {a, 7'b0} : 15'b0);q = tmp[7];

    end endmodule

    Sharing may reduce area at expense of delay

  • 7/27/2019 Mult Logic Opt

    46/52

    Sharing may reduce area at expense of delay

    Area(LUT)

    unsharedmultiplier

    108 LUT, 20 ns

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    Delay

    sharedmultiplier

    36 LUT, 32 ns

    This will generateadditional intermediate

    solutions

  • 7/27/2019 Mult Logic Opt

    47/52

    Example Full Adder > Half Adder

  • 7/27/2019 Mult Logic Opt

    48/52

    Example Full Adder -> Half Adder

    Propagate '0' into Ci

    SABCi0

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    Co

    Half Adder: S = A B, Co = A & B

    Example: constant multiplication

  • 7/27/2019 Mult Logic Opt

    49/52

    Example: constant multiplication

    (* mult_style = "lut" *)

    module syn_ex6(q, a);output [15:0] q;input [7:0] a;

    assign q = a * 215;

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

    endmodule

    215 = 11010111

    Spartan3: 36 LUT 13 5 ns

  • 7/27/2019 Mult Logic Opt

    50/52

    Spartan3: 36 LUT, 13.5 ns

    Cfr: 70 LUT, 10ns for full 8*8 bit

    Patrick SchaumontSpring 2008ECE 4514 Digital Design IILecture 18: Optimizing Area

  • 7/27/2019 Mult Logic Opt

    51/52

    Example: constant multiplication

  • 7/27/2019 Mult Logic Opt

    52/52

    Example: constant multiplication

    (* mult_style = "lut" *)

    module syn_ex6(q, a);output [15:0] q;input [7:0] a;

    assign q = a * 64;

    Patrick SchaumontSpring 2008

    ECE 4514 Digital Design IILecture 18: Optimizing Area

    endmodule

    When constant is a power of two,multiplication becomes a

    hardwired shift: no LUT's needed at all!