h5h4, h5e7 fixed point arithmetic lecture 2 (used to be 5)iverbauw/courses/h05h4/... · 2009. 3....

32
Page 1 1 H5H4, H5E7 H5H4, H5E7 Fixed point arithmetic Fixed point arithmetic Lecture 2 (used to be 5) Lecture 2 (used to be 5) I. Verbauwhede Acknowledgements: H. DeMan, V. Öwall, D. Hwang, 2008-2009 K.U.Leuven 2 Overview Overview Lecture 1: what is a system-on-chip Lecture 2: terminology for the different steps Lecture 3: models of computations, SDFG Lecture 4: control flow Lecture 5 – today : fixed point refinement because we need it for exercises

Upload: others

Post on 27-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Page 1

    1

    H5H4, H5E7 H5H4, H5E7 Fixed point arithmeticFixed point arithmetic

    Lecture 2 (used to be 5)Lecture 2 (used to be 5)I. VerbauwhedeAcknowledgements:

    H. DeMan, V. Öwall, D. Hwang,2008-2009

    K.U.Leuven

    2

    OverviewOverview

    Lecture 1: what is a system-on-chipLecture 2: terminology for the different stepsLecture 3: models of computations, SDFGLecture 4: control flowLecture 5 – today : fixed point refinement

    because we need it for exercises

  • Page 2

    3

    Lecture 3: invited lectureLecture 3: invited lecture

    Friday Feb. 27, 10.30u to 12.30u, room 00.62Prof. Çetin Kaya Koç (University of Santa Barbara,

    CA), “A brief history of cryptographic hardware design”

    4

    H5H4 goal: Skiing down a mountainH5H4 goal: Skiing down a mountain

    Specification

    ASIC SpecialPurposeRetargetablecoprocessor

    DSPprocessor

    DSP-RISC RISC

    Algorithm Transformations

    Memory Transformations and Optimizations

    Floating-point to Fixed-point

    SPW, Matlab, C++

    pipelining, unrolling

    loop merging, compaction

    40 bit accumulator

  • Page 3

    5

    ReferencesReferences

    P. Lapsley, et al., “DSP Processor fundamentals: Architectures and features,” IEEE Press, 1997, Chapter 3.

    W. Sung, K. Kum, “Simulation-based Word-Length Optimization Method for Fixed-point Digital Signal processing systems,” IEEE Trans. On Signal Proc. Vol. 43, No. 12, Dec. 1995.

    Viktor Öwall, Dept. of Electroscience, Lund Sweden -www.es.lth.se/ugradcourses/DSPDesign/

    M. Ercegovac, T. Lang, “Digital Arithmetic,” Kaufmann Publishers, 2004.

    Fridge project: http://www.ert.rwth-aachen.de/Projekte/Tools/FRIDGE/fridge.html

    6

    DSP applicationshigh speedminimum arealow power

    **

    3 bytes (mantissa)3 bytes (mantissa)+ 1 byte (exponent)+ 1 byte (exponent)

    Fixed-point refinement

    88

    **661414

    Finite word lengths: a must for DSPFinite word lengths: a must for DSP

    Floating-point– powerful– expensive (storage & ops)

  • Page 4

    7

    Example: Failure of Patriot Missile (1991 Feb. 25)

    Source http://www.math.psu.edu/dna/455.f96/disasters.html

    American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile The Scud struck an American Army barracks, killing 28

    Cause, per GAO/IMTEC-92-26 report: “software problem” (inaccurate calculation of the time since boot)

    Specifics of the problem: time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds. Internal registers were 24 bits wide 1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 b) Error ≅ 0.1100 1100 × 2–23 ≅ 9.5 × 10–8

    Error in 100-hr operation period≅ 9.5 × 10–8 × 100 × 60 × 60 × 10 = 0.34 sDistance traveled by Scud = (0.34 s) × (1676 m/s) ≅ 570 m

    This put the Scud outside the Patriot’s “range gate” Ironically, the fact that the bad time calculation had been improved in some (but not all) code parts contributed to the problem, since it meant that inaccuracies did not cancel out

    Consequences of Bad Consequences of Bad UseUse of of ApproximationsApproximations

    8

    Example: Explosion of Ariane Rocket (1996 June 4)

    Source http://www.math.psu.edu/dna/455.f96/disasters.html

    Unmanned Ariane 5 rocket launched by the European Space Agency veered off its flight path, broke up, and exploded only 30 seconds after lift-off (altitude of 3700 m)

    The $500 million rocket (with cargo) was on its 1st voyage after a decade of development costing $7 billion

    Cause: “software error in the inertial reference system”

    Specifics of the problem: a 64 bit floating point number relating to the horizontal velocity of the rocket was being converted to a 16 bit signed integer

    An SRI* software exception arose during conversion because the 64-bit floating point number had a value greater than what could be represented by a 16-bit signed integer (max 32 767)

    Consequences of Bad Consequences of Bad ApproximationsApproximations

  • Page 5

    9

    OutlineOutline

    • Number representation• Location of decimal point• Precision• Dynamic range• Truncation, rounding• Overflow

    10

    Binary numbers, unsigned integersBinary numbers, unsigned integers

    22 21 200 0 0 (0)0 0 1 (1)0 1 0 (2)0 1 1 (3)1 0 0 (4)1 0 1 (5)1 1 0 (6)1 1 1 (7)

    MSB =Most Significant Bit

    LSB =Least Significant Bit

    N bits

    2N

    [V. Öwall]

  • Page 6

    11

    Dynamic range and ResolutionDynamic range and Resolution

    Nr. of Nr. of Resolution Dynamic Rangebits levels Vfs=0.5V VLSB=0.03125

    4 16 0.03125V 0.5V

    8 256 2mV 8V

    12 4096 0.12mV 128V

    16 65 536 7.6μV 2042V

    How do we use the bits?Depends on the application!

    [V. Öwall]

    12

    Number RepresentationNumber Representation

    •Unsigned numbers•Signed digit numbers

    •Sign magnitude•One’s complement•Two’s complement

    Notation: with W = K + LW = wordlengthL = number of bits behind decimal (or binary) point

  • Page 7

    13

    SignedSigned--Digit RepresentationsDigit Representations

    Representations– 1) Signed-Magnitude: redundant – 2) Biased: non-redundant– 3) Complement

    » A) Radix Complement (r=2 “two's complement”)– non-redundant

    » B) Digit Complement or Diminished-Radix Complement (r=2 “one's complement”)

    – redundant

    Redundant two representations for same numberNon-redundant each representation is different

    number

    14

    Sign MagnitudeSign Magnitude

    Unsigned numbers with a sign-bit

    - Two Zeros

    + Low Power?SignedMagnitude

    000

    001

    011101

    111

    110 010

    100

    0

    -10

    3

    2

    1-3

    -2

    [V. Öwall]

  • Page 8

    15

    One’s ComplementOne’s Complement

    Signed numbers by inverting (Complement)

    - Two Zeros

    + Easy to convert to Negative

    One'sComplement

    000

    001

    011101

    111

    110 010

    100

    0

    -1

    0

    3

    2

    1

    -3-2

    [V. Öwall]

    16

    Two’s ComplementTwo’s Complement

    Complement + LSB

    + One Zero

    + Easy Addition

    - Not so easy to convert to Neg.

    Two'sComplement

    000

    001

    011101

    111

    110 010

    100

    0-1

    - 43

    2

    1

    -3

    -2

    Most widely used fixed point numbering system

    [V. Öwall]

  • Page 9

    17

    7 0111 1111 0111 01116 0110 1110 0110 01105 0101 1101 0101 01014 0100 1100 0100 01003 0011 1011 0011 00112 0010 1010 0010 00101 0001 1001 0001 00010 0000 1000 0000 0000-0 1000 1111-1 1001 0111 1111 1110-2 1010 0110 1110 1101-3 1011 0101 1101 1100-4 1100 0100 1100 1011-5 1101 0011 1011 1010-6 1110 0010 1010 1001-7 1111 0001 1001 1000-8 0000 1000

    Signed-magnitude Biased

    Two’s complement

    One’s complement

    18

    2233

    22 22 22 22 22i.2i.222 11 00 --11 --22 --33

    WWLL

    MSB=WMSB=W--LL LSB=LLSB=L

    Position of decimal pointPosition of decimal point

    Total number of bits WFractional bits L

    • Value representation• 2’s complement (i=-1)• unsigned (i=1)

    How do you store this decimal point?

  • Page 10

    19

    Fixed point for DSP processorsFixed point for DSP processors

    Simple binary integer (two’s complement)

    Simple binary fractional representation

    2266

    22 22 22 22 222255 44 33 22 11 00

    WW

    MSB=WMSB=W LSB=0LSB=0

    77

    --22SignbitSignbit

    22--11

    22 22 22 22 2222--22 --33 --44 --55 --66 --77

    WW

    MSB=WMSB=W LSB=L=WLSB=L=W--11

    00

    --22SignbitSignbit

    Values between [-1,1[

    20

    Mantissa representationMantissa representation

    • Mantissa: e.g. 24 bit– One sign bit– Mantissa bit = 1 (always!) [-1, -2] and [+1, +2]

    • Exponent: e.g. 8 bit• Value = Mantisse x 2exponent

    2200

    22 22 22 22 2222--11 --22 --33 --44 --55 --66

    WW

    MSB=WMSB=W LSB=LLSB=L

    11

    --22SignbitSignbit

    22 22 22 2233 22 11 00

  • Page 11

    21

    PrecisionPrecision

    • Quantization error = error when a longer numeric format is converted to a shorter one

    • E.g.: round 1.325 to 1.33, error = 0.005

    Maximum precision (in bits) = log2 (|maximum value| / |max quantization error|)

    • E.g.: 16 bit fractional representation• max value = -1, max error = 2-16 (with rounding)

    maximum precision = 16 bits

    Importance of scaling!!

    22

    Dynamic rangeDynamic range

    • Dynamic range = largest number / smallest numberin a given data format

    • E.g. 32 bit fractional valueratio = (1- 2-31) / 2-31 = 2+31 = 2.15 109 = 187 dB

    • Telecom: 50 dB,• High End Audio: 90dB +

    • DSP processors: provide a few more bits than the dynamic range requires

    Scaling !!

  • Page 12

    23

    RoundingRounding

    24

    How do we quantize?How do we quantize?

    floorfloor

    fxpfxp

    flpflp

    roundround

    fxpfxp

    flpflp

    MagnitudeMagnitudetruncatetruncate

    fxpfxp

    flpflp

    ceilceil

    fxpfxp

    flpflp

    CheapNasty

    BestExpensive

    Sign-MagnitudeUnusual

    2-compltruncate⎣x⎦

    ⎡x⎤

  • Page 13

    25

    RoundingRounding

    Rounding occurs when we want to approximate a more precise number (i.e. more fractional bits L) with a less precise number (i.e. fewer fractional bits L')

    Example 1: “down”– old: 000110.11010001 (K=6, L=8)– new: 000110.11 (K'=6, L'=2)

    Example 2: “up”– old: 000110.11010001 (K=6, L=8)– new: 000111. (K'=6, L'=0)

    The following show rounding from L>0 fractional bits to L'=0 bits, but the mathematics hold true for any L' < L

    Usually, keep the number of integral bits the same K'=K

    26

    Rounding EquationRounding Equation

    y = round(x)

    Fractional partWhole part

    xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l yk–1yk–2 . . . y1y0Round

  • Page 14

    27

    Rounding TechniquesRounding Techniques

    Different rounding techniques:– 1) truncation

    » results in round towards zero in signed magnitude

    » results in round towards -∞ in two's complement– 2) round to nearest number– 3) round to nearest even number (or odd number)– 4) round towards +∞

    Other rounding techniques– 5) jamming or von Neumann– 6) ROM rounding

    Each will differ in their error depending on representation of numbers i.e. signed magnitude versus two's complement– Error = round(x) – x

    28

    1) Truncation1) Truncation

    Truncation in signed-magnitude results in a number chop(x) that is always of smaller magnitude than x. This is called round towards zero or inward rounding– 011.10 (3.5)10 011 (3)10

    » Error = -0.5– 111.10 (-3.5)10 111 (-3)10

    » Error = +0.5Truncation in two's complement results in a number chop(x) that is always

    smaller than x. This is called round towards -∞ or downward-directed rounding– 011.10 (3.5)10 011 (3)10

    » Error = -0.5– 100.10 (-3.5)10 100 (-4)10

    » Error = -0.5

    The simplest possible rounding scheme: chopping or truncation

    xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l xk–1xk–2 . . . x1x0trunculp

  • Page 15

    29

    Truncation Function Graph: Truncation Function Graph: chop(xchop(x))

    Fig. 17.5 Truncation or chopping of a signed-magnitude number (same as round toward 0).

    Fig. 17.6 Truncation or chopping of a 2’s-complement number (same as round to -∞).

    chop(x)

    –4

    –3

    –2

    –1

    x–4 –3 –2 –1 4321

    4

    3

    2

    1

    chop(x)

    –4

    –3

    –2

    –1

    x–4 –3 –2 –1 4321

    4

    3

    2

    1

    30

    Bias in two's complement truncationBias in two's complement truncation

    0-3101-3101.00

    -0.75-4100-3.25100.11

    -0.5-4100-3.5100.10

    -0.25-4100-3.75100.01

    -0.7530113.75011.11

    -0.530113.5011.10

    -0.2530113.25011.01

    030113011.00

    Error(decimal)

    chop(x)(decimal)

    chop(x) (binary)

    X (decimal)

    X (binary)

    Assuming all combinations of positive and negative values of x equally possible, average error is -0.375

    In general, average error = (2-L'-2-L )/2, where L' = new number of fractional bits

  • Page 16

    31

    Implementation truncation in Implementation truncation in hardwarehardware

    Easy, just ignore (i.e. truncate) the fractional digits from L to L'+1

    xk-1 xk-2 .. x1 x0. x-1 x-2 .. x-L

    = yk-1 yk-2 .. y1 y0.

    ignore (i.e. truncate the rest)

    32

    2) Round to nearest number2) Round to nearest number

    Rounding to nearest number what we normally think of when say round

    rtn in two's complement – 010.10 (2.5)10 011 (3)10

    » Error = +0.5– 101.10 (-2.5)10 110 (-2)10

    » Error = -0.5

  • Page 17

    33

    Round to Nearest Function Graph: Round to Nearest Function Graph: rtn(xrtn(x))

    rtn(x)

    –4

    –3

    –2

    –1

    x–4 –3 –2 –1 4 3 2 1

    4

    3

    2

    1

    34

    Bias in two's complement round to Bias in two's complement round to nearestnearest

    0-2110-2110.00+0.25-2110-2.25101.11+0.5-2110-2.5101.10-0.25-3101-2.75101.01+0.2530112.75010.11+0.530112.5010.10-0.2520102.25010.01

    020102010.00

    Error(decimal)

    rtn(x)(decimal)

    rtn(x) (binary)

    X (decimal)X (binary)

    All combinations of positive and negative values of x equally possible, average error is +0.125

    – Smaller average error than truncation, but still not symmetric error– We have a problem with the midway value, i.e. exactly at 2.5 or -2.5 leads to positive

    error bias alwaysOverflow problem: if only allocate K' = K integral bits

    – Example: rtn(011.10) overflow– This overflow only occurs on positive numbers near the maximum positive value, not on

    negative numbers

  • Page 18

    35

    Truncation and roundingTruncation and rounding

    Truncation: cheapest but introduces “bias”E.g.: use 0011 = 3 0011.1 = 3.5 truncates to 31100 = -4 1100.1 = -3.5 truncates to -4Always a smaller number

    Rounding: “round to the nearest”Simple hardware trick: add 1/2 of the smallest number and truncateE.g.: use 0011 = 3 0011.1 = 3.5 rounds to 4

    1100.1 = -3.5 rounds to -3

    How in hardware?

    36

    RoundingRounding

    Rounding to the nearest: still bias for numbers exactly half way

    More expensive: “convergent” rounding

    aa66

    aa aa aa aa aaaa55 44 33 22 11 0077

    aaSignbitSignbit

    bb22

    bbbb11 0033

    bb

    If a3:a0 > 1000

    b3:b0 = a7:a4 + a3

    If a3:a0 < 1000

    b3:b0 = a7:a4 + a3

    If a3:a0 = 1000

    b3:b0 = a7:a4 + a4

    SignbitSignbit

  • Page 19

    37

    OverflowOverflow

    38

    What happens on an overflow?What happens on an overflow?

    wrap-around saturation

    flp flpfxp fxp

    max. value

  • Page 20

    39

    Adding Two's Complement Numbers: Adding Two's Complement Numbers: Ignoring OverflowIgnoring Overflow

    Ignoring overflow, adding a K.L two's complement number to a K.L binary unsigned number results in a K.L numberExample: 0111.01 + 1000.00 +

    0110.10 = 1001.00 =01101.11 10001.00

    Ignore cK

    Adding 7.25 + 6.5 results in -2.25: must add 2^K = 16 to get correct result (13.75)

    Adding -8 + -7 results in +1: must add -2^K = -16 to get correct result

    Ignore cK

    40

    Two's Complement Wraparound Two's Complement Wraparound PropertyProperty

    Temporary wraparounds are fine as long as final value is in the correct dynamic range:– Example: add (-8 + -6) + 7 = -7– 1000 + 1010 = 0010

    » Should be (-14)10 not (+2)10 wraparound/overflow– 0010 + 0111 = 1001

    » Final result is correct: (-7)10» If final result guaranteed to be in the correct dynamic

    range [-8,+7] then intermediate wraparounds are fine

  • Page 21

    41

    To avoid overflow, adding a binary two's complement number to a two's complement number results in a number. To compute, sign extend MSB, ignore cK+1Example: 00111.01 +

    00110.10 =001101.11

    Adding Two's Complement Numbers: Adding Two's Complement Numbers: Avoiding or Detecting OverflowAvoiding or Detecting Overflow

    If result is confined to a number, need overflow detection, which is the cK xor cK-1Example: 0111.01 +

    0110.10 =01101.11

    cK XOR cK-1 indicates overflow

    Ignore cK+1

    K=4, L=2

    42

    Ignoring overflow, subtracting a two's complement number from a two's complement number results in a numberExample: 1

    0111 - 0111 +1000 0111 =

    01111

    Subtracting Two's Complement Subtracting Two's Complement Numbers: Ignoring OverflowNumbers: Ignoring Overflow

    7 – (-8) resulted in -1– A wraparound/overflow occured– Must add 2^K=2^4=16 to get correct value of +15

    Again we see the modulo effect– As with addition, temporary wraparounds are okay as long as final

    result is in correct dynamic range

    Ignore cK

  • Page 22

    43

    If result is confined to a K.L number, need overflow detection, which is the cK xor cK-1Example: 1

    0111.01 - 0111.01 +1000.00 0111.11 =

    01111.01

    To avoid overflow, subtracting a two's complement number from a two's complement number results in a numberExample: 1

    0111.01 - 00111.01 +1000.00 00111.11 =

    001111.01

    Subtracting Two's Complement Numbers: Subtracting Two's Complement Numbers: Avoiding or Detecting OverflowAvoiding or Detecting Overflow

    cK XOR cK-1 indicates overflow

    Ignore cK+1

    44

    Negating a Two's Complement Negating a Two's Complement NumberNumber

    Negating a K.L two's complement number usually only requires a K.L digit result. The only exception is when you negate the largest negative number, and you need a K.(L+1) digit result.

    » - 0111 = 1001» - 1000 = 01000 need extra bit to negate largest negative

    number

    Again overflow detection needed

  • Page 23

    45

    OutlineOutline

    • Number representation• Location of decimal point• Precision• Dynamic range• Truncation, rounding• Overflow• Now: what to do?

    46

    The Wordlength, i.e. nr of bitsThe Wordlength, i.e. nr of bits

    D D Dx(n)

    h0 h3h2h1

    y(n)

    Every extra bit costs• energy/power• delay• area

    • the word length has to be reduced

    UMTS-filter

    7bits

    float

    [V. Öwall]

  • Page 24

    47

    The Wordlength, i.e. nr of bitsThe Wordlength, i.e. nr of bitsD D Dx(n)

    h0 h3h2h1

    y(n)

    The output of• adder output needs an extra bit to be sure of no overflow, e.g. 2+2 = 4 ⇒ 10+10=100

    • multiplier MxN bits ⇒ M+N bits for full precision

    ⇒ Precision has to be limited

    [V. Öwall]

    48

    FloatingFloating--pointpointalgorithmalgorithmADAD

    88 77

    **

    **++

    ????

    ????

    ????

    During design: During design: specifyspecify fixedfixed--point formats for point formats for signalssignals

    W,L,Q

    System context

    System context

    coefficients

    data

  • Page 25

    49

    FixedFixed--point refinement: optimization problempoint refinement: optimization problem

    Minimize overall cost:– minimal word lengths– truncate and wrap-around

    MSB determination:– goal: avoid unwanted overflows– method: find min, max signal values– result: MSB position, value

    representation, overflow behaviour

    LSB determination:– goal: keep “required” precision– method: evaluate difference

    between flp and fxp behavior– result: LSB position, quantization

    safe rangesafe range

    quantizationquantization

    t

    t

    cost

    50

    1.MSB determination: range calculations1.MSB determination: range calculations

    * +

    d

    m

    x

    c y

    rangerangeinfoinfo

    rangerangecalc.calc.

    Analytical methodPut range (min, max) on inputs, statesPropagate range over the operatorsThis gives a save(pessimistic) estimate

  • Page 26

    51

    Word length propagationWord length propagation

    Range propagation translates to word length growthE.g. Two’s complement integer addition A + B

    A and B represented by A + B needs A – B needs

    In general:A is represented by , B by , A + B needs

    Get’s more complicated for multiplication

    52

    Range calculations grows unbounded?Range calculations grows unbounded?

    *

    +

    a

  • Page 27

    53

    * +

    d

    m

    x

    c ystimuli

    stimuli

    ?min, maxq1

    q2

    Alternative: Collect signal statistics Alternative: Collect signal statistics during simulationsduring simulations

    Perform simulation with realistic stimuli.Collect minimum and maximum value on each signal during the

    simulationThis gives an optimistic, stimuli dependent estimate

    54

    signal statistic range propagationname min max MSB1 min max MSB2

    signal1 -1.5 1.6 2 -1.9 1.9 2signal2 -1.3 1.4 2 -2.1 2.1 3signal3 -1.2 1.2 2 -22.0 22.0 6

    Combine both methods for accurate Combine both methods for accurate MSB determinationMSB determination

    If MSB1 == MSB2: wrap-around(MSB1)If MSB1 < MSB2: choose saturate(MSB1) or wrap-around(MSB2)If MSB1

  • Page 28

    55

    Transform DFG for cheaper solutionTransform DFG for cheaper solution

    Scaling by moving multiplications or shifters over operators, use commutativity, associativity, distributivity(check accuracy!)

    Need to verify also LSB behavior

    162-420

    16

    16

    20

    20

    ++16

    2-4

    16

    16

    16

    16

    ++2-4

    56

    QQ ++

    B bitsinput output outputinput

    noise

    2. Quantization effects can be modeled 2. Quantization effects can be modeled as additive noise (LSB)as additive noise (LSB)

    Quantization noise is approximated by a statistical model with the following assumptions:

    – the noise is uncorrelated to the input.– the noise is white.– the probability distribution is uniform.

  • Page 29

    57

    Each quantization effect is modeled Each quantization effect is modeled by a mean and varianceby a mean and variance

    Rounding:

    Truncation:

    Magnitude truncation:

    12 and 0

    2Δ== nnm σ

    12 and

    2

    2Δ=

    Δ−= nnm σ

    3 and 0

    2Δ== nnm σ

    Δ is quantization step

    58

    This results in an equivalent linear This results in an equivalent linear networknetwork

    *

    +

    a

    X(n) Y(n)

    z-1

    QQ *

    +

    a

    X(n) Y(n)

    z-1

    +

    e(n)

    But quantization is a non-linear operation!

  • Page 30

    59

    Limit cycles are an example of nonLimit cycles are an example of non--linear behaviorlinear behavior

    *

    +

    -0.96

    X(n) Y(n)

    z-1

    QQ

    X(0) = 14, x(n) = 0 for n > 0

    round to nearest integer

    B bits

    ...

    ...

    with rounding:

    without rounding:

    60

    LimitcycleLimitcycle exampleexample

  • Page 31

    61

    a) LSB determination must be based a) LSB determination must be based on simulationson simulations

    All fixed-point

    simulate

    outputok

    yes

    no

    * +

    stimuli

    0.6

    x

    ymQQ

    * +0.6

    x

    ym compare

    QQ

    z-1

    62

    b) Gradual refinement is necessary b) Gradual refinement is necessary to keep the problem manageableto keep the problem manageable

    quantize S only

    simulate

    Perf.ok

    yes

    no

    return

    For each

    signal S

    * +

    stimuli

    0.6

    x

    ymQQ

    reference simulationcompare

    z-1

  • Page 32

    63

    ConclusionConclusion

    • Number representation• Location of decimal point• Precision• Dynamic range• Truncation, rounding• Overflow• Now: what to do?• Why are we doing this?

    – Next exercise: floating point to fixed point– Area/time/power optimization– Important design optimization for JPEG project