h5h4, h5e7 fixed point arithmetic lecture 2 (used to be 5)iverbauw/courses/h05h4/... · 2009. 3....

1

H5H4, H5E7 H5H4, H5E7 Fixed point arithmeticFixed point arithmetic

Lecture 2 (used to be 5)Lecture 2 (used to be 5)I. VerbauwhedeAcknowledgements:

H. DeMan, V. Öwall, D. Hwang,2008-2009

K.U.Leuven

2

OverviewOverview

Lecture 1: what is a system-on-chipLecture 2: terminology for the different stepsLecture 3: models of computations, SDFGLecture 4: control flowLecture 5 – today : fixed point refinement

because we need it for exercises

3

Lecture 3: invited lectureLecture 3: invited lecture

Friday Feb. 27, 10.30u to 12.30u, room 00.62Prof. Çetin Kaya Koç (University of Santa Barbara,

CA), “A brief history of cryptographic hardware design”

4

H5H4 goal: Skiing down a mountainH5H4 goal: Skiing down a mountain

Specification

ASIC SpecialPurposeRetargetablecoprocessor

DSPprocessor

DSP-RISC RISC

Algorithm Transformations

Memory Transformations and Optimizations

Floating-point to Fixed-point

SPW, Matlab, C++

pipelining, unrolling

loop merging, compaction

40 bit accumulator

5

ReferencesReferences

P. Lapsley, et al., “DSP Processor fundamentals: Architectures and features,” IEEE Press, 1997, Chapter 3.

W. Sung, K. Kum, “Simulation-based Word-Length Optimization Method for Fixed-point Digital Signal processing systems,” IEEE Trans. On Signal Proc. Vol. 43, No. 12, Dec. 1995.

Viktor Öwall, Dept. of Electroscience, Lund Sweden -www.es.lth.se/ugradcourses/DSPDesign/

M. Ercegovac, T. Lang, “Digital Arithmetic,” Kaufmann Publishers, 2004.

Fridge project: http://www.ert.rwth-aachen.de/Projekte/Tools/FRIDGE/fridge.html

6

DSP applicationshigh speedminimum arealow power

**

3 bytes (mantissa)3 bytes (mantissa)+ 1 byte (exponent)+ 1 byte (exponent)

Fixed-point refinement

88

**661414

Finite word lengths: a must for DSPFinite word lengths: a must for DSP

Floating-point– powerful– expensive (storage & ops)

7

Example: Failure of Patriot Missile (1991 Feb. 25)

Source http://www.math.psu.edu/dna/455.f96/disasters.html

American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile The Scud struck an American Army barracks, killing 28

Cause, per GAO/IMTEC-92-26 report: “software problem” (inaccurate calculation of the time since boot)

Specifics of the problem: time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds. Internal registers were 24 bits wide 1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 b) Error ≅ 0.1100 1100 × 2–23 ≅ 9.5 × 10–8

Error in 100-hr operation period≅ 9.5 × 10–8 × 100 × 60 × 60 × 10 = 0.34 sDistance traveled by Scud = (0.34 s) × (1676 m/s) ≅ 570 m

This put the Scud outside the Patriot’s “range gate” Ironically, the fact that the bad time calculation had been improved in some (but not all) code parts contributed to the problem, since it meant that inaccuracies did not cancel out

Consequences of Bad Consequences of Bad UseUse of of ApproximationsApproximations

8

Example: Explosion of Ariane Rocket (1996 June 4)

Source http://www.math.psu.edu/dna/455.f96/disasters.html

Unmanned Ariane 5 rocket launched by the European Space Agency veered off its flight path, broke up, and exploded only 30 seconds after lift-off (altitude of 3700 m)

The $500 million rocket (with cargo) was on its 1st voyage after a decade of development costing $7 billion

Cause: “software error in the inertial reference system”

Specifics of the problem: a 64 bit floating point number relating to the horizontal velocity of the rocket was being converted to a 16 bit signed integer

An SRI* software exception arose during conversion because the 64-bit floating point number had a value greater than what could be represented by a 16-bit signed integer (max 32 767)

Consequences of Bad Consequences of Bad ApproximationsApproximations

9

OutlineOutline

• Number representation• Location of decimal point• Precision• Dynamic range• Truncation, rounding• Overflow

10

Binary numbers, unsigned integersBinary numbers, unsigned integers

22 21 200 0 0 (0)0 0 1 (1)0 1 0 (2)0 1 1 (3)1 0 0 (4)1 0 1 (5)1 1 0 (6)1 1 1 (7)

MSB =Most Significant Bit

LSB =Least Significant Bit

N bits

2N

[V. Öwall]

11

Dynamic range and ResolutionDynamic range and Resolution

Nr. of Nr. of Resolution Dynamic Rangebits levels Vfs=0.5V VLSB=0.03125

4 16 0.03125V 0.5V

8 256 2mV 8V

12 4096 0.12mV 128V

16 65 536 7.6μV 2042V

How do we use the bits?Depends on the application!

[V. Öwall]

12

Number RepresentationNumber Representation

•Unsigned numbers•Signed digit numbers

•Sign magnitude•One’s complement•Two’s complement

Notation: with W = K + LW = wordlengthL = number of bits behind decimal (or binary) point

13

SignedSigned--Digit RepresentationsDigit Representations

Representations– 1) Signed-Magnitude: redundant – 2) Biased: non-redundant– 3) Complement

» A) Radix Complement (r=2 “two's complement”)– non-redundant

» B) Digit Complement or Diminished-Radix Complement (r=2 “one's complement”)

– redundant

Redundant two representations for same numberNon-redundant each representation is different

number

14

Sign MagnitudeSign Magnitude

Unsigned numbers with a sign-bit

- Two Zeros

+ Low Power?SignedMagnitude

000

001

011101

111

110 010

100

0

-10

3

2

1-3

-2

[V. Öwall]

15

One’s ComplementOne’s Complement

Signed numbers by inverting (Complement)

- Two Zeros

+ Easy to convert to Negative

One'sComplement

000

001

011101

111

110 010

100

0

-1

0

3

2

1

-3-2

[V. Öwall]

16

Two’s ComplementTwo’s Complement

Complement + LSB

+ One Zero

+ Easy Addition

- Not so easy to convert to Neg.

Two'sComplement

000

001

011101

111

110 010

100

0-1

- 43

2

1

-3

-2

Most widely used fixed point numbering system

[V. Öwall]

17

7 0111 1111 0111 01116 0110 1110 0110 01105 0101 1101 0101 01014 0100 1100 0100 01003 0011 1011 0011 00112 0010 1010 0010 00101 0001 1001 0001 00010 0000 1000 0000 0000-0 1000 1111-1 1001 0111 1111 1110-2 1010 0110 1110 1101-3 1011 0101 1101 1100-4 1100 0100 1100 1011-5 1101 0011 1011 1010-6 1110 0010 1010 1001-7 1111 0001 1001 1000-8 0000 1000

Signed-magnitude Biased

Two’s complement

One’s complement

18

2233

22 22 22 22 22i.2i.222 11 00 --11 --22 --33

WWLL

MSB=WMSB=W--LL LSB=LLSB=L

Position of decimal pointPosition of decimal point

Total number of bits WFractional bits L

• Value representation• 2’s complement (i=-1)• unsigned (i=1)

How do you store this decimal point?

19

Fixed point for DSP processorsFixed point for DSP processors

Simple binary integer (two’s complement)

Simple binary fractional representation

2266

22 22 22 22 222255 44 33 22 11 00

WW

MSB=WMSB=W LSB=0LSB=0

77

--22SignbitSignbit

22--11

22 22 22 22 2222--22 --33 --44 --55 --66 --77

WW

MSB=WMSB=W LSB=L=WLSB=L=W--11

00

--22SignbitSignbit

Values between [-1,1[

20

Mantissa representationMantissa representation

• Mantissa: e.g. 24 bit– One sign bit– Mantissa bit = 1 (always!) [-1, -2] and [+1, +2]

• Exponent: e.g. 8 bit• Value = Mantisse x 2exponent

2200

22 22 22 22 2222--11 --22 --33 --44 --55 --66

WW

MSB=WMSB=W LSB=LLSB=L

11

--22SignbitSignbit

22 22 22 2233 22 11 00

21

PrecisionPrecision

• Quantization error = error when a longer numeric format is converted to a shorter one

• E.g.: round 1.325 to 1.33, error = 0.005

Maximum precision (in bits) = log2 (|maximum value| / |max quantization error|)

• E.g.: 16 bit fractional representation• max value = -1, max error = 2-16 (with rounding)

maximum precision = 16 bits

Importance of scaling!!

22

Dynamic rangeDynamic range

• Dynamic range = largest number / smallest numberin a given data format

• E.g. 32 bit fractional valueratio = (1- 2-31) / 2-31 = 2+31 = 2.15 109 = 187 dB

• Telecom: 50 dB,• High End Audio: 90dB +

• DSP processors: provide a few more bits than the dynamic range requires

Scaling !!

23

RoundingRounding

24

How do we quantize?How do we quantize?

floorfloor

fxpfxp

flpflp

roundround

fxpfxp

flpflp

MagnitudeMagnitudetruncatetruncate

fxpfxp

flpflp

ceilceil

fxpfxp

flpflp

CheapNasty

BestExpensive

Sign-MagnitudeUnusual

2-compltruncate⎣x⎦

⎡x⎤

25

RoundingRounding

Rounding occurs when we want to approximate a more precise number (i.e. more fractional bits L) with a less precise number (i.e. fewer fractional bits L')

Example 1: “down”– old: 000110.11010001 (K=6, L=8)– new: 000110.11 (K'=6, L'=2)

Example 2: “up”– old: 000110.11010001 (K=6, L=8)– new: 000111. (K'=6, L'=0)

The following show rounding from L>0 fractional bits to L'=0 bits, but the mathematics hold true for any L' < L

Usually, keep the number of integral bits the same K'=K

26

Rounding EquationRounding Equation

y = round(x)

Fractional partWhole part

xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l yk–1yk–2 . . . y1y0Round

27

Rounding TechniquesRounding Techniques

Different rounding techniques:– 1) truncation

» results in round towards zero in signed magnitude

» results in round towards -∞ in two's complement– 2) round to nearest number– 3) round to nearest even number (or odd number)– 4) round towards +∞

Other rounding techniques– 5) jamming or von Neumann– 6) ROM rounding

Each will differ in their error depending on representation of numbers i.e. signed magnitude versus two's complement– Error = round(x) – x

28

1) Truncation1) Truncation

Truncation in signed-magnitude results in a number chop(x) that is always of smaller magnitude than x. This is called round towards zero or inward rounding– 011.10 (3.5)10 011 (3)10

» Error = -0.5– 111.10 (-3.5)10 111 (-3)10

» Error = +0.5Truncation in two's complement results in a number chop(x) that is always

smaller than x. This is called round towards -∞ or downward-directed rounding– 011.10 (3.5)10 011 (3)10

» Error = -0.5– 100.10 (-3.5)10 100 (-4)10

» Error = -0.5

The simplest possible rounding scheme: chopping or truncation

xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l xk–1xk–2 . . . x1x0trunculp

29

Truncation Function Graph: Truncation Function Graph: chop(xchop(x))

Fig. 17.5 Truncation or chopping of a signed-magnitude number (same as round toward 0).

Fig. 17.6 Truncation or chopping of a 2’s-complement number (same as round to -∞).

chop(x)

–4

–3

–2

–1

x–4 –3 –2 –1 4321

4

3

2

1

chop(x)

–4

–3

–2

–1

x–4 –3 –2 –1 4321

4

3

2

1

30

Bias in two's complement truncationBias in two's complement truncation

0-3101-3101.00

-0.75-4100-3.25100.11

-0.5-4100-3.5100.10

-0.25-4100-3.75100.01

-0.7530113.75011.11

-0.530113.5011.10

-0.2530113.25011.01

030113011.00

Error(decimal)

chop(x)(decimal)

chop(x) (binary)

X (decimal)

X (binary)

Assuming all combinations of positive and negative values of x equally possible, average error is -0.375

In general, average error = (2-L'-2-L )/2, where L' = new number of fractional bits

31

Implementation truncation in Implementation truncation in hardwarehardware

Easy, just ignore (i.e. truncate) the fractional digits from L to L'+1

xk-1 xk-2 .. x1 x0. x-1 x-2 .. x-L

= yk-1 yk-2 .. y1 y0.

ignore (i.e. truncate the rest)

32

2) Round to nearest number2) Round to nearest number

Rounding to nearest number what we normally think of when say round

rtn in two's complement – 010.10 (2.5)10 011 (3)10

» Error = +0.5– 101.10 (-2.5)10 110 (-2)10

» Error = -0.5

33

Round to Nearest Function Graph: Round to Nearest Function Graph: rtn(xrtn(x))

rtn(x)

–4

–3

–2

–1

x–4 –3 –2 –1 4 3 2 1

4

3

2

1

34

Bias in two's complement round to Bias in two's complement round to nearestnearest

0-2110-2110.00+0.25-2110-2.25101.11+0.5-2110-2.5101.10-0.25-3101-2.75101.01+0.2530112.75010.11+0.530112.5010.10-0.2520102.25010.01

020102010.00

Error(decimal)

rtn(x)(decimal)

rtn(x) (binary)

X (decimal)X (binary)

All combinations of positive and negative values of x equally possible, average error is +0.125

– Smaller average error than truncation, but still not symmetric error– We have a problem with the midway value, i.e. exactly at 2.5 or -2.5 leads to positive

error bias alwaysOverflow problem: if only allocate K' = K integral bits

– Example: rtn(011.10) overflow– This overflow only occurs on positive numbers near the maximum positive value, not on

negative numbers

35

Truncation and roundingTruncation and rounding

Truncation: cheapest but introduces “bias”E.g.: use 0011 = 3 0011.1 = 3.5 truncates to 31100 = -4 1100.1 = -3.5 truncates to -4Always a smaller number

Rounding: “round to the nearest”Simple hardware trick: add 1/2 of the smallest number and truncateE.g.: use 0011 = 3 0011.1 = 3.5 rounds to 4

1100.1 = -3.5 rounds to -3

How in hardware?

36

RoundingRounding

Rounding to the nearest: still bias for numbers exactly half way

More expensive: “convergent” rounding

aa66

aa aa aa aa aaaa55 44 33 22 11 0077

aaSignbitSignbit

bb22

bbbb11 0033

bb

If a3:a0 > 1000

b3:b0 = a7:a4 + a3

If a3:a0 < 1000

b3:b0 = a7:a4 + a3

If a3:a0 = 1000

b3:b0 = a7:a4 + a4

SignbitSignbit

37

OverflowOverflow

38

What happens on an overflow?What happens on an overflow?

wrap-around saturation

flp flpfxp fxp

max. value

39

Adding Two's Complement Numbers: Adding Two's Complement Numbers: Ignoring OverflowIgnoring Overflow

Ignoring overflow, adding a K.L two's complement number to a K.L binary unsigned number results in a K.L numberExample: 0111.01 + 1000.00 +

0110.10 = 1001.00 =01101.11 10001.00

Ignore cK

Adding 7.25 + 6.5 results in -2.25: must add 2^K = 16 to get correct result (13.75)

Adding -8 + -7 results in +1: must add -2^K = -16 to get correct result

Ignore cK

40

Two's Complement Wraparound Two's Complement Wraparound PropertyProperty

Temporary wraparounds are fine as long as final value is in the correct dynamic range:– Example: add (-8 + -6) + 7 = -7– 1000 + 1010 = 0010

» Should be (-14)10 not (+2)10 wraparound/overflow– 0010 + 0111 = 1001

» Final result is correct: (-7)10» If final result guaranteed to be in the correct dynamic

range [-8,+7] then intermediate wraparounds are fine

41

To avoid overflow, adding a binary two's complement number to a two's complement number results in a number. To compute, sign extend MSB, ignore cK+1Example: 00111.01 +

00110.10 =001101.11

Adding Two's Complement Numbers: Adding Two's Complement Numbers: Avoiding or Detecting OverflowAvoiding or Detecting Overflow

If result is confined to a number, need overflow detection, which is the cK xor cK-1Example: 0111.01 +

0110.10 =01101.11

cK XOR cK-1 indicates overflow

Ignore cK+1

K=4, L=2

42

Ignoring overflow, subtracting a two's complement number from a two's complement number results in a numberExample: 1

0111 - 0111 +1000 0111 =

01111

Subtracting Two's Complement Subtracting Two's Complement Numbers: Ignoring OverflowNumbers: Ignoring Overflow

7 – (-8) resulted in -1– A wraparound/overflow occured– Must add 2^K=2^4=16 to get correct value of +15

Again we see the modulo effect– As with addition, temporary wraparounds are okay as long as final

result is in correct dynamic range

Ignore cK

43

If result is confined to a K.L number, need overflow detection, which is the cK xor cK-1Example: 1

0111.01 - 0111.01 +1000.00 0111.11 =

01111.01

To avoid overflow, subtracting a two's complement number from a two's complement number results in a numberExample: 1

0111.01 - 00111.01 +1000.00 00111.11 =

001111.01

Subtracting Two's Complement Numbers: Subtracting Two's Complement Numbers: Avoiding or Detecting OverflowAvoiding or Detecting Overflow

cK XOR cK-1 indicates overflow

Ignore cK+1

44

Negating a Two's Complement Negating a Two's Complement NumberNumber

Negating a K.L two's complement number usually only requires a K.L digit result. The only exception is when you negate the largest negative number, and you need a K.(L+1) digit result.

» - 0111 = 1001» - 1000 = 01000 need extra bit to negate largest negative

number

Again overflow detection needed

45

OutlineOutline

• Number representation• Location of decimal point• Precision• Dynamic range• Truncation, rounding• Overflow• Now: what to do?

46

The Wordlength, i.e. nr of bitsThe Wordlength, i.e. nr of bits

D D Dx(n)

h0 h3h2h1

y(n)

Every extra bit costs• energy/power• delay• area

• the word length has to be reduced

UMTS-filter

7bits

float

[V. Öwall]

47

The Wordlength, i.e. nr of bitsThe Wordlength, i.e. nr of bitsD D Dx(n)

h0 h3h2h1

y(n)

The output of• adder output needs an extra bit to be sure of no overflow, e.g. 2+2 = 4 ⇒ 10+10=100

• multiplier MxN bits ⇒ M+N bits for full precision

⇒ Precision has to be limited

[V. Öwall]

48

FloatingFloating--pointpointalgorithmalgorithmADAD

88 77

**

**++

????

????

????

During design: During design: specifyspecify fixedfixed--point formats for point formats for signalssignals

W,L,Q

System context

System context

coefficients

data

49

FixedFixed--point refinement: optimization problempoint refinement: optimization problem

Minimize overall cost:– minimal word lengths– truncate and wrap-around

MSB determination:– goal: avoid unwanted overflows– method: find min, max signal values– result: MSB position, value

representation, overflow behaviour

LSB determination:– goal: keep “required” precision– method: evaluate difference

between flp and fxp behavior– result: LSB position, quantization

safe rangesafe range

quantizationquantization

t

t

cost

50

1.MSB determination: range calculations1.MSB determination: range calculations

* +

d

m

x

c y

rangerangeinfoinfo

rangerangecalc.calc.

Analytical methodPut range (min, max) on inputs, statesPropagate range over the operatorsThis gives a save(pessimistic) estimate

51

Word length propagationWord length propagation

Range propagation translates to word length growthE.g. Two’s complement integer addition A + B

A and B represented by A + B needs A – B needs

In general:A is represented by , B by , A + B needs

Get’s more complicated for multiplication

52

Range calculations grows unbounded?Range calculations grows unbounded?

*

+

a

53

* +

d

m

x

c ystimuli

stimuli

?min, maxq1

q2

Alternative: Collect signal statistics Alternative: Collect signal statistics during simulationsduring simulations

Perform simulation with realistic stimuli.Collect minimum and maximum value on each signal during the

simulationThis gives an optimistic, stimuli dependent estimate

54

signal statistic range propagationname min max MSB1 min max MSB2

signal1 -1.5 1.6 2 -1.9 1.9 2signal2 -1.3 1.4 2 -2.1 2.1 3signal3 -1.2 1.2 2 -22.0 22.0 6

Combine both methods for accurate Combine both methods for accurate MSB determinationMSB determination

If MSB1 == MSB2: wrap-around(MSB1)If MSB1 < MSB2: choose saturate(MSB1) or wrap-around(MSB2)If MSB1

55

Transform DFG for cheaper solutionTransform DFG for cheaper solution

Scaling by moving multiplications or shifters over operators, use commutativity, associativity, distributivity(check accuracy!)

Need to verify also LSB behavior

162-420

16

16

20

20

++16

2-4

16

16

16

16

++2-4

56

QQ ++

B bitsinput output outputinput

noise

2. Quantization effects can be modeled 2. Quantization effects can be modeled as additive noise (LSB)as additive noise (LSB)

Quantization noise is approximated by a statistical model with the following assumptions:

– the noise is uncorrelated to the input.– the noise is white.– the probability distribution is uniform.

57

Each quantization effect is modeled Each quantization effect is modeled by a mean and varianceby a mean and variance

Rounding:

Truncation:

Magnitude truncation:

12 and 0

2Δ== nnm σ

12 and

2

2Δ=

Δ−= nnm σ

3 and 0

2Δ== nnm σ

Δ is quantization step

58

This results in an equivalent linear This results in an equivalent linear networknetwork

*

+

a

X(n) Y(n)

z-1

QQ *

+

a

X(n) Y(n)

z-1

+

e(n)

But quantization is a non-linear operation!

59

Limit cycles are an example of nonLimit cycles are an example of non--linear behaviorlinear behavior

*

+

-0.96

X(n) Y(n)

z-1

QQ

X(0) = 14, x(n) = 0 for n > 0

round to nearest integer

B bits

...

...

with rounding:

without rounding:

60

LimitcycleLimitcycle exampleexample

61

a) LSB determination must be based a) LSB determination must be based on simulationson simulations

All fixed-point

simulate

outputok

yes

no

* +

stimuli

0.6

x

ymQQ

* +0.6

x

ym compare

QQ

z-1

62

b) Gradual refinement is necessary b) Gradual refinement is necessary to keep the problem manageableto keep the problem manageable

quantize S only

simulate

Perf.ok

yes

no

return

For each

signal S

* +

stimuli

0.6

x

ymQQ

reference simulationcompare

z-1

63

ConclusionConclusion

• Number representation• Location of decimal point• Precision• Dynamic range• Truncation, rounding• Overflow• Now: what to do?• Why are we doing this?

– Next exercise: floating point to fixed point– Area/time/power optimization– Important design optimization for JPEG project

h5h4, h5e7 fixed point arithmetic lecture 2 (used to be 5)iverbauw/courses/h05h4/... · 2009. 3....

Documents