7. quantization noise in dsp - simon fraser · pdf file7. quantization noise in dsp ... the...

7-1

7. Quantization Noise in DSP

• Reference: Sections 6.6 to 6.9, 9.7 of Text

• In our discussions of the filtering and FFT operations, it was implicitlyassumed that all the computations can be performed with infinite precision.This is quite true with floating point implementation.

• With fixed point implementation,

- the filter parameters will be quantized (finite word lengthrepresentation) and

- any intermediate result arising from the multiplication and/or additionof numbers will be quantized (finite precision computation)

Some effects of quantization are:

- it changes the locations of the poles and zeros of a filter,- it creates the equivalent of an additive noise term at the filter/FFT

output .

7.1 Number Representation and Quantization

• In some DSP chips or special purpose hardware, numbers/signals arerepresented and manipulated using fixed-point binary arithmetic.

7-2

• Examples of fixed-point binary representations are sign and magnitude,one’s complement, and two’s complement.

The two’s complement representation is the most common format.

• In the two’s complement numbering system, a real number x within therange

m mX x X− ≤ ≤

is quantized to the number

01

ˆ ˆ2B

im i m B

i

x X b b X x−

=

= − + = ∑ ,

where , 0,1,...,ib i B= are binary (0,1) numbers, with 0b being the sign bit.

With this numbering scheme, the smallest difference between numbers is

2 BmX −∆ =

and the fractional part of a number, i.e. the term

01

ˆ 2B

iB i

i

x b b −

=

= − +∑ ,

can be represented by the binary pattern

0 1 2ˆ ...B Bx b bb b≡ o ,

where o is the binary point.

7-3

The number mX must be large enough that the risk of overflow is small,and small enough that the quantization noise is kept to an acceptable level.

When B → ∞ , x becomes the real number x .

• The quantization of x to x can be done through

1. rounding, or2. truncation.

The figures below illustrate these two operations for the case of 2B = .Clearly the mapping from x to x is nonlinear in both cases.

rounding

7-4

truncation

• Let [ ]BQ i denotes a 1B + bit quantizer and let the quantization error be

ˆ

[ ] .B

e x x

Q x x

= −= −

Then it can be easily deduced from the above figures that for rounding,

/ 2 / 2e−∆ < ≤ ∆ ,

and for truncation,0e−∆ < ≤ .

7-5

• In studying the effect of quantization, the quantization error is usuallymodelled as a uniform random variable. For rounding, the probabilitydensity function (pdf) of this random variable is

1/ / 2 / 2( ) ,

0 otherwisee

p e∆ − ∆ < ≤ ∆

=

and for truncation, the pdf is

1/ 0( ) .

0 otherwisee

p e∆ − ∆ < ≤

=

• The mean square quantization error for rounding is

/ 2 / 2 22 2 2

/ 2 / 2

1( ) ,

12e e p e de e deσ∆ ∆

−∆ −∆

∆= = =∆∫ ∫

and for truncation is

0 0 22 2 21

( ) .3e e p e de e deσ

−∆ −∆

∆= = =∆∫ ∫

Thus rounding is clearly more desirable than truncation.

• Consider the multiplication of the quantized numbers x and a during aDSP operation. The product,

ˆ ˆy xa= ,

7-6

itself will be quantized to a 1B + number ˆ [ ]By Q y= .

The quantized product can be written in terms of the unquantized productˆ ˆxa and the quantization error ye as

[ ]ˆ ˆ ˆB y yy Q y y e xa e= = + = +

If the quantization error is modelled as a uniform random variable, themodel for a fixed-point multiplier becomes

This linearlized model will be adopted in our study of the effect ofrounding error in IIR filters later on.

7.2 The Effects of Coefficient Quantization

• The transfer function of an IIR filter can be written as

[ ]BQ i

a

x y

ye

y

a

x

7-7

0

1

( )( ) ,

( )1

Mk

kk

Nk

kk

b zB z

H zA za z

−

=

−

=

= =−

∑

∑

where

1

( ) 1N

kk

k

A z a z−

== − ∑

and

0

( )M

kk

k

B z b z−

== ∑ .

The Direct form implementation of this filter is shown below. All the filtercoefficients are at their original unquantized values.

We want to address in this section the issue of poles and zeros relocationwhen the filter coefficients are quantized.

7-8

• Consider the denominator polynomial ( )A z . This polynomial can bewritten in product form as

( )1

1 1

( ) 1 1NN

kk k

k k

A z a z p z− −

= =

= − = −∑ ∏ ,

where the kp ’s are the roots of ( )A z , or equivalently the poles of ( )H z . It

is understood that the kp ’s are nonlinear functions of the ka ’s. Forexample when 2N = , then 1 1 2a p p= + and 2 1 2a p p= − .

So how would the kp ’s be affected when the ka ’s are quantized?

• Lets take the partial derivative of ( )A z with respect to ia . From thesummation form, we obtain

( ) i

i

A zz

a−∂ = −

∂ (1)

From the product form we obtain

( )1 1

1 1

( )1

NNn

kn ki i

k n

pA zp z z

a a− −

= =≠

∂∂ = − − ∂ ∂

∑ ∏ (2)

Suppose we want to determine /j ip a∂ ∂ . What we can do is to evaluate both

(1) and (2) at jz p= and equate them. The end result is

7-9

( )1

N ij j

Ni

j kkk j

p p

ap p

−

=≠

∂=

∂ −∏

• Example : When 3N = , ( )A z can be written as

1 2 31 2 3( ) 1A z a z a z a z− − −= − − −

or

( )( )( )1 1 11 2 3( ) 1 1 1A z p z p z p z− − −= − − − .

The partial derivative of the first expression with respect to 1a yields

1

1

( )A zz

a−∂ = −

∂ .

On the other hand, the partial derivative of the second expression withrespect to 1a yields

( ) ( ) ( ) ( )

( ) ( )

1 1 1 1 1 13 21 2 1 3

1 1 1

1 1 112 3

1

( )1 1 1 1

1 1

pA z pp z p z z p z z p z

a a a

pz p z p z

a

− − − − − −

− − −

∂∂ ∂= − − − + − − − + ∂ ∂ ∂ ∂− − − ∂

If we evaluate the two expressions at 3z p= and equate them, we obtain

( ) ( )1 1 1 133 1 3 2 3 3

1

1 1p

p p p p p pa

− − − − ∂= − − − ∂

7-10

which can be simplified to

( )( )2

3 3

1 3 1 3 2

p p

a p p p p

∂=

∂ − −

• Let , 1,2,...,ka k N∆ = , be the quantization errors in the ka ’s when the IIRfilter is implemented using the Direct form computational structure withfixed-point arithmetic. Then the poles will be shifted by the amounts

1

; 1,2,...,N

jj i

i i

pp a j N

a=

∂∆ = ∆ =

∂∑

Because of the term

( )1

N

j kkk j

p p=≠

−∏

in the denominator of /j ip a∂ ∂ , we can deduce that if the poles are clusteredtogether, i.e. when

1j kp p− << ,

there could be big changes to the poles’ locations. Consequently, the directform implementation structure is quite sensitive to quantization errors inthe filter coefficients.

• As shown in Section 6.4, the cascade form of an IIR filter is made up of aserial concatenation of second order direct form subsystems. The poles

7-11

(zeros) of all the subsystems together form the poles (zeros) of the IIRfilter.

Quantization of the filter coefficients in a subsystem will only affect thetwo poles (and 2 zeros) of that subsystem. In other word, quantizationerrors are localized.

Thus the cascade form is generally much less sensitive to coefficientquantization than the direct form.

• Let us focus on a 2nd order IIR filter of the form

( )( )1 11 2

1( )

1 1H z

p z p z− −=

− − ,

where

1jp re θ=

and

2jp re θ−=

are the conjugate pole-pair of the filter. The two poles are located on acircle with radius r , with one of them at phase angle of θ and the other atan angle of θ− .

The denominator polynomial can be written as

( )( )( )

( )

1 11 2

1 21 2 1 2

1 2 2

( ) 1 1

1

1 2 cos

A z p z p z

p p z p p z

r z r zθ

− −

− −

− −

= − −

= − + +

= − +

7-12

Consequently the coefficients in the feedback portion of the IIR filter are

( )1 2 cosa r θ=and

22a r= −

The implementation structure of this filter is as shown below

If we assume 4-bit quantization of 1a and 2a in the interval [-1,+1]. Then2r− and 2 cosr θ are numbers from the set:

7 3 5 3 3 5 3 71 1 1 1 1 18 4 8 2 8 4 8 8 4 8 2 8 4 81, , , , , , , , 0, , , , , , ,− − − − − − − − .

This means after quantization, the poles can only take on values from theset shown in diagram in the next page (only the first quadrant in the z-planeare shown).

From the diagram, we can deduce that poles that are originally around0θ = or θ π= are more affected by quantization than those around

/ 2θ π= .

7-13

• Exercise: Determine

1

1

p

a

∂∂ , 1

2

p

a

∂∂ , 2

1

p

a

∂∂ , and 2

2

p

a

∂∂

for the above second order IIR filter. Assuming the original poles are on acircle of radius 0.96r = with phases of /36θ π= ± . Where are thelocations of the new poles after 4-bit quantization?

7-14

7.3 Zero Input Limit Cycle

• For a stable digital filter, if the input is set to zero at some point in time, theoutput should decay to zero.

• For finite precision implementations, the output may decay to a non-zeroamplitude and then oscillate. This phenomenon is known as the zero-inputlimit cycle; see for example the following results for a first order IIR filterwith 4-bit quantization between –1 and +1.

7-15

• Zero-input limit cycle can be very annoying for applications such asspeech, as tones will be generated during the silence periods.

• Zero-input limit cycle is unique to IIR filters because of the feedbackmechanism in these filters. No need to worry about this phenomenon forFIR filters.

• As an example, consider a first order IIR filter

[ ] [ 1] [ ]; 1y n ay n x n a= − + <

With ( 1)B + -bit fixed-point implementation, the output becomes

[ ]ˆ ˆ[ ] [ 1] [ ]; 1By n Q ay n x n a= − + <

Here, we assume rounding with a quantization step size of ∆ .

Suppose the input becomes zero when on n≥ . Then the unquantized outputat time on is ˆ[ 1]oay n − . Assuming that both ˆ[ 1]oy n − and the filter parametera are positive, then ˆ[ 1]oay n − will be quantized back to ˆ[ 1]oy n − if

ˆ ˆ[ 1] [ 1] / 2o oay n y n− − − ≥ −∆or when

( )ˆ ˆ[ 1] ; ( [ 1] 0, 0)

2 1o oy n y n aa

∆− ≤ − > >−

7-16

In general, it can be shown that for a 1st order IIR filter, the zero-input limitcycle exsists when

( )ˆ[ 1]

2 1oy na

∆− ≤−

This is known as the dead band of the first order IIR filter.

• Limit cycles can also be caused by overflow. In this case, the oscillation isbetween large limits.

• Limit cycles can be eliminated by adopting computation structures that donot support them. But these structures are usually more computationallyintensive than the cascade form we discussed.

More bits in the quantization process will reduce the chance of limit cycles.

7.4 Effects of Round-off Noise in IIR Filters

• In Section 7.1, we showed that a fixed-point multiplier can be replaced byan unconstrained (real) multiplier in concatenation with an additive,uniform noise source that models the round-off error.

PDF of round-off noise in a B+1 bit quantizer

7-17

• We want to analyze in this section the effect of the round-off noise on theoutput of an IIR filter. With infinite precision implementation, the inputoutput relationship of such a filter is given by

1 0

[ ] [ ] [ ]N M

k kk k

y n a y n k b x n k= =

= − + −∑ ∑ .

With rounding, this becomes

[ ] [ ]1 0

ˆ ˆ[ ] [ ] [ ]N M

B k B kk k

y n Q a y n k Q b x n k= =

= − + −∑ ∑ ,

where we assume that the input and the filter coefficients are alreadyquantized.

• The figures below show the additive noise model for the fixed pointimplementation of a 2nd-order IIR filter using the Direct Form I structure.

(a) Direct Form I (floating point) implementation of a 2nd order IIR Filter

7-18

(b) Fixed point implementation of a 2nd order IIR filter using theDirect Form I structure

(c) Additive noise model for the fixed-point 2nd order IIR filter in (b).

The terms 0 1 2 3 4[ ], [ ], [ ], [ ], and [ ]e n e n e n e n e n in Diagram (c) represent theround-off noises in the five fixed-point multipliers. Each of these noiseterms has zero mean and a variance of

7-19

2 222 2

12 12

BmX

σ−∆

= =

Furthermore, we assume that each [ ]ie n is white (i.e. [ ][ ] [ ] 0i iE e n e n m+ = )and statistically independent of one another.

• Since all the nodes in Diagram (c) represent adders, and since the output ofthe first stage is the input to the second stage, we can lump all the noisesources together into a single noise source

0 1 2 3 4[ ] [ ] [ ] [ ] [ ] [ ]e n e n e n e n e n e n= + + + +

and place it at the input to the second stage; see diagram below.

Like the individual [ ]ie n ’s, the combined noise term [ ]e n has zero mean. Its

variance, however, is 2 25eσ σ= .

Since each [ ]ie n is white, so [ ]e n is also white. Consequently, the powerspectral density of [ ]e n is ( ) 25j

ee e ω σΦ = .

7-20

The output ˆ[ ]y n of the linearized model has two components, the desired(i.e. real value) output [ ]y n and the noise term [ ]f n . The output noise termis obtained by disabling the input line and feeding the combined noise termto the feedback filter; see diagram below.

Since the frequency response of the feedback filter is

( ) 21 2

11

jef j j

H ea e a e

ωω ω− −=

− − ,

the PSD of the output noise is

( ) ( ) ( ) 2 222

1 2

151

j j jff ee ef

j je e H e

a e a e

ω ω ω

ω ωσ

− −Φ = Φ =

− −

• In general, for an IIR filter with N feedback taps and 1M + feedforwardtaps, the variance of the combined noise term [ ]e n in a Direct Form I fixed-point implementation is

7-21

( ) ( ) 2 22 2 1 2

112

Bm

e

N M XN Mσ σ

−+ += + + = ,

and its power spectral density is

( ) ( ) 2 22 1 2

12

Bmj

ee e

N M Xe ω σ

−+ +Φ = = .

Since this combined noise term is injected into the input of the feedbackfilter, the PSD of the output quantization noise, [ ]f n , is

( ) ( ) ( )

( )

2

22

1

2 2

2

1

1 ( 1)

1

2 1 1 ,

121

j j jff ee ef

Nj k

kk

Bm

Nj k

kk

e e H e

N M

a e

XN M

a e

ω ω ω

ω

ω

σ−

=

−

−

=

Φ = Φ

= + −−

= + +

−

∑

∑

and the output noise power is

( ) ( )2 2

22

1

21 11

2 12 21

Bj m

f ff Nj k

kk

X de d N M

a e

π πω

ω π ω π ω

ωσ ω

π π

−

=− =− −

=

= Φ = + +

−∫ ∫

∑

• While the Cascade Form is preferred over the Direct Forms in terms ofpoles/zeros relocation caused by quantization, there are NO similar rulesfor output quantization error. A lot actually depends on the coefficients ofthe feedforward and feedback filters.

7-22

The diagrams below show the noise model for a fixed-pointimplementation of a 2nd order IIR filter using the Direct Form II structure.It is clear that round-off error affects Direct Forms I and II differently.

(a) Individual quantization noise sources in 2nd order Direct FormII IIR filter

(b) Combined quantization noise sources in 2nd order Direct FormII IIR filter

7-23

7.5 Quantization Noise in FFT

• The DFT coefficients of a N-sample long signal [ ]x n are:

1

0

[ ] [ ] ; 0,1,..., 1N

knN

n

X k x n W k N−

=

= = −∑ ,

where2 /j N

NW e π−= .

The DFT coefficients are actually uniformly spaced samples of ( )jH e ω , thespectrum of [ ]x n , between 0 2ω π≤ < .

• The DFT coefficients can be computed efficiently using a FFT algorithm.The signal flow graph for the the decimation in time FFT algorithm, with

8N = , is shown below.

7-24

• For 8N = , the computation of each DFT coefficient involves 7 butterflycomputational structures of the form shown below,

where [ ], [ ]m mX p X q represent the p-th and q-th intermediate outputcoefficients of the m-th stage. For example in the case of [0]X we have

7-25

and for [2]X we have

• In general, the computation of a DFT coefficient always involves 1N −butterfly structures.

• With fixed-point implementation, the complex multiplication of 1[ ]mX q− byr

NW in each butterfly introduces a complex quantization noise term [ , ]e m q ;see diagram below.

7-26

• The expected value of 2[ , ]e m q represents the average noise power due to

quantization and it can be determined as follows. Let

1 1 1c a jb= +and

2 2 2c a jb= +

be two complex numbers with real components 1a , 2a and imaginarycomponents 1 2, b b . Their product, with infinite precision implementation, is

1 2 1 2 1 2 2 1( ) ( )c a a b b j a b a b= − + +

With fixed-point implementation though, the product of 1c and 2c becomes

( ) ( ){ } ( ) ( ){ }1 2 1 2 1 2 2 1ˆ B B B Bc Q a a Q b b j Q a b Q a b= − + +

The quantization error is thus

( ) ( ){ }( ) ( ){ }

( ) ( )

1 2 1 2 1 2 1 2

1 2 1 2 2 1 2 1

1 2 3 4

ˆ

B B

B B

c c Q a a a a Q b b b b

j Q a b a b Q a b a b

e e j e e

− = − − − +

− + − = − + +

,

7-27

where( )( )( )( )

1 1 2 1 2

2 1 2 1 2

3 1 2 1 2

4 2 1 2 1

,

,

,

B

B

B

B

e Q a a a a

e Q bb b b

e Q a b a b

e Q a b a b

= −

= −

= −

= −

are the individual real-value quantization errors. It is assumed that theseindividual error terms are statistically independent. Consequently, the meanmagnitude square of the complex quantization erro is

( ) ( )

( ) ( )

2 2 2

1 2 3 4

2 2 2 21 2 1 2 3 4 3 4

2 2 2 21 2 3 4

2

ˆ

2 2

4 ,

E c c E e e e e

E e e e e e e e e

E e E e E e E e

σ

− = − + + = + − + + + = + + +

=

where2 2 /12σ = ∆

is the mean-square quantization error of a 1B + -bit real multiplier with astep size of ∆ . From this analysis, we can conclude that the meanmagnitude square of [ , ]e m q is

2 22 2[ , ] 4

12 3 BE e m q σ∆ ∆ = = =

• It should be pointed out that when the term rNW in the butterfly is either

1, 1, , or j j+ − − , then there is actually no quantization error.

7-28

For simplicity in our analysis though, we assume that each butterflyintroduces a complex quantization noise term whose average power is 2

Bσ .

• From the signal flow graphs associated with the computation of [0]X and[2]X in the 8N = case, we see that a quantization noise term introduced in

an intermediate butterfly will be multiplied by a sequence of complexexponential terms of the form r

NW . This operation will not change theaverage power of that noise source.

Consequently, an intermediate noise source can be replaced by anequivalent noise source with the same power at the output node.

So for the 8N = example, there will be altogether 7 equivalent noisesources, all with a power of 2

Bσ , being added to the output node.

It is reasonable to assume that all the 7 equivalent noise sources arestatistically independent. Consequently, they together form a singlecombined noise source with a power of 27 Bσ .

• In general, quantization in a N-point FFT algorithm produces theequivalence of a noise term of power 2( 1) BN σ− in each of the DFTcoefficient. In other word, we can express the quantized DFT coefficinetˆ [ ]X k as

ˆ [ ] [ ] [ ]X k X k F k= + ,

where [ ]F k is the effective quantization noise and

( )2 2 2[ ] 1 B BE F k N Nσ σ = − ≈

7-29

• Recall that

1

0

[ ] [ ] ; 0,1,..., 1N

knN

n

X k x n W k N−

=

= = −∑

Suppose the signal range for our fixed-point implementation is between –1and +1. So to avoid overflow,

[ ] 1X k < .

This can be achieved by scaling the time-domain signal [ ]x n in such a waythat

1[ ]x n

N< .

This stems from the fact that

1 1

0 0

[ ] [ ] [ ] N N

knN

n n

X k x n W x n− −

= == ≤∑ ∑ .

• Assuming that each [ ]x n is a uniform random variable in [0,1/ ]N . Then

1/2 2

20

1[ ]

3

N

E x n N y dyN

= = ∫ .

Furthermore, if we assume that all the [ ]x n ’s are statistically independent,then

7-30

212

0

1 1*

0 0

1 1 12 *

0 0 0

[ ] [ ]

[ ] [ ]

[ ] [ ] [ ]

Nkn

Nn

N Nkn km

N Nn m

N N Nkn km

N Nn n m

m n

E X k E x n W

E x n W x m W

E x n E x m x n W W

−

=

− −−

= =

− − −−

= = =≠

=

=

= +

∑

∑ ∑

∑ ∑∑

2

1

31

3

NN

N

=

=

• The signal-to-noise (SNR) ratio in ˆ [ ]X k is thus

( )2

2

2 2 2 2 2 22

[ ] 1/ 3 1 1 23[ ]

B

B B

E X k N

N N N NE F kγ

σ σ

= = = = =

∆

,

where 2 B−∆ = is the quantization step size of a (B+1)-bit quantizer whenthe signal range is between –1 and +1.

The result indicates that if N is doubled (equivalent to adding 1 more FFTstage), then we must use 1 more bit in the quantization process in order tomaintain the same SNR.

• It is possible to reduce the above 1 extra bit per extra stage requirement to½ bit per stage if the butterfly structure below is adopted instead.

7-31

Here, the signal [ ]x n (i.e. the input to the FFT) is scaled in such a way that

[ ] 1x n < .

(instead of [ ] 1/x n N< ). To avoid overflow, the input to any intermediatebutterfly is first scaled by ½ before being multiplied and added. Theeffective scaling factor for [ ]x n is thus 1 / 2 1/v N= , where 2logv N= is thetotal number of FFT stages.

A noise source in the thm − stage, i.e. [ , ]e m p or [ , ]e m q , will be scaled inmagnitude by the factor ( 1)2 v m− − − . This is in contrast to the original butterflystructure where all noise sources introduced by quantization has a scalingfactor of 1 in magnitude.

This exponential weighting function on noise sources further back in theoverall computational structure leads to the improvement in SNR.

7. quantization noise in dsp - simon fraser · pdf file7. quantization noise in dsp ... the...

Documents