a3_5

1Exploiting Signal and Noise Statistics for FixedPoint FFT Design Optimization in OFDM systems

Sthanunathan Ramakrishnan, Jaiganesh Balakrishnan and Karthik Ramasubramanianemail: {sthanu,jai,karthikr}@ti.com

AbstractScaling the different stage outputs in an FFT ap-propriately is crucial for getting a good Signal to Quantizationnoise ratio (SQNR) in fixed point FFT design. Traditional designshave either handled this through Convergent block floatingpoint technique (CBFP), which has memory, computation andlatency penalties or through time consuming simulations. In thispaper, we consider the special case of FFT design for OFDMtransceivers. We exploit the Gaussian nature of OFDM signalsto predict the bit-growth of the signal through the variousstages of the FFT and propose a technique to scale the signalappropriately. Additionally, we investigate the quantization errorprofile and propose a technique to improve SQNR by exploitingthe presence of null tones at the band edges. With the proposedtechniques, the performance comes close to the CBFP design, withno increase in complexity compared to existing static designs.Simulation results illustrating the performance improvements ofthe proposed technique are presented.

I. INTRODUCTIONFast fourier transforms (FFT) have been widely used in

various fields for computing the Discrete Fourier Transform(DFT) of a signal. A lot of work has gone into designingefficient architectures like radix 2, radix 22 [1], radix 23 [2]and split radix algorithm. Fixed point FFT designs have largelyused the complex but optimal Convergent Block Floating Point(CBFP) method [3]or its variants [4]. This method essentiallyscales the output of every stage appropriately to maximallyutilise the dynamic range and hence gets the best Signal toQuantization Noise Ratio(SQNR) for a given number of bits.The disadvantage of this method is its higher complexity andlatency. Desingers often use simulations to decide the scalingfactors at the expense of increased design time.

Orthogonal frequency division multiplexing (OFDM) is nowused in a number of wireless systems like WLAN, Wi-Max,DVB, DAB, UWB and has also been proposed in futurecellular technologies like LTE. In OFDM, the data is sent onorthogonal tones and an IFFT at the transmitter and an FFT atthe receiver multiplex and demultiplex the data respectively.Many FFT designs for OFDM systems have also used theCBFP technique as reported in [5], [3].

In this paper we exploit the statistical property of OFDMsignals and use it to derive a near-optimal scaling scheme. Wealso discuss the effects of truncation and rounding and showways of increasing SQNR even with the low complexity trun-cation operation. In section II we present a short introductionto OFDM. Section III discusses the traditional techniquesfor fixed point design and points out their advantages anddisadvantages. In section IV, we derive the statistics of theOFDM signal as it passes through different stages of the

FFT and derive optimal and near-optimal scaling schemes. Insection V, we obtain the nature of the quantization noiseaffecting the FFT outputs and exploit it to get higher SQNR.Finally simulation results are presented in section VI toquantify the gains of the proposed technique.

II. ORTHOGONAL FREQUENCY DIVISION MULTIPLEXINGOrthogonal frequency division multiplexing is a method

where data tones are modulated on orthogonal sub carriersand transmitted through a channel. The time domain signalx(t) for one symbol, is generated from the frequency domaintones X(k) using the IFFT relationship

x(n) =

N1k=0

X(k) expj2kn

N n = 0 N 1

At the receiver, the transmitted tones multiplied by the channelcoefficient at that frequency are recovered using an FFT on thetime domain signal.

X(k) =

N1n=0

x(n) expj2kn

N k = 0...N 1

For reasonably large N , the time domain samples x(n) aregaussian distributed due to the central limit theorem. Thisimplies that the Peak-to-average ratio (PAR) of OFDM signalsis high. OFDM has become hugely popular due to its inherentability to combat multipath and to scale to higher data rates.Higher data rates typically imply higher constellations for agiven channel bandwidth and hence higher SNR requirementfrom the system. This makes the problem of fixed point FFTdesign more challenging as higher SQNR is needed to realisethese systems while keeping the implementation complexityand latency low. Design cycle times also need to be kept lowto ensure that systems can be productized quickly to keep pacewith market dynamics.

III. TRADITIONAL FFT DESIGNMany architectures like radix-2, radix-4, radix-22 or radix-

23 or a mix thereof can be used for FFT design in OFDMsystems. An example signal-flow graph representation foran 8 point FFT with radix 2 implementation is shown infigure 1. Pipelined architectures are prefered for their lowlatency, high processing element utilization and low memoryrequirements. A radix r pipelined FFT stage, using single pathdelay feedback, operating on an N th order FFT needs N/rmemory elements, r adders and one multiplier and generates

2Fig. 1. Signal flow graph of a radix-2 8 point FFT

r independent streams of length N/r, which are then passedon to the next stage (an N/r FFT) for further processing [1].

In fixed point design, the usual technique to optimally scalethe outputs at different stages is the Convergent Block FloatingPoint method (CBFP) [3]. In this method each of the N/routput streams are assigned a scale factor so as to optimallyutilise the bit-widths before passing on to the next stage. Atthe end of all the stages, each FFT output has an associatedscale factor like a floating point representation. To get it backto fixed point, the common exponent is extracted and all theoutputs are quantized to the desired bit-widths. The advantagesof this method are that it gives the best SQNR for the bit-widths allocated. But this method has a penalty in terms ofmemory and latency. Each stage needs N/r memory elementsin full precision to store the outputs of the stage, before thescale factor for that block can be computed and the outputsscaled. For example, in a 128 point radix 4 FFT design, thiswould introduce an extra overhead of 32+8+2 = 42 memoryelements. This reduces the effectiveness of a pipelined designand increases the latency through the FFT. There have beensome variants where this memory requirement is traded-offfor more complicated butterfly stages that equalise the scalefactors on the input data on the fly [4]. Due to the increasedcomplexity of CBFP, traditional fixed point designs have reliedon simulations to fix the scaling requirement at each stage.

IV. STATISTICS OF THE INTERMEDIATE SIGNALS

To design a static scaling scheme,we need to know howthe signal behaves as it passes through multiple stages of theFFT. In a Nyquist-sampled OFDM system, the time domainsamples at the receiver are independent and are Gaussiandistributed as shown in section II. Now let the signal at theoutput of butterfly stage m be Sm(n), S0(n) = x(n) andSfinal(n) be the FFT output. Any FFT architecture using aradix r structure takes in r inputs and gives out N/r sets ofr data values that are then processed independently. In eachbutterfly itself, the operation is performed on r inputs to giver outputs. The output of stage m can be expressed in terms ofthe input Sm1(n), a transformation matrix T , which performssome operation on the inputs, a twiddle factor matrix Ttw, adiagonal matrix containing the appropriate twiddle factors forthe input set, as Sm = TtwTSm1. Note that since Ttw is adiagonal matrix containing complex exponentials, it is unitary.The matrix T depends on the radix r used to implement the

FFT. For example for a 4 point FFT implemented using radix2, the matrix is given by

T =

1 0 1 00 1 0 11 0 1 00 1 0 1

For a radix 22 FFT, each stage has two butterflies and let thecorresponding matrices be T1 and T2. The output is given bySm = TtwT2T1Sm1 It can be seen that

T1 =

1 0 1 00 1 0 11 0 1 00 1 0 1

T2 =

1 1 0 01 1 0 00 0 1 j0 0 1 j

These transformations are orthogonal transformations andgiven that the input to the FFT is zero mean i.i.d Gaussian, theoutput of every stage is also zero mean iid Gaussian. To showthis, it is enough to show that if the input has a correlationmatrix of the form 2I, then the output also has a similarform. Assume that the input Sm1 is iid with variance of2m1. Then the variance of the output is given by

E[SmSHm ] = TtwTE[Sm1S

Hm1]T

HTHtw (1)= 2m1TtwTT

HTHtw (2)= r2m1Ir (3)

where r is the radix of the FFT implementation and Ir isthe identity matrix of order r. So the rms value of the outputof a stage increases by

r. A radix 22 stage can be considered

as a cascade of two stages where, at each stage the rms valueof the signal increases by

2. The analysis shown here for

one butterfly stage can be simply extended to all the outputsof a particular stage. In other words, given that the output ofone stage of the FFT is iid gaussian, the output of the nextstage is also iid with the variance increasing by a factor of r.Note that the twiddle factor multiplication does not affect thescaling as its a unitary transformation.

In appendix VIII, it is shown that for any given number ofbits, there is a certain rms value that gives the best SQNR. Theexact rms value at which the SQNR is maximised depends onthe Peak-to-Average ratio (PAR) of the signal. The PAR ofan OFDM signal in turn typically depends on the number oftones used to generate the OFDM signal. At the input of theFFT, the best SQNR for the given number of bits is typicallyassured by the use of an Automatic Gain Control (AGC) block.To retain the same rms value at different stages of the FFT, allwe need to do is to scale the signal down by

r. Note that this

scaling need not be precise, as the SQNR degrades graduallyfrom the peak. For a radix 4 architecture, the optimal scalingfactor is 2 and hence is trivial to implement.By choosing thesame number of bits at every stage and using the above scalingbefore quantizing, the SQNR contribution from each stage ismade equal. This is essentially due to the fact that the abovescaling factor makes the operation in each stage a unitarytransformation. When each stage gives the same SQNR, then

3Fig. 2. Impact of Quantization noise on FFT output

the implementation complexity is minimized for a given totalSQNR requirement.1

For radix 2 stage or the individual butterflies of a radix22 stage, a lower complexity method, referred to as Sub-OPtimally Static scaled FFT (SOPS FFT), can be employedwhere the input is scaled down by 2 for every two stages.Note that this method is static and hence does away with largesimulation times and has no additional complexity. In sectionVI, it is shown that the performance of SOPS FFT is close tothat of the CBFP.

Many OFDM systems are implemented with a 2x or 4xoversampling to relax filtering requirements on analog filters.In this case, the independence assumption in section IV is nottrue. However, adjacent input samples affect the FFT outputonly in the last stage. Hence for a 2x FFT the independenceassumption is true till the last but one stage and for a 4x FFT,till the last but two stages. So for a 2x FFT, the scaling by 2should be done once every two stages, untill the last but onestage, and a scaling by 2 should be done for the last stage.

V. EFFECT OF QUANTIZATION NOISEQuantization noise gets added at different points in the FFT.

For simplicity, we assume that the noise gets added at the endof every stage. We will illustrate the case of a radix-2 FFT butthe analysis can be extended for other cases as well with somemodifications. For example in a radix 2 FFT, as shown in figure2, the quantization noise at the output of every stage passesthrough a lower order FFT before affecting the output. In theexample shown, the quantization noise at the input affects theoutput through an 8-point FFT. The quantization noise at thesecond stage passes through 4 point FFT before affecting theoutput. The quantization noise in subsequent stages go throughlower order FFTs and finally the quantization noise at theoutput affects the signal directly.

Let wi denote the vector of quantization noise affecting theith stage output (i = 0 corresponds to the quantization noise

1The output of the last few stages of the FFT may not be gaussian as theFFT outputs would be close to the constellation points. This means that thePAR of those outputs would be lower. So the SQNR vs signal rms curve wouldfurther increase beyond 0.2 and would fall at a higher rms value (as againstthe SQNR of a gaussian signal shown in figure 8). So the proposed methoddoes not get the best SQNR for the last stages but it does not degrade theperformance. In practice this is insignificant as our simulation results indicate.

at the input) and let W im denote an mth order FFT of the noiseaffecting the ith stage. Let Q(k) denote the quantization noiseaffecting the kth tone at the FFT output, then

Q(k) =

log2 Ni=0

W iN2i( k2i) (4)

The equation represents the fact that the output is computedusing the Decimation-in-Frequency method. The Quantizationnoise at the DC tone in the FFT output is the sum of the DCvalue of the quantization noises added in the different stages.The Quantization noise at the first tone(k = 1) in the FFToutput is the sum of the first tone of the input(i = 0) stagequantization noise and the DC tones of other stages(i > 0quantization noise and so on. To obtain quantization noisestatistics, let us assume that the quantization noise at any stageand index is independent of quantization noise at any otherindex or stage. For simplicity we assume that the noise processis also identically distributed although the results can easilybe extended to the cases where the quantization noise sourcesat different stages have different mean and variances. Let themean and variance of the individual quantization noise samplesin wi for any stage i be and 2 respectively. Then

E[W iM (k)] = M(k) (5)The above equation just reflects the fact that the mean of theFFT output is the same as the FFT of the mean of the input,which is just a non-zero value at DC and zero elsewhere.

E[|W iM (k) E[W iM (k)]|2] = M2 (6)The above equation shows that the variance of the FFT outputis identical for all tones due to the i.i.d assumption and the factthat the FFT is an orthogonal transformation. M represents theprocessing gain of the FFT. The Quantization noise power isgiven by

E[Q2(k)] =

log2 Ni=0

N

2i( k

2i)

2

+

log2 Ni=0

N

2i2 (7)

The equation shows that the variance is same for all theoutput tones of the quantization noise, but the mean is differentfor different tones thereby causing a different noise powerprofile. Now let us consider the two most common quan-tization schemes - truncation and rounding. Both truncationand rounding have the same variance but their mean valuesare different. Suppose b fractional bits are either truncated orrounded to the nearest integer, it can be shown that the meanerror with truncation is 0.5 0.5

2band with rounding is 0.5

2b

The quantization noise power profile at the FFT output fora 16 point FFT is plotted in figure 3 for both truncation androunding, based on equation 7, assuming 2 fractional bits atevery stage before rounding/truncating to an integer. The plotshows that the quantization noise is high near the DC toneswhile it significantly smaller near the edge tones. The trend isthe same for rounding as well, but the difference between edgetone and DC tone is smaller due to the lower mean for the

40 5 10 150

20

40

60

80

100

120

140Effect of quantization noise on 16 pt FFT

Tone index

Noi

se P

wr

truncationrounding

Fig. 3. Theoretical plot of Quantization noise power over tones at the outputof a FFT

0 20 40 60 80 100 120 1400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Plot of Mean absolute value of error vs tone index

Tone Index

Mea

n ab

solu

te v

alue

of e

rror

TruncationRounding

Fig. 4. Quantization noise over tones at the output of a FFT - Simulation

rounding noise. This analysis disregards quantization within astage as well as quantization of twiddle factors. To get a morerealistic view of the quantization noise impact at the receiver,we took a 128 pt FFT design as in figure 5 and plottedthe quantization noise at the output with both truncation androunding. The plot in figure 4 shows a high quantization noisenear the DC tone as expected. But the quantization noise againincreases near the highest tones and gives near U-shape to thespectrum. This is due to twiddle factor quantization and otherfactors that we have disregarded in our simple analysis.

A. Rotated FFTIn almost all OFDM systems, there are a number of zero

tones at the band edges to prevent spectral leakage to adjacentbands.For example, WLAN has 12 guard tones, 6 on eithersideof the spectrum. Also oversampling at the receiver introducesnull tones at high frequencies. At the receiver, These zerotones fall in the low quantization noise region of the U-shapedcurve, while the desired tones fall in the high quantizationnoise region. This causes a degradation in the SQNR of thedesired tones. So for systems with oversampling or with guardtones, an easy way to increase SQNR would be to cyclically

xBF 1 BF 2

Twiddlefactor

One radix 2 stage2

14 bits

Stage number is i

Twiddlefactor

Stage 1 Stage 3 Stage 5 radix 2

stage

x

Fig. 5. Legacy 128 point FFT

shift the desired tones to the highest SQNR region. This iseasily accomplished by multiplying the inputs by (1)n, sothat the FFT outputs are cyclically shifted by N/2 tones. Wecall this method as Rotated FFT (R-FFT) and this methodcan be used to improve the performance of any FFT designthat uses truncation and achieve performance gains withoutthe additional complexity due to rounding 2. In section VI,we demonstrate the gains due to this method when applied intandem with the earlier proposed methods.

VI. SIMULATION RESULTSThe simulations were performed for an 802.11a/g system,

oversampled by a factor of 2. There are a total of 128 tones,with 52 data tones, 12 guard tones which are zero and 64zero tones due to oversampling. The modulation mode usedis 54 Mbps, where the transmitted symbols are chosen from a64 QAM constellation. A legacy 128 point FFT design, basedon a radix 22 architecture for all stages but the last stage,is chosen as the starting point for evaluating our proposals.The input for each stage is 13 bits and the final FFT outputis 14 bits. The legacy fft, shown in figure 5, increases thenumber of integer bits by 2 for every radix 22 stage which isthe worst case bit growth possible. 3 The SOPS technique isnow applied to this design by modifying the scale factor (orequivalently the number of integer bits) at the different stages,while retaining the same complexity. The block diagram of thenew design is shown in 6. The Rotated-SOPS (R-SOPS) FFTis obtained by modifying the SOPS FFT to include the trivialmultiplication of the inputs by (1)n. The CBFP FFT uses13 output bits at every stage and uses the CBFP algorithmto obtain the scaling factors for each output. The final FFToutputs are again converted to fixed point by extracting acommon scale factor for all outputs. The comparison is donebased on the cdf of the SQNR and the cdf was obtained from1000 simulation runs.

Figure 7 compares the performance of fixed-point-inputfloating-point-output, CBFP (rounded), SOPS (truncated),

2rounding requires an extra adder. Multiplication by (1)n can be absorbedinto the FFTs first stage and will only cause a control overhead. Even if aseparate adder is needed for multiplying the input by -1, it still needs to bedone only at the input and not at every stage.

3The format for the numbers is as follows:13,1,tc means a twoscomplement number with 13 total bits and 1 integer bit

5factor

One radix 2 stage2

14 bits

Stage number is i

Twiddlefactor

xStage 1 Stage 3 Stage 5 radix 2

stage

i = 1 i=3 i=5

Twiddle

xBF 1 BF 2

Fig. 6. 128 point FFT design using sub optimal scaling method

45 50 55 60 65 70 750

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1CDF of SQNR for different ffts

SQNR in dB

CDF

of S

QNR

legacy fftfixed in float outSOPS truncSOPS roundRotatedSOPSCBFP

Fig. 7. CDF of SQNRs for different FFT designs

SOPS(rounded), R-SOPS (truncated) and legacy FFT designs.Rounding was employed in CBFP to compare the proposeddesigns against the best possible design. The figure shows asignificant performance gain of about 16 dB with the R-SOPSFFT compared to the legacy design. Note that the rotatedSOPS (R-SOPS) design is 1.5 dB better than the roundedSOPS design, although only truncation is employed in R-SOPSdesign. This can be explained using figure 4, where the errorprofile for the rounding case near the DC tones is higher thanthe error profile for the truncation case at the edge tones. Therotated SOPS design also performs within about .5 dB of theCBFP design. It must be borne in mind that rotation gives suchlarge gains because the signal is oversampled by a factor of 2.For Nyquist sampled systems, the gains due to rotation wouldbe smaller. The SQNR curves were similar for all modulationmodes and in the presence of multi-path.

VII. CONCLUSIONIn this paper we have presented a technique to design

a fixed point FFT that does not require highly complexblock floating point implementation. We have investigated thequantization error profile at the output of FFT and based onthis have proposed a technique to improve the SQNR. A legacy128 point FFT design has been modified according to theseprinciples and shown to achieve near-CBFP performance.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.520

30

40

50

60

70

80

90

s

SQNR

(dB)

SQNR as a function of signal RMS level and bitwidth

increasingbitwidth

8 bits

16 bits

Fig. 8. SQNR vs. signal RMS level

VIII. APPENDIXFor a reasonably large number of subcarriers, the OFDM

signal has a probability density function (pdf) that is approx-imately a complex Gaussian. Let us assume that for both Iand Q, the full-scale fixed-point range is [-1, 1) and that theI and Q are each quantized to N-bit signed twos-complementrepresentation. In the following, we derive the expression forthe SQNR that results from the quantization and clipping, asa function of the RMS signal level and the number of bits

The mean-squared error due to loss of precision at the LSBis given by 2q = 2/12 = 22N/3, where = 2N+1 isthe bin-width of each of the N possible quantized levels.

Using the Gaussian assumption for the pdf of the signal,the mean-squared error due to clipping at +/-1 is given as

2c =222s

v=1

(v 1)2ev2

22s dv (8)

The above integral can be simplified using complementaryerror function and the composite SQNR can be written as

10log

2s(2s + 1)erfc

(12s

) 2s

2e 1

22s + 22N3

Figure 8 shows the plot of SQNR as a function of s fordifferent bit-widths. From the figure, it can be observed thatfor a given bit-width, as the signal RMS level increases, theSQNR improves upto a certain point , but beyond that it startsdecreasing steeply. Therefore, the signal RMS level has to bechosen appropriately to get the best SQNR.

REFERENCES[1] S. He and M. Torkelson, A new approach to pipeline fft processor, in

Proc. IEEE IPPS, 1996.[2] S. He and M. Torkelson, Designing pipeline fft processor for ofdm

(de)modulation, in Proc. IEEE International symposium on signals andsystems, 1998.

[3] E.Bidet, C.Joanblanq, and P.Senn, A fast single chip implementationof 8192 complex points fft, in Proc. IEEE Custom integrated circuitsconference, 1994.

[4] T.Lenart and V.Owall, A 2048 complex point fft processor using a noveldata scaling approach, in Proc. IEEE International symposium on circuitsand systems, 2003.

[5] S. H. Park, D. H. Kim, D. S. Han et.al, Sequential design of a complex8192 point fft in ofdm receiver, in IEEE AP-ASIC, 1999.

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 2.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing false /UntaggedCMYKHandling /UseDocumentProfile /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice

a3_5

Documents