fft design methodology - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35415/13/13... ·...

87

5 FFT DESIGN METHODOLOGY

5.1 General

The fast Fourier transform is used to deliver a fast approach for the processing of

data in the wireless transmission. The Fast Fourier Transform is one of the methods

of converting the time domain data to frequency domain data with less hardware

requirement and fast time utilization.

5.2 Fast Fourier Transform

The conventional signal and image processing applications requires high

computational power based on Fast Fourier Transform (FFT) in addition to the

ability to choose the algorithm and architecture. When considering alternate FFT

algorithm implementations the criteria to consider are: execution speed,

programming effort, hardware design effort, system cost, flexibility and precision.

Nevertheless, for real time signal processing the main concern is execution speed.

The implementation has been made on a Field Programmable Gate Array (FPGA)

as a way of obtaining high performance at economical price and a short time of

realization. It can be used with segmented arithmetic of any level of pipeline in

order to speed up the operating frequency.

5.3 Fixed-Radix FFT Algorithms

In this section we will introduce several fixed-radix FFT algorithms such as radix-2,

radix-4, mixed radix-4-2, R2MDC, Proposed Modified R2MDC etc.

88

5.3.1 Radix-2 FFT Algorithm

The radix-2 FFT algorithm is obtained by using the divide-and-conquer approach

split the output sequence X(k) into two summations[87], one of which involves the

sum over the first 2/ N data points and the second sum involves the last 2/ N data

points. Thus we obtain,

(5.1)

Now, let us restrict the discussion to N power of 2 and consider computing

separately the even-numbered frequency samples and the odd-numbered frequency

samples. Thus we obtain the even-numbered frequency samples as

(5.2)

89

Equation (5.2) is the 2 N point DFT of the 2 N point sequence obtained by

subtracting the bottom half of the input sequence from the upper half and

multiplying the resulting sequence by n

NW . If we define the 2 N point sequences

g (n) and h(n) as

(5.3)

Figure 5.1 Signal flow graph of a typical 2-point DFT

The computation of the sequence g(n) and h(n) according to Equation (5.3) and the

subsequent use of these sequences to compute the N/2 point DFTs are depicted in

Figure 5.1. For the 64-point DFT [1], the computation has been reduced to a

computation of 2-point DFTs. With the computation of Figure inserted in the signal

flow graph of Figure 5.1, we obtain the complete signal flow graph for computation

of the 64-point DFT, as shown in Figure 5.1.

From Figure 5.2 the proceeding from one stage to the next, the basic computation in

the form of Figure 5.1 i.e., it involves obtaining a pair of values in one stage from a

pair of values in the preceding stage, where the coefficients are always power of WN

and the exponents are separated by N/2. Because of the shape of the signal flow

graph, this elementary computation is called a butterfly [84]. It is also noted that the

butterfly number of N/2 is regular in each stage. The basic butterfly of Figure 5.1

90

can be redrawn in Figure 5.2, which requires only one complex multiplication and

two complex additions.

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6

Figure 5.2 Radix-2 DIF FFT signal flow graph of 64-point

From Figure 5.2 the time domain input data x (n) occurs in natural order, but the

frequency domain output DFT X (k) occurs in bit-reversed order. It is also noted

that the computations are performed in-place. In-place represents memory read and

memory write in each butterfly processing use the same memory location.

By this method, the required memory space can be minimized. It is also observed

from Figure 5.2, the relationship between the input data and the output data that the

output data index is [kok1……klog2N-2klog2N-1]2 mapped to index [klog2N-

1klog2N-2.... kok1]2 in a one dimension memory array.

For example in the 64-point radix-2 DIF FFT signal flow graph, the output index

1011 is mapped to index 1101 of the memory array. For the radix-2 16-point DIF

91

FFT signal flow graph, the relationship between normal order and bit-reversed order

can be explained clearly in Figure 5.3.

Figure 5.3 Bit-reversed order

However, it is possible to reconfigure the decimation-in-frequency algorithm so that

the input sequence occurs in bit-reversed order while the output DFT occurs in

normal order. Furthermore, if we abandon the requirement that the computations be

done in place, it is also possible to have both the input data and the output DFT in

normal order. This case is called out-of-place mode.

5.3.2 Radix 4 FFT

A radix-4 common-factor FFT algorithm can be employed when N = 4k by

recursively reorganizing sequences into N × N/4 arrays. The development of a

radix-4 algorithm is similar to the development of a radix-2 FFT. Here, both DIT

and DIF versions are possible. Rabiner and Gold (1975) provide more details on

radix-4 algorithms [89].

The Radix-4 decimation in time butterfly is represented in Figure 5.4. As with the

development of the radix-2 butterfly, the radix-4 butterfly is formed by merging a 4-

point DFT with the associated twiddle factors that are normally between DFT

stages. The four inputs A, B, C, and D are on the left side of the butterfly diagram

and the latter three are multiplied by the complex coefficients Wb, Wc, and Wd

respectively. These coefficients are all of the same form but are shown with

92

different subscripts here to differentiate the three since there is more than one in a

single butterfly.

Figure 5.4 Radix-4 DIT butterfly

When the number of data points N in the DFT is a power of 4 (i.e., N =4 ), one can

always use a radix-2 algorithm for the computation. However, it is computationally

more efficient to employ a radix-4 FFT algorithm. Similarly to the radix-2 FFT

algorithm we use divide-and-conquer approach decimate the N-point DFT into four

point N/4 DFTs. We have

(5.4)

From the definition of the twiddle factors, we have

93

(5.5)

The relation is not an N/4-point DFT because the twiddle factor [90] depends on N

and not on N/4. To convert it into an N/4-point DFT we subdivide the DFT

sequence into four N/4-point subsequences, X(4k), X(4k+1), X(4k+2), and

X(4k+3), k = 0, 1, ..., N/4. Thus we obtain the radix-4 decimation-in frequency DFT

as

(5.6)

where, the propertykn

N

kn

N WW 4/

4 . Note that the input to each N/4-point DFT

is a linear combination of four signal samples scaled by a twiddle factor. This

procedure is repeated v times, where4log Nv .

5.3.3 Radix 8 FFT

A radix-8 common-factor FFT algorithm can be employed similar to radix 4 when

N = 8k by recursively reorganizing sequences into N × N/8 arrays. The

development of a radix-8 algorithm is also similar to the development of a radix-4

FFT. Since the Radix 8 FFT is beyond the scope of this thesis more descriptions are

not included.

94

5.3.4 Split Radix FFT

After one has studied the fixed radix (radix-2 and radix-4) algorithms, it is

interesting to see that for radix-2 the even-numbered points of the DFT [91, 92] and

20] can be computed independently of the odd-numbered points. This suggests the

possibility of using different computational methods for independent parts of the

algorithm with the objective of reducing the number of computations.

The split-radix FFT (SRFFT) algorithms exploit this idea by using different fixed-

radix decomposition in the same FFT algorithm. The split-radix approach was first

proposed by Duhanmel and Hollmann in 1984 [52]. This FFT algorithm can be

developed by mixing various two or more fixed-radix decomposition methods, such

as split-radix 2/4, split-radix 2/8, split-radix 2/4/8 etc. Split-radix 2/4 alone is will

be considered here. In the mixing fixed-radix, the radix-2 is the basic component

because the radix-2 can compute all of power of 2-point DFTs.

We illustrate this approach with a DIF SRFFT algorithm. First, we recall that in the

radix-2 DIF FFT algorithm, the even-numbered samples of N-point DFT are given

as

(5.7)

A radix-2 suffices for this computation. The odd-numbered samples {X(2k+1)} of

the DFT require the pre-multiplication of the input sequence with the twiddle

factors n

NW . For these samples radix-4 decomposition produces some

computational efficiency because the four-point DFT has the largest multiplication-

free butterfly. Indeed, it can be shown that using a radix greater than 4 does not

result in a significant reduction in computational complexity.

95

If we use a radix-4 decimation-in-frequency FFT algorithm for the odd-numbered

samples of the N-point DFT, we obtain the following N/4-point DFTs:

(5.8)

Thus the N-point DFT is decomposed into one 2 N -point DFT [93] without

additional twiddle factors and two 4 N -point DFTs with twiddle factors. The N-

point DFT is obtained by successive use of these decompositions up to the last

stage. Thus we obtain a DIF split-radix 2/4 algorithm [6]. The signal flow graph of

basic butterfly cell of split-radix 2/4 DIF FFT algorithm is shown in Figure 5.5.

Figure 5.5 Signal flow graph of basic butterfly cell of split-radix 2/4 DIF FFT

We have,

(5.9)

96

As a result, even and odd frequency samples of each basic processing block are not

produced in the same stage of the complete signal flow graph. This property causes

irregularity of signal flow graph, because the signal flow graph is an “L”-shape

topology.

It is noted that the butterfly counting can not have regularity with each stage as the

radix-2 or radix-4 FFT algorithm, and its coefficients arrangement is very irregular

too, that it requires more effort in implementation than the other FFT algorithms.

5.3.5 Mixed Radix 4-2 FFT

The mixed-radix 4/2 butterfly unit is shown in Figure 5.6. It uses both the radix-2^2

and the radix-2 algorithms and can process FFTs that are not power of four. The

mixed-radix 4/2 [2], [3], [4], which calculates four butterfly outputs based on

X(0)~X(3). The proposed butterfly unit has three complex multipliers and eight

complex adders. Four multiplexers represented by the solid box are used to select

either the radix-4 calculation or the radix-2 calculation.

Figure 5.6 The basic butterfly for mixed-radix 4/2 DIF FFT algorithm

In order to verify the proposed scheme, 64-points FFT based on the proposed

Mixed-Radix 4-2 butterfly with simple bit reversing for ordering the output

97

sequences is exampled. As shown in the Figure 5.7, the block diagram for 64-points

FFT is composed of total six-teen Mixed-Radix 4-2 Butterflies. In the first stage,

the 64 point input sequences are divided by the 8 groups which correspond to n3=0,

n3=1, n3=2, n3=3, n3=4, n3=5, n3=6, n3=7 respectively. Each group is input

sequence for each Mixed-Radix 4-2 Butterfly. After the input sequences pass the

first Mixed-Radix 4-2 Butterfly stage, the order of output value is expressed with

small number below each butterfly output line in the figure 5.7. The proposed

Mixed-Radix 4-2 is composed of two radix-4 butterflies and four radix-2 butterflies

[98], [99].

In the first stage, the input data of two radix-4 butterflies which are expressed in

equation (5.9), are grouped with the x(n2), x(N/2±n2), x(N/4±n2), x(3N/4±n2) and

x(N/8±n2), x(5N/8±n2), x(3N/8±n2), x(7N/8±n2) respectively. After the each input

group data passes the first radix-4 butterflies, the outputted data is multiplied by the

special twiddle factors. Then, these outputted sequences are inputted into the second

stage which is composed of the radix-2 butterflies. After passing the second radix-2

butterflies, the outputted data are multiplied by the twiddle factors. These twiddle

factors WQ (1+k) is the unique multiplier unit in the proposed Mixed-Radix 4-2

Butterfly [99] with simple bit reversing the output sequences. Finally, we can also

show order of the output sequences shown in above Figure 5.6.

The order of the output sequence is 0,4,2,6,1,5,3 and 7 which are exactly same at

the simple binary bit reversing of the pure radix butterfly structure. Consequently,

proposed mixed radix 4-2 butterfly with simple bit reversing output sequence

include two radix 4 butterflies, four radix 2 butterflies, one multiplier unit and

additional shift unit for special twiddle factors [98], [99], [100].

The Mixed-Radix 4-2 butterfly structure with simple bit reversing for ordering the

output sequences derived by index decomposition techniques is given. The Mixed-

Radix 4-2 butterfly structure is using the same number of multiplier as the Radix-

98

2^3 and the Split-Radix 2/4/8 algorithm. However, the Split-Radix 2/4/8 butterfly

[88] has not a regular shape. Therefore the realization is very complicated.

Figure 5.7 The Mixed-Radix 4-2 butterfly structure

5.3.6 R2MDC FFT Algorithm

This section investigates a new architecture for pipelined Radix-2 FFT used in

MIMO-OFDM. The radix-2 multipath delay commutation (R2MDC) is one of the

commutated architectures of radix-2 FFT algorithm which is used to commutate the

values as fast as possible in order to process the values and to commutate the FFT

inputs. One of the most straightforward approaches for pipeline implementation of

99

radix-2 FFT algorithm is Radix-2 Multi-path Delay Commutator (R2MDC)

architecture [94]. It’s the simplest way to rearrange data for the FFT/IFFT

algorithm. The input data sequence are broken into two parallel data stream flowing

forward, with correct distance between data elements entering the butterfly

scheduled by proper delays. At each stage of this architecture half of the data flow

is delayed via the memory (Reg) and processed with the second half data stream.

The delay for each stage is 4, 2, and 1 respectively.

In this R2MDC architecture, both Butterflies (BF) and multipliers are idle half the

time waiting for the new inputs. The 8-point FFT/IFFT processor has one

multiplier, 3 of radix-2 butterflies, 10 registers (R) (delay elements) and 2 switches

(S).

Figure 5.8 R2MDC architecture

The A input comes from the previous component twiddle factor multipliers (TFM).

The B output is fed to the next component, normally BFII. In first cycles,

multiplexors direct the input data to the feedback registers until they are filled

(position “0”). On next cycles, the multiplexors select the output of the adders/sub

tractors (position “1”), the butterfly computes a 2-point DFT with incoming data

and the data stored in the feedback registers [94]. The detailed structure of BFI is

shown in Fig.5.9 (a).

100

Figure 5.9 (a) BF I Structure and Figure 5. (b) BF II Structure

The B input comes from the previous component, BFI. The Z output fed to the next

component, normally TFM. In first cycles, multiplexors direct the input data to the

feedback registers until they are filled (position “0”). On next cycles, the

multiplexors select the output of the adders/sub tractors (position “1”), the butterfly

computes a 2-point DFT with incoming data and the data stored in the feedback

registers. The multiplication by –j involves real-imaginary swapping and sign

inversion. The real-imaginary swapping is handled by the multiplexors MUX in

efficiently and the sign inversion is handled by switching the adding-subtracting

operations by mean of MUX. When there is a need for multiplication by −j, all

multiplexors switches to position “1”, the real-imaginary data are swapped and the

adding-subtracting operations are switched. The detailed structure of BF I and BFII

are shown in Figure 5.9 (a) & (b). The adders and sub tractors in BFI and BFII are

fully-pipelined and followed by divide-by-2 and rounding [94]. The algorithm used

here is to commutate the radix-2 algorithm in the IFFT architecture and to replace

by R2MDC architecture in order to get a low area than the existing system.

5.3.7 Proposed Modified R2MDC FFT

The Radix-2 butterfly processor is consists of a complex adder and complex

subtraction. Besides that, an additional complex multiplier for the twiddle factors

WN is implemented. The complex multiplication with the twiddle factor requires

four real multiplications and two add/subtract operations.

101

The A input comes from the previous component twiddle factor multipliers (TFM).

The B output is fed to the next component, normally BFII. In first cycles,

multiplexors direct the input data to the feedback registers until they are filled

(position “0”). On next cycles, the multiplexors select the output of the adders/sub

tractors (position “1”), the butterfly computes a 2-point DFT with incoming data

and the data stored in the feedback registers. The detailed structure of BFI is shown

in Fig. 5.10 (a).

The architecture of BFI and BFII supporting two receive chains is shown in Fig.

5.10 (a) and Fig.5.10 (b). In BFI structure the sample routing MUXs and DEMUXs

at the input and output of the BF_RAMs are controlled based on c2 and c3 control

signals while the computation unit is controlled by c1 control signal. The control

signals are issued by the BFI controller. Depending on the programming of number

of receive chains the extra BF_RAMs are enabled. WiMAX supports 1Rx and 2Rx,

LTE supports 1Rx, 2Rx and 4Rx. Based on the requirement extra buffers can be

extended to the existing BF structure.

(a) (b)

Figure.5.10 (a) BF I Structure (b) BF II Structure

Since the handling -1, +j and -j multiplication is handled inside the BFII structure,

two control signals c1 and c2 are used in the basic computation unit. The muxes and

102

the demuxes are controlled by c3 and c4 control signals. The product with ‘-j’ term

is implemented by swapping the real and imaginary part considering the sign of the

sample. The algorithm used here is to commutate the radix-2 algorithm in the IFFT

architecture [94].

In order to optimize the processor, the proposed shift and add method that

eliminates the non-trivial complex multiplication with the twiddle factors (W81,

W83) and implements the processor without complex multiplication. The proposed

butterfly processor performs the multiplication with the trivial factor W82=-j by

switching from real to imaginary part and imaginary to real part, with the factor W80

by a simple cable. With the non-trivial factors W81= e

-jπ/4, W8

3= e

-j3π/4, the processor

realize the multiplication by the factor 1/√2 using hardwired shift-and-add operation

as shown in Figure.5.11.

Figure 5.11 MOD-R2MDC Butterfly FFT with no complex multiplication.

5.4 Summary

This chapter includes the detailed description about different FFT design

methodology for Radix-2, Radix-4, Radix-8, Mixed Radix 4-2, Split Radix,

R2MDC and Modified R2MDC FFT.

fft design methodology - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35415/13/13... ·...

Documents