chapter - 4 implementation of 128 point fft...

35

CHAPTER - 4

IMPLEMENTATION OF 128 POINT FFT PROCESSOR USING R2SDF ARCHITECTURE

Table of Contents

Page No.

4.0 IMPLEMENTATION OF 128 POINT FFT PROCESSOR

USING R2SDF ARCHITECTURE 36

4.1 Introduction 36

4.2 Architecture of Pipelined FFT 37

4.2.1 Architecture of Radix – 2 FFT 37

4.2.2 Architecture of Radix 2/4/8 FFT 39

4.2.3 Architecture of Proposed FFT of variable length 41

4.3 Considerations of Architecture 43

4.3.1 Architectures of FFT – A comparison 43

4.3.2. CORDIC Vs Complex Multiplier 44

4.4 Circuit Design 45

4.4.1 Minimization of Word-length 45

4.4.2 Delay Line based on RAM 47

4.4.3 Current-mode technique of SRAM 48

4.4.4 Twiddle-factor ROM and Complex Multiplications 50

4.5 Results and Discussion 50

4.6 Summary 52

36

CHAPTER - 4

IMPLEMENTATION OF 128 POINT FFT PROCESSOR USING R2SDF ARCHITECTURE

4.1 INTRODUCTION

This chapter illustrates implementation of a 128-point Fast Fourier Transform

(FFT) processor with speed and power efficient characteristics for the applications of

OFDM. The idea of ROM component and variable length are provided by this design

from 128 to 2048 points of FFT processor for various applications such as Digital

Transmission of Video or Audio signals and High Speed or Asymmetric Digital

Subscriber Loops. The Radix-2 FFT using single delay feedback (SDF) style is

employed to construct this 128-point architecture. The SRAMs in current-mode are

applied in the place of shift registers in delay lines to decrease the consumption of

power and area. The chip is realized with a power consumption of 176 mW at 2.3 V at

a frequency of 17.8 MHz.

There is a huge requirement nowadays for low-power and faster FFT in

applications of OFDM. The architectures used for adopting a FFT processor are

divided into three types. The first design type is the single-memory architecture

having a single butterfly and memory element. The second type is the dual-memory

architecture, having two memories with more throughputs when compared to single-

memory architecture. The current improvements in semiconductor technology have

paved the way for implementation of FFT in areas such as wideband communications,

Signal processing, Biomedical instrumentation etc. Hence OFDM has become famous

in various communication systems due to its capability in eliminating spectrum and

channel effects. Therefore, appropriate adoption of FFT is very much required for

applications of OFDM. The various sizes of FFT with respect to the standards of

37

different communication systems are shown in Table 4.1. From this table, the FFT

processor of variable length becomes an inevitable component in providing

appropriate solution for the above communication systems.

Table 4.1 Various Communication Systems Vs Size of FFT

S No System Size of FFT

1 ADSL 512

2 DAB 2048, 1024, 512. 256

3 VDSL 8192, 4096, 2048, 1024, 512

4 DVB 8192, 2048

The implementation of hardware of N-point algorithm is intensive by means of

arithmetic operations and data swapping. Generally, O (log N) number of arithmetic

operations is needed per clock cycle for processing the FFT. The FFT processing of

high speed can be obtained in two different methods. The manipulation is done by a

single processor at a frequency, greater the sample frequency of data in a general-

purpose processor. But the manipulation is done at sample frequency itself in a special

purpose processor, thus consuming low power. In this chapter, various methods of

optimization are implemented in the design architecture to obtain a power and area-

efficient FFT processor.

4.2 ARCHITECTURE OF PIPELINED FFT

4.2.1 Architecture of Radix-2 FFT

The Radix-2 multiple delay commutator (R2MDC) is a pipelined architecture

using Radix-2 FFT as shown in Figure 4.1. The sequence of input data is decomposed

into two streams by means of a commutator. Then the multiplication of twiddle

factors and the butterfly operation is executed with appropriate scheduling of both

38

data streams. Hence (log2 N) number of Radix-2 elements, (log2 N-2) number of

multipliers and (3N/2)-2 number of delay components are needed for this architecture.

The Radix-2 single delay feedback style (R2SDF) is shown in Figure 4.2 and it uses

the same storage for inputs and outputs of the butterfly by adopting delay elements

efficiently. The multiplier element encounters a single data stream at each stage. This

style of scheme includes the exact count of butterfly elements and multipliers when

compared to Radix-2 multiple delay commutator, but with requirement of (N-1) delay

elements. The multipliers and butterfly elements are utilized at 50% level because

they are skipped during half of the time period.

Fig. 4.1 Radix-2 Multiple Delay Commutator

Input Output

39

Fig. 4.2 Radix-2 Single Delay Feedback

4.2.2 Architecture of Radix-2 / 4 / 8 FFT

The Discrete Fourier Transform of N-point data stream is defined by-

1

0, 0,1,2,......... 1

Nnz

zn

x xnW z N

(4.1)

where 2 nzjnz NW e

. The basic idea with respect to Radix-2 FFT is utilizing the

symmetry between the twiddle factors Wnz and Wnz+N/2 for simplification. The

multiplication by the twiddle factors of WN/8, W3N/8, W5N/8 and W7N/8 can be utilized

for simplification because their real and imaginary parts have same values. The

multiplication by these twiddle factors are given by equations (4.2) to (4.4).

Input Output

40

/8 5 /8( ) ( )N Na jb W a jb W (4.2)

2 ( ) ( )2

a b j b a (4.3)

3 /8 7 /8N Na jb W a jb W 2 ( ) ( )2

b a j a b (4.4)

Fig. 4.3 Signal flow graph of Radix-2 / 4 / 8 FFT

The signal flow graph of Radix-2 / 4 / 8 FFT is shown in Figure 4.3 [5]. This

algorithm implements the butterfly of Radix-8 in terms of three stages of Radix-2 in

place of a single butterfly element. Hence its signal flow graph is same as that of

Radix-23 FFT [3]. By altering the architecture of Radix-2 single delay feedback

appropriately, a Radix-2/4/8 architecture can be obtained. The processing elements -

PE1, PE2 and PE3 are employed for processing each and every stage of FFT. This

architecture comprises a combination of processing elements and a multiplier for

41

multiplication of twiddle-factors in a repeated manner. The required count of delay

components reduces by 50 % in each stage. The block diagram of all the processing

elements is shown in Figure 4.4.

Fig. 4.4 Processing Elements in Radix-2/4/8 FFT

4.2.3 Architecture of proposed FFT of variable length

To decrease the power and area consumption of the chip, it is highly important to

select the algorithm of FFT and the architecture possessing less complexity in

computation and hardware respectively. The block diagram of Radix-2/4/8 SDF based

FFT of variable length is shown in Figure 4.5. This processor can achieve FFT

operations for various lengths. To include various stages of FFT, the initial two stages

are Radix-2 Processing elements, having a similar structure as that of processing

element (PE3) in the architecture. Each block is composed of a group of processing

elements - PE1, PE2, PE3 and a multiplier as shown in Figure 4.5. For instance, if

512-point FFT is to be performed, the first two stages are bypassed by input signals

Input

Output

42

through the multiplexer MUX2. If a 128-point FFT is to be executed, the first stage is

skipped by means of the multiplexer MUX1.

Fig. 4.5 Block diagram of FFT processor of variable length

43

4.3 CONSIDERATIONS OF ARCHITECTURE

4.3.1 Architectures of FFT – a comparison

In the past decade, various design architectures of FFT have been suggested with

the objective of contributing successful implementation of FFT with high speed. The

important features of Radix-2 / 4 / 8 FFT and various other architectures are listed in

Table 4.2 [2]. In this table, their requirements of memory and complexity of

computations are compared [3]. It is vivid that the count of multiplication reduces

with the increase in radix number. Further, algorithms of higher radix have better

exploitation of hardware in multipliers as far as bit-parallel operations are concerned.

Table 4.2 Comparison of various architectures of FFT

Proposed Chip /

Parameters Bit- Parallel Architectures Digit- Serial Architectures

Radix number Radix- 2/4/8 Radix - 4 Radix 2/4//8 Radix -4

Type of style Feedback Feedback Feed Forward Feed Forward

Utilization of Complex

Adder

2 log 2 N

50 %

4 log 4 N

50 %

12 log 4 N

100 %

8 log 4 (N+1)

100 %

Utilization of Complex

Multiplier

log 8 (N-7-1)

87.5 %

log 4 (N-7-1)

75 %

3 (log4 N- 7 1)

100 %

3 (log4 N- 7 1)

100 %

Size of Memory N-7-1 N-7-1 2.5 N 1.18 N

ROM for Twiddle Factor 0.25 N N N 0.5 N

The data word-length for adders and multipliers is decreased to one digit thus

requiring small number of adders in case of digit-serial architectures. The word-length

of both the architectures should coincide with the throughputs of Radix-4 commutator

to obtain complete usage in adders and multipliers. The area occupied by a complex

multiplier dominates the area occupied by a complex adder by which considerable

44

reduction in hardware economy can be obtained with small number of multipliers.

The feedback architecture requires less number of memory elements whereas the

feed-forward architecture needs large in number. There is no need for large number of

ROM’s if the number of multiplications is reduced appropriately. The ROM

corresponding to first multiplier saves the twiddle values with a spacing of N = 2 p,

but with an increase in the later stages. If the symmetry of the sine-cosine function is

utilized in an efficient manner, then appropriate reduction in ROM size can be

obtained.

In the suggested architecture, the ROMs store only 1/8 cycle of the waveforms of

sine-cosine function and the symmetry of the twiddle factors is exploited within each

group thereby constructing a smaller ROM table. Hence this can employed to

implement a FFT processor of varying length with less complexity of hardware. This

architecture provides a trade-off between complexity of hardware and performance

when compared to other architectures.

4.3.2 CORDIC Vs Complex multiplier

Since the Coordinate Rotation Digital Computer (CORDIC) algorithm is efficient

in vector rotation, it can be employed for multiplying the twiddle factors in FFT

processors [4]. The performance of a complex multiplier and CORDIC algorithm

have been evaluated and compared in this sub-section. The precision has been set to

‘16’ bits in all algorithms with an allocation of ‘19’ bits in the data path of the

CORDIC design to eliminate the propagation of rounding errors. Generally a ROM

table saves the sequences of rotation with N=4, 16-bits per word in this algorithm. In

each stage, two 19-bit adders are employed for micro-rotation and the architecture

requires 2, 16, 19 … number of full adders for 16 stages of micro-rotation. The

scaling factor of 0.100110110111010 is used for additional multiplications in the final

45

stage thus requiring 2, 9, 19 adders. The delay of critical path is 19 times the delay of

full adder in all stages of micro-rotation without pipelining. The size of ROM table is

decreased to ‘8’ words (1 word = 23 bits) due to usage of greater radix in the

subsequent stages. The outputs of CORDIC are represented in the arithmetic format

and converted to binary format once the butterfly operation is performed by carry-

look ahead adders [7].

In the suggested chip, the complex multiplication involves five additions and three

multiplications. The addition is performed by means of carry-select adders with a

delay of about 8 TFA at the maximum and each carry select adder employs ‘30’ full

adders and ‘63’ full adders in the initial 16-bit addition and the final 33-bit addition

respectively. The delay of critical path is decreased to 7 TFA by using Wallace tree

multipliers. A Wallace tree multiplier requires nearly ‘280’ full adders and two stages

of pipeline are included at the start and end of the multiplication process. The

CORDIC algorithm seems to be slower in function without pipelining whereas the

multiplication by Wallace tree decreases the delay of critical path of complex

multiplier. With regard to relationship between speed and area, the FFT processor can

consume low power, if complex multiplier is employed for the multiplication of

twiddle factors.

4.4 CIRCUIT DESIGN

The FFT processor of variable-length should be designed appropriately to reduce

the consumption of power in order to function as an inevitable component in OFDM

systems.

4.4.1 Minimization of Word-length

In the architecture of this FFT processor, the word lengths of different signals are

reduced as per their requirements of signal-to-noise ratio (SNR). The input signals

46

with Gaussian noise in fixed point arithmetic are given into FFT to determine the

optimum value of word length. The output signals in frequency-domain are received

and the signal-to-noise ratio (SNR) is calculated. Figure 4.6 (a) shows the output SNR

with respect to input word length of FFT under various conditions of input SNR. Then

the input word length is determined to be ‘9’ bits. Based upon accuracy of the sine

and cosine tables, the output SNR with respect to word length of twiddle factors is

shown in Figure 4.6 (b), when the SNR of input signal is 30 dB. Thus a 9 bit word

length is selected for the twiddle factors. The process of reduction of word length

passes component by component and all word lengths inside the processor are

computed. To maintain the consumption of power, true-single-phase-clock (TSPC)

flip-flops are employed in the ring counters.

Fig. 4.6 (a) Output SNR with respect to input word length of FFT

47

Fig. 4.6 (b) Output SNR with respect to word length of Twiddle factors

4.4.2 Delay Line based on RAM

The FFT processor based on single delay feedback (SDF) architecture requires

more number of delay lines which are adopted in shift registers, as shown in Figure

4.7 (a). The data transits in a forward direction at every clock edge and 50 % of the

total registers modify their states, wasting large amount of power. The SRAM has

been employed in the place of shift registers to reduce the consumption of power and

area. Hence a dual port memory is needed to perform the memory access operations

in a single clock cycle. Two SRAMs having single port are implemented and a single-

port memory achieves an area saving of 33% when compared to a dual-port memory.

Here the usage of a single-port SRAM is illustrated in Figure 4.7 (b). The read and the

write operations are achieved in first half and next half of the clock cycles

respectively. Two registers are inserted before and after the processing element to

avoid the access of data becoming critical paths in SRAM.

48

Fig. 4.7 (a) Delay line based on shift-register (b) Delay line based on SRAM

4.4.3 Current-mode technique of SRAM

This technique can be employed in reading and writing the contents of SRAM cell

in order to decrease the consumption of power [8]. The voltage levels of bit lines and

data lines of SRAM are maintained very smaller in this technique thereby reducing

the dissipation of dynamic power in a significant manner. The SRAM cell in current

mode comprises seven transistors, one transistor in addition to the traditional Six-

transistor cell as shown in Figure 4.8 (a). This additional transistor, Meq, is used to get

the output voltages of the two inverters equalized and hence a small difference in

current is detected by means of access transistors activated by the word enable signal.

Input Output

Input Output

49

Fig. 4.8 (a) 7 T SRAM cell based on current-mode (b) SRAM write circuit

When the additional transistor is in off state, the SRAM cell will function as a

traditional six transistor memory cell. The difference in current ‘DI’ is available on

the write data lines ‘wdp’ and ‘wdn’ during the write access operation. The current

conveyor of N-type material as given in Figure 4.8 (b) is activated by signal ‘WY’.

Then bit lines ‘blp’ and ‘bln’ will have currents without any attenuation. Since the

control signal ‘WY’ is activated, the circuit is closed between ‘wdp’ and ‘wdn’. The

voltages at these data lines ‘wdp’ and ‘wdn’ are equal to VDD ðV1 þ V2Þ. Hence the

voltage levels are maintained very low on these data lines. The read operation is

performed by a sense amplifier and a column decoder in this SRAM.

50

4.4.4 Twiddle-factor ROM and Complex Multiplications

In this FFT processor, each multiplication of the twiddle factors - WN=8, W3N =8,

W5N=8 and Wp7N=8 is decremented to two real multiplications by an integer of ’2’ and

still can be reduced to shift and add operations. The block diagram of Twiddle-factor

ROM is shown in Figure 4.9.

4.5 RESULTS AND DISCUSSION

The complete architecture without including the SRAM components is

implemented by a hardware description language (VHDL). The SRAM layout

includes ring counters, control units, and SRAM cells designed appropriately. This

FFT processor is implemented with 0.35 µm technology. The die photo of the chip is

shown in Figure 4.10. The multipliers and the processing elements are marked as

‘MUT’ and ‘Ux’ respectively along with their twiddle-factor ROMs. The longer and

shorter delay lines are realized by SRAM and registers respectively taking the circuit

overheads into consideration. The complete details of the chip are given in Table 4.3.

The FFT processor dissipates 176 mW with an operating frequency up to 17.8 MHz at

2.3 V and dissipates 640 mW with an operating frequency up to 45 MHz at 3.3 V. It is

highly difficult to make a good comparison with respect to size of FFT, type of

algorithm, supply voltage, consumption of power, clock rate etc. in different

technologies of CMOS [1]. Three parameters have been adopted to compare and

estimate the statistics with the assumption that a 128-point FFT is performed. They

are specified by-

Normalized area = Area of 128 - point FFT / Technology (4.5)

FFT/Energy = Technology / Power of 128 - point FFT * Execution Time *10-6 (4.6)

Energy * Time = Execution Time / (FFT/Energy) (4.7)

51

Fig. 4.9 Block diagram of Twiddle-factor ROM

Fig. 4.10 Die photo of suggested FFT processor

52

Table 4.3 Complete Summary of Chip

Process

Area

Transistor count

Maximum frequency

Power consumption (at high speed)

Power consumption (at low

voltage)

Package

TSMC 0.35 1P4M

3:9 mm_ 5:5 mm

5, 98, 078

45 MHz at 3.3V

640 mW (at 45 MHz, 3.3 V)

176 mW (at 17.8 MHz, 2.3 V)

68 PGA

It is inferred from Table 4.3 that the suggested chip includes both area reduction

and product of energy–time given by equation (4.7). The speed of execution of FFT

processor is slower at 1.1 V and this voltage prevents it from applications of high-

speed in spite of its good energy efficiency.

4.6 SUMMARY

In this chapter, the architecture of a FFT processor appropriate for various

communication systems such as Digital Transmission of Audio or Video signals,

Digital Subscriber Loops etc. for achieving complex FFTs of variable length 128 or

more has been discussed. The suggested FFT processor achieves complete utilization

of hardware and least consumption of power. The single delay feedback (SDF) FFT

architecture needs least number of delay components and the Radix-2 FFT employs

with shift and-add operations in the place of complex multipliers. The processor was

realized using a 0.35 µm technology. The experimental statistics indicate that the

processor consumes a power of 640 mW at 3.3 V up to 45 MHz but consumes only

176 mW at 17.8 MHz if the voltage is reduced to 2.3 V.

chapter - 4 implementation of 128 point fft...

Documents