a 64 point i/fft processor implementation over fpga · fpgas and the fast fourier transform (fft)...

12
International Journal of Computing and Information Technology 3 (1) (2011): 21– 32 A 64 POINT I/FFT PROCESSOR IMPLEMENTATION OVER FPGA W. Saad 1 , N. EI-Fishawy 2 , S. SL-Rabaie 3 , and M. Shokair 4 Abstract Field Programmable Gate Array (FPGA) devices have become the technology of choice in small volume modern systems and their capabilities will continue to improve in the future. The Digital Signal Processing (DSP) is one of the most important applications of FPGAs and the Fast Fourier transform (FFT) is the most important tools in DSP. In this paper, a proposed circuit of FFT and its inverse algorithms are implemented over FPGA. This circuit has unique ideas for implementation. Furthermore, the circuit has the advantageous of portability among different Electronic Design Automation (EDA) tools and technology independent. In addition, it is a pipelined circuit and gives a reliable results compared with MATLAB. Moreover the computational time of the I/FFT operation introduced is 17 sec. 1. INTRODUCTION Field programmable gate arrays are digital integrated circuits (ICs) that contain programmable blocks of logic along with configurable interconnects between these blocks. Design engineers can program such devices to perform a tremendous variety of tasks. VHDL stands for VHSIC (very high-speed integrated circuit) Hardware Description Language, is the commonly used as a design entry language for FPGA and ASIC devices. A recent trend has been to take the coarse-grained architectural approach a step further by combining the logic blocks and interconnects of traditional FPGAs with embedded microprocessors and related peripherals to form a complete system on a programmable chip. High performance and scientific computing applications are typically compute-intensive and require high throughput rates. Applications of FPGAs include DSP, software-defined radio, aerospace and defense systems, Application-Specific Integrated Circuit (ASIC) prototyping, medical imaging, computer vision, speech recognition, computer hardware emulation, and a growing range of other areas [1]. Dept. of Electronic and Communication Eng., Faculty of Electronic Engineering, EI-Menufiya University, Egypt. Email: [email protected], [email protected], [email protected], [email protected]

Upload: tranque

Post on 03-May-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of Computing and Information Technology3 (1) (2011): 21–32

A 64 POINT I/FFT PROCESSOR IMPLEMENTATIONOVER FPGA

W. Saad1, N. EI-Fishawy2, S. SL-Rabaie3, and M. Shokair4

Abstract

Field Programmable Gate Array (FPGA) devices have become the technology of choicein small volume modern systems and their capabilities will continue to improve in thefuture. The Digital Signal Processing (DSP) is one of the most important applications ofFPGAs and the Fast Fourier transform (FFT) is the most important tools in DSP. In thispaper, a proposed circuit of FFT and its inverse algorithms are implemented over FPGA.This circuit has unique ideas for implementation. Furthermore, the circuit has theadvantageous of portability among different Electronic Design Automation (EDA) toolsand technology independent. In addition, it is a pipelined circuit and gives a reliableresults compared with MATLAB. Moreover the computational time of the I/FFT operationintroduced is 17 sec.

1. INTRODUCTION

Field programmable gate arrays are digital integrated circuits (ICs) that contain programmableblocks of logic along with configurable interconnects between these blocks. Design engineerscan program such devices to perform a tremendous variety of tasks. VHDL stands forVHSIC (very high-speed integrated circuit) Hardware Description Language, is the commonlyused as a design entry language for FPGA and ASIC devices. A recent trend has been totake the coarse-grained architectural approach a step further by combining the logic blocksand interconnects of traditional FPGAs with embedded microprocessors and relatedperipherals to form a complete system on a programmable chip. High performance andscientific computing applications are typically compute-intensive and require high throughputrates. Applications of FPGAs include DSP, software-defined radio, aerospace and defensesystems, Application-Specific Integrated Circuit (ASIC) prototyping, medical imaging,computer vision, speech recognition, computer hardware emulation, and a growing range ofother areas [1].

Dept. of Electronic and Communication Eng., Faculty of Electronic Engineering, EI-Menufiya University,Egypt. Email: [email protected], [email protected], [email protected],[email protected]

22 International Journal of Computing and Information Technology (IJCIT)

The Discrete Fourier Transform (DFT) is one of the most important tools in DSP. It hasa lot of applications as: Signal Analysis, Orthogonal Frequency Division Multiplexing (OFDM),WLAN, etc...

The DFT is a specific kind of Fourier transform, used in Fourier analysis. It transformsone function into another, which is called the frequency domain representation.

FFT is speedy implementation of the DFT that rely on mathematical simplification andclassification of the input sequence to achieve their performance gain. Computing a DFT ofN points takes O(N|) arithmetical operations, while an FFT can compute the same result inonly O(N log N) operations.

In this paper, a proposal of FFT algorithm and its inverse are designed in one circuit.This circuit is portable which means that it can be implemented over any FPGA technology.In addition, the circuit is pipelined and has unique ideas for implementation.

This paper is organized as follows; Section 2, gives an overview of the FFT and itsinverse definitions and algorithms. Section 3, introduces a comprehensive discussion of thedesigned I/FFT processor architecture. The circuit implementation results and the timingsimulation tests are made in Section 4. Finally, conclusions are made in Section 5.

2. FAST FOURIER TRANSFORM

The N-point DFT of a finite duration sequence of x(n) and its inverse are defined asfollows [2].

X(k) =1

0

( )N

nk

n

x n W−

=∑ k = 0, 1, N – 1 ... (1)

x(n) =1

0

1( )

Nnk

k

X k WN

−−

=∑ n = 0, 1, ... N – 1 ... (2)

Where W = e–j(2π/N) is referred as the twiddle factor, N is the transform size, and j = 1−From DFT and IDFT equations, DFT and IDFT Look like very similar but two

differences. Firstly, the division byN is located only in IDFT. Secondly, the sign of exponentialpower is inverted. Consequently, IDFT circuit can be implemented if DFT circuit is existed.This can be done by dividing the output by N. Additionally, the real and imaginary partsshould be reversed as shown in Figure 1.

Fig. 1: IDFT Circuit Made by DFT Module

A 64 Point I/FFT Processor Implementation over FPGA 23

Fig. 2: Radix-4 DIF DFT Equations

The FFT is an efficient algorithm to compute the DFT and its inverse [3, 4]. There aretwo ways for the calculations: Decimation In Time (DIT) and Decimation In Frequency(DIF) algorithms. The basic idea of these algorithms is to break up the N-point DFT intosuccessive smaller DFTs known as butterfly. The smallest transform used is a 2-point DFTknown as radix-2 butterfly. In this paper, radix-4 butterfly DIF (which means 4-point DFT)is used. That is because radix-4 is more efficient than radix-2 butterfly for N ≥16 points [2].

To convert X(k) into an N/4 4-point DFT, the DFT sequence is subdivided into fourN/4-point subsequences; X(4k), X(4k + 1), X(4k + 2), and X(4k + 3), k = 0, 1,..., N/4-1. Thusthe radix-4 decimation-in frequency DFT is obtained as shown in Figure 2.

3. PROPOSED ARCHITECTURE DESCRIPTION

There are two implementation methods of circuits over FPGAs. One is named by the fullyspread method, and the other one is the reuse method. The first method has advantages ofthe simple control and the fast speed but it requires a very large area. In the opposite, thesecond method has a very low speed and requires a very complicated control but it serve ahuge hardware area. In this paper, the proposed circuit has been implemented by mergingthe two methods to combine the advantages of both of them.

24 International Journal of Computing and Information Technology (IJCIT)

Fig. 3: I/FFT Core Interface

Table 1I/O Pins Functionality Description

pin name direction description

clk input Master clock (active rising high).

rst input Master asynchronous reset (Active High).

en input I/FFT start signal (Active High). “en” is asserted to beginthe data loading and transform calculations.

sel input To switch between FFT and IFFT circuits. FFT processor isselected when sel is ‘0’ otherwise, IFFT processor is chosen.

xi(11:0) input The real part of the input data. It is 12 bit floating pointformat.

xq(11:0) input The imaginary part of the input data. It is 12 bit floatingpoint format.

out_head output I/FFT calculation finished (Active High).

I(11:0) output The real part of the output data. It is 12 bit floating pointformat.

Q(11:0) output The imaginary part of the output data. It is 12 bit floatingpoint format.

3.1. General Description

Unlike the previous implementations of 8 points FFT shown in [5, 6], the pipelined 16-pointFFT discussed in [7], and the 64-point FFT hardware explained in [8, 9], a pipelined 64-pointI/FFT processor is implemented in this paper using DSP multipliers blocks.

The proposed I/FFT circuit interface structure is shown in Figure 3. The circuit associatedpin functionality is described in Table 1.

A 64 Point I/FFT Processor Implementation over FPGA 25

The input data to the I/FFT processor is a vector of 64 complex values with 12 bits forreal part (xi) and 12 bits for imaginary part (xq) representing in a floating point format. Anacceptable resolution of accuracy can be achieved by this bit width of the floating point. Allthe 64 points are input to I/FFT processor serially as shown in the Figure 3. In this design,the first input point (xi and xq) is applied after one clock from “en” signal activated high.

The I/FFT output is also in a serial manner. The head position of the series is indicatedby “out_head” signal. When the first output of the 64 points vector is come out, the “out_head”signal is activated high. The output is rested, when “rst” signal is activated high. The outputdata is the FFT of the input data, when “sel” signal is ‘0’. On the contrary, when the “sel”signal is activated high, the output data is the IFFT of the input data.

As described in the previous section, form FFT circuit, IFFT circuit can be obtained intwo steps, as shown in Figure 1:

• The real part of input data (xi) and the imaginary part (xq) are reversed.

• Dividing by N = 64. This is equivalent to multiplication by 1/64.

Fig. 4: General Structure of I/FFT Processor

Fig. 5: FFT Circuit Architecture

26 International Journal of Computing and Information Technology (IJCIT)

Therefore, the I/FFT processor can be implemented by using only FFT circuit, twofloating point multipliers, and four multiplexers as shown in Figure 4.

3.2. FFT Circuit Architecture

The architecture of the FFT circuit is represented in Figure 5. It consists of three stages.The input data is stored in a Random Access Memories (RAM) with 64 locations, to beappropriately reordered for stage 1 circuit. By the same way, the outputs from stage 1 andstage 2 circuits are applied to RAMs to be fitted for stage 2 and stage 3 respectively. Inorder to let the FFT circuit behave in a continuous (pipelined) manner, the RAMs aredoubled. While storing the input data vector in one RAM, the input to the next stage iscoming from the other one. This can be done by using a simple multiplexer switched everyone data vector. In addition, a control unit is added to control all the RAMs addressing,stages control, and multiplexers switching.

3.3. Stages Architecture

Stages 1 and 2 are similar to each others with the exception of the Read Only Memory(ROM) contents. Stage 1 consists of a serial to parallel converter, radix-4 butterfly circuit,complex multiplier, and ROM, as shown in Figure 6. The reordered input data is applied tostage 1 serially. Subsequently, by serial to parallel circuit, each four serial input data areconverted into four parallel points which are applied to the radix-4 butterfly circuit. Theoutput from the butterfly circuit is four points in series. After that, each output from thebutterfly is multiplied by the appropriate twiddle factor which is stored in the ROM. TheROM addressing is controlled by the control unit. Stage 2 has the same structure as stage 1with the exception of the twiddle factors values which are stored in the ROM.

Stage 3 is simpler than both stages 1 and 2. That is because of all the twiddle factors areunity in this stage. Therefore, both the ROM and the complex multiplier can be removed toreduce the circuit area. Thus, stage 3 consists of only the serial to parallel and the butterflycircuits. The output from stage 3 must be reordered before it is obtained. This is the functionof the control unit circuit.

3.4. Radix-4 Butterfly Circuit Architecture

Each stage in the FFT circuit, consists of one butterfly using radix-4 decimation in frequencyalgorithm with the basic structure shown in Figure 7. The butterfly circuit has to multiply1, j, –1, –j to the inputs. However, this multiply is just exchange the real and the imaginarycomponents and change sign. Then actual implementation can be done with multiplexersand floating point adders/subtraction. In this paper, the butterfly consists of four floatingpoint adders and two floating point adders/subtraction. In addition to a state machine circuitto switch between the butterfly inputs and the floating point adders, besides controlling theselection of the floating point adders/subtraction.

A 64 Point I/FFT Processor Implementation over FPGA 27

Fig. 6: Stage 1 Architecture

Fig. 7: Radix-4 Butterfly Circuit

3.5. Memories

The memories can be categorized into RAMs and ROMs. The pipelined FFT circuit consistsof eight RAMs as shown from Figure 5. Each RAM is 24 bit width and 64 location length.If the RAMs are implemented using D-Flip Flop (DFF) elements, a 12,288 DFFs will beneeded. The half of this number can be saved, if the pipelining is not required. All useRAMs are double port RAMs which the output can be addressed separately from the writeaddress.

The other storage memories are ROMs. The FFT circuit contains only two ROMs.Each one is with the same size of the RAM. They are used to store the values of the twiddlefactors. All RAMs and ROMs addressing and write enable control are determined by thecontrol unit circuit.

28 International Journal of Computing and Information Technology (IJCIT)

3.6. The Control Unit

The control unit circuit is the most important unit in the FFT circuit. This is because of thedomination of nearly all the other units. It controls the RAMs writing enables, read andwrite addressing, ROMs addressing, the indication signal “out_head”, and multiplexerswitching between RAMs.

The first pair of RAMs are used to successively store the input words (64 points) whichare needed to be operated. To do the I/FFT calculations in a correctly maner, the readaddresses from these RAMs should be i, i + 16, i + 32, and i + 48. Where “i” s updated from0 to 15.

The next pair of RAMs serve to alternately store the output words from stage 1 whichare addressed in the same manner of reading from the first RAMs. However, the readaddresses are needed to be suitable for the next stage. It has to be i + ii, i + ii + 4,i + ii + 8, and i + ii + 12. Where “ii” is changed from 0 to 3 for each updating in “i” whichis varied from 0 to 48 with a step of 16.

The next pair of RAMs are attended to hold the outputs from stage 2 successively inthe same way of reading from the previous pair of RAMs. While the read addresses shouldbe i, i + 1, i + 2, and i + 3 to be suitable for stage 3. Where “i” is updated from 0 to 60 witha step of 4.

The last pair of RAMs are used to store the outputs from stage 3 and to produce theoutput words in a series manner. The write addresses should be as the read addresses fromthe previous pair of RAMs. At the stage 3, the output X (iii + 4ii + 16i) is computed.Whereas, original index should be X (i + 4ii + 16iii). Where i, ii, and iii are changed from0 to 3. Therefore, the read addresses should be re-ordered to be in a correct index. To dothis in a very simple way, just the first and the last two bits in the read address signal (6 bitsto cover the 64 locations) should be reversed.

The succession between the RAMs in one pair is done by changing the multiplexerselection signal for each word. In addition, read addresses from the two ROMs are updatedfor each clock in a simple way as they are filled with the suitable twiddle factors in order.Furthermore, when the first point of each word is produced, the “out_head” signal is activated high.

The control unit is implemented in this paper with only 169 configurable logic block(CLB) slices i.e., only 1.23% utilization for the used FPGA device.

4. RESULTS

The VHDL code has been done following the recommendations of [10, 11, 12]. No anypre-designed component available from the libraries of any EDA vendor have been used.Therefore, this design can be applied to any synthesis tools and makes the portability amongdifferent technologies.

A 64 Point I/FFT Processor Implementation over FPGA 29

The system has been synthesized using “Precision” EDA tool (which is introduced byMentor Graphics FPGA Ad- vantage 7.0 PS). The synthesis results have been declared inTable 2. The most of the used area has been occupied by the memories. If block RAMs andROMs are used, the occupied DFFs will be reduced too much from 13614 to 1326. Only198 DFFs are occupied by the control unit block i.e., 0.68% device utilization. Subsequently,the system has been placed and routed using “ISE 10.1” EDA tool to be fitted on xc2vp30device which belongs Virtex2p family from Xilinx. The mapping results have been displayedin Table 3. Afterwards, post routing timing simulation has been executed using “ModelsimSE” EDA tool as shown from Figure 9.

Several modules have been designed for testing the circuit reliability. The method usedin this paper is that each FFT and IFFT module is tested independently and the results iscompared with the results computed by MATLAB software as shown from Figure 8.

An example has been taken to test the circuit. The complex input value has beenchosen as a vector of 64 points as shown in Table 4. MATLAB output is compared with theoutput obtained by the designed processor. The results of the designed processor shown inFigure 9 match the MATLAB output which confirms the reliability of our I/FFT processorperformance. The point error rate between the MATLAB and the designed processorresults gives a quantitive evidence for the performance of the proposed processor. Thepoint error rate can be defined as the number of error points over the total number of points(64 points). For the proposed I/FFT processor, the point error rate is zero which meansperfect reliable I/FFT processor.

Table 2Synthesis Report

Device Utilization for 2VP30 896

Resource Used Available Utilization

IOs 52 556 9.35%

Global Buffers 1 16 6.25%

Function Generators 13112 27392 47.87%

CLB Slices 6807 13696 49.70%

Dffs or Latches 13614 29060 46.85%

Block RAMs 0 136 0.00%

Block Multipliers 8 136 5.88%

Block Multiplier Dffs 0 4896 0.00%

30 International Journal of Computing and Information Technology (IJCIT)

Table 3Mapping Report

Device Utilization for 2VP30 896

Resource Used Available Utilization

Number of bonded IOBs 53 556 9%

Number of BUFGMUXs 1 16 6%

Number of Slice Flip Flops 13,611 27,392 49%

Number of occupied Slices 12,883 13,696 94%

Number of MULT18X18s 8 136 5%

(a) (b)

Fig. 8: (a) MATLAB I/FFT Module. (b) Virtex2p I/FFT Module

Fig. 9: Timing Simulation

A 64 Point I/FFT Processor Implementation over FPGA 31

Table 4Comparison Between MATLAB and I/FFT Processor Results

Input MATLAB output FFT Circuit output

Real imaginary Real imaginary Real imaginary

x(0) 1 1 64 64 64 64

x(1) 1 1 0 0 0 0

x(2) 1 1 0 0 0 0

... ... ... ... ... ... ...

x(63) 1 1 0 0 0 0

IFFT Circuit output

x(0) 64 64 64 64 64 64

x(1) 64 64 0 0 0 0

x(2) 64 64 0 0 0 0

... ... ... ... ... ... ...

x(63) 64 64 0 0 0\ 0

1 = “001111000000” and 64 = “010101000000” in floating point.

Fig. 10: Task Waveforms

In this paper, the First output of the 64-point I/FFT processor appears in the fourth“-out_head” signal as shown in Figure 10. The total conversion time of the 64-point I/FFT is276 clock cycles. The clock cycles depend on the chip used; in this paper, the clock cycle ofthe chip is 16.234 MHz, given a 17 sec conversion time.

In this work, the I/FFT computational time is faster than other works in the same field.While in [8], the execution time for the 64-point FFT was 177 sec with 24.87 MHz operatingfrequency.

32 International Journal of Computing and Information Technology (IJCIT)

5. CONCLUSIONS

In this paper, a pipelined, portable, and technology independent I/FFT processor has beenimplemented over FPGA. The butterfly using radix-4 decimation in frequency algorithm,has been used in the FFT circuit design. The values of the twiddle factors have been locatedin ROMs. The most complicated part in the design has been the control unit circuit. Thedesigned processor has been reliable, pipelined, and portable.

REFERENCES[1] S. Brown and J. Rose, “Fpga and Cpld Architectures: A tutorial” EEE Design and Test of Computers,

13, No. 2, pp. 42-57, 1996.

[2] (2009, August). [Online]. Available: http://www.cmlab.csie.ntu.edu.tw/ cml/dsp/training/coding/transform/t.html

[3] J. Cooley and J. Tukey, “An Algorithm for the Machine Calculation of the Complex Fourier Series”,Math. of Compu- tation, 19, pp. 297-301, 1965.

[4] E. Brigham, The Fast Fourier Transform And Its Applications. Prentice Hall, 1998.

[5] K. A. B. Kadiran, “Design and Implementation of Ofdm Transmitter and Receiver on Fpga Hardware”Aster’s thesis, University Teknology, Malaysia, 2005.

[6] U. M. Baese, “Digital Signal Processing With Field Programmable Gate Array”. Springer, 2007.[7] S. Kishk, A. Mansour, and M. Eldin, “Implementation of an Ofdm System Using Fpga” in 26th National

Radio Science Conference, Egypt, 2009.

[8] E. E. Fabris, G. A. Hoffmann, A. Susin, and L. Carro, “A Bit-serial fft Processor” in VII WORKSHOPIBERCHIP, Uruguay, 2001.

[9] C. G. Concejero, V. Rodellar, A. A. Marquina, E. M. de Icaya, and P. Gomez-Vilda, “A PortableHardware Design of a fft Algorithm”, Latin American Applied Research, 37, pp. 7982, 2007.

[10] M. Keating and P. Bricand, Reuse Methodology Manual: For System on a Chip Designs. KluwerAcademic Publishers, 2002.

[11] V. A. Pedroni, Circuit Design With VHDL. Massachusetts Institute of Technology, 2004.

[12] P. P. Chu, FPGA Prototyping By VHDL Examples. John Wiley and Sons, Inc., 2008.