low-complexity multi-mode memory-based fft …soc.inha.ac.kr/images/ieice-fft-pub201111.pdf2376...

2376IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.11 NOVEMBER 2011

PAPER

Low-Complexity Multi-Mode Memory-Based FFT Processor forDVB-T2 Applications

Kisun JUNG†, Nonmember and Hanho LEE†a), Member

SUMMARY This paper presents a low-complexity multi-mode fastFourier transform (FFT) processor for Digital Video Broadcasting-Terrestrial 2 (DVB-T2) systems. DVB-T2 operations need 1K/2K/4K/8K/16K/32K-point multiple mode FFT processors. The proposed architectureemploys pipelined shared-memory architecture in which radix-2/22 /23/24

FFT algorithms, multi-path delay commutator (MDC), and a novel datascaling approach are exploited. Based on this architecture, a novel low-cost data scaling unit is proposed to increase area efficiency, and an elabo-rate memory configuration scheme is designed to make single-port SRAMwithout degrading throughput rate. Also, new scheduling method of twid-dle factor is proposed to reduce the area. The SQNR performance of32K-point FFT mode is about 45.3 dB at 11-bit internal word length for256QAM modulation. The proposed FFT processor has a lower hardwarecomplexity and memory size compared to conventional FFT processors.key words: DVB-T2, FFT, multi-mode, scaling, pipelined, shared memory

1. Introduction

Digital Video Broadcasting-Terrestrial 2 (DVB-T2) is thenext development of the DVB-T standard [1]. Buildingon the success and on the technology of DVB-T, it pro-vides additional facilities and features in line with thedeveloping Digital Terrestrial Television (DTT) market.DVB-T2 promises a 30% to 50% increase in capacity inequivalent conditions already used for DVB-T. Also it al-lows additional features and services such as High Defini-tion Television (HDTV) services. In the physical (PHY)layer design of the DVB-T2, orthogonal frequency divi-sion multiplexing (OFDM) is one of the main technolo-gies, and fast Fourier transform (FFT) is the key compo-nent. In previous DVB-T/H systems, a 2048/4096/8192-point FFT processor is used and the specified symbol du-ration is 224 μs/448 μs/896 μs. Also, previous DVB-T/Hsystems used QPSK/16QAM/64QAM modulations [2], [3].DVB–T2 proposes three additional modes of 1K/16K/32K(112 μs/1792 μs/3584 μs) to the three DVB-T/H modes.Therefore, a FFT processor must be able to work in1K/2K/4K/8K/16K/32K-point multiple modes in DVB-T2systems. Also, DVB-T2 proposes additional 256QAM mod-ulation.

Design of efficient FFT architectures have been ac-tively investigated since last decades. Generally, FFT ar-chitectures can be divided into two different categories:

Manuscript received January 14, 2011.Manuscript revised June 14, 2011.†The authors are with the School of Information and Commu-

nication Engineering, Inha University, Incheon, 402-751, Korea.a) E-mail: [email protected]

DOI: 10.1587/transfun.E94.A.2376

pipelined structures (such as the multi-path delay com-mutator (MDC), multi-path delay feedback (MDF), orsingle-path delay feedback (SDF) architectures) [4]–[6], andmemory-based structures [7], [8]. Pipelined architectureshave the advantage of high-throughput, but demand higharea cost especially for long-length FFTs. Memory-basedFFT architectures usually contain one butterfly processingelement (PE), memory banks and control logics. Althoughthey have low area costs, their throughputs are often lim-ited by the available number of PEs and memory accessbandwidth. The memory–based architecture can be dividedinto two categories: the cached-memory architecture [7] andpipelined shared-memory architecture [8].

On designing a long-size FFT processor, one still has toconsider its power consumption and hardware cost. Further-more, the power consumption of both data access in mem-ory and operation of complex multipliers is more than 75%of total power consumption in the FFT processor [9]. To re-duce the power consumption of the FFT memories and com-plex multipliers, the familiar and useful way is to reduce thememory access times, internal word length and the numberof operations in complex multipliers.

In this paper, we propose a new multi-mode memory-based FFT processor to minimize hardware complexity andto improve clock speed. It locally employs pipelined archi-tecture to realize the Butterfly Unit (BU) and globally uti-lizes the pipeline in memory-based architecture to save thearea. This work uses a multi-path delay commutator (MDC)architecture. By doubling the pipelined processing path, theexecution speed can be balanced with hardware costs effi-ciently. Also, the use of the radix-24 algorithm reduces thememory access times. However, more word length is neededto maintain the sufficient signal-to-quantization noise ra-tio (SQNR) in the fixed-point long-size FFT processor. Toovercome this problem, we propose a new scaling approach,which is called the indexed block scaling approach. Thus, itcan maintain a sufficient SQNR without increasing the wordlength. The memory occupies lots of chip area and increasepower consumption especially in the long-size FFTs. Theoccupied area of the memory module is not only propor-tional to the amount of stored data and word length but alsoproportional to the number of ports. Thus, single-port mem-ory is used in the FFT processors to reduce area significantly[10].

The rest of this paper is organized as follows. Section 2describes the design considerations for the FFT processorand proposes new data scaling approach. Section 3 presents

Copyright c© 2011 The Institute of Electronics, Information and Communication Engineers

JUNG and LEE: LOW-COMPLEXITY MULTI-MODE MEMORY-BASED FFT PROCESSOR FOR DVB-T2 APPLICATIONS2377

the proposed pipelined shared-memory architecture and de-scribes several techniques to reduce area in practical imple-mentation. In Sect. 4, the implementation and comparisonare presented. Finally, conclusions are made in Sect. 5.

2. Design Consideration for the FFT Processor

2.1 FFT Algorithms

In designing an FFT processor, one can first determine therequired specifications of the target FFT processor as a func-tion of FFT radix r, and then decide the most suitable radix.Table 1 shows the computational complexity of the com-plex multipliers for several FFT algorithms in 32K mode,in which the radix-24 algorithm has lowest computationalcomplexity. Thus, the proposed FFT processor employs theradix-24 decimation in frequency (DIF) algorithm for mostoperations. But depending on the FFT size, the radix-24

DIF FFT algorithm cannot be used in some operations andthe radix-2/22/23 DIF FFT algorithms are used in some partsof operation. Detailed radix-2/22/23/24 algorithms are wellexplained in [4], [5], [11], [12].

2.2 Scaling Approach

In order to maintain data accuracy in fixed-point FFT pro-cessor, the internal word length of the FFT processor is usu-ally larger than the word length of input data to achieve ahigher SQNR, especially in a long-size FFT processor. Inorder to solve this problem, a scaling approach is used inthe long size FFT processors to minimize the quantizationerror.

The block floating point (BFP) approach is one of thescaling approaches, usually used in FFT implementation. Intraditional BFP, the largest value is detected and all compu-tational results are scaled by a scale factor in stage N beforestarting the calculations of the stage N + 1 [13]. The BFPapproach has the advantage of the lowest scaling overhead.However, this approach has the lowest SQNR performancecompared with any other scaling approaches.

The hybrid floating point (HFP) approach is also awell-known scaling approach. The HFP approach is a hy-brid and simplified scheme for floating point representationof a complex number. This approach uses a single exponentfor the real and imaginary parts. Also, supporting HFP re-quires pre and post processing units in the arithmetic build-ing blocks [14]. The HFP approach has the highest SQNRperformance except for floating point. However, this ap-proach has the disadvantage of high scaling overhead.

Table 1 Computational complexity of the complex multiplication forFFT algorithm (32K mode).

The block scaling (BS) approach is a co-optimized ap-proach that combines HFP approach with BFP approach.This approach divides blocks into an arbitrary block sizein which each block has a single exponent. The values in-cluded in each block are scaled to the corresponding expo-nent. The scale factor (exponent) is determined when theoperation of each block is finished. The data in the blockis scaled before starting to operate the next block. All scalefactors need to be stored in a table, and the operated datain a block are stored in a cache [7]. The BS approach hasa sufficient SQNR performance close to the HFP approachand has the middle of scaling overhead compared with anyother scaling approaches.

To improve the SQNR from BS approach, we proposeindexed block scaling (IBS) approach based on indexingtechnique. It improves SQNR by increasing the numberof scale factors and indexes. The IBS approach has dif-ferent characteristics compared to the BS approach. Whenthe scaling block size is increased in the BS approach, theprefetch buffer size is also increased proportionally and thenumber of exponents is decreased. On the other hand, theIBS approach is using an index table instead of a prefetchbuffer unlike the BS approach. This approach is effectivein terms of area cost if the block size is higher than 128, asshown in Fig. 1.

In the BS approach, the elements inside the block aredetermined sequentially. However, with the IBS approach, itcan be possible to determine the sequence of elements insidethe block flexibly. Thus, the IBS approach can gain a higherSQNR performance than the BS approach with the sameblock size. Figure 2 shows an example of the differencesof the scaling block of the BS and IBS approach (Block size= 4, Radix-22, 16-point FFT). In the proposed architecture,the PE block is the structure of the four butterfly units con-nected in series. Thus, when the BS approach is appliedto this structure, the block size is determined only propor-tional to 24n. And the IBS approach has almost the sameSQNR compared with the BS approach using 24 times ex-ponents. In the proposed approach, we used a scaling blocksize 256. As a result, the SQNR performance of 32K-point

Fig. 1 Memory usage of IBS and BS.


Fig. 2 Block size = 4, radix-22, 16-point FFT. (a) BS approach, (b) IBSapproach.

Fig. 3 SQNR performance of several scaling approaches.

FFT achieved 45.3 dB for 256QAM signals, which is a bet-ter SQNR performance compared to the BS approach.

Figure 3 shows the SQNR performance of each scal-ing approach and internal word length (IWL). The BFP ap-proach features the lowest scaling overhead but the worstSQNR performance. In contrast, the HFP approach has theworst scaling overhead, but the best SQNR performance.However, the proposed IBS approach has a better SQNRperformance and less scaling overhead than the conventionalBS approach. Thus, the IBS approach has the efficient mem-ory element usage compared with the BS approach.

3. Proposed FFT Architecture

The pipelined shared-memory FFT processor is describedin this section. Figure 4(a) shows the top block diagramof the proposed architecture. It mainly consists of process-ing element (PE), 8K-word (22 bit) single-port SRAM, anda control unit. The proposed FFT processor performs aninput/output (I/O) operation and PEcomputation operation,repeatedly. The FFT mode is determined by a mode selec-tion signal. In this section, we describe how each block canbe implemented to reduce area.

3.1 Processing Element

Figure 4(b) shows the proposed PE based on the MDC ar-chitecture. Through the control of the input sequence in thePE, it can reduce the buffer by half compared to the con-ventional MDC architecture. The proposed architecture isbased on radix-24 MDC folding architecture. It has differ-ent iterations of PE computation operations in each mode.1K/2K/4K-point FFT modes can be performed by the PEcomputation operation during three iterations, and the otherpoint FFT modes can be performed by the PE computationoperation during four iterations. Figure 5 shows the opera-tion of each radix. The proposed PE can execute differentoperations according to the radix. Figure 6 shows the opera-tion of each mode. Two iterations of the radix-24 DIF algo-rithm operation are performed in all modes during the final8 stages. Additionally, 1K-point FFT performs the radix-22 operation during the first 2 BU stages. 2K-point FFTmode is performing the radix-23 operation during first 3 BUstages. 4K-point FFT mode is performing the radix-24 op-eration during the first 4 BU stages. 8K-point FFT mode isperforming radix-2 and radix-24 operations during the first 5BU stages. 16K-point FFT mode is performing radix-22 andradix-24 operations during the first 6 BU stages. 32K-pointFFT mode is performing radix-23 and radix-24 operationsduring the first 7 BU stages.

3.2 Complex Constant Multiplier

Figure 7 shows the proposed complex constant multi-plier which consists of six constant multipliers, adders andMUXs. Proposed complex constant multiplier consists ofthree types of constant multipliers, which compute multi-plication using the twiddle factors cos(1/16)π = 0.9239,cos(2/16)π = 0.7071, and cos(3/16)π = 0.3827. Coefficientsare generated using Canonic Signed Digit (CSD) number.CSD constant multiplier contains the fewest number of non-zero bits, so it can reduce the area and power consumption[4], [15]. The area cost of the proposed multiplier is only70% of the complex Booth multiplier without the ROM ta-ble. The conventional MDC structure using the radix 24 al-gorithm has a larger size of constant multiplier than the SDFand MDF structures. However, by adjusting the schedulingof the twiddle factor, hardware complexity can be reduced.


Fig. 4 (a) Proposed FFT processor, (b) processing element.

Table 2 shows the scheduling of the twiddle factor W16 us-ing the proposed and conventional scheduling schemes. Toavoid the extra cycles and further reduce the hardware com-plexity, the simplification scheme is proposed as follows: 1)The proposed scheduling scheme can reduce conflicts by us-ing idle Time Slot 0 in Table 2 and feedback loop as shownin Fig. 5. 2) The twiddle factors of the 2nd data-path in TimeSlots 5 and 6 are shifted to Time Slots 6 and 7. And the twid-dle factor of the 2nd data-path in Time Slot 7 is shifted toTime Slot 0. That is, in the Time Slots 5, 6, and 7, the max-imum number of complex constant multiplier used in theproposed scheduling scheme is 2, whereas the conventionalscheduling method has 4. Thus, the proposed schedulingmethod requires a lower number of complex constant multi-plier compared with the conventional method.

3.3 Complex Booth Multiplier

Figure 8 shows a block diagram of the complex Booth mul-tiplier. The proposed complex Booth multiplier is a non-

pipeline structure because low area and low power are moreimportant factors than the clock speed in the memory baseddesign. In the multiplication block, the word length of twid-dle factor influences the SQNR performance. For this rea-son, twiddle factors are determined by appropriate wordlength. Figure 9 shows the SQNR for internal word length(IWL) and twiddle factor word length (TWL). If the internalword length is 11 bits, 10 bits and more twiddle factor wordlengths have the same SQNR performance. For this reason,the twiddle factor word length is determined to be 10 bits.

The complex Booth multiplier needs a look-up table(LUT) using read-only memory (ROM) to store the twiddlefactor values. Figure 10 shows the twiddle factors for theROM table. Only 1/8th period of cosine and sine waveformsare stored in the ROM and the other period waveforms canbe reconstructed with these stored values. Also, if the wordlength of the twiddle factor is determined to be 10 bits, aFFT processor using 1/4 twiddle factors has almost the sameSQNR performance as previous FFT processors using alltwiddle factors. For this reason, this architecture uses only a


Fig. 5 Operation of each radix (a) 16-point radix-24 FFT, (b) 8-pointradix-23 FFT, (c) 4-point radix-22 FFT, (d) 2-point radix-2 FFT.

Fig. 6 Operation of each mode.

1/32nd period of cosine and sine waveforms. As a result, thehardware complexity of the ROM table has been reduced.

3.4 Overflow Detection and Scaling Unit

The overflow detection and scaling unit (ODSU) detects theoverflow from the input data and converts the overflow de-tection result into the exponent form. Figure 11 shows thestructures of ODSU1 and ODSU2. The exponent value ofODSU1 is propagated to the ODSU2 and used as the scalingfactor in the scaling unit of ODSU1. ODSU2 performs anindex table update operation That is, ODSU2 gets the finalexponent information through the addition of propagated ex-ponent value from ODSU1 and outputs the exponent valuefrom the overflow detection unit in ODSU2. Using this in-formation, ODSU2 decides whether or not the index table is

Fig. 7 Complex constant multiplier.

Table 2 Scheduling of twiddle factor at each time slot.

updated, and performs the index update operation. The loca-tion of the first overflow, which is generated in each scalingblock, is recorded in the index table. This index table datais used to scale the input values at the equalizer block of thenext PE operation.

3.5 Memory Unit

As mentioned in Sect. 1, the major part of the FFT pro-cessor’s power consumption results from the main mem-ory blocks. A large number of memory accesses are one ofthe most significant problems of power consumption. Also,more than half of all chip area is occupied by main mem-ory blocks. Generally speaking, main memory blocks arethe most critical block in the FFT processor in terms ofboth power consumption and hardware complexity. So, inorder to reduce chip area and power consumption, single-port SRAM and low internal word length are adopted in theproposed architecture. The single-port SRAM is operated


Fig. 8 Complex booth multiplier.

Fig. 9 SQNR for IWL and TWL.

Fig. 10 Twiddle factors for the ROM table.

with lower memory access and occupied a lower area thandual-port SRAM. The total memory size in our design is704 Kbit, consisting of 11-bit real and imaginary parts.

The read/write scheduling scheme of the four memorybanks and prefetch/prewrite buffers is shown in Fig. 12. Thedetailed operation of the memory block in the FFT processoris as follows: Four memory blocks are performing the readoperation during the latency of the PE block as shown inFig. 12. Since then, the memory banks are performing thewrite and read operations per each clock cycle, iteratively.

Fig. 11 (a) ODSU1, (b) ODSU2.

Fig. 12 Scheduling of the memory and prefetch/prewrite buffer.

1) Read operation: After reading the values from the fourmemory banks, two symbols are sent to the PE blockand the others are sent to the prefetch buffer. The sym-bols in the prefetch buffer are sent to the PE block atthe next clock.

2) Write operation: First, the output values of the PEblock are sent to the prewrite buffer. At the next clockcycle, values in the prewrite buffer and output valuesfrom the PE block are written in the memory at thesame time.

4. Results and Comparison

Table 3 compares the storage requirements and SQNR of theproposed IBS approach and previous scaling approaches.The proposed scaling approach provides a storage size thatis more efficient than those of previous BS and HFP ap-proaches. In case of SQNR, the proposed approach has45.3 dB at 11-bit internal wordlength for 256QAM modu-lation, which is comparable SQNR performance with those


Table 3 Storage requirement and SQNR for different scaling approaches(N = 32768, W = 11, S = 4, B=256).

Table 4 Performance of the proposed FFT processor compared with previous implementations.

of BS and HFP approaches, and much better SQNR perfor-mance than BFP approach.

The proposed FFT architecture was modeled in VerilogHDL and simulated to verify its functionality. After com-plete verification of the design functionality, it was then syn-thesized using appropriate time and area constraints. Bothsimulation and synthesis steps were carried out using SYN-OPSYS synthesis tool and 0.18-μm CMOS standard celllibrary. The total number of gates is 41,000 gates fromthe synthesized results excluding memories. From the pre-layout simulation, the proposed architecture can operate ata maximum clock frequency of 110 MHz. The executiontime to compute 32K-point FFT is 595.5 μs at 110 MHz,which is enough to meet the specification of DVB-T2 stan-dard (3584 μs).

In Table 4, the performance characteristic of the pro-posed processor are summarized and compared with the pre-vious works. For comparison of hardware complexity, thenumber of complex adders, complex multipliers and memo-


ries are used because the area is dominated by adders, mul-tipliers and memories. The results show that the proposedprocessor provides less number of complex adders, complexBooth multiplier and complex CSD multiplier than otherFFT processors. To compare fairly with the previous works,the memory elements are normalized to 8K-point and 32K-point FFT modes. The proposed processor with support-ing 32K-point FFT mode has smaller memory size than theother processors with supporting 8K-point FFT mode. TheSQNR and execution time is enough to meet the specifica-tion of DVB-T2 standard. As a result, the proposed pro-cessor achieves better hardware complexity compared to theother FFT processors. Moreover, the proposed processorresults in much higher operating frequency than the othermemory-based FFT processor.

5. Conclusion

In this paper, a new pipelined shared-memory FFT archi-tecture has been proposed for DVB-T2 applications. Byadjusting the scheduling of the twiddle factor, the size ofcomplex constant multiplier was reduced. In order to re-duce the internal word length, a new indexed block scal-ing approach has been proposed to preserve SQNR at lowscaling overhead. Also, single-port SRAM with minimalword length is adopted to further reduce hardware complex-ity. Therefore, the proposed processor achieves lower wordlength, less memory usage and higher SQNR performance,which results in low hardware complexity and small mem-ory size. The proposed architecture has potential applica-tions in DVB-T2 systems.

Acknowledgement

This work was supported by the MKE, Korea, under theITRC support program supervised by the NIPA (NIPA-2011-C1090-1111-0007).

References

[1] DVB Document A133, “Implementation guidelines for a secondgeneration digital terrestrial television broadcasting system (DVB-T2),” Feb. 2009.

[2] ETSI EN 300 744, “Digital video broadcasting (DVB); Framingstructure, channel coding and modulation for digital terrestrial tele-vision,” Jan. 2009.

[3] DVB Document A092r3, “Digital video broadcasting (DVB);DVB — H implementation guidelines,” April 2009.

[4] J. Lee and H. Lee, “A high-speed two-parallel radix-24 FFT/IFFTprocessor for MB-OFDM UWB systems,” IEICE Trans. Fundamen-tals, vol.E91-A, no.4, pp.1206–1211, April 2008.

[5] H.-Y. Lee and I.-C. Park, “Balanced binary-tree decomposition forarea-efficient pipelined FFT processing,” IEEE Trans. Circuits Syst.I, vol.54, no.4, pp.889–900, April 2007.

[6] M. Turrillas, A. Cortes, I. Velez, J.F. Sevillano, and A. Irizar, “AnFFT core for DVB-T2 receivers,” Proc. 16-th IEEE InternationalConference Electronics, Circuits, Syst., pp.120–123, Dec. 2009.

[7] Y. Lin, H. Liu, and C. Lee, “A dynamic scaling FFT processor forDVB-T applications,” IEEE J. Solid-State Circuits, vol.39, no.11,pp.2005–2013, Nov. 2004.

[8] H. Xiao, A. Pan, Y. Chen, and X. Zeng, “Low-cost reconfigurableVLSI architecture for fast Fourier transform,” IEEE Trans. Consum.Electron., vol.54, no.4, pp.1617–1622, Nov. 2008.

[9] W. Li and L. Wanhammar, “A pipeline FFT processor,” Proc. IEEEWorkshop Signal Process. Syst., pp.654–662, Nov. 1999.

[10] C.-M. Wu, M.-D. Shieh, H.-F. Lo, and M.-H. Hu, “Implementationof channel demodulator for DAB system,” Proc. 2003 IEEE Interna-tional Symposium Circuits Syst., vol.2, pp.25–28, May 2003.

[11] S. He and M. Torkelson, “Designing pipeline FFT processor forOFDM (de)Modulation,” Proc. IEEE URSI Int. Symp. Signals,Syst., Electron., pp.257–262, Oct. 1998.

[12] J.-Y. Oh and M.-S. Lim, “Fast Fourier transform algorithm forlow-power and area-efficient algorithm,” IEICE Trans. Commun.,vol.E89-B, no.4, pp.1425–1429, April 2006.

[13] A.V. Oppenheim and R.W. Schafer, Discrete-time signal processing.upper saddle river, Prentice-Hall, NJ, 1999.

[14] T. Leanrt and V. Owall, “Architectures for dynamic data scaling in2/4/8K pipeline FFT cores,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol.14, no.11, pp.1286–1290, April 2006.

[15] S.-M. Kim, J.-G. Chung, and K.K. Parhi, “Low error fixed-widthCSD multiplier with efficient sign extention,” IEEE Trans. CircuitsSyst. II, Analog Digit. Signal Process., vol.50, no.12, pp.984–993,Dec. 2003.

Kisun Jung received the B.S. degree in elec-tronic engineering from Hankuk University ofForeign Studies in 2009 and M.S. degree in in-formation & communication engineering fromInha University, Incheon, Korea, in 2011. Hisresearch interests VLSI design and implementa-tion for communication systems.

Hanho Lee received the Ph.D. and M.S.degrees, both in Electrical & Computer Engi-neering, from the University of Minnesota, Min-neapolis, in 2000 and 1996 respectively. In1999, he was a Member of Technical-Staff-1at Lucent Technologies, Bell Labs, Holmdel,NJ. From April 2000 to August 2002, he was aMember of Technical Staff at the Lucent Tech-nologies (Bell Labs Innovations), Allentown,where he was responsible for the developmentof VLSI architectures and implementation of

high-performance DSP multiprocessor SoC for wireless infrastructure sys-tems. From August 2002 to August 2004, he was an assistant professor atthe Department of Electrical & Computer Engineering, University of Con-necticut, Storrs. Since August 2004, he has been with the School of Infor-mation and Communication Engineering, Inha University, Incheon, Korea,where he is presently a Professor. His research interests include design ofVLSI circuits and systems for communications, System-on-a-Chip (SoC)design, reconfigurable architecture, and forward error correction coding.

low-complexity multi-mode memory-based fft …soc.inha.ac.kr/images/ieice-fft-pub201111.pdf2376...

Documents