design automation for a 3dic fft processor for synthetic ... · fft calculations in the processor....

Design Automation for a 3DIC FFT Processor for SyntheticAperture Radar: A Case Study

Thorlindur [email protected]

Kiran [email protected]

Paul D. [email protected]

Department of Electrical and Computer EngineeringNorth Carolina State University Box 7911

Raleigh, NC 27695

ABSTRACTThis work discusses a 1024-point, memory-on-logic 3DICFFT processor for synthetic aperture radar (SAR), sent tofabrication in the 180 nm MIT Lincoln Labs 3D FDSOI 1.5 Vprocess[12] along with the design flow required to realize itwith off-the-shelf commercial 2D tools. The work shows howthe vertical dimension can be exploited for novel memoryarchitecture tradeoffs that are not feasible in 2D, reducingthe energy consumed per memory operation in the FFT by60.3%. In comparison to its 2D counterpart, the SAR FFTprocessor exhibits a 53.0% decrease in average wire length,a 24.6% increase in maximum operating frequency and a25.3% decrease in total silicon area.

Categories and Subject DescriptorsB.7.1 [Integrated Circuits]: Types and Design Styles; C.4[Performance Of Systems]: Design studies

General TermsDesign

KeywordsSAR, 3DIC, TSV, FFT

1. INTRODUCTIONNew developments in fabrication technology allow vertical

integration using 3D thru-silicon vias. Vertical integrationhas the potential to cut wire length drastically for standardcell designs as reported by Davis et al.[5]. This is importantbecause as designs move to smaller feature sizes the wireswill increasingly dominate the delay and the power bud-gets of digital logic circuits[7]. 3D integration has severalmajor obstacles, including increased thermal densities[10],increased test costs, and the lack of commercial EDA toolsupport for 3DICs[2]. It has been shown that using custom3D placement and routing tools, a 28-51% reduction in total

wire length can be achieved[4]. In this paper we demonstrateone way in which an application-specific-processor can be re-architectured to take advantage of 3DIC technology. In thispaper an FFT engine designed for use in a Synthetic Aper-ture Radar (SAR) processor is used as a case study. Were-architectured the baseline design to reduce power and to-tal area simultaneously by taking advantage of 3DIC withthrough-silicon via (TSV) technology. Furthermore, by sep-arating the logic and memory layers we simplify the tool flowso that the design can be executed easily using extensionsof current 2D CAD tools. 3DIC stacking is used to inter-connect the power-optimized memory to the logic tier whilereducing overall area. The design has been sent to fabrica-tion at MIT Lincoln Laboratory but fabrication of the chiphas not yet been completed.

The paper is organized in the following manner. Section 2describes the algorithm on which the SAR FFT processor isbased. Section 3 describes the architecture of the SAR FFTprocessor. Section 4 describes the tool flow and manufac-turing process. Finally, Section 5 compares the metrics ofthe 3D implementation to a 2D equivalent.

2. SAR ALGORITHMSynthetic aperture radar, unlike most radar, is used for

imaging. While conventional images are formed using thevisible spectrum, SAR images are formed using the radioregion of the spectrum. A tremendous amount of digital sig-nal processing and memory bandwidth is required to form aSAR image. The required digital signal processing and mem-ory bandwidth increase exponentially with the desired imageresolution. This makes a SAR FFT processor an excellentcandidate to demonstrate the memory bandwidth benefitsthat 3D integrated circuits can provide. The image-formingalgorithm used for the SAR FFT processor is derived fromthe one used in the RASSP[6] project and based on theRange Doppler Algorithm[8]. The steps required to formthe SAR image along, with the portion of the floating pointoperations performed by each of the steps (for 30 cm imagingresolution), are shown in Table 1. It is important to note themajority of all the floating point operations are FFT/IFFToperations, which occur in steps 2, 3 and 5.

3. ARCHITECTUREAs we have shown in Table 1 and discussed in Section 2,

the majority of SAR processing involves computing FFTs tosome degree. As a result, the main objective of the designis to efficiently calculate the FFTs used in the SAR algo-

5.1

51

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC’09, July 26-31, 2009, San Francisco, California, USA Copyright 2009 ACM 978-1-60558-497-3/09/07....10.00

Authorized licensed use limited to: North Carolina State University. Downloaded on September 21, 2009 at 10:32 from IEEE Xplore. Restrictions apply.

Table 1: The Steps in the SAR Algorithm.Step %1)Range Low Pass FIR Filtering 35.6%2)Range Fast Fourier Transform 12.3%3)Azimuth Fast Fourier Transform 22.3%4)Azimuth Complex Multiply 3.8%5)Azimuth Inverse Fast Fourier Transform 26.0%

FFT Steps Combined (Steps 2, 3 & 5) 60.6%

rithm. We use a radix-2 Cooley-Tukey FFT [3] for all theFFT calculations in the processor. A radix-2 FFT has adata dependency that resembles a hypercube. This hyper-cube data dependency can be exploited in two ways. First, aradix-2 FFT will process two memory locations every cycle,one of which will have odd parity while the other will haveeven parity. As a result we can split the processing mem-ory into two independent memory groups that never needto be accessed at the same time. Second, we can sub-dividethe even and odd groups into smaller subgroups where eachprocessing element is only connected to the absolute mini-mum number of memory locations required to successfullycompute the FFT. Furthermore, in this subdivision eachmemory subgroup is not accessed by more than one pro-cessing element at the same time. The benefit of splittingthe memories into smaller subgroups is that smaller mem-ories are faster, and since each memory subgroup can beaccessed simultaneously, the system can perform a greaternumber of reads and writes per cycle. Conversely, a singlememory will require less area as only one set of peripherallogic (write driver and sense amp) is required. We use Cacti4.1[13] to assess the architectural tradeoff, by comparing theproperties of a single 8 kByte memory to sixteen 512 Bytememories. The memory-core area savings of using a singlememory would have been 67.6%. By using multiple smallermemories, the energy per read is reduced by 60.8% (from68.205 to 26.718 pJ), the energy per write is reduced by57.6% (from 14.48 to 6.142 pJ) and the memory bandwidthis increased by 854.9% (13.4 to 128.4 GBps). The numberof wires interconnecting the memory to logic is increasedfrom 150 to 2272 wires. 3DIC stacking is used to minimizethe area impact of the added wires. Furthermore, the singlememory will have a shorter and simpler interconnect struc-ture between the logic and the memory. This tradeoff isillustrated in Figure 1.

�

Figure 1: The memory design tradeoffs.

Overall, the architecture we fabricated can process a 1024-pixel wide image using an FFT of the same width. Eachpixel/data point in the FFT has a precision of 32 imaginaryand 32 real bits. Computing an N-point FFT requires usingN/2 FFT twiddle factors of the same precision. To minimize

the actual storage of the FFT twiddle factors we utilize twooptimizations. First, we use trigonometric properties to re-duce the number of twiddle factors stored[11] from N/2 toN/8+1. For the 1024 point FFT this effectively reduces thenumber of twiddle factors stored from 512 to 129. Second, ifany bit is the same for all the words, that bit is hard codedinto the processing element rather being stored in the ROM.This optimization effectively reduces the number of bits re-quired to store the twiddle factors from 64 to 52. We usethe memory-dividing scheme described above to divide theprocessing memory in 32 smaller memories (16 even and 16odd). Furthermore, every single memory is dual-ported (oneread and one write port). Overall, this allows the systemto perform 32 memory accesses per cycle (16 reads and 16writes), completing a 1024-point FFT in 653 cycles assumingfive pipeline stages. Each of the different components of thearchitecture are described below and illustrated in Figure 3.

The system consists of four different components, eightprocessing elements, one controller, thirty two SRAMs, andeight ROMs. The processing elements are the core of thesystem, implementing the FFT butterfly with four floatingpoint multipliers and six addition/subtraction units. Theinternal structure of the processing element is shown in Fig-ure 2. The controller orchestrates the overall operation ofthe system, by setting the addresses and read enables of thememories. The controller requires very little communicationwith the processing elements, only three signals per process-ing element. The SRAMs implement the main processingmemory using 8-transistor dual ported SRAMs. The ROMsstore the FFT twiddle factors and are implemented as singleported NOR type ROMs.

��

Figure 2: The internal structure of a processing el-ement.

The SAR FFT processor architecture is a good example ofa design that has a significant number of heavily shared andinterconnected resources. For this reason it can be expectedto benefit significantly from 3D integration due to long wiresin the interconnect between these resources (memories andprocessing elements).

4. IMPLEMENTATION AND TOOL FLOWIn this section we discuss the design, implementation, tool

flow and manufacturing process used to create the system.Before the design flow is explained, it is important to under-stand the manufacturing process. The MIT Lincoln Labs’manufacturing process is a three tier, 180 nm wafer scale3D integration process[9, 1]. It features a 1.5 V low powerfully depleted silicon on insulator CMOS technology withone layer of polysilicon, three metal layers per tier and aback-metal layer between the top two tiers, with an addi-tional metal layer on top of the entire stack. The bottomtier is named A, the middle tier B and the top tier C. Tier

52


��

�

Figure 3: The SAR FFT processor architecture.

A is closest to the heat sink. Tier C is the only tier whichhas off-chip inputs and outputs. Tiers B and C face down,while tier A faces up. Figure 4 shows a side view of theprocess with the silicon-thru vias and the orientation of thetiers shown. In this process the dimensions of a single thru-silicon via are 2.5 × 2.5 μm and the smallest pitch the viascan be placed on is 3.9 μm.

Overall, the design is a mix between standard cell and fullcustom design. The processing elements and controller arecoded in Verilog, while the memories (SRAMs and ROMs)are implemented using full custom design. One of the bene-fits of doing the memories in full custom, rather than usingan off-the-shelf memory generator, is that it allows the thru-silicon vias to be implemented on the outside edges of thememories. This simplifies the flow as the thru-silicon viasget placed along with the memory. This, however, is not thecase for the 24 logic-to-logic vias which must have their posi-tion predetermined and are then placed in the final assemblystage.

Figure 4: A side view of the MIT Lincoln Labs’ pro-cess with the silicon-thru vias and tier orientationshown.

Figure 5 shows the complete design flow. The first step inthe design flow is 3D floorplanning, partitioning and select-ing the locations for the memories. In the 3D floorplanningphase, the main objective is to get the memories as close aspossible to the processing elements that use them. We definePE0, PE1, PE2 and PE3 to be the lower numbered process-ing elements and PE4, PE5, PE6 and PE7 to be the uppernumbered processing elements. Figure 3 shows that every

memory is connected to one lower numbered PE and one up-per numbered PE. To exploit this connectivity we partitionthe system so that the controller and the memories are onthe middle tier (tier B), with the upper numbered PEs andtheir respective twiddle factor ROMs placed on tier A andthe lower numbered PEs along with their ROMs placed ontier C. This partitioning scheme guarantees that a memoryis never more than one tier away from the processing ele-ments that are connected to it. This means the memory isalso on the same tier as the controller that sets its addresslines. On the middle tier we have thirty-two memories andone controller to place. To accomplish this, we use an 11×3grid. We place the controller in the center location of thegrid in the middle tier. For the remaining memories we use aPython constraints package to generate an optimal memoryplacement based on the distance a given memory is to thetwo processing elements that use it. The resulting floorplanis shown in Figure 8. In the system there are a total of 8280thru-silicon vias 4128 of those vias connect the logic on tierA to the memories on tier B, another 4128 connect the logicon tier C to the memories on tier B, the remaining 24 thruvias connect the controller to the processing elements.

The next step in the design flow is synthesis, which wasaccomplished using a standard cell library based on the IIT-SoC library from the Illinois Institute of Technology. Eachtier is synthesized separately in Synopsys Design Compiler.After synthesis, we perform static timing analysis and addan additional pipeline stage to the processing elements un-til adding another pipeline stage to the processing elementsdoes not result in any overall speed increase. The optimalpre-place and route pipeline depth for the system was dis-covered to be five stages for this manufacturing process andstandard cells, yielding a maximum operation frequency of196 MHz (without parasitics).

After synthesis, we perform place and route. This stagedeviates the most from a conventional 2D flow. In order tosuccessfully complete place and route, the global informationabout the placement of the memories and the thru-siliconvias is required. Using standard string and file manipulationfunctions built into the TCL interpreter in Encounter, thethru-via and pin locations can easily be extracted by pars-ing the DEF files of the custom memories designs. Using

53


Figure 5: The design flow.

the information from the DEF file, routing and placement isblocked over the areas of the memories and inter-tier silicon-thru via location. Normal placement is then performed, fol-lowed by clock tree synthesis. Due to the fact that the pro-cess only allows three metal layers per tier, the clock tree isnot routed before regular routing, which is common for pro-cesses with a greater number of metals. Instead the clocktree is routed along with all other routing. This causes moreclock skew than would have occurred if a greater number ofmetal layers had been available. After clock tree synthesis,the ”preassignPin” command is then used to place virtualinput/output pins directly on top of the thru-vias on theedge of the memories. Encounter then performs routing asnormal, connecting the standard cells, clock tree and vir-tual pins (effectively performing 3D routing). After placeand route, the design along with its parasitics is importedinto PrimeTime and post-place and route timing analysis isperformed. In this step it is important to make sure thateach tier has no setup or hold violations. It is also impor-tant to make sure that signals that travel between tiers haveno setup or hold violations either. This step is greatly sim-plified due to the fact there are very few logic-to-logic vias(24) and the remaining signals are either data pins to theSRAMs or address pins to the twiddle factor ROMs.

Finally, all the tiers are imported separately into Virtuoso.In Virtuoso the three tiers and the full-custom memories arecombined. For the 24 signals that connect the controller toprocessing elements, the through-silicon vias are placed byhand, the rest are placed automatically as part of the mem-ory. The reason the 24 TSVs were placed by hand is thatsince there were so few of them it was quicker to place themby hand then to write a script to do so. However, this pro-cess can easily by scripted using Skill code in Virtuoso asthe location of all through-silicon vias are known. Further-more, scripting this process would be necessary for other3D designs that contain a greater number of logic to logicthrough-silicon vias. The power and ground rings of thethree tiers are then combined into 3D meshes, by placingthru-silicon vias all along the perimeter. Due to the factthat Encounter routed over the power and ground rings in asome areas, a few thru-silicon vias along the perimeter hadto be removed to avoid shorts. All in all, we managed to

fit 4554 power and ground vias between tiers A and B and4800 vias between tier B and C. The next step was to placethe input and output pads and perform a final DRC andLVS. Furthermore, since the process being used is an SOIprocess, we added extra power and ground decoupling ca-pacitors where ever there was room left over to compensatefor the limited native decoupling in SOI. Figure 6 shows thethree tiers stacked, along with the thru-silicon vias.

Figure 6: The 3D SAR FFT processor with thrusilicon vias drawn in.

5. RESULTSTo quantify the improvements of the 3D circuit over its 2D

counterpart, we place and route the design in 2D. In orderto ensure a fair comparison between the two circuits, the cir-cuit is not resynthesized, instead the same synthesis outputis used. For the comparison, a literally identical floorplanis used. This floorplan is essentially the floorplan of tier Bexpanded with the ROMs placed in similar locations to the3D version, shown in Figure 9. Due to increased congestion,the 2D design does not route successfully with the same areaas its 3D counterpart (4.8 × 4.8 mm). To remedy this, thearea used for place and route is grown until the design routeswithout any design rule violations. Compared to the 3D ver-sion, the total area used must be expanded significantly from3× 2.6× 3 mm for the 3D circuit to 5.6× 5.6 mm for its 2Dcounterpart, which is 25.3% increase in total area. To getjust core placement area, we exclude the power and groundrings from the total area (0.1 mm on every side) and thecomparison becomes 3× 2.8× 2.4 mm versus 5.4× 5.4 mm.The area discrepancy between the total area and the corearea illustrates an interesting point: given the same totalarea and same power and ground ring width, a 3D designwill devote more area to the power and ground rings. Thenext metric examined was net length. We extract all netinformation directly from Encounter, combining the infor-mation from all the different tiers. As expected, the averagewire length decreased drastically from 836.0 μm down to392.9 μm. This is a 53.0% decrease. Similarly, the totalwire length decreased from 19.107 m to 8.238 m, a total of53%. A histogram of the wire lengths is shown in Figure 7.

In order to gather the speed and power metrics of thedesign, we have to extract the parasitics and characterizethe switching activity of the design. The parasitics are ex-

54


Figure 7: Histogram of wire lengths of the SAR FFTprocessor for both the 2D and 3D versions (bin size= 250μm ).

tracted into a SPEF file using Encounter. The switchingactivity is generated by simulating an FFT test bench inMentor Graphics Modelsim and exporting the resulting ac-tivity of the test bench to a SAIF file. Both files werethen read into Synopsys PrimeTime. In PrimeTime theclock period was increased to the fastest clock that did notcause any setup violations, to determine the maximum op-erating frequency. The 3D design simulated correctly at79.4 MHz (12.6ns), whereas the 2D design simulated cor-rectly at 63.7 MHz (15.7ns). This is a 24.6% increase inmaximum operating frequency and a 19.7% improvement inclock speed. As these numbers may seem a bit slow for thegiven technology node, it is important to keep two points inmind. First, the process only has three metal layers whichlimits clock tree routing causing more skew than would oc-cur if more metals were available. Second, the standardcell library does not have the multi-adder cells that manycommercial libraries have, which would have helped increasethe maximum operating frequency. Finally, using both theSPEF and the SAIF file, power dissipation numbers (ex-cluding power dissipated in the memories) were generatedusing PrimeTime. For the 3D design the power dissipa-tion is determined for both the maximum operating fre-quency of the 2D and the 3D design, while for the 2D designthe power dissipation is only determined for its own max-imum operating frequency. At an operating frequency of79.4 MHz the 3D design dissipates 409.2 mW. Operating at63.7 MHz the 3D design dissipates 324.9 mW and the 2Ddesign dissipates 340.0 mW. This is a 4.4% improvement.Using the power numbers of both circuits operating at max-imum frequency, we compute the energy (excluding memoryaccesses) required per 1024-point FFT. The energy requiredfor the completing the FFT in 3D is 3.366 μJ as opposedto 3.552 μJ for the 2D version which is a 5.2% improve-ment. The results are summarized in Table 2, followed by asummary of the memory tradeoffs in Section 3 in Table 3.

6. CONCLUSIONSThe main point of this paper is not necessarily to compare

Table 2: Comparison between the 2D and 3D met-rics of the SAR FFT along with read and write en-ergy from Cacti.

Metric 2D 3D %

Total Area (mm2) 31.36 23.40 25.3%

Core Area (mm2) 29.16 20.16 30.9%Mean Net Length (μm) 836.0 392.9 53.0%Total Wire Length (m) 19.107 8.238 56.9%Max Speed (MHz) 63.7 79.4 24.6%Critical Path (ns) 15.7 12.6 19.7%Logic Power @ 63.7MHz (mW ) 340.0 324.9 4.4%Logic Power @ 79.4 MHz (mW ) —— 409.2 ——FFT Logic Energy (μJ) 3.552 3.366 5.2%

Table 3: Read and write energy from Cacti compar-ing the un-optimized to the optimized design.

Metric Divided Undivided %Bandwidth (GBps) 13.4 128.4 854.9%Energy Per Write (pJ) 14.48 6.142 57.6%Energy Per Read (pJ) 68.205 26.718 60.8%Memory Wires (#) 150 2272 -1414.7%

the 2D and 3D implementations of the same design but alsoto show how a system can be re-optimized in 3D in waysthat are not available in 2D. Furthermore, the system can berealized with use of commercial 2D tools. Thus it is not nec-essary to use 3D tools. The 3D optimized design permits asingle large memory to be broken down into multiple smallermemories to reduce the energy consumption in memory op-erations per FFT by 60.3%. This memory re-optimizationwould not be suited to a 2D design due to the high inter-connect cost. In the 2D design, the increase in interconnectarea is greater than the increase in memory area. Case inpoint, the 2D implementation of the archicture on the leftside of Figure 1 is significantly worse than the 3D imple-mentation of the archictecture to the right of Figure 1 in allmetrics - power, performance and area. Finally, comparingthe 2D and 3D implementations of the SAR FFT proces-sor, we show an average wire length reduction of 53.0%, anoverall wire length reduction of 56.9%, a 24.6% increase inmaximum operating frequency, a 5.2% reduction in energyper FFT and a 30.9% reduction in area.

7. ACKNOWLEDGMENTSThis project was funded by DARPA under contract FA8650-

04-C-7127, and contract FA8650-04-C-7120 both managedby AFRL. Additional funding was provided by Semiconduc-tor Research Corporation. The authors would like to thankMIT Lincoln Labs for providing access to their FD-SOI tech-nology and Magnus Halldorsson at Reykjavik University forhelp with the memory partitioning approach.

8. REFERENCES[1] J. Burns, B. Aull, C. Chen, C.-L. Chen, C. Keast,

55


Figure 8: The 3D floorplan.

J. Knecht, V. Suntharalingam, K. Warner, P. Wyatt,and D. Yost. A wafer-scale 3-D circuit integrationtechnology. IEEE Transactions on Electron Devices,53(10):2507–2516, October 2006.

[2] P. Clarke. Eda’s big three unready for 3d chippackaging. EE Times Asia Online, October 2007.

[3] J. W. Cooley and J. W. Tukey. An algorithm for themachine calculation of complex fourier series.Mathematics of Computation, 19(90):297–301, 1965.

[4] S. Das, A. Chandrakasan, and R. Reif. Design toolsfor 3-d integrated circuits. In ASPDAC: Proceedings ofthe 2003 conference on Asia South Pacific designautomation, pages 53–56, New York, NY, USA, 2003.ACM.

[5] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua,C. Mineo, A. M. Sule, M. Steer, and P. D. Franzon.Demystifying 3D ICs: The Pros and Cons of GoingVertical. IEEE Design And Test of Computers,22(6):498–510, Nov.-Dec. 2005.

[6] C. Hein, J. Pridgen, and W. Kline. RASSP VirtualPrototyping of DSP Systems. Design AutomationConference DAC 97, pages 492–497, 1997.

[7] R. Ho, K. Mai, and M. Horowitz. The Future ofWires. Proceedings of the IEEE, 89(4):490–504, 2001.

[8] M. Jin and C. Wu. SAR correlation algorithm whichaccommodates large-range migration. IEEETransactions on Geoscience and Remote Sensing,22(6):592–597, 1984.

[9] Massachusetts Institute of Technology Lincoln Labs.MITLL Low-Power FDSOI CMOS Process DesignGuide, revision 2008:6 edition, September 2008.

[10] A. Rahman and R. Reif. Thermal analysis ofthree-dimensional (3-d) integrated circuits (ics).Interconnect Technology Conference, 2001. Proceedingsof the IEEE 2001 International, pages 157–159, 2001.

[11] T. Sansaloni, A. Perez-Pascual, V. Torres, andJ. Valls. Scheme for Reducing the StorageRequirements of FFT Twiddle Factors on FPGAs.The Journal of VLSI Signal Processing,47(2):183–187, 2007.

[12] V. Suntharalingam, R. Berger, J. Burns, C. Chen,

C. Keast, J. Knecht, R. Lambert, K. Newcomb,D. O’Mara, D. Rathman, D. Shaver, A. Soares,C. Stevenson, B. Tyrrell, K. Warner, B. Wheeler,D.-R. Yost, and D. Young. Megapixel cmos imagesensor fabricated in three-dimensional integratedcircuit technology. Solid-State Circuits Conference,2005. Digest of Technical Papers. ISSCC. 2005 IEEEInternational, pages 356–357 Vol. 1, Feb. 2005.

[13] S. Wilton and N. Jouppi. CACTI: an enhanced cacheaccess and cycle time model. Solid-State Circuits,IEEE Journal of, 31(5):677–688, 1996.

Figure 9: The 2D floorplan for comparison.

56


design automation for a 3dic fft processor for synthetic ... · fft calculations in the processor....

Documents