180mv low voltage fft processor paper on ieee

10
2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 ©2004 IEEE ISSCC 2004 / SESSION 16 / TD: EMERGING TECHNOLOGIES AND CIRCUITS / 16.4 16.4 A 180mV FFT Processor Using Subthreshold Circuit Techniques Alice Wang, Anantha Chandrakasan Massachusetts Institute of Technology, Cambridge, MA The key design metric in emerging applications such as wireless sensor networks, is the energy dissipated per function rather than clock speed or silicon area. The author’s previous energy- scalable FFT ASIC uses an off-the-shelf standard-cell logic library and memory only scaled down to 1V operation [1]. This paper describes a custom real-valued FFT processor that oper- ates over a variety of operating scenarios (programmable FFT length and bit precision) and employs circuit techniques that allow the supply voltage to be deeply scaled into the subthresh- old regime for minimal energy dissipation. As processing speed requirements are relaxed, the supply volt- age can be scaled down well below the threshold voltage to min- imize switching energy. However, at low clock frequencies, leak- age energy dissipation can exceed active energy, leading to an optimal operating frequency and voltage that minimizes energy consumption. To investigate the optimal operating point for the FFT, logic and memory design techniques allowing subthreshold operation are needed. Previous research demonstrates the func- tionality of logic circuits at 200mV using low threshold devices [2]. This FFT processor operates at 180mV using a standard CMOS 0.18µm logic process with threshold voltages of around 450mV. The 16b architecture of the FFT is shown in Figure 16.4.1. After the input data is reordered and clocked into the data memory, one N-point real-valued FFT is performed. In one clock cycle, two 32b complex values (A,B) are read from the data memory, and the datapath outputs (X,Y) are written back to the memory. In addition, one 32b complex twiddle factor (W) is read from the ROM. The 512-Word, 32b memory bank for the FFT is segment- ed by address parity and MSB to avoid read/write memory haz- ards. Additionally, the memory is configured to allow for variable FFT lengths. The 16b hardware for both memory and datapath logic is reused for 8b processing. In Fig. 16.4.1, the LSB inputs to the 16b Baugh-Wooley multiplier are gated to configure the multiplier for energy-efficient 8b processing. For ultra-low voltage operation, there are new circuit design con- siderations. As the supply voltage decreases, a CMOS inverter may not achieve rail-to-rail output voltage swing due to reduced I on /I off . An increase in W p /W n causes larger PMOS drive currents and improves the output-high swing but degrades the output-low voltage level by increasing PMOS leakage currents. This effect is further compounded by process variations. At the FS corner, the Fast NMOS is more leaky than the Slow PMOS leading to a higher bound on W p (min). Similarly, the SF corner sets W p (max). Figure 16.4.2 shows W p (min) and W p (max) of the inverter at process corners assuming a 10-90% output voltage swing and for W n =0.44µm. The worst-case minimum supply voltage for this cell is estimated at 195 mV when W p =4.8µm where the two curves intersect. Parallel leakage, sneak leakage paths, and stacked devices, all create problems for traditional logic circuits in deep subthresh- old operation. For example, Fig. 16.4.3 shows the operation of a standard tiny XOR logic gate. When operating at normal voltage levels, for A=1, B=0, the output node, Z, is high. However, when the voltage is scaled down to 100mV, the output voltage is degraded by three leaking devices and reduced swing input volt- ages (due to imperfect inverters). Alternatively, a transmission gate XOR has fewer parallel devices which improves subthresh- old performance at worst-case input vectors. Additionally, having both NMOS and PMOS in the pull-up and pull-down reduces the effects of process variations on minimum voltage operation. Sneak leakage paths between standard cells are minimized by introducing inverters and buffers and by carefully analyzing interfaces between standard cells. In multiple stacked devices, the drive current is significantly reduced in subthreshold opera- tion, so subthreshold transmission-gate MUXes cannot be direct- ly cascaded. Datapath and control circuits for the subthreshold FFT processor are developed by minimizing stacked devices, reducing parallel leakage, and avoiding sneak leakage paths. Memory design using subthreshold operation is challenging. Conventional SRAM designs will not function at low voltage due to reduced I on /I off and bitline leakage that depends on the values stored in memory. For read access in deep subthreshold opera- tion, the bitline is segmented by using a MUX-based hierarchical approach (Fig. 16.4.4). The selectors to the muxes are the read- address inputs, and the data from the memories is hierarchical- ly passed through the MUXes to the output. The MUXes are designed to ensure a high I on /I off at each level of hierarchy by avoiding parallel leakage and stack effects. The simulation in Fig. 16.6.4 contrasts operation of the hierarchical read bitline with a conventional read bitline. The MUXes can be daisy- chained and arrayed for compact layout. The same hierarchical design is used to create subthreshold Twiddle ROMs. A latch- based circuit is used for reliable write access at very low voltages and process corners (Fig. 16.4.4). The low-voltage FFT containing 627k transistors is fabricated in a standard 0.18µm 6M CMOS process. It is fully functional at 128 to 1024 FFT lengths, 8 and 16b precision, for voltage sup- plies 180 to 900mV and for clock frequencies of 164Hz to 6MHz. The minimum supply voltage is 180mV where it dissipates 90nW. Figure 16.4.5 is a oscilloscope plot of outputs from the FFT chip functioning at 180mV. The optimal operating point is where energy is minimized and is a function of activity factor and process technology. The optimal operating point is at 350mV with a clock frequency of 9.6kHz and is shown in Fig. 16.4.6. This figure is a plot of the energy and the performance for a 16b, 1024 point FFT as a function of V DD . As previously reported, a low power FFT processor implemented in a 0.7µm process dissipates 3.4µJ when performing one 1024-point CVFFT at 1.1V [3]. The energy used by this FFT processor to compute one 16b, 1024 point RVFFT at the optimal operating point is 155nJ. Figure 16.4.7 shows a die photo of the IC that occupies 2.6mm x 2.1mm. Acknowledgments: The authors thank J. Cline for her help with the multiplier design. We also thank B. Calhoun and Prof. K.C. Smith for valuable feedback on the paper. This effort is sponsored by DARPA Power Aware Computing and Communications (PAC/C) and the Air Force Research Laboratory, under agreement number F33615-02-2-4005. A. Wang is supported by an Intel PhD Fellowship. References: [1] A. Wang and A. Chandrakasan, “Energy-Aware Architectures for a Real-Valued FFT Implementation,” ISLPED 2003, pp. 360-365, August 2003. [2] J. Burr and J. Shott, “A 200mV Self-Testing Encoder/Decoder Using Stanford Ultra-Low-Power CMOS,” ISSCC Dig. Tech. Papers, pp. 84-85, Feb. 1994. [3] B. Baas, “A Low-Power, High-Performance, 1024-Point FFT Processor,” IEEE J. Solid-State Circuits, vol 34, no 3, pp. 380-387, March 1999.

Upload: loilevan

Post on 17-Sep-2015

234 views

Category:

Documents


0 download

DESCRIPTION

180mV low voltage FFT processor paper on IEEE

TRANSCRIPT

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    ISSCC 2004 / SESSION 16 / TD: EMERGING TECHNOLOGIES AND CIRCUITS / 16.4

    16.4 A 180mV FFT Processor Using Subthreshold Circuit Techniques

    Alice Wang, Anantha Chandrakasan

    Massachusetts Institute of Technology, Cambridge, MA

    The key design metric in emerging applications such as wirelesssensor networks, is the energy dissipated per function ratherthan clock speed or silicon area. The authors previous energy-scalable FFT ASIC uses an off-the-shelf standard-cell logiclibrary and memory only scaled down to 1V operation [1]. Thispaper describes a custom real-valued FFT processor that oper-ates over a variety of operating scenarios (programmable FFTlength and bit precision) and employs circuit techniques thatallow the supply voltage to be deeply scaled into the subthresh-old regime for minimal energy dissipation.

    As processing speed requirements are relaxed, the supply volt-age can be scaled down well below the threshold voltage to min-imize switching energy. However, at low clock frequencies, leak-age energy dissipation can exceed active energy, leading to anoptimal operating frequency and voltage that minimizes energyconsumption. To investigate the optimal operating point for theFFT, logic and memory design techniques allowing subthresholdoperation are needed. Previous research demonstrates the func-tionality of logic circuits at 200mV using low threshold devices[2]. This FFT processor operates at 180mV using a standardCMOS 0.18m logic process with threshold voltages of around450mV.

    The 16b architecture of the FFT is shown in Figure 16.4.1. Afterthe input data is reordered and clocked into the data memory,one N-point real-valued FFT is performed. In one clock cycle, two32b complex values (A,B) are read from the data memory, andthe datapath outputs (X,Y) are written back to the memory. Inaddition, one 32b complex twiddle factor (W) is read from theROM. The 512-Word, 32b memory bank for the FFT is segment-ed by address parity and MSB to avoid read/write memory haz-ards. Additionally, the memory is configured to allow for variableFFT lengths. The 16b hardware for both memory and datapathlogic is reused for 8b processing. In Fig. 16.4.1, the LSB inputsto the 16b Baugh-Wooley multiplier are gated to configure themultiplier for energy-efficient 8b processing.

    For ultra-low voltage operation, there are new circuit design con-siderations. As the supply voltage decreases, a CMOS invertermay not achieve rail-to-rail output voltage swing due to reducedIon/Ioff. An increase in Wp/Wn causes larger PMOS drive currentsand improves the output-high swing but degrades the output-lowvoltage level by increasing PMOS leakage currents. This effect isfurther compounded by process variations. At the FS corner, theFast NMOS is more leaky than the Slow PMOS leading to ahigher bound on Wp(min). Similarly, the SF corner sets Wp(max).Figure 16.4.2 shows Wp(min) and Wp(max) of the inverter atprocess corners assuming a 10-90% output voltage swing and forWn=0.44m. The worst-case minimum supply voltage for this cellis estimated at 195 mV when Wp=4.8m where the two curvesintersect.

    Parallel leakage, sneak leakage paths, and stacked devices, allcreate problems for traditional logic circuits in deep subthresh-old operation. For example, Fig. 16.4.3 shows the operation of astandard tiny XOR logic gate. When operating at normal voltagelevels, for A=1, B=0, the output node, Z, is high. However, whenthe voltage is scaled down to 100mV, the output voltage is

    degraded by three leaking devices and reduced swing input volt-ages (due to imperfect inverters). Alternatively, a transmissiongate XOR has fewer parallel devices which improves subthresh-old performance at worst-case input vectors. Additionally, havingboth NMOS and PMOS in the pull-up and pull-down reduces theeffects of process variations on minimum voltage operation.Sneak leakage paths between standard cells are minimized byintroducing inverters and buffers and by carefully analyzinginterfaces between standard cells. In multiple stacked devices,the drive current is significantly reduced in subthreshold opera-tion, so subthreshold transmission-gate MUXes cannot be direct-ly cascaded. Datapath and control circuits for the subthresholdFFT processor are developed by minimizing stacked devices,reducing parallel leakage, and avoiding sneak leakage paths.

    Memory design using subthreshold operation is challenging.Conventional SRAM designs will not function at low voltage dueto reduced Ion/Ioff and bitline leakage that depends on the valuesstored in memory. For read access in deep subthreshold opera-tion, the bitline is segmented by using a MUX-based hierarchicalapproach (Fig. 16.4.4). The selectors to the muxes are the read-address inputs, and the data from the memories is hierarchical-ly passed through the MUXes to the output. The MUXes aredesigned to ensure a high Ion/Ioff at each level of hierarchy byavoiding parallel leakage and stack effects. The simulation inFig. 16.6.4 contrasts operation of the hierarchical read bitlinewith a conventional read bitline. The MUXes can be daisy-chained and arrayed for compact layout. The same hierarchicaldesign is used to create subthreshold Twiddle ROMs. A latch-based circuit is used for reliable write access at very low voltagesand process corners (Fig. 16.4.4).

    The low-voltage FFT containing 627k transistors is fabricated ina standard 0.18m 6M CMOS process. It is fully functional at128 to 1024 FFT lengths, 8 and 16b precision, for voltage sup-plies 180 to 900mV and for clock frequencies of 164Hz to 6MHz.The minimum supply voltage is 180mV where it dissipates90nW. Figure 16.4.5 is a oscilloscope plot of outputs from theFFT chip functioning at 180mV. The optimal operating point iswhere energy is minimized and is a function of activity factorand process technology. The optimal operating point is at 350mVwith a clock frequency of 9.6kHz and is shown in Fig. 16.4.6. Thisfigure is a plot of the energy and the performance for a 16b, 1024point FFT as a function of VDD. As previously reported, a lowpower FFT processor implemented in a 0.7m process dissipates3.4J when performing one 1024-point CVFFT at 1.1V [3]. Theenergy used by this FFT processor to compute one 16b, 1024point RVFFT at the optimal operating point is 155nJ. Figure16.4.7 shows a die photo of the IC that occupies 2.6mm x 2.1mm.

    Acknowledgments:The authors thank J. Cline for her help with the multiplier design. Wealso thank B. Calhoun and Prof. K.C. Smith for valuable feedback on thepaper. This effort is sponsored by DARPA Power Aware Computing andCommunications (PAC/C) and the Air Force Research Laboratory, underagreement number F33615-02-2-4005. A. Wang is supported by an IntelPhD Fellowship.

    References:[1] A. Wang and A. Chandrakasan, Energy-Aware Architectures for aReal-Valued FFT Implementation, ISLPED 2003, pp. 360-365, August2003.[2] J. Burr and J. Shott, A 200mV Self-Testing Encoder/Decoder UsingStanford Ultra-Low-Power CMOS, ISSCC Dig. Tech. Papers, pp. 84-85,Feb. 1994.[3] B. Baas, A Low-Power, High-Performance, 1024-Point FFT Processor,IEEE J. Solid-State Circuits, vol 34, no 3, pp. 380-387, March 1999.

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    ISSCC 2004 / February 17, 2004 / Salon 10-15 / 3:15 PM

    ! "#

    $ #

    %&'()*&

    +&&,

    -'./%0

    -1/20

    -1/20

    -'./%0

    )&

    '(), &

    '()%)

    ,

    3++

    &+*

    *

    $

    !

    4

    5&

    &6

    &+

    *

    *

    )

    77+

    & * )*&

    ,&

    ;'/&6

    2

    &*+

    2

    '4

    '

    '1

    2

    '22

    %2

    (2

    92

    2

    2 , 9, (,

    2

    &-'20

    &*&*

    :

    B,:C

    =+BGC

    *&*F*

    Figure 16.4.1: RVFFT architecture that enables scalability in bit-precisionand FFT length, and includes circuits which can scale down to 180 mV oper-ation.

    Figure 16.4.2: Sizing trade-off for an inverter at the minimum operating volt-age with process variation considerations given Wn=0.44m (simulation).

    Figure 16.4.3: The effects of parallel leakage is compounded at ultra-lowvoltages as shown by the standard-cell tiny XOR gate for the inputs A=1 andB=0 at VDD=100mV. Parallel leakage is reduced in the subthreshold XOR gate,which functions better at 100mV.

    Figure 16.4.4: The MUX-based hierarchical-read access works reliablyat 100mV in simulation compared to a conventional read bitline (RBL).

    Figure 16.4.5: Oscilloscope plot showing outputs from the RVFFT chip at180 mV operation.

    Figure 16.4.6: Energy and FFT clock frequency for 16b, 1024-pointRVFFT as a function of VDD.

  • Figure 16.4.7: Die photograph of the 180mV real-valued FFT chip.

    )

    ,-&&

    (#

    ./)

    %

    0

    2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    ! "#

    $ #

    %&'()*&

    +&&,

    -'./%0

    -1/20

    -1/20

    -'./%0

    )&

    '(), &

    '()%)

    ,

    3++

    &+*

    *

    $

    !

    4

    5&

    &

    6&+

    *

    *

    )

    77+

    & * )*&

    ,&

    ;'/

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    !

    "

    :

    B,:C

    Figure 16.4.2: Sizing trade-off for an inverter at the minimum operating voltage with process variationconsiderations given Wn=0.44m (simulation).

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    D

    +*

    >*

    D

    )&!

    *)

    !

    '424D'

    #$%#%&'(

    '22

    %2

    2

    ', , ?,

    24

    '

    '4

    2

    )'(

    *+

    (2

    92

    2

    9,2

    >*

    Figure 16.4.3: The effects of parallel leakage is compounded at ultra-low voltages as shown by the standard-cell tinyXOR gate for the inputs A=1 and B=0 at VDD=100mV. Parallel leakage is reduced in the subthreshold XOR gate, whichfunctions better at 100mV.

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    2

    6

    '

    (

    6

    2

    6

    '

    6

    2

    6

    '

    6

    '1

    6

    '1

    D

    2

    '

    2

    2

    2

    E

    *) **

    ,E)

    2

    '

    ?

    '(

    '1

    ,E 6

    *&>&6

    2

    &*+

    2

    '4

    '

    '1

    2

    '22

    %2

    (2

    92

    2

    2 , 9, (,

    2

    Figure 16.4.4: The MUX-based hierarchical-read access works reliably at 100mV in simulation compared toa conventional read bitline (RBL).

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    &-'20

    &*&*

    Figure 16.4.5: Oscilloscope plot showing outputs from the RVFFT chip at 180 mV operation.

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    :

    B,:C

    =+

    BGC

    *&*F*

    Figure 16.4.6: Energy and FFT clock frequency for 16b, 1024-point RVFFT as a function of VDD.

  • 2004 IEEE International Solid-State Circuits Conference 0-7803-8267-6/04 2004 IEEE

    Figure 16.4.7: Die photograph of the 180mV real-valued FFT chip.

    )

    ,-&&

    (#

    ./)

    %

    0

    footer1: