analysis and circuit design for low power programmable ... · figure 5-6: level restoration...
TRANSCRIPT
Analysis and Circuit Design
for Low Power Programmable
Logic Modules
by
Eric A. Kusse
Master of Sciencein
Electrical Engineering and Computer Science
University of California at Berkeley
Abstract
This thesis presents research in low power programmable logic. In particular, pro-grammable gate array (PGA) structures are examined emphasizing their strengths andweaknesses from a power perspective. A detailed analysis of power consumption for aXilinx XC4003A was conducted to gain insights into what blocks of a PGA consume themost power and how those elements contribute to a mapped design’s total power. In addi-tion, several circuit design issues pertaining to PGAs were studied. Several problems sur-rounding low voltage pass transistor design were identified and solutions were evaluated.Based on the results from the power analysis and the circuit studies, a new PGA architec-ture has been defined. Finally, a 2x4 mini-array implementation of the design in a 3-layermetal, 0.6um CMOS process has been completed. Extracted simulations show order ofmagnitude energy improvements over the 4003A with only a factor of 2 speed penaltyusing a dual supply voltage of 1.5 and 2 volts.
i
Table of Contents
Introduction 11.1 What Are Programmable Gate Arrays (PGAs)?..................................................................1
1.2 Why Use Them? ..................................................................................................................3
1.3 What about performance? ....................................................................................................4
1.4 PGA Power?.........................................................................................................................6
1.5 Research Contributions ........................................................................................................7
1.6 Thesis Organization .............................................................................................................8
PGA Architecture Overview 102.1 Basic Components and Blocks...........................................................................................10
2.1.1 Logic Blocks........................................................................................................102.1.2 Programmable Interconnect.................................................................................122.1.3 I/O Ring ...............................................................................................................14
2.2 Architectural Styles............................................................................................................142.2.1 Island ...................................................................................................................142.2.2 Cellular ................................................................................................................15
2.3 The Xilinx XC4000 Family ...............................................................................................152.3.1 Composition of a configurable logic block (CLB) ..............................................152.3.2 Interconnect Resources........................................................................................16
2.4 The CLAy Architecture......................................................................................................172.4.1 Logic Cells...........................................................................................................182.4.2 Interconnect Resources........................................................................................19
2.5 Triptych..............................................................................................................................20
2.6 Conclusions........................................................................................................................212.6.1 Architectural Design Insights ..............................................................................22
PGA Power Analysis 233.1 Sources of Power Consumption.........................................................................................23
3.1.1 Dynamic ..............................................................................................................233.1.2 Short Circuit Current ...........................................................................................243.1.3 Leakage................................................................................................................243.1.4 Static Current.......................................................................................................25
3.2 Xilinx Component Power Measurements (opening the black box) ...................................263.2.1 Method.................................................................................................................263.2.2 Components measured.........................................................................................273.2.3 Results .................................................................................................................28
3.3 Xilinx Profiler Description.................................................................................................313.3.1 General Idea.........................................................................................................313.3.2 Flow....................................................................................................................323.3.3 Summary of Profiler Outputs...............................................................................38
3.4 Results & Recommendations.............................................................................................38
ii
3.4.1 Benchmark Data ..................................................................................................383.4.2 Conclusions .........................................................................................................403.4.3 Error Sources .......................................................................................................453.4.4 Periphery effects ..................................................................................................453.4.5 Activity estimation ..............................................................................................453.4.6 Interconnect cap values .......................................................................................46
3.5 Triptych Analysis...............................................................................................................46
Application of Low Power Techniques to PGA Design 524.1 Hierarchy of Low Power Techniques.................................................................................53
4.1.1 Algorithmic..........................................................................................................534.1.2 Architectural ........................................................................................................544.1.3 Logic....................................................................................................................554.1.4 Circuit ..................................................................................................................574.1.5 Layout..................................................................................................................59
4.2 Reducing Dynamic Power in PGAs...................................................................................604.2.1 Capacitance Reduction ........................................................................................604.2.2 Frequency ............................................................................................................684.2.3 Voltage Scaling ....................................................................................................694.2.4 Low Swing...........................................................................................................69
Low Voltage Pass Transistor Design 735.1 Design Issues with Pass Transistors at Low Supply ..........................................................73
5.1.1 Signal Loss Effects on Performance....................................................................745.1.2 Multiple Threshold Losses ..................................................................................77
5.2 Level Restoration ...............................................................................................................80
5.3 Vt Scaling and Leakage .....................................................................................................835.3.1 Subthreshold Leakage Currents...........................................................................835.3.2 Static Current Dissipation in Pass Transistor Fed Inverters ...............................85
Cell Design and Array Architecture 876.1 Logic Cell Circuit Options.................................................................................................87
6.1.1 Pass Transistor ....................................................................................................896.1.2 Cross-Coupled PMOS Restorer (CCP) ...............................................................906.1.3 Boosted Capacitor Restoration (BCR) ................................................................926.1.4 Buffered Pass Transistor......................................................................................936.1.5 Transmission Gate ...............................................................................................946.1.6 Decoder-Based.....................................................................................................956.1.7 Static Complex Gate............................................................................................966.1.8 Current Mode Logic ............................................................................................976.1.9 Low Threshold Devices.......................................................................................986.1.10 Final Cell Results ...............................................................................................99
6.2 Logic Cell Output Circuitry .............................................................................................101
6.3 Interconnect Circuitry ......................................................................................................102
6.4 Config memory/ FFs for CLBs ........................................................................................107
6.5 Logic Cell Array Architecture .........................................................................................108
6.6 Detailed Architectural Discussion ...................................................................................1096.6.1 Level 0 Direct Interconnect ...............................................................................112
iii
6.6.2 Level 1 Local Bus Interconnect.........................................................................1166.6.3 4x interconnect layer .........................................................................................118
6.7 Mapping ...........................................................................................................................1206.7.1 Correlator Mapping ...........................................................................................1216.7.2 Add-Compare-Select Mapping..........................................................................1216.7.3 Mapping Guidelines ..........................................................................................124
Low Power PGA Results 1277.1 Overview..........................................................................................................................127
7.2 Layout Discussion............................................................................................................127
7.3 Capacitance Data .............................................................................................................129
7.4 Performance Data.............................................................................................................133
7.5 Area..................................................................................................................................137
Conclusions 1398.1 Conclusions......................................................................................................................139
8.2 Future Work .....................................................................................................................141
References 142
Appendix A: Xilinx XC4003A Resource Measurements
Appendix B: Power Profiling Data
Appendix C: Detailed Breakdown of Low Power PGA Energy and Capacitance
Appendix D: PGA Configuration Encodings and Example Initialization File
Appendix E: PGA Memory Cell Programming Polarities and Memory Cell Layout Locations.
iv
List of Figures
FIGURE 1-1: Design Implementation Spectrum..........................................................1FIGURE 1-2: PGA User Design Flow .........................................................................2FIGURE 1-3: PGA Application Domains: Stand-Alone Commodity Parts and
Embedded Section on a ‘System on a Chip’..........................................3
FIGURE 2-1: Logic Block Designs...........................................................................11FIGURE 2-2: Types of PGA Interconnect Resources ................................................12FIGURE 2-3: Two Programming Methods for PGA Interconnect .............................13FIGURE 2-4: Various Levels of Switch Population and Examples
of Segmentation ..................................................................................13FIGURE 2-5: Representative Architecture of Island and Cellular PGAs...................14FIGURE 2-6: Xilinx XC4000 CLB............................................................................16FIGURE 2-7: Switch Matrix Detail............................................................................17FIGURE 2-8: Xilinx 4003A Interconnect Diagram ...................................................18FIGURE 2-9: CLAy Architecture Logic Block ..........................................................19FIGURE 2-10: CLAy Architecture Interconnect Scheme ............................................20FIGURE 2-11: Triptych Logic Cell and Interconnect Scheme ....................................21
FIGURE 3-1: MOS Parasitic Capacitances ................................................................24FIGURE 3-2: Heavily Loaded Interconnect due to Switch
Diffusion Capacitance.........................................................................30FIGURE 3-3: Xilinx Power Analysis Flow ................................................................32FIGURE 3-4: Excerpt from .lca file (contains routing and configuration data for a
mapped design)....................................................................................33FIGURE 3-5: Output of xpx.pm in initial non-activity weighted
analysis of Correl4b............................................................................35FIGURE 3-6: Output from xpx.pm on post-processing pass of Correl4b. .................37FIGURE 3-7: Profiler Outputs....................................................................................38FIGURE 3-8: Power Breakdown for Xilinx XC4003A..............................................39FIGURE 3-9: Interconnect Power Breakdown by Resource for XC4003A ...............39FIGURE 3-10: Interconnect Wiring Resource Usage Breakdown for XC4003A ........40FIGURE 3-11: Correlator Schematic ...........................................................................43FIGURE 3-12: Triptych Routing Scheme and Cell Routing Paths ..............................47FIGURE 3-13: Triptych Logic Cell ..............................................................................47
v
FIGURE 3-14: Triptych Center Output Driver Chain ..................................................50
FIGURE 4-1: Hierarchy of Power Reduction Domains .............................................54FIGURE 4-2: Wiring Capacitance Parasitics..............................................................62FIGURE 4-3: Pass Transistor Diffusion Capacitance vs. Fanout and
Switch Size (W) L=0.6um.................................................................63FIGURE 4-4: Model For Programmable Interconnect ...............................................63FIGURE 4-5: Simulation Model for Switch Size Experiments..................................65FIGURE 4-6: Delay and Energy of Interconnect Chains vs. Switch Size..................65FIGURE 4-7: Energy-Delay of Interconnect Chains vs. Switch Size ........................66FIGURE 4-8: Possible Logic Cell Input Structures....................................................67FIGURE 4-9: Delay and Capacitance of Logic Cell Input Structures........................68FIGURE 4-10: Low Swing Latched Sense Amp Receiver...........................................70FIGURE 4-11: Pseudo-Differential Receiver for Low Swing Signals .........................71
FIGURE 5-1: NMOS Pass Transistor Feeding an Inverter.........................................74FIGURE 5-2: Pass Transistor Threshold’s Impact on Delay......................................76FIGURE 5-3: Supply Voltage Impact on Delay .........................................................77FIGURE 5-4: Logic Cell Design with Multiple Drain to Gate Pass Transistor
Connections..........................................................................................78FIGURE 5-5: Signal Loss Levels at Various Supply Voltages ...................................79FIGURE 5-6: Level Restoration Schemes..................................................................80FIGURE 5-7: Level Restorer Delay and Energy Comparison with
Single Threshold Loss.........................................................................81FIGURE 5-8: Level Restorer Delay Comparison with Multiple Threshold
Loss (low Vt pass transistors)............................................. ................82FIGURE 5-9: Restorer Delay and Energy for Different Supply Voltages
(low Vt pass transistors)......................................................................83FIGURE 5-10: Measured Sub-threshold Id-Vt Characteristics for
Various Vgs [19] .................................................................................84FIGURE 5-11: Sneak Leakage Path in Between Pass Transistor
Connected LUTs .................................................................................84FIGURE 5-12: Static Current Dissipation for Low Vt Inverter
fed by Pass Transistor .........................................................................85
FIGURE 6-1: Pass Transistor Logic Cell ...................................................................90FIGURE 6-2: Cross-coupled pmos Restored Pass Transistor Logic Cell ..................91FIGURE 6-3: Boosted Capacitor Restoration Input Circuitry....................................92FIGURE 6-4: Buffered Pass Transistor Logic Cell ....................................................93FIGURE 6-5: Impact of Transmission Gate Sizing....................................................94
vi
FIGURE 6-6: Transmission Gate Logic Cell..............................................................95FIGURE 6-7: NAND Decoder Style LUT..................................................................96FIGURE 6-8: Static Branch-Based Logic Cell (only 2-LUT)....................................97FIGURE 6-9: Current Mode Logic Cell Implementation (only 2-LUT
for simplicity)......................................................................................98FIGURE 6-10: Delay of Potential Cell Designs.........................................................100FIGURE 6-11: Energy of Potential Cell Designs .......................................................100FIGURE 6-12: Energy-Delay Product Of Potential Cell Designs..............................101FIGURE 6-13: Programmable Output Circuitry ........................................................102FIGURE 6-14: Delay and Energy for Interconnect Chains ........................................103FIGURE 6-15: Energy-Delay Product of Interconnect Chains ..................................104FIGURE 6-16: Typical path from interconnect through logic cell
and back to interconnect ...................................................................105FIGURE 6-17: Fully optimized typical path from interconnect to
logic cell and back to interconnect....................................................106FIGURE 6-18: Configuration Memory Options........................................................107FIGURE 6-19: Basic Logic CellPair Tile ...................................................................110FIGURE 6-20: Simplified Diagram of Logic Cell Pair with Input Breakdown .........111FIGURE 6-21: Level 0 Direct Interconnect (Pattern spans the entire array)..............112FIGURE 6-22: Level 1 Vertical and Horizontal Interconnect (showing
cell output connections and the 4 versions of switch matrices .........113FIGURE 6-23: Diagrams depicting the directly connected cells from
Input and Output Perspective............................................................113FIGURE 6-24: Ripple Carry Adder Mapping making use of Direct Connects..........114FIGURE 6-25: FSM Mapping of ISCAS S27 (3 states).............................................115FIGURE 6-26: Level 1 Switch Matrix Detail.............................................................117FIGURE 6-27: 4x Horizontal and Vertical Interconnect Layers ................................119FIGURE 6-28: Level 1 and Full 4X Switch Matrix Connectivity and
Buffering Scheme (only 4X connections shown, rest same as 1X switch matrix). ...............................................................120
FIGURE 6-29: Correlator Mapping............................................................................122FIGURE 6-30: Add-Compare-Select Block of a 32 State, Radix-4 Viterbi
Decoder (4-way) ...............................................................................123FIGURE 6-31: Floorplan of ACS Mapping in Figure 6-32........................................124FIGURE 6-32: Add-Compare-Select Partial Mapping...............................................125
FIGURE 7-1: 2x4 Mini-Array of Logic Cells with Block Diagram Representation (4 pairs of cells) .......................................................128
FIGURE 7-2: Logic Cell Layout ..............................................................................130FIGURE 7-3: Typical Signal Path from Level 1 Horizontal Track
vii
Input to Diagonal Output ..................................................................135FIGURE 7-4: HSPICE Waveforms for Multi-Path Simulation ................................136FIGURE 7-5: Multi-Path Example ...........................................................................137
viii
List of Tables
TABLE 1-1 Energy Metrics for Various Designs ............................................................7TABLE 1-2 Process Parameters.......................................................................................8
TABLE 2-1 XC4003A Delays .......................................................................................17
TABLE 3-1 Xilinx XC4003A Architecture Components..............................................27TABLE 3-2 Measured Component Energies and Capacitances ....................................28TABLE 3-3 Estimated Correlator Capacitance Breakdown ..........................................43TABLE 3-4 Net Capacitance Statistics..........................................................................44TABLE 3-5 Triptych Cell Area .....................................................................................48TABLE 3-6 Triptych Delay Performance......................................................................48TABLE 3-7 Triptych Cell Capacitances ........................................................................49
TABLE 5-1 HP Process Parameters ..............................................................................73TABLE 5-2 Comparison of Nominal Vt and Low Vt Nmos Devices (w=1.8u) ...........75TABLE 5-3 Vt Variation with Well Bias .......................................................................75
TABLE 7-1 Energy and Extracted Capacitance Data..................................................130TABLE 7-2 Performance Data.....................................................................................133TABLE 7-3 Delay Breakdown for Path in Figure 7-3 .................................................135TABLE 7-4 Delays from Multi-Path Example ............................................................136TABLE 7-5 Area Breakdown ......................................................................................137
1
CHAPTER 1
Introduction
1.1 What Are Programmable Gate Arrays (PGAs)?
A number of choices exist when implementing a digital integrated circuit design. The
options available traverse a range which can be thought of as an increasing investment in
design time and customization. Figure 1-1 shows programmable gate arrays (PGAs) at
one end of the implementation spectrum and a full custom design at the other.
Full custom design involves time-intensive optimization of circuits and layout resulting in
the maximal performance. By giving up a small amount of performance, a standard cell
library methodology (typically used in ASIC design) can be applied which allows the use
of several automated tools to more quickly bring a design to completion. On the far end
of the spectrum lies the programmable gate array approach. This method has been
growing in popularity as time to market pressures force reductions in design time. As the
FullCustom
StandardCell PGA
Implementation Flexibility
Increasing Design Time & Cost
GateArray
FIGURE 1-1: Design Implementation Spectrum
Section 1.1: What Are Programmable Gate Arrays (PGAs)? 2
cost of custom IC manufacturing necessitate very high volume production, amortization of
costs by fabricating a general purpose IC have become more attractive.
The programmable flexibility of a PGA is what gives rise to its benefits over custom
and semi-custom designs. Once a design has been specified as a set of schematics or in a
an HDL description, it is mapped to the hardware defined by the PGA architecture. A two
step mapping process is typically used to transform the logic to fit the physical constraints
of the PGA. The first step is termed technology independent mapping where the logic
equations are optimized and redundant terms are removed. Next, the technology
dependent mapping is performed where logic is allocated to programmable logic blocks
whose functionality is determined by the PGA architecture. After mapping, a router
defines the connectivity of the design using the interconnection resources available in the
PGA (Figure 1-2). At this point, the design is complete and can be verified by lab tests or
back-annotated simulation.
Technology IndependentOptimization
Schematic/HDLDescription
PGA ProgrammingBitstream
Placement& Routing
Logic Mapping
FIGURE 1-2: PGA User Design Flow
Section 1.2: Why Use Them? 3
1.2 Why Use Them?
As mentioned above, PGAs offer a unique alternative in the IC design implementation
space; however, there are many other reasons to choose PGAs. Since the programming of
a PGA is determined by a configuration memory, the design can be quickly and easily
changed after an initial design version. This allows fixing bugs with minimal effort
whereas a custom design requires a new mask set resulting in a significant added design
cost. In a relatively low volume design, the ability to fix bugs cheaply is crucial.
In addition, reprogrammability offers further design advantages. Reconfigurable
computing architectures have generated much interest recently as a viable trade-off for
absolute performance in return for more flexible logic utilization. Rather than dedicate
large amounts of custom logic to performing one function, a programmable body of logic
can offer acceptable performance over a wider range of functions. In fact, the presence of
a programmable body of logic embedded within a larger custom chip design can allow a
greater measure of programmability and flexibility to the whole design [27]. One can
envision several different programming templates which can be used to reconfigure the
PGA for any one of a number of desired functions depending on the current application
being executed. Depending on the task, a PGA may provide the best trade-off in perfor-
mance, power, and flexibility among the resources available. Specifically, a PGA may
provide an advantage over purely software and ASIC approaches for certain functions.
SpecializedDatapathFunctionality
EmbeddedPGA
Mem
Ctrl
PGA Array
FIGURE 1-3: PGA Application Domains: Stand-Alone Commodity Parts and Embedded Section on a ‘System on a Chip’.
Section 1.3: What about performance? 4
Thus, PGAs can be seen as stand-alone commodity parts as well as modules in a larger
chip design. Along these lines, some environments suitable to PGAs are:
• multi-modal test structure for system on a chip.
• prototyping and testing of asic designs.
• hardware accelerators (decoders, encoders, error correction, bit and byte-wise opera-
tions).
• flexible logic resource (address generation, encryption).
Lastly, not unlike traditional CMOS designs, FPGAs have benefited from shrinking
feature sizes resulting in higher capacity arrays. This increase in gate counts has further
expanded the range of possible PGA implementations opening the doors to an even greater
body of applications.
1.3 What about performance?
Whenever performance is to be examined, a number of metrics must be used. Within
the PGA industry, the most important have been:
•Density or Equivalent Gates
•Logic Utilization
•Routability
•Speed
Density is often given in terms of the ability to realize an equivalent number of gates.
Today, FPGA parts typically range from a thousand to one hundred thousand gates. This
represents the ideal number of gates that can be used towards implementing a design. Two
words of caution must be made here. First, the number of gates that can be derived from a
logic block is somewhat subjective due to the programmable nature of the block. In
general, a reasonable approximation of the gate capacity can be derived from the number
of inputs and outputs the logic block offers. Secondly, the achievable logic utilization is
likely to be far lower (50% to 10%) than the equivalent number of gates that can be imple-
Section 1.3: What about performance? 5
mented. This inefficiency arises from the difficulties encountered when routing a densely
packed design and cases where logic blocks are used to implement very simple functional-
ity (i.e. a single inversion).
In terms of physical area, overhead from logic and interconnect greatly impact packing
density. Instead of the simple implementation of a desired gate, a PGA logic block must
support several different logic configurations. As a result, more transistors are necessary
and configuration memory must be added. However, active area is not the largest source
of area overhead. The large amounts of interconnect lead to large channels dedicated to
only wiring. An empirical review conducted in [8] shows that the interconnect area
(including switches) is nearly 100 times greater than the active area devoted to the
functional block. As will be discussed in a later section, the interconnect area penalty can
be alleviated somewhat by using more advanced processes which offer more layers of
metallization. In general though, configuration memory and interconnect account for
nearly 90% of a PGA’s die area.
As previously alluded to, routability influences effective density, but it is also an
important metric on its own. Routability is not a concrete metric like delay. Instead,
routability reflects the amount of interconnection resources provided for connecting logic
blocks together and also the efficiency with which those resources can be utilized.
Routability cannot be guaranteed for all possible designs to be implemented, but is heuris-
tically determined for a representative subset of potential designs. The flexibility and
amount of interconnect provided in a PGA architecture strongly influences its routability.
However, routability’s heuristic nature complicates an analytical determination of the
necessary quantity and make-up of interconnection resources. Thus, a PGA designer must
remain mindful of the logic intended to be implemented on the PGA.
After density, logic utilization, and routability, speed is the most touted architecture
performance metric in the PGA industry. The obvious drawback of the general purpose
design offered by a PGA is sub-optimal performance. The speed of programmable logic
considerably lags behind full-custom and standard cell designs. Typical design clock
Section 1.4: PGA Power? 6
speeds are less than 50MHz, far less than the hundreds of MHz in recent, full-custom
microprocessor designs.
Speed or conversely delay consists of two basic components, the logic delay and the
interconnection delay. The logic delay consists of the propagation time through a series of
configured logic blocks from their inputs to their outputs. In typical ASIC and custom IC
designs, logic delay is the dominant component of overall delay although shrinking
feature sizes have begun to place more importance on interconnect delays. A programma-
ble gate array, on the other hand, experiences much greater interconnection delays and in
many cases, the interconnect delays can be the performance limiter for a design. To
support the flexibility necessary to map nearly any logic design to the hardware, FPGAs
contain an abundance of interconnect. Since the interconnect resources must support
general connection patterns through switching, rather than dedicated wiring, routing is
significantly more resistive and capacitive. In a typical PGA implementation, the intercon-
nect accounts for 40%-80% of the overall design delay [8],[47]. Thus, PGA wiring delays
are substantially worse than in a traditional CMOS design.
The above discussion of performance mentions nothing about power consumption.
This is because power consumption has been a long-neglected metric of programmable
gate array design. Xilinx and Altera, the leading FPGA manufacturers, do not provide any
tools for estimating power consumption for their parts, and commercial databooks only
give rough power consumption guidelines suitable for rough hand calculations [43],[46].
1.4 PGA Power?
Power consumption has become a major concern in IC design. Higher clock speeds
coupled with ever increasing integration levels have combined to increase power
consumption of high performance chips into the tens of watts range. PGAs, as could be
expected from their general purpose architecture, pay a high price in power consumption.
Table 1-1 shows some energy metrics for a variety of designs. A comparison between the
XC4003A and the static CMOS full adder implementation show a 100x difference in
Section 1.5: Research Contributions 7
energy consumption for an 8-bit adder. Entire processors consume less power than a
design consisting of 50 XC4000XL logic functions.
Furthermore, as the capacity of these arrays has increased due to technology scaling,
so has their power dissipation. In some sense, the maximum clock frequency and size of
future designs will become limited by power dissipation and heat removal requirements.
This is becoming apparent in some of the recent commercial literature where packaging
constraints have become a concern [45],[46]. In addition, many of today's applications
revolve around embedded solutions and portable mediums where battery life is at a
premium. An example of such a design is the Infopad developed for wireless multimedia
[15]. Clearly, power and energy consumption in PGAs must be examined if their utility is
to be preserved in future applications.
1.5 Research Contributions
This work represents a significant in-road into the design of low energy programmable
logic modules. Unfortunately, in many chip designs downstream power saving solutions
and heat management become costly afterthoughts; whereas, an up-front approach can
have a much greater impact. As a result, a firm understanding of PGA power consumption
was built up. Then, through several physical measurements an explicit breakdown of PGA
capacitance was achieved which allowed a detailed power analysis to be performed.
Based on the intuitions generated from the power analysis and a series of circuit studies, a
low energy programmable gate array architecture has been defined. Several insights into
TABLE 1-1 Energy Metrics for Various Designs
Design Example Vdd Energy
Xilinx XC4003A 8-bit Adder (measured) 5v 4.2mW/MHz
Xilinx XC4000XL Series [46] 3.3v 92uW/MHz/Logic Function Output
.8um Gate Array [40] unspecified 7.5uW/gate/MHz
Static CMOS Full Adder [17] 3.3v 5.5uW/MHz
54x54 Multiplier [13] 2.5v 2.23mW/MHz
DSP Processor [21] 1v 0.21mW/MHz
StrongArm Microprocessor [24] 1.5v 2.1mW/MHz
Alpha Microprocessor [4] 2v 60mW/MHz
Section 1.6: Thesis Organization 8
PGA circuit design and low voltage pass transistor design are also discussed. A 0.5um
effective channel length HP process was the target technology for the design (Table 1-2).
A supply voltage of 1.5 volts was intended, but during the course of the research, 2 volts
was determined to be more optimal from a circuit design and energy-delay product
standpoint. Finally, the layout for the main blocks of a PGA array have been extracted and
simulated to demonstrate the significant improvements in energy that have been attained.
As previously stated, power has been neglected within the PGA industry resulting in
virtually no information relating to power consumption in PGAs. In fact, circuit details
are regarded as highly proprietary so little is known about the internal structure of
commercial PGA designs, unlike memories and common datapath blocks. This
complicates any examination of power since a design’s power consumption implemented
on a PGA is inherently defined at the circuit level. Therefore, significant effort was spent
in understanding where the PGA power problem lies and in developing a methodology
from which to study PGA power.
As a final note, the goal of defining a low power PGA design is constrained by the
intent to preserve the PGA’s most attractive asset, namely its flexibility. As will be seen
throughout this thesis, this goal influences nearly every design decision.
1.6 Thesis Organization
The organization of this work is divided into 8 chapters. Chapter 2 provides an
overview of PGA architecture and illustrates some of the trade-offs involved. An
extensive discussion of PGA power analysis is in Chapter 3. A presentation of low power
TABLE 1-2 Process Parameters
Drawn Channel Length 0.6um
N-channel Vt 650mV
P-channel Vt - 850mV
Poly pitch/spacing 0.6um/0.9um
M1 pitch/spacing 0.9um/0.9um
M2 pitch/spacing 0.9um/0.9um
M3 pitch/spacing 1.5um/0.9um
N-well process
Section 1.6: Thesis Organization 9
techniques and their applicability to the PGA design environment follows in Chapter 4.
Chapter 5 considers issues surrounding low voltage design and Chapter 6 details circuit
design options and the architectural description of the proposed low power array. A
description of the layout appears in Chapter 7 along with simulated results. Conclusions
and thoughts on future work appear in Chapter 8. A set of Appendices contain the power
analysis data discussed in Chapter 3, a detailed capacitance breakdown for the newly
proposed design, and information relating to programming of the array.
10
CHAPTER 2
PGA Architecture Overview
2.1 Basic Components and Blocks
As alluded to previously, a programmable gate array consists of an array of logic
blocks and a body of programmable interconnect resources. The entire array is then
surrounded by a set of I/O blocks providing the interface to the chip’s pins. Once the user
has specified a design, the logic block functionality is programmed into the configuration
memory of the logic blocks and the interconnect resources are configured to support the
necessary communication paths among logic and I/O blocks. The next few sections give a
brief description of the three basic PGA components.
2.1.1 Logic Blocks
The logic blocks represent the sole computational resources of a PGA. Depending on
how the internal connectivity and gate structure of the logic cell is programmed, a variety
of logic functionality can be realized. Often, the logic capacity of a block is thought of in
terms of gate equivalents. Typical gates that can be implemented are NANDs, NORs,
XORs and their complements, but more complex combinations can also be made. In some
cases, multiple levels of logic can be implemented within one cell. Logic blocks can vary
widely in logic capability, depending on their design. The number of unique inputs to a
logic cell can range anywhere from two to upwards of 16. Similarly, the number of
Section 2.1: Basic Components and Blocks 11
outputs provided can be as little as one and as high as 4 or 5. As a result, high fan-in
functions are possible as well as the implementation of multiple independent functions. A
common measure of expressing the utility of a logic block is the number of 2, 3, 4 etc.
input functions it can form. ATT’s ORCA FPGA provides an example of one of the most
complex logic blocks in use. It can compute any 4 independent functions of 4 distinct sets
of inputs [22].
The actual implementation of logic blocks also varies, but can be grouped into two
styles. The first is the discrete gate method. In this case, a small number of primitive
static gates are provided and then the inputs to those gates are switched through multiplex-
ers giving rise to different functions. The state of the switches is usually held in a static
configuration memory. This style typically has less total coverage of the possible two and
three input functions than, the second style of decoder-based logic blocks. In decoder
based logic blocks, the inputs select various storage elements which are programmed with
the 1’s and 0’s corresponding to the logic function that is to be realized. This technique
can be thought of as assigning 1’s to the desired minterms. Frequently, the decoder style is
referred to as a look-up table, or k-LUT where k is the number of inputs. Decoder style
LUTs offer the advantage that any function of the inputs can be formed. Figure 2-1
illustrates the difference between the two basic logic cell implementation styles.
So far, only combinational logic has been discussed. In order to perform retiming and
data storage, flip-flops are often included within the logic blocks to provide a registered
output if desired. As a final note, the design of the logic block is crucial to the PGA’s
2-LUT2-LUT
2-LUT
FIGURE 2-1: Logic Block Designs
Section 2.1: Basic Components and Blocks 12
overall utility and since it can be repeated thousands of times in the array, significant effort
is expended on optimizing its design and layout.
2.1.2 Programmable Interconnect
Once the circuit is mapped to the logic blocks, the interconnection or routing resources
are configured to provide the required connectivity. Just as custom IC design requires
global signal distribution, busses, and local wiring, so too does a PGA architecture. The
key difference; however, is that the PGA must allocate wiring resources before the target
design is known. As a result, the interconnection resources provided by a PGA must be
designed to support a variety of possible patterns, and so a variety of different interconnect
is usually included in a PGA architecture. In general, long length wires spanning the
entire array, medium length wires spanning a few logic blocks, short length wires for cell
to cell connections, and globally distributed wires are available to make the necessary
connections among the logic and I/O blocks.
In order to allow flexibility in the interconnect, two schemes are in popular use (Figure
2-3). The first employs antifuses which when blown, provide permanent connections
between the various interconnect segments and logic block pins. Anti-fuse programming
introduces much less resistance and capacitance to an interconnect network [38]. The
family of parts offered by Actel [1],[5] utilizes this method, but most PGA designs have
gone with an SRAM based configuration technique allowing multiple programming
iterations. In the SRAM style, programmable switches are turned on or off based on the
Array LengthInterconnect
SingleLengthWires
DoubleLengthWires
FIGURE 2-2: Types of PGA Interconnect Resources
Section 2.1: Basic Components and Blocks 13
value stored in the configuration bit that controls that particular switch. An NMOS pass
transistor, allowing bi-directional signal flow, forms the basis for all the switchable
connections in SRAM based interconnection networks. As will be seen later, the
pervasive use of pass transistors in interconnect and logic resources presents some serious
challenges for low power design.
Determining the number and placement of switches within the PGA interconnect
encompasses several design issues and greatly affects the performance of the mapped
circuit. The switches form the basis of the interface between the logic block and the
routing resources. Switches at the logic block outputs offer a larger fanout capability just
as more switches on the inputs allow a greater number of options from which inputs may
be derived. The number of switches relative to the number of interconnect lines is often
referred to as the population of the network. In addition to cell interfacing, switches also
make connections between different types of interconnect and allow the construction of
longer segments from shorter ones called segmentation (Figure 2-4).
n diff+
poly
dielectric
fieldoxide
ab
AntifuseSRAM Controlled
Switch
a b
configmemory
cell
FIGURE 2-3: Two Programming Methods for PGA Interconnect
segmentation
highlypopulated
minimallypopulated
FIGURE 2-4: Various Levels of Switch Population and Examples of Segmentation
Section 2.2: Architectural Styles 14
2.1.3 I/O Ring
A ring of I/O blocks around the logic block array provides the PGA’s interface to the
pins of the chip. Only a few differences separate it from a standard pad frame. Instead of
having a collection of dedicated input or output pins, all PGA pins have circuitry in the I/
O blocks to support either an input or output path. The configuration of the I/O block
determines which path is active. In addition, flip-flops may be present in the I/O block.
2.2 Architectural Styles
The architecture of a programmable gate array can be loosely characterized into one of
two different styles: 1) Island-based, or 2) Cellular-based. (Figure 2-5)
2.2.1 Island
Several examples of PGAs that can be grouped into the island-based category include
designs by Xilinx [43], Lucent Technologies [22], and Altera [23],[36]. An island-based
array typically employs larger and more complex logic blocks allowing more computation
per logic block. The interconnect structure of such arrays consists of a set of general
purpose wiring used to connect logic ‘islands’ located throughout the array.
single wire
double wire
long wire
CLB
Switch Matrix
Logic Cell
direct interconnect
Island Style Cellular Style
FIGURE 2-5: Representative Architecture of Island and Cellular PGAs
Section 2.3: The Xilinx XC4000 Family 15
2.2.2 Cellular
In contrast to the island-based approach is the cellular class of architectures. Repre-
sentatives of this style include the CLAy [26], Atmel [11], Actel [5], CAL [5], and
XC6200 [44]. Logic cells in these architectures are much simpler than their island-based
counterparts. In many cases, only a single two input function is performed in each logic
cell yielding a more fine grain approach to computation. The interconnect of cellular
arrays relies heavily on neighbor to neighbor communication. More specifically, logic
cells have dedicated wires connecting the outputs of an adjacent cell to its inputs. This
mechanism of neighbor to neighbor connectivity is intended as the primary means of
routing signals among the logic cells.
2.3 The Xilinx XC4000 Family
The Xilinx XC4000 family of programmable gate arrays will now be discussed in
greater detail. An extremely good understanding of the architecture was necessary in
order to perform the power analysis discussed in Section 3.3. Many of the architectural
resources referred to in Section 3.2 are explained here as well.
2.3.1 Composition of a configurable logic block (CLB)
The logic blocks in a Xilinx PGA are called configurable logic blocks or CLBs. Two 4
input look-up tables (LUT) form the basis of the cells. The use of look-up tables insures
that any function of 4 inputs can be realized. Each LUT receives four independent inputs
(F1..4 and G1..4). A second level 3-LUT (H function generator) allows any function of
the previous outputs of the two 4-LUTs and a special extra input H1. Two CLB outputs (X
and Y) can be configured such that the LUT outputs are passed to the programmable
interconnect network outside the CLB. Each output can also be registered by a flip-flop
before leaving the CLB making a total of 4 outputs. The other signals shown in Figure 2-
6 pertain to the use of the CLB as a RAM and will not be discussed any further.
Configuring the logic block is performed by setting the memory bits that control the multi-
plexers and programming the look-up tables for the desired functions.
Section 2.3: The Xilinx XC4000 Family 16
In order to speed up the carry ripple path for adders, dedicated carry circuitry is included
with the F and G function generators. This allows the computation of 2 sum and carry bits
within one CLB. A dedicated carry in and carry out path feeds the carry logic thus
isolating it from the rest of the general interconnect resources resulting in a significant
decrease in carry propagation delay.
2.3.2 Interconnect Resources
The interconnect structure of the Xilinx XC4000 series appears in Figure 2-8. The
array of switch boxes perform two functions: the connection of shorter segments into
longer ones and the ability to make corner turns in the routing. A six pass transistor
connection composes each connection dot of the switch matrix as shown in Figure 2-7.
The shortest interconnect resource called single lines span one CLB pitch and are bounded
by switch boxes on each end. Double lines allow a switch matrix to be bypassed when
going a distance of two CLBs. Depending on the size of the array, other intermediate
lengths like “quad lines”, and “octal lines” may be included and are similar in construction
to the double lines. For cross chip connections, longlines traverse the CLB array. If the
full length of the longline is not required, they may be halved into two independent
FIGURE 2-6: Xilinx XC4000 CLB
Section 2.4: The CLAy Architecture 17
segments by disabling the splitter transistor (segmentation). All the interconnect
described allows bidirectional signal flow. As mentioned earlier, a vertically oriented
dedicated carry chain connects CLBs independent from the general purpose interconnect.
Lastly, all cross points that indicate a possible connection among the interconnect and
CLBs represents a single pass transistor between the horizontal and vertical wires.
2.4 The CLAy Architecture
Next, the details of the Configurable Logic Array (CLAy) from National Semiconduc-
tor [26] will be briefly discussed to provide an example of the cellular type of architecture.
The CLAy design strongly contrasts with the Xilinx 4000 style and thus illustrates some of
the different directions PGA architecture can take. A representative section of a CLAy
array is shown in Figure 2-10.
a. Data from Xilinx Databook and XACT Xdelay estimates
TABLE 2-1 XC4003A Delays
Path Delaya (ns)
LUT Combinational 4
CLK to Q 3
Setup Time 4.5
Single Line 1.3
Double Line 1.3
Long Line ~1.1
Carry Chain 1.5
6-passTransistorSwitch MatrixConnection
FIGURE 2-7: Switch Matrix Detail
Section 2.4: The CLAy Architecture 18
2.4.1 Logic Cells
The logic cells of the array are simpler than the Xilinx CLB. An XOR and a set of
AND gates form the heart of the function unit (Figure 2-9). Unlike the Xilinx, the config-
uration of the block is solely determined by multiplexers and hardwired gates. The
number of logic states that can be supported by this cell is not as prolific as a look-up table
structure, but commonly used combinations are included. Three unique inputs are routed
to the logic cell via the interfaces to the interconnect network. Two of these inputs come
from nearest neighbor direct interconnect (A, B) and one from the local bussing network
adjoining the cell.
DoubleLine
SingleLine
Long Line
CLB InputLine
Pass Xtr.Switch
SwitchMatrix
CLB
FIGURE 2-8: Xilinx 4003A Interconnect Diagram
Section 2.4: The CLAy Architecture 19
2.4.2 Interconnect Resources
Two types of interconnect surround the cell array in the CLAy architecture. Figure 2-
10 shows the bussing interconnect that connects cells in 8x8 blocks. Local busses have
connections to all 8 cells in a row or column and express busses simply bypass the entire
group of 8 cells. At the block boundary, a set of interface boxes allow connections
between the express and local busses of the neighboring block. The interface boxes also
provide signal regeneration since a signal may pass through several highly resistive
switches and will need amplification for reasonable performance.
The bulk of the interconnect fabric lies at the inter cell level. Figure 2-10 depicts the
cell-to-cell connectivity. Two distinct inputs arrive from each neighboring cell and the two
cell outputs are connected to all four cartesian neighbors. By providing fast, direct, cell-
to-cell connections, logic cells can be combined into larger blocks without much
overhead. Clearly, the cellular architecture emphasizes the use of local interconnect.
However, mapping experiments have shown that sharing a local bus amongst 8 cells
produces inefficient mappings [34]. The key problem is that when the local busses are
used, the entire row or column of 8 cells can only be sourced by neighboring cells. As a
FIGURE 2-9: CLAy Architecture Logic Block
Section 2.5: Triptych 20
result, many cells become used only for routing and many irregular, paths develop . A
possible solution involves the addition of more tracks or the sharing of busses across fewer
cells.
2.5 Triptych
A summary of the Triptych [14] architecture follows because it differs significantly
from the other architectures. In addition, Triptych was studied in detail due to the avail-
ability of layout.
The basic logic cell appears in Figure 2-11. It implements a 3-LUT allowing any 3
input function to be realized and includes a flip-flop to register the output. The unique
property of this cell is the ability to route 3 independent paths from input to any output.
Thus, the problem of a fixed amount of routing and logic resources is alleviated by
allowing logic cells to serve as short distance routing structures.
Interconnection of logic cells takes place along two levels. Horizontal flow is achieved
by routing through the logic cells (actually in a diagonal fashion) and is depicted in Figure
2-11. Two of the three cell outputs connect to the downstream diagonal neighbor.
Likewise, two of the three inputs to the cell are derived from the diagonal connections.
Vertical connections are supported from a channel of variable length bus segments also
FIGURE 2-10: CLAy Architecture Interconnect Scheme
Section 2.6: Conclusions 21
depicted in Figure 2-11. The complete interconnection network is actually two checker-
boards superimposed on each other giving a left to right and right to left diagonal flow.
Connections between the two grids are through loopback connections. Similar to the
CLAy architecture, the Triptych interconnect attempts to rely on more dedicated, localized
routing resources.
2.6 Conclusions
From the above discussion of a few architectures, one can see that a PGA design can
take on many different forms. By learning where power consumption is concentrated in
such architectures, decisions can be made towards the development of lower power PGAs.
In addition, several general architectural conclusions have been reported. Although they
do not focus on power, the following statements are still applicable to the development of
a low power PGA.
3-LUT DQ
Loopback Path
Logic Cell
Logic Cell
FIGURE 2-11: Triptych Logic Cell and Interconnect Scheme
Section 2.6: Conclusions 22
2.6.1 Architectural Design Insights
• There is a trade-off in block functionality: too little leads to higher interconnect
demands, too much leads to wasted active area and underutilization [29]. In the case of
having very simple logic blocks, a large amount of high cost, general purpose intercon-
nect becomes necessary to form relatively simple functions. Whereas an overly com-
plex logic block may not be efficiently used.
• Logic block inputs should posses high functionality/pin for area efficiency [29]. In
other words, once a signal is brought to the input of a logic block, it is best to perform
as much of the necessary computation involving that signal rather than having to route
it to several blocks and use several blocks for computation.
• The optimal number of LUT inputs is 3-4 for lowest total area [29].
• A flip-flop should be included within the logic block [29]. In this case, the work
showed that directly implementing a flip-flop in the architecture is much better than
constructing one from a group of logic blocks even though it won’t always be needed.
• Avoiding repeated paths through general purpose interconnect can increase PGA per-
formance on the critical path [29].
• Connection delays often limit the system clock. This paper demonstrates the impor-
tance of optimizing the interconnect to improve overall design speed as was mentioned
in the introduction to this work [29].
• The efficiency and quality of software mapping is impacted by how well the underlying
PGA architecture can support the intended applications. Symmetry in the interconnect
and logic block interfaces along with input permutability facilitate mapping designs to
a PGA.
23
CHAPTER 3
PGA Power Analysis
In order to begin the design of a low power PGA, a good understanding of the sources
of power consumption for existing designs is essential. As mentioned earlier, very little
detailed information on power is readily available and so the first part of this research
focused on gaining some knowledge about PGA power.
3.1 Sources of Power Consumption
The sources of power consumption for CMOS integrated circuits have been studied in
great detail the last several years [6]. The predominant contributions to CMOS power in
relative order of their magnitudes are: 1) dynamic switching power, 2) short circuit power,
3) leakage power, and 4) static power. Each of these will be explained below.
(EQ 1)
3.1.1 Dynamic
The dynamic part of power consumption usually accounts for upwards of 90% of total
power. It arises from the charging and discharging of the parasitic capacitances shown in
Figure 3-1. The components of dynamic power appear in Equation 2 and a derivation can
be found in [28]. From the equation, one can see that the degrees of freedom offered are
Pav Pdyn Psc Pleak Pstatic+ + +=
Section 3.1: Sources of Power Consumption 24
capacitance, supply voltage, signal swing, frequency and activity. The manipulation of
these variables will be closely examined in Chapter 4.
(EQ 2)
3.1.2 Short Circuit Current
The next largest part of CMOS power consumption is generally due to short circuit
current flow during the switching transient. A slow input slope relative to the output
node’s transition causes both PMOS and NMOS devices to conduct. As a result, a low
resistance path momentarily exists between the supply and ground leading to wasted
charge. Keeping the input and output rise and fall times approximately equal insures that
short circuit power will be kept to under 10% of total design power. An important note to
make is that virtually no short circuit current flows if the supply voltage is lowered to the
sum of the PMOS and NMOS device thresholds. This fact plays an important role in the
later discussion of low voltage pass transistor design.
3.1.3 Leakage
Even if a circuit does not switch, the circuit will still consume power due to the
reverse-biased drain/source diodes and the non-idealities of the MOSFET device.
Leakage of the drain/source diodes is governed by Equation 3 in which Is is the reverse
DynamicPower C Vdd Vswing Fclk ActivityFactor××××=
Cdbn
Cgp
Cgdp
Cdbp
Cgn
CgdnCwiring+Cload
In Out
FIGURE 3-1: MOS Parasitic Capacitances
Section 3.1: Sources of Power Consumption 25
saturation current, V is the voltage across the junction, and Vth is the thermal voltage (kT/
q).
(EQ 3)
A simple estimate of the leakage from all the drain area of a 1 million transistor chip
reveals that the power is on the order of 50-100uW and can therefore be considered
negligible in the case of any PGA.
The other source of leakage current is due to the non-ideal I-V characteristics of a
MOSFET which gives rise to subthreshold conduction. When the applied gate voltage is
below the threshold voltage a small current will flow due to weak inversion of the channel.
This current is an exponential function of the device threshold as shown in Equation 4.
For a 3.3v process, threshold voltages remain fairly high (600-800mV) and so the
aggregate current from subthreshold conduction is minimal. However, attempts to lower
device thresholds for improved performance will cause subthreshold currents to quickly
rise. This fact must be carefully weighed when designing for low voltage operation so that
leakage currents do not dominate a design’s power consumption.
(EQ 4)
3.1.4 Static Current
The magnitude of the last component of power consumption depends heavily on the
type of design. In a standard CMOS design with rail to rail swings, static power should be
zero. However, other circuit styles can introduce static current components. Possible
sources of static power include pseudo-NMOS gates and circuits requiring bias currents
(e.g. amplifiers). Contrary to dynamic power, static power is independent of frequency.
Ileakage Is e
VVth---------
1–
=
Isub e
Vgs Vt–( )s
--------------------------s∝ subthreshold slope=
Section 3.2: Xilinx Component Power Measurements (opening the black box) 26
3.2 Xilinx Component Power Measurements (opening the black box)
The first step in gaining some insight into PGA power consumption was to open the
black box. However, without the luxury of detailed schematics or layout of a complete
PGA, the only means to study an array was by physical lab measurements. As a result, a
systematic procedure was developed to obtain a detailed picture of the internal capaci-
tances of the chip. These capacitance values were then combined with architectural
knowledge to form some crucial insights about where power goes in PGAs. A Xilinx
XC4003A became the analysis target because of the existence of a test board and the
necessary Xilinx design software. The XC4003A was fabricated in a 0.6um, 2-layer metal
process. Although the XC4000 series has been improved upon in recent years, the
underlying architectural design has remained consistent allowing useful conclusions to
still be drawn. The structural details of the Xilinx device were presented in Section 2.3.
3.2.1 Method
As with any experimental measurements, the method used to obtain the data play an
important role governing the utility of the information. In addition, the procedure
described can be used for future analysis of more recent Xilinx designs. The intent of the
lab measurements was aimed towards getting capacitance information, but capacitance
values are not directly observable. Thus, current measurements were performed on
mapped designs under carefully controlled operating conditions. Using the current values,
average power can be calculated as can average energy in terms of mW/MHz. Assuming
that the overwhelming majority of power comes from the dynamic component, the
capacitance of the structure under examination can be found from Equation 5.
(EQ 5)
The first step in achieving the measurements was the generation of test designs.
However, the normal synthesis and automatic partition, place and route (PPR) flow does
not allow sufficient control over the resulting mapped design. Instead, the Xilinx XACT
Energy DynamicPower( ) ToggleFrequency( )⁄( )
CEnergy
Vdd Vswing×------------------------------------
=
=
Section 3.2: Xilinx Component Power Measurements (opening the black box) 27
tool which allows very detailed hand routing of a design, was extensively used [47]. The
ability to enable the exact interconnect paths desired and to set the configurable logic
block (CLB) and I/O structure functionality allowed a high level of control over what
circuit structures were toggling. This degree of control insured the proper isolation and
orthogonality among measured components so that an accurate measurement of only the
component under examination was performed. A thorough understanding of how the
PGA architecture works was essential to manipulating the PGA at such an intricate level.
Lastly, each measurement was performed in a relative fashion. That is to say, one
measurement was recorded without the circuit component toggling and one with the
component toggling. Thus, all extraneous sources of current draw (power) are eliminated
from the desired measurement. In addition, the accuracy of measurements was improved
by enabling several of the same components at one time to give a larger current differen-
tial. A larger current differential gets around the problem of finite measurement resolution
and also provides an averaging effect to improve accuracy.
3.2.2 Components measured
The components analyzed in the lab measurements were chosen based on their archi-
tectural impact and on the ability to accurately isolate them. Table 3-1 shows all the
resources that were measured.
TABLE 3-1 Xilinx XC4003A Architecture Components
Component Class Type Comments
CLB CLB Function Logic LUT Tree and Internal Muxes
I/O I/O Interface Input and Output Buffers
Interconnect
CLB Output Interface Output Buffers and Track Switch Load
CLB Input Interface LUT Input Multiplexers
Full & Half Longlines Wires span 5 or 10 CLBs
Double Lines Wires span 2 CLBs
Single Lines Wires span 1 CLB
Clock Lines Clock Distribution and Clk Input Loads
Carry Chains Direct Path between CLBs
Section 3.2: Xilinx Component Power Measurements (opening the black box) 28
The majority of items listed in the table belong to the interconnect resources of the PGA
since much of a PGA is simply wiring. A full longline corresponds to the case of a
longline where the splitter connects the two segments into one. The half longline
represents the case where the splitter switch is off and only half of the longline is toggling.
Overall, a very detailed breakdown of the interconnect structure was performed in order to
achieve high accuracy in later analyses.
The CLB related components combine to cover nearly all the major potential sources
of power within a logic block. Only the carry logic and the RAM configuration areas were
neglected.
The I/O structures were included for the sake of completeness. Although output driver
power can be significant, those currents are largely set by the board capacitance and other
system issues so they were not seen as something to be optimized. In fact, all measure-
ments were taken without any output pins enabled.
Finally, the clock distribution network was also included among the components to be
measured. In the Xilinx architecture, a series of dedicated vertical wires traverse each
column of the chip to provide global signals to a column of CLBs.
3.2.3 Results
TABLE 3-2 Measured Component Energies and Capacitances
ComponentEnergy (mW/
MHz=nJ)Estimated
Capacitance (pF)
CLB Function Generator 0.025 1.10
I/O Input Path (I1,I2) 0.062,0.140 2.5,5.6
CLB Input Interface 0.040-0.050 1.95-2.4
CLB Output Interface 0.041 1.64
10 CLB Longline (Horiz,Vert) 0.054,0.088 2.7,4.4
5 CLB Longline (Horiz,Vert) 0.022,0.040 1.1,2.2
Double Line 0.06-0.107 3-5.35
Single Line 0.048-0.088 2.4-4.4
Clock Connection 0.030 1.5
Section 3.2: Xilinx Component Power Measurements (opening the black box) 29
The data in Table 3-2 shows the results of the board-level measurements on the Xilinx
chip. The complete detailed breakdown of the lab measurements for each component are
listed in the Appendix. The process used for this chip was two-layer metal with 0.6um
feature size. These values provide the first concrete numbers on internal PGA capaci-
tances for an industry chip. A range of energy and capacitance values appears for some of
the table entries because several components fall into those categories. The range
represents the bounds that were measured among the components. In addition to the
absolute capacitance magnitudes, relative comparisons among the components reveal
some interesting insights.
The most astounding fact about the numbers is their order of magnitude. In conven-
tional VLSI CMOS designs, the capacitances are typically in the range of tens to hundreds
of fFs. Only on long, high fanin/fanout busses would one be likely to encounter capacitive
loads in the picofarad range. However, in the case of the PGA, a single wire spanning one
CLB pitch has a load of 2-4 pFs and the other types of interconnect fall in this range as
well. One might have anticipated a high value for interconnect capacitances given that
interconnect delays are fairly substantial in these devices, but picofarads is quite a
surprise.
The reason for such large capacitances must be related to the fact that the interconnect
is programmable. As a comparison, parasitic wiring capacitance to the substrate and other
metal layers can only account for capacitances on the order of 150fFs (500um line). The
only other source of capacitance is from the devices relating to configuring the intercon-
nect. Although the exact circuit structure of the CLB - interconnect interface is not
known, it is known that each possible connection point has a switch to enable or disable an
electrical connection. Each switch takes the form of single NMOS pass transistor and so
each one contributes a parasitic drain cap to the interconnect that it attaches to regardless
Clock Column Distribution Wire 0.128 6.4
Carry Chain 0.050 2.5
TABLE 3-2 Measured Component Energies and Capacitances
ComponentEnergy (mW/
MHz=nJ)Estimated
Capacitance (pF)
Section 3.2: Xilinx Component Power Measurements (opening the black box) 30
of whether a signal has been routed through the switch. A single line type of interconnect
is loaded by 15-20 pass transistor switches. The size of the pass transistors has been
referred to as being “very big” which leads to large parasitic drain capacitances. Very
large pass transistors allow the delay of a multi-segment path to appear linear as opposed
to quadratic because the path becomes completely capacitively dominated. This attempt
to linearize long-distance delay growth by using large pass transistors makes sense
because buffering within the interconnect network was not used in the design. Only in the
most recent chip families has buffering been introduced within the switch matrices [39].
Therefore the large capacitance seen on the interconnect, especially the shorter varieties,
appears to be predominantly due to switch or fanout capacitance as opposed to wiring
capacitance. Lastly, it is important to recognize the fact that fanout capacitance is a
function of the number of switches hanging off an interconnect segment, as well as the
size of those switches. (Figure 3-2)
In addition to absolute capacitance information, the characterization data in Table 3-2
also allows comparisons to be made among the different types of interconnect. The most
interesting result is that the shorter distance interconnect experiences approximately the
same loading as the more global, long distance interconnect. Once again, this implies the
importance of fanout over wiring length in determining the overall interconnect parasitic
Fanin/FanoutConnections
Fanin/FanoutConnections
FIGURE 3-2: Heavily Loaded Interconnect due to Switch Diffusion Capacitance
Section 3.3: Xilinx Profiler Description 31
load capacitance. However, the longlines which span 10 CLBs are exposed to 3 times
more switches than the shorter interconnect so one would expect a proportionate increase
in capacitance, but that is not the case. A plausible explanation for this seemingly incon-
sistent result is that the switches used to connect to the longlines are smaller since a
longline path is not exposed to as much series resistance as a path using the more local
interconnect resources. Another possible reason is the use of buffers at the CLB inputs
which would contribute significant gate capacitance. Regardless of the exact reason, the
fact that local and global interconnect possess the same capacitive load gives rise to
another crucial difference between conventional IC designs and PGA design.
In most IC design, the local wiring has a minimal effect on the total parasitic node
capacitance as should be the case for localized connections. Recently though shrinking
feature sizes have begun to shift the balance of parasitic capacitance more towards the
interconnect which does not scale as well due to fringing and interline capacitance
components. Despite this trend, very large capacitances are usually seen only on long-
distance bus connections. In a PGA, any kind of wiring experiences large capacitances
and hence, local wiring is just as expensive from a power standpoint as are cross-chip con-
nections. Thus, an island style PGA does not exploit the benefits of locality since no
“cheap” local interconnect is available. In essence, the large discontinuity between the
CLB and interconnect capacitance numbers causes a serious power penalty to be paid
every time a signal leaves the CLB.
3.3 Xilinx Profiler Description
3.3.1 General Idea
The insights developed from the baseline capacitance measurements shed some light
on the power profile of a PGA, but they do not provide a complete picture. Any given
design will be made up of several of the building blocks in Table 3-2 and the aggregate
design power will be a function of those components. An important question to be
answered is how these building blocks contribute to a design’s overall power consumption.
By generating routed designs for common library functions and other design examples, a
Section 3.3: Xilinx Profiler Description 32
more complete and useful analysis can be performed. In fact, complete power profiles can
be generated for designs, thus revealing general trends as well as providing useful metric
data such as average net capacitance. In addition, a first order activity factor weighting
must be included in the analysis. This is especially important because power is directly
proportional to transition rate and so any attempt to model power must include some
estimation of activity.
In order to perform the analysis discussed above, a methodolgy and power estimation
tool needed to be developed. The basic idea leveraged on the use of the baseline
capacitance data obtained from the detailed lab measurements and the existing Xilinx
software flow. By forming a linear combination of all the characterized components used
in a design, the total capacitance and dynamic power consumption could be calculated. A
first order estimate of the activities of various nets was then included to achieve the proper
weighting of terms. A set of Perl programs were written to perform these various tasks.
3.3.2 Flow
Figure 3-3 shows the methodology used in the power analysis of the Xilinx.
LCA File
XPX.pm
NET2CMD.pm
ACTIVITY.pm
AVECAP.pm
DISTANCE.pm
ViewLogic
XPX.pm
CharacterizationData
.len
.net
.trace
.dat
.est
.dist
.net
.net
.cmd
post-process
FIGURE 3-3: Xilinx Power Analysis Flow
Section 3.3: Xilinx Profiler Description 33
At the top of the figure is the most essential piece of information: the LCA file. A
designer starts with a description of a circuit in either HDL or schematic form and then the
Xilinx tools are used to optimize and transform the design into a placed and routed imple-
mentation on the PGA architecture. The only file that contains information about the fully
mapped design is the LCA file. Without the detailed description of how the specified
design is actually implemented on the PGA, a linkage between the physical capacitances
and the routed design would have been impossible.
Unfortunately, the LCA file syntax is not in a documented format, and to make matters
worse, the file is not a traditional netlist. Rather, the LCA file specifies the set of program-
mable interconnect points that need to be enabled to realize a routed design. CLB and I/O
block functionality also appears in the LCA file. A sample portion of an LCA file appears
below:
Upon careful examination of the LCA file, it was found that sufficient information was
present to decompose a design into its component building blocks. As a result, the
connection between circuit capacitances and a software mapped design could be
determined.
Addnet B6 PAD2.I1 FD.F2
Netdelay B6 FD.F2 3.4
Program B6 {137G224} {85G226} {79G232} {79G312} {79G319} {79G388} {79G395}
{79G450}
NProgram B6 row.H.local.7:FD.F2 HC.24.1.11 HC.24.1.0 EC.24.1.17 EC.24.1.0
CC.24.1.17 CC.24.1.0 col.C.local.1:PAD2.I1
Addnet B7 PAD1.I2 FD.G1
Netdelay B7 FD.G1 2.9
Program B7 {114G262} {114G429} {85G429} {78G429} {47G429} {43G433} {43G444}
NProgram B7 col.D.long.1:FD.G1 col.D.long.1:row.B.local.4-s BC.24.1.9 BC.24.1.20
BB.24.1.9 BB.24.1.2 col.B.local.3:PAD1.I2
Addnet CLK_1 LL.Y startup.CLK
Netdelay CLK_1 startup.CLK 1.8
Program CLK_1 {428G19} {428G33} {428G40} {428G63}
NProgram CLK_1 col.M.local.4:startup.CLK MM.24.1.14 MM.24.1.3 col.M.local.4:LL.Y
Addnet CLK_2 bufgp_tl.O LL.G3 HE.K IE.K JE.K BE.K CE.K DE.K EE.K FE.K
Netdelay CLK_2 LL.G3 1.7 HE.K 1.7 IE.K 1.7 JE.K 1.7 BE.K 1.7 CE.K 1.7 DE.K 1.7
EE.K 1.7 FE.K 1.7
Program CLK_2 {165G134} {165G172} {165G210} {447G69} {447G238} {165G411} {165G373}
{165G335} {165G297} {165G259} {165G238}
NProgram CLK_2 col.E.long.4:JE.K col.E.long.4:IE.K col.E.long.4:HE.K
col.M.long.4:LL.G3 col.M.long.4:bufgp_tl.O col.E.long.4:BE.K col.E.long.4:CE.K
col.E.long.4:DE.K col.E.long.4:EE.K col.E.long.4:FE.K col.E.long.4:bufgp_tl.O
FIGURE 3-4: Excerpt from .lca file (contains routing and configuration data for a mapped design).
Section 3.3: Xilinx Profiler Description 34
The xpx.pm program forms the heart of the analysis flow. The LCA file serves as the
only input to the script and is parsed in a line by line fashion. Several output files are
generated from the xpx.pm program and all begin with the design name that was specified.
A [name].dat file contains a wealth of intermediate processing information that is useful
for debugging purposes and obtaining more detailed information than is normally
produced in the standard program output. The [name].len file contains source, sink and
routed length information used by the distance.pm script. Net capacitance data resides in
the [name].net file and is used as an input to several other programs.
The processing performed by xpx.pm consists of a number of different tasks. One of
these is global entry translation. In order to construct a list of the interconnect resources
that a design uses, the programmable interconnect point (PIP) representations must be
translated into unique segment names. All nets in a given design are specified by a list of
PIPs which define the enabled path through the programmable interconnect network.
Each PIP defines the presence of a connection between two segments of interconnect.
The variety of syntaxes used to describe PIPs complicates the segment extraction and
translation task. However, once the translation has been finished, each segment entry is
represented by a globally unique name that contains information about that segment’s ori-
entation, array location index (row and column), type (single line, longlines, etc.) and
track value (1,2,3, etc.). As the xpx tool processes each net, the interconnect segments are
recorded allowing the data to be used in later functions.
The net length subroutines use the segment data to get an estimate of the length of
routed nets and their fanout. From the net specification, the source and a list of sinks can
be easily identified giving a measure of net fanout. However, the determination of routed
length is a bit trickier since nets can fork off the main net at any PIP point. This makes it
difficult to determine which segments are common among sinks and which are not; one
should remember that nets are stored in the LCA file just as a list of PIPs and thus very
little information about the actual path taken by the net is preserved. Despite this difficulty,
a routed length estimate for each sink was calculated from the combination of the segment
information and the source/sink data.
Section 3.3: Xilinx Profiler Description 35
The most important use of the interconnect segment data revolves around energy com-
putations. Each global segment name undergoes a mapping to an index name which
corresponds to one of the measured energies in Appendix A and summarized in Table 3-2
as discussed in Section 3.2.3. The energy contribution from CLB outputs, CLB inputs and
the internal CLB logic is then combined with the accumulated interconnect segment
energies to give the total energy required to charge and discharge each net. A simple
conversion to capacitance is performed using the clock frequency, supply voltage, and
voltage swing parameters. Once, all nets are processed, the [name].net file is generated
which contains a list of all the netnames with their associated capacitance. In addition, a
preliminary energy report for the design is sent to the [name].dat file and STDOUT. This
report contains a complete breakdown of the overall design energy in terms of CLB energy
and energies from the various types of interconnect prior to activity weighting (All
components are treated with activity of 1). General CLB usage and a detailed breakdown
of interconnect statistics are also provided in the listing. A sample output of the xpx.pm
script is shown in Figure 3-5.
After progressing from the xpx.pm program, the two auxiliary scripts, distance.pm and
avecap.pm can be executed to give processed distance and net capacitance information.
The distance script examines the [name].len file and returns four values. The first pair of
numbers are the average and maximum fanout for all the nets in the mapped design.
These values give an idea of the typical interconnect fanout requirements that need to be
=================================================================Results for: correl4b.lcaNumber of CLBs used= 76 Number of IOBs used= 7Number of switch matrix passes = 223Energy in mW/MHz: Total= 54.31 Total(w/o IO)= 53.63 Total IO= 0.68Fraction of power in CLBs vs. Interconnect = 6.66%CLB function power = 3.57 mW/MHzLocal Wires= 222 lines consuming a total of 15.52 mW/MHzCarry Wires= 15 lines consuming a total of 0.75 mW/MHzDouble Wires= 167 lines consuming a total of 14.50 mW/MHzLong Wires= 38 lines consuming a total of 2.71 mW/MHzInput Lines= 209 lines consuming a total of 9.38 mW/MHzOutput Lines= 153 lines consuming a total of 8.00 mW/MHzClock Lines= 76 lines consuming a total of 2.77 mW/MHzF inputs= 71 linesG inputs= 62 linesC inputs= 76 lines=================================================================
FIGURE 3-5: Output of xpx.pm in initial non-activity weighted analysis of Correl4b.
Section 3.3: Xilinx Profiler Description 36
supported. The second pair of numbers are net length measures. Two net length measures
are computed to give an upper and lower bound on the average net length for a design in
terms of CLB pitches. The manhattan distance gives a lower bound assuming ideal
connections on a manhattan grid between the list of sources and sinks for the design. A
routed distance value comes from the computations performed while processing each net
in the xpx.pm script. The routed value will always be higher than the manhattan value
because it takes into account the non-ideal routing patterns that resulted from architectural
and design restrictions during PPR.
The avecap.pm script generates three capacitance metrics. The average net
capacitance value shows how much capacitance a typical net sees. In addition, a
maximum and minimum net capacitance is recorded to give an idea of the bounds that
routed nets see for capacitive load.
Up to this point, all the capacitance information obtained has not been weighted by an
activity factor and so a crucial component of dynamic power is still lacking. To solve this
inadequacy, the Viewlogic Viewsim [41] gate-level simulator was added to the analysis
flow. By performing a simulation of the design using random input data, a first order
approximation of transition activity is obtained for all nets in the mapped design. The
net2cmd.pm script automates some of the functions necessary to generate a cmd file for
Viewsim. The most important task that net2cmd performs is to enable the tracking of all
the nets found from the xpx program. The list of nets to be tracked will be much smaller
than the original netlist for the design because groups of logic become subsumed inside
the CLBs. Often, new nets are generated from the mapping process that were not present
in the original schematic for the design. However, this is always the result of a duplication
and buffering operation by the routing tools. As a result, this case is detected by net2cmd
and handled so that the correct activity factor for the new nets is appropriately assigned.
Lastly, the simulation produces a trace file containing all the transitions for the routed nets.
After computing activity factors from the Viewsim trace file, activity.pm generates an
estimate of the total power consumption for the design being analyzed. The value
produced does not include the power contribution from the logic within the CLBs;
Section 3.3: Xilinx Profiler Description 37
however, the logic power is again factored into the post-processing step described later.
Despite leaving out the logic power, the power estimate given by activity.pm is accurate
because nearly all PGA power is consumed by the interconnect as will be explained in a
later section. The total clock power for the design is also reported. Along with the power
numbers, the average activity of the circuit is given as is the average switched capacitance
of the nets.
In a final post-processing step, the xpx program is re-run to give an activity weighted
breakdown of CLB, I/O, and interconnect power. A sample output after the post-
processing run of xpx appears in Figure 3-6.
============= Activity Weighted Final Results for correl4b.lca ====================
=================================================================Post Processed Results for: correl4b.lcaNumber of CLBs used= 76 Number of IOBs used= 7Number of switch matrix passes = 223Power in mW: Total= 71.65 Total(w/o IO)= 70.14 Total IO= 1.51Fraction of power in CLBs vs. Interconnect = 3.65%CLB function power = 2.56 mWLocal Wires= consuming a total of 10.92 mWCarry Wires= consuming a total of 0.01 mWDouble Wires= consuming a total of 16.60 mWLong Wires= consuming a total of 2.90 mWInput Lines= consuming a total of 12.12 mWOutput Lines= consuming a total of 6.22 mWClock Lines= consuming a total of 21.36 mW=================================================================
FIGURE 3-6: Output from xpx.pm on post-processing pass of Correl4b.
Section 3.4: Results & Recommendations 38
3.3.3 Summary of Profiler Outputs
3.4 Results & Recommendations
From running several designs through the power analysis flow, a number of interesting
results arose leading to important considerations for low power PGA design. In addition,
error sources in the power analysis flow were carefully examined.
3.4.1 Benchmark Data
The distilled results from running the analysis tools on a set of 36 mapped designs are
displayed in the series of graphs displayed below. The 36 designs used for the data are
composed of 27 Xilinx macros for common functions and 9 larger designs. The set of
designs give a good cross-section of the probable make-up of a PGA design. In all cases,
the designs were analyzed without any outputs driving chip pins. This insures that the
power contribution from driving board capacitances does not get factored into the power
profile of the internal PGA circuitry. All graphs represent average values across all 36
designs. The averages from only the large designs did not show much difference from the
whole group and so no distinction was made in the graphs. Lastly, the complete set of data
for all the analysis metrics can be found in Appendix B.
• Total Design Power (xpx.pm postpro-cess)
• I/O, CLB Function, Interconnect Power Breakdown (xpx.pm post-process)
• Interconnect Power Breakdown (Single, Double, Long, CLB input, CLB output) (xpx.pm post-process)
• Interconnect Resource Usage Break-down (xpx.pm)
• Clock Power (activity.pm)
• Average # of Inputs Used per Function Block (xpx.pm)
• Max, Min, Ave Net Capacitance (ave-cap.pm)
• Ave, Max Signal Fanout (distance.pm)
• Ave Manhattan and Routed Net Length(distance.pm)
• Ave Switched Capacitance (activity.pm)
• Various Other Design Metrics (see appendix)
FIGURE 3-7: Profiler Outputs
Section 3.4: Results & Recommendations 39
FIGURE 3-8: Power Breakdown for Xilinx XC4003A
FIGURE 3-9: Interconnect Power Breakdown by Resource for XC4003A
Section 3.4: Results & Recommendations 40
3.4.2 Conclusions
The wealth of data provided by the power analysis scripts described above allowed the
construction of a PGA power profile for designs mapped to Xilinx devices. From the
profile data, various conclusions could be made about PGA power consumption. These
insights proved invaluable in guiding the development of a low power PGA design.
The most important result from the power analysis studies was the overwhelming
dominance of interconnect to a design’s power consumption. In all cases, at least 65% of
a design’s power is dissipated in the collection of interconnect resources and logic cell
interface circuitry that a design utilizes. The category of interconnect resources
encompasses single length, double length, carry chains, and longlines. Interface circuitry
includes the input multiplexers used to select CLB inputs from the wiring tracks (an
essential part of the interconnect structure) and the output buffers used to drive signals
onto the interconnect fabric. Astoundingly, the logic circuitry that actually implements
the desired functionality consumes a minimal amount of power. A good way of thinking
about the situation is to treat the CLBs as islands in a sea of very expensive interconnect
resources. As soon as a signal leaves an island, it immediately drowns in the capacitance
FIGURE 3-10: Interconnect Wiring Resource Usage
Section 3.4: Results & Recommendations 41
discontinuity caused by the general purpose routing network. Therefore, any attempt to
reduce PGA power must focus on how to minimize interconnect power.
Another surprising fact discovered by the power analysis was the relationship between
local wiring and long-distance wiring. On the level of a mapped design, local interconnect
dissipated ten times more power than long-distance interconnect. In making this
comparison, single and double line segments were considered local resources and the
longlines made up the long-distance resources. One can understand such a result from an
analysis of design properties and the baseline component capacitance measurements. As
discussed in an earlier section, the local and long-distance wiring resources have
comparable capacitances. However, in typical designs local wiring usage and activity is
much higher than long-distance wiring. As a result, the aggregate capacitance of the local
wires amounts to about 10 times more than the long-distance wires. Clearly, the property
of spatial locality inherent in designs drives this imbalance in wire usage. One might
wonder how increasing array size would affect the local vs. long-distance relationship.
Although long-distance wiring requirements will grow with increasing array size, so too
will the amount of local wiring to accommodate the increased number of cells. Thus,
from this very simplistic argument, one can be sure that optimization of local connection
resources is and will continue to be a crucial issue in low power PGA design.
In addition to the realizations made about the power consumed in routing resources,
the importance of the logic cell interface circuitry should not be overshadowed. Every
time a logic cell receives an input, the signal must be selected from a number of possible
tracks. In the architecture studied, anywhere from 8 to 16 possible inputs exist per CLB
input. As a result, a signal must pass through a large multiplexer tree and thus switch
significant capacitance. Power from the CLB input paths consistently accounted for 15%-
25% of total interconnect power. Architectural and circuit design decisions can have a
large impact on the composition of the logic cell input interface and thus, PGA power can
be further reduced by paying attention to logic cell interfaces.
Clock power has long been the achilles heel of low power design and PGAs are no
exception to this rule. Except for data signals which may experience excessive glitching,
Section 3.4: Results & Recommendations 42
the clock signal possesses the highest transition activity. As a result, any capacitance due
to clock distribution and clocked transistor gates generally factors much more heavily in
total power consumption than other individual nets. In the PGA design space, the clock
contribution to power was seen to vary over a large range. Many designs are entirely com-
binational whereas some are almost solely state-based. The data for percentage of total
power due to clocks reflects this variation as some designs have essentially zero clock
power and others consume nearly 50% of their power in the clock network. Careful
thought about conditional clocking, the style of flip-flop and the chosen supply voltage
will most likely have the greatest impact on reducing clock power for a low power PGA
design.
Some general statements about the affect of bitwidth on power consumption can also
be made from the collected data. For datapath operators like adders, comparators, shifters,
and accumulators, power tends to scale linearly with bit width. This effect is a direct
result of the linear scaling in the number of CLBs required, and thus the amount of inter-
connections. Often, designs containing a clock component will have some fixed offset
factors from clock distribution resource usage. Because clocks are enabled on a per
column basis, the clock power contribution for distribution resources occurs in discrete
amounts depending on which columns need a clock. Therefore, designs with a large clock
power component do not depend as strongly on the number of CLBs used, but rather on
the clock distribution resources that have been enabled.
Another interesting result of the analysis was the effect of structured vs. unstructured
placement of logic on power consumption. In datapath-oriented designs, data signals
generally flow in one direction and fanouts are very low (1 or 2). In other words, connec-
tivity resides mostly on a cell to cell basis leading to high spatial locality.
Two versions of a correlator design [26] were analyzed (Figure 3-11). The first was
completely mapped using the simulated annealing algorithm used by the Xilinx software
tools. The second design was manually placed such that dataflow was oriented from
column to column of cells with each row of cells forming a bitslice. The resulting power
Section 3.4: Results & Recommendations 43
numbers revealed a minimal improvement for the structured mapping over the random
one.
Two conclusions can be drawn from this result. First, although simulated annealing
usually upsets the inherent structure in many designs, cost-function based placement does
do a good job of optimizing overall power for a Xilinx architecture. However, the second
conclusion is that the architecture is not designed to exploit the benefits of locality.
Supporting this conclusion is the fact that the average net capacitance for both designs was
about the same. One would expect, that a structured design given a suitable architecture
would have lower net lengths and in turn lower power than a design using a more
TABLE 3-3 Estimated Correlator Capacitance Breakdown
Component Class Structured Unstructured
Total Power 38.4 mW 42 mW
Single Line Capacitance 712 pF 666 pF
Double Line Capacitance 725 pF 535 pF
Long Line Capacitance 105.5 pF 55.5 pF
Clock Capacitance 231 pF 248 pF
CLB Inputs Capacitance 470 pF 455 pF
CLB Output Capacitance 400 pF 385 pF
Average Net Capacitance 13.9 pF 12.5 pF
FIGURE 3-11: Correlator Schematic
IN
GATED
+3
GATED GATED
POSACC
CLK
NEGACC
CLK CLK
DATAIN
CLK(64MHz)
4
OCLK(1MHz)
-10
RSTRST
10
3
SIG
N-B
IT(T
o C
ontr
ol)
OCLK (1MHz)
OCLK (1MHz)
+
+3
GATED GATEDCLKCLK
RSTRST
OCLK (1MHz)
OCLK (1MHz)
+
GATEDCLK
SU
M
SU
MS
UM
SU
M
CA
RR
Y
CA
RR
YC
AR
RY
CA
RR
Y
9
9
9
9
99
9
9
9
9
99
9
9
CORROUT
Section 3.4: Results & Recommendations 44
“random” placement. It turns out, the minor decrease in power for the structured design
was due to a reduction in clock capacitance, which due to its high activity weighs heavily
in the correlator’s power. In actuality, the average net capacitance increased slightly for
the structured design, but the average net activity (excluding clocks) of 0.05 kept this fact
from influencing total correlator power.
Many other insights can be gained from perusing the analysis data. A complete listing
appears in the appendix to this work for completeness. However, one additional result that
deserves mentioning concerns net capacitance. The net capacitance refers to the
capacitance starting with the buffering of a CLB output, all the capacitance along the
routed interconnect path, and then ending with the CLB input multiplexers. In the subset
of designs that were analyzed, some extremes of net capacitance were identified. The
highest values measured 30-40pF and the lowest came in at 4pFs. This huge variation in
net capacitance poses serious problems from a circuit design standpoint. CLB output
buffers must be oversized to drive worst case loads resulting in wasted capacitance when
the post PPR nets do not have a worst case load. Clearly, some effort to control the range
of net capacitances and a buffering scheme must be included in a low power PGA design.
As a final note, the average net capacitance of a design provides an excellent metric for
evaluating the quality of an architecture from a power perspective. Both the impact of
mapping efficiency (exploiting locality) and the intrinsic design capacitance are captured
in the average net capacitance metric. For the set of designs analyzed, the net capacitance
statistics appear in Table 3-4. Once again, the internal capacitances of a PGA represent a
substantial increase over typical VLSI designs. Furthermore, a significant design
challenge must be dealt with due to the high variance in capacitance on the nets since the
distribution of high and low capacitance paths is not known a priori, clearly some attempt
to bound net capacitances must be made.
TABLE 3-4 Net Capacitance Statistics
Net Capacitance Metric Capacitance in pFs
Average 13.76
Standard Deviation 4.0
Median 13.73
Section 3.4: Results & Recommendations 45
3.4.3 Error Sources
Overall, the accuracy of the profiler results was sufficient for learning the power
properties of the Xilinx FPGA. However, in some cases, the error between estimation and
measurement became as high as 30%. Thus, a firm understanding of the error sources is
necessary to gain confidence in the validity of the results. Several experiments were run to
determine what the major sources of error were. From these experiments, essentially all
the error could be accounted for by considering three different areas: 1) periphery effects,
2) activity estimation errors, and 3) interconnect capacitance values.
3.4.4 Periphery effects
As a result of the non-regular structure of a PGA at the array boundaries, some inter-
connect segments are not properly included in the interconnect estimation. Throughout
most of the array, the interconnect forms a very regular pattern and because of this, the
types of interconnect can be uniquely disseminated. However, along the edges of the array
the regular layout breaks down and several ‘special’ cases exist. Other than hardcoding
each periphery case (there are many), there is no way to determine the correct capacitance
contribution from these resources. Fortunately, the contribution of these resources is fairly
small when taken together with an entire design’s resources. Typically, the resulting error
is a less than a few per cent.
3.4.5 Activity estimation
Inaccuracy of net activity values was the predominant source of error for the
estimation. By carefully isolating the output nodes of CLB outputs, and determining their
energy contribution, a measurement of the node activity could be made. Several such
measurements were taken and compared with the activity values given by the gate-level
simulation. Despite the use of back-annotated timing information for the RC based
simulation, the measured activities were found to be higher than those achieved from the
simulation.
Section 3.5: Triptych Analysis 46
The impact of an inaccuracy of the net activities is quite severe. First, the component
of CLB energy should be slightly higher due to the increased internal activity and
secondly, the interconnect contribution needs to be increased. Since the interconnect
component dominates the switched capacitance of the design, most of the difference in
measured and estimated power numbers comes from interconnect. Although the range of
activity inaccuracy was highly design dependent, in the cases looked at, an activity
correction could account for all but about 10% of the estimator error. Specifically, in the
case of the adders, the adjusted weighting of interconnect and CLB components to
compensate for the underestimate of activity allowed much closer agreement with
measured power numbers.
3.4.6 Interconnect cap values
After eliminating errors due to activity, the remaining results underestimated by about
10%. This source of error was traced to the interconnect contribution accuracy. One
possible reason for this is the impact of cross-coupling components which would make the
baseline capacitance measurements appear slightly lower. Irrespective of the source of the
error, the fact that it is due to the interconnect is all that is important for the purposes of
this study since the conclusions are not affected as a result.
3.5 Triptych Analysis
In addition to the detailed analysis of a Xilinx XC4000 PGA architecture, the Triptych
[14] design from the University of Washington offered further insight into power
consumption in PGA designs. There were two main reasons for choosing Triptych as
another design to be studied. First, Triptych represents a much different style of PGA
architecture from the Xilinx approach. As was discussed in Section 2.5, the design makes
extensive use of more dedicated diagonal neighbor connections, instead of using a
completely general purpose sea of interconnect. Thus, a first order examination of the
impact of differing architectural characteristics could be achieved. Secondly, the layout of
a Triptych test chip was available1 for detailed analysis. In fact, this was the only design
that was studied where extracted physical capacitances could be determined. Based on the
Section 3.5: Triptych Analysis 47
capacitance data and circuit structure information from the layout, some important power
considerations were discovered.
1. Special thanks to Scott Hauck for offering Triptych layout for this study.
back
Routing Scheme
logiccell
loop-
path
verticaltracks
(not allare shown)
diagonal path
center todiagonalpath
Thru-Cell Routing Paths
centerpath
FIGURE 3-12: Triptych Routing Scheme and Cell Routing Paths
3-LUT DQ
Logic Cell
FIGURE 3-13: Triptych Logic Cell
Section 3.5: Triptych Analysis 48
Before stating the conclusions about capacitance, some general information about the
Triptych design should be covered. The layout consisted of a 4x4 array of basic cells. A
detailed description of a basic cell appears in Section 2.5, but a diagram is shown here as
Figure 3-13 for convenience. The layout was drawn for a 1.2um, 2-layer metal process
operating at 5 volts. As a result, some area goes to routing the interconnection network
leading to a lower density than that which could be achieved in a more advanced three
metal layer process. Table 3-5 displays the measured areas of important subsections of a
basic cell. The LUT transistors which perform the desired logic function only take 2%
out of the total cell area. On the other hand, the configuration storage requires about 25%
of the cell area as does the output vertical track drivers. Typical, delay performance
among the most common paths appears in Table 3-6 as obtained from [14].
In terms of the capacitance analysis of the Triptych design, the heaviest loading
appears on the vertical track interconnect. This is not too surprising since these lines are
basically highly shared busses. The worst case for capacitive loading is found on the
longest segments which are fed from one of 16 inputs and can be tapped off to 16 other
cells. Any signal using this resource will see approximately 5pF of capacitance. The
shorter vertical tracks are not as severe although they still see loads of about 1.7pF for the
length 8 line and 330fF for the length 4 line. Two observations can be made from these
TABLE 3-5 Triptych Cell Area
Cell Component Area (um2)
Total ~100,000
3-LUT 2,200
Config Memory 22,000
Output Drivers 24,000
TABLE 3-6 Triptych Delay Performance
Resource Delay (ns)
Routing Path thru Cell 1.6
Function Computation Path (in addition to routing path)
2.2
Channel Wire 2.5-3.7
Section 3.5: Triptych Analysis 49
results. First, the large fanout of highly shared bus structures causes very high capaci-
tances. By supporting the ability of any cell along the path of the long wires to directly
connect to that bus, the exposure to added diffusion capacitances of several large output
driver transistors is unavoidable. In addition, the relation among the 4,8,16 progression in
track length and fanout does not result in a linear increase in net capacitance. This is an
important conclusion because it implies that the cost of supporting highly populated1
busses is much higher due to the necessity to increase driver size (a significant component
of the total capacitance) to attempt to compensate for some of the loss of speed on the
more heavily loaded bus. Although the actual implementation of a design can have a
significant impact of the actual growth in capacitance, it is safe to say that the two
components of fanout and increased driver size combine to at least cause a super-
linear increase in capacitance with track length. Therefore, a low power design should
endeavor to minimize the population on busses so that such a penalty in capacitance can
be avoided while still preserving an overall efficient routing topology.
The relationship between the capacitances of the dedicated, diagonal interconnect
paths and the shared vertical tracks provides another insight into low power PGA design.
The diagonal paths feed each cell with two of the three inputs and can only be derived
from two possible sources each. Thus, the capacitance on these paths is lower than the
vertical tracks and consists largely of wiring parasitics. A typical diagonal routing path
1. Population is a term used to indicate the degree to which a long wiring segment is accessible to the cells which it traverses. A highly populated segment offers direct connections to all cells, whereas a minimal population would just have connections at the segment’s endpoints.
a. Estimates from layout extraction and transistor cap calculations, Track paths do not include track wiring capacitance.
TABLE 3-7 Triptych Cell Capacitancesa
Signal Path Path Capacitance (pF)
Center to Diagonal 0.51-0.60
Diagonal to Diagonal 0.29-0.46
Center to 4x Track 4.3
Center to 8x Track 5.8
Center to 16x Track 9.0
Section 3.5: Triptych Analysis 50
will see between 200fF and 300fF of load, and would be significantly less without the
wiring capacitance. From a power standpoint, the presence of such low capacitance inter-
connect resources is attractive. However, an important trade-off must be observed. For
the direct interconnects to truly yield a benefit, they must be utilized often and the
overhead of supporting them in the design must be minimal. As a side-note, the more
dedicated interconnect will be much faster than the general interconnect because of less
capacitance and less path resistance.
Another interesting result of the Triptych capacitance analysis surrounds the output
path of a cell (Figure 3-14). A large driver chain feeds the heavily loaded vertical track
drivers and supports the fanout to the many possible tracks to which the output can be
directed. The total capacitance of this output chain approaches 4pF. However, although
this is a large capacitance, the critically important point is that all that capacitance is
switched no matter which track is being used. For example, if a length 4 line (~330fF) is
driven, an additional 4pF also toggles resulting in a huge overhead. Triptych was not
optimized for low power, but by studying it, the problem of inefficiency in output fanout
chains in PGA cells has been highlighted. A low power approach to PGA design must pay
attention to minimizing the significant waste associated with logic cell fanout.
Consequently, the unique opportunity to examine the layout of a PGA design lead to
many further insights into the issues and areas for improvement regarding low power PGA
design. Due to the architectural differences between the Triptych design and the Xilinx
series designs, many new observations could be made. Most importantly, the cost of
16track 8track 4track
16track 8track 4track
CenterOutput
X5pf
1.7pf
330ff
330ff5pf
1.7pf
Cequiv=3.7pf
FIGURE 3-14: Triptych Center Output Driver Chain
Section 3.5: Triptych Analysis 51
highly shared busses, the relative benefits of dedicated interconnect and the waste
associated with cell fanout paths were discovered.
In summary, the PGA power analysis proved valuable from many different aspects.
The measurements of the internal capacitances of a Xilinx chip have begun to open the
black box of PGA power consumption. Both the absolute data and the relative comparison
of PGA capacitances has allowed several insights to be formulated. More specifically, the
data revealed the importance of fanout capacitance versus wiring parasitics. In the later
chapters that discuss design issues, reduction of fanout capacitance will be crucial. In
addition, a more comprehensive study was performed by developing an analysis
methodology for profiling PGA power and the associated tools. Finally, results from the
Xilinx estimations and the Triptych study led to further insights that guided the specifica-
tion and design of a low power PGA device.
52
CHAPTER 4
Application of Low Power
Techniques to PGA Design
The growing crisis of managing power consumption on VLSI chips has prompted a
large amount of research resulting in the development of a number of techniques effective
at reducing power. Despite the diversity of available power minimization methods, they
all follow three general themes: 1) performance/area trade-offs, 2) reducing waste, and 3)
exploiting locality.
A common example of a performance tradeoff is the reduction of supply voltage to
achieve a quadratic improvement in dynamic power. Unfortunately, the power
improvement comes at the cost of increased delay and thus a linear drop in performance.
Similarly, an area/power trade-off results when moving to a parallel implementation. By
duplicating functional units, slower logic paths can be used at the expense of increased
area for redundant logic.
The second theme for reducing power consumption involves reducing waste.
Although a simple idea, eliminating waste can have a considerable impact. Typical
examples include turning off clocks to unused functional units, minimizing transistor
parasitic capacitance on non-critical paths, and using dedicated rather than programmable
hardware.
Section 4.1: Hierarchy of Low Power Techniques 53
The third recurring theme in low power design is exploiting locality. Locality is a
natural result of complex design and implies that hardware communication is highest
among neighboring modules. As a result, an efficient partitioning of a design can reduce
bus lengths and capacitive load, thereby reducing power consumption.
When attempting to achieve a low power design, the various techniques available must
be weighed against their suitability and applicability to the design problem one is dealing
with. In the case of PGAs, the design constraints are quite different from most IC design
environments. In many cases, the unique considerations of the PGA rule out the use of
often used low power techniques. Thus, some of the challenges in designing a low power
PGA are to determine which low power techniques can be utilized, which techniques will
provide the largest impact, and how they can be efficiently incorporated into the design.
These questions will be explored in the subsequent sections.
4.1 Hierarchy of Low Power Techniques
As mentioned above, several techniques have been developed to reduce power
consumption in IC designs (Figure 4-1). The methods can be organized in a hierarchy
starting with the highest level of design, the algorithmic level, and proceeding downward
through the layers of specification until the physical layout is reached. As each case is
presented, the specific constraints imposed by the PGA environment will be explained and
the applicability of the techniques will be judged.
4.1.1 Algorithmic
The algorithmic level of design offers the largest degree of flexibility and thus
provides the greatest amount of leverage on the resulting power consumption for a design.
Depending on the application, several optimizing transforms can be performed to reduce
power such as parallelizing and constant propagation.
Converting a design to exploit parallelism is one of the most effective tools in low
power design. By replicating functionality N times, a specified throughput can be
delivered at a cycle time N times longer. Although the overall capacitance is increased by
Section 4.1: Hierarchy of Low Power Techniques 54
approximately N, the decreased timing constraint, allows the supply voltage to be dropped
by a factor of N and thus leads to an overall N2 decrease in dynamic power. Unfortu-
nately, parallelizing cannot be employed when designing a PGA because the application is
not yet defined.
Another common technique to lower power involves converting a multiplication
operation to an add and shift. Designs mapped to PGAs frequently utilize this method
since the user knows the exact algorithm to be implemented. As a result, multiplications
by constants can be simplified by using hard-wired shifts and adds giving a reduction in
hardware complexity leading to power savings over a completely general purpose
multiplier. However, add-shift transformations, like parallelism, can only be applied by
the PGA user during the mapping and implementation process and not by the PGA
designer. When designing a general purpose PGA, one needs to look at the class of appli-
cations to be supported, but cannot rely on any particular application’s property to
optimize on. Thus, the low power tools available to a PGA designer are more restricted
and begin at the architectural level.
4.1.2 Architectural
The technique of pipelining a datapath is often used to achieve a higher clock
frequency. Nevertheless, pipelining can also be thought of as a way to reduce power. Due
Algorithmic
Architectural
Logic
Circuit
Layout
IncreasingLeverage
GeneralPurpose
Applicability
FIGURE 4-1: Hierarchy of Power Reduction Domains
Section 4.1: Hierarchy of Low Power Techniques 55
to the increase in achievable operating frequency, the supply voltage can be reduced until
the original cycle time is met again thus realizing a quadratic improvement in power.
Although the PGA designer is once again limited by the late binding of the actual imple-
mentation, as was the case in the discussion of parallelizing, the pipelining case is a bit
different in that a PGA should be architected to allow easy use of pipelining if chosen by
the user. Examples of architectural features to facilitate the incorporation of pipelining are
the ability to register all outputs1 and an underlying interconnect structure that supports a
datapath style of placement.
In many IC designs, busses account for a significant portion of the power consump-
tion. The high fanout and long wires associated with global bus structures, lead to large
capacitances and high activity. PGA’s are no exception to this trend. As shown in the
analysis of Section 3.4.1, the interconnect accounts for an overwhelming fraction of total
power consumption. In fact, the PGA case is even worse because the length of a wire is
not important due to the extremely high programmable fanout on all wires. In addition,
the previous analyses have shown that most PGA wiring is local in nature (a likely by-
product of spatial locality in designs). Therefore, the low power technique of replacing
high fanout busses with more local, dedicated busses should definitely be exploited in the
PGA environment. The difficulty however, is to keep a reasonable measure of flexibility
to preserve the general purpose capability of the device and to avoid an explosive growth
in area since wiring and programmable switches already dominate array size.
4.1.3 Logic
The next level in the hierarchy of low power techniques resides at the logic layer.
Several important choices such as logic block composition and mapping will impact
overall array power consumption.
The decision of logic block composition has been studied in great detail. In several
papers by Rose [29], [31] the impact of logic block functionality against area and speed
1. The Xilinx series contains special carry chain paths which do not allow easy latching of the carry bit which is often necessary to implement carry-save adders as well as other designs.
Section 4.1: Hierarchy of Low Power Techniques 56
have been examined. However, the metric of power has not been explored. In order to
gain an intuitive understanding of the role logic block composition has on power con-
sumption, one can consider the following thought process. From previous studies [29], as
the complexity or size of a logic block increases, the number of inputs and outputs to that
block must increase. From a power standpoint, this means more connections to intercon-
nect must be made resulting in higher per wire capacitance. Despite the increase in wiring
capacitance, more logic can be subsumed within a logic block. This fact helps to offset
the higher capacitance penalty of the general interconnect network. Finding the optimum
granularity of the logic block rests with the resulting level of utilization that can be
achieved across typical designs. A complete exploration of this trade-off was beyond the
scope of this thesis, but work in [12] has begun to shed some light on the proper logic
block from a power minimization standpoint. The conclusions from the granularity study
reveal that a 5 input LUT with 2 outputs works well for datapath intensive operations. For
this work, a 3 input LUT was chosen as a reasonable middle ground based on the power
results from an earlier fine-grain design [20] which showed that even smaller logic blocks
tend to be at a disadvantage from a mapping perspective.
The logic level of design in a PGA is closely related to the mapping process which
determines how the actual hardware is allocated to the PGA resources. Logic optimiza-
tion at this stage allows judicious trimming of unused or redundant functionality in
accordance with the low power theme of reducing waste. Another way in which mapping
impacts power is through the placement process. The degree of locality in a mapped PGA
design is subject to the efficiency of the logic partition, placement, and routing functions.
Some work has been performed in this area and have shown that by weighting these
functions against power, reductions of 10%-40% are possible [10],[18]. This is a
surprising result considering that designs are known to possess locality. However, the
examination of Xilinx power contributors revealed that local wiring is just as costly from a
power standpoint as is long-distance wiring. This leads one to believe that current archi-
tectures are not designed to exploit locality. Thus, a low power PGA design must pay
stricter attention to the inherent structure in designs by providing a more natural fit to the
Section 4.1: Hierarchy of Low Power Techniques 57
underlying PGA architecture. Furthermore, by adding a cheaper local interconnect the
benefits of locality can be translated into greater power savings.
4.1.4 Circuit
At the circuit level of design, many choices can impact power consumption. As was
noted in previous sections, the flexibility offered by a PGA is strongly dependent on archi-
tecture. However, flexibility is one of the PGA’s most attractive assets and thus, should
not be unduly sacrificed. In some sense, the circuit level becomes the most appropriate
point to introduce low power techniques because circuit optimizations are relatively
orthogonal to the architecture of a PGA. Therefore, the impact of circuit design is well
worth looking into.
The decision of which logic style to use often accounts for a 10%-50% difference in
power consumption and in some cases like dynamic logic a factor of 2x is common.
Often, dynamic logic is thought to offer low power benefits along with an increase in per-
formance. Although dynamic gates have reduced input and output capacitance since they
lack all but one PMOS device, the increased activity of the gate due to precharging clocks
outweigh the reductions in capacitance. As a result, dynamic gates are generally not a low
power logic style. More importantly, the incorporation of dynamic circuitry into a PGA
design is non-trivial. As noted in an earlier section, an important parameter left to the
user’s discretion is the clocking in a design. Without a pre-defined clock period to work
from and a known depth of logic, the timing of dynamic precharge and evaluate cycles can
not be designed. In addition, one must be reminded of the fact that a PGA will be used for
implementing many different types of designs. Thus, the user will most likely design the
clock frequency for the most optimal point with regards to power and performance.
Clearly, the necessity of supporting operation at a wide range of frequencies coupled with
unknown logic depths precludes the use of any dynamic logic.
Fortunately, static CMOS and pass transistor logic offer a lower power alternative to
dynamic logic. Many studies have attempted to determine whether static CMOS or pass
transistor logic is more superior as a low power logic style [50]. Based on these studies,
Section 4.1: Hierarchy of Low Power Techniques 58
the answer seems to be very dependent on the design being implemented. The potential
power savings offered by pass transistor logic derive from the sole use of NMOS transis-
tors, thus avoiding the larger PMOS devices in a complementary structure. In addition,
some functions (XOR) map to much simpler implementations with pass transistors as
compared to static CMOS. In the case of a PGA, most designs are known to make
extensive use of pass transistors in the LUTs and as switching elements. The reasoning
behind such a choice is evident because switches and multiplexer structures can be
minimally implemented using just NMOS pass transistors. Thus, significant capacitance
savings have already been realized through smart choice of logic style in current PGAs.
However, the use of pass transistors complicates low voltage design making further
reductions in power much more difficult as will be discussed in the following chapter.
Another key issue in low power circuit design is transistor sizing. In order to minimize
power, all devices should be kept as small as possible. Frequent use of minimum size
transistors will lead to an overall decrease in the switched capacitance of a circuit and in
turn, a decrease in dynamic power consumption. Unfortunately, some devices will require
sizing up to meet timing constraints on paths with heavy loads. In the case of a PGA, the
timing constraint is even more difficult to optimize against because the critical path of a
circuit has not yet been defined. Therefore, circuit paths must be sized to meet some
bounds on average circuit performance. In doing so, the cost of sizing up transistors must
be carefully weighed against the impact that larger parasitic capacitances will have on
power consumption. This is an especially important consideration along high fanout inter-
connect paths and will be examined in detail in a later section.
The last circuit design issue surrounds the choice of supply voltage. The discussion of
parallel and pipelined architectures showed how it is possible to achieve substantial
reductions in power without giving up on throughput. However, when power is the most
critical design constraint, some performance can be given up. The reason why such a
tradeoff appears attractive derives from the quadratic dependence of dynamic power on
supply voltage. While on the other hand, delay tends to vary linearly with supply. In
actual practice, the argument is not as simple and two caveats should be mentioned. First,
Section 4.1: Hierarchy of Low Power Techniques 59
the push to smaller feature sizes causes most devices to become velocity saturated fairly
quickly. As a result, the current characteristic of a device more closely resembles a linear
increase with the applied gate voltage. From this, one might think that delay will now
appear independent of supply, but one should also consider threshold voltage. When the
supply voltage falls in the range of a couple Vt, constants in the Ids equation become
important and the delay experiences a further degradation. Thus, lowering the supply
voltage always results in a decrease in performance.
(EQ 6)
Evaluating the tradeoff in performance and power as the supply voltage is reduced is
crucial to determining an optimal operating point. The standard measures of delay and
energy give a good indication of performance and circuit efficiency respectively, but tend
to be too biased towards a high supply for speed and a low supply for low energy. A
better metric to represent a design’s overall quality is the energy-delay product. This
metric captures the opposing influences of performance and energy efficiency leading to a
more balanced picture from which to optimize a design. Later studies in this thesis will
use all three metrics when evaluating circuit structures.
4.1.5 Layout
Once the physical level of design is reached, most parameters affecting the power
consumption of a design have been determined. Despite this, significant differences in
power consumption can arise between a well-planned layout and a haphazard one. Early
floorplanning is especially important for an array based structure like the PGA so that
basic blocks can be connected by abutment. Direct abutment of neighbor to neighbor
interconnect avoids the overhead of routing channels which lead to longer wires and hence
more capacitance. Along similar lines, the effective use of upper level metals for
connecting longer bus-like wires (feedthroughs, over the cell routing) allows a smaller
layout. In general, a compact basic cell layout will result in a more dense PGA array with
smaller wiring capacitances. Considering that a PGA array can have upwards of a
Id κυ satCoxW Vgs Vt–( ) for Vds Vdsat≥=
Section 4.2: Reducing Dynamic Power in PGAs 60
thousand cells, the optimization of the basic cell layout should be leveraged as much as
possible for reducing power consumption.
4.2 Reducing Dynamic Power in PGAs
The previous section provided a better perspective of the methods from which to
attack the PGA power problem. A more detailed discussion of the techniques and trade-
offs used to reduce dynamic power is the focus of this section. All the topics discussed
basically aim to reduce the dynamic power component which dominates PGA power con-
sumption. Specific issues are divided into attacking the capacitance, frequency and
activity, supply voltage, and voltage swing components of dynamic power.
(EQ 7)
4.2.1 Capacitance Reduction
Several architectural and circuit design choices impact the amount of capacitance that
will be charged and discharged within a PGA design. The number of metal layers
available in the selected process technology influence the final area and power of a PGA
array. In addition, the sizing and number of switches a design uses within the interconnect
fabric can also seriously affect power as seen in the Xilinx study of Section 3.2. Lastly,
capacitance is affected by the interface of the basic logic cell to the surrounding intercon-
nect and to the neighboring cells.
On a global level, the choice of a process that allows 3 or more metal layers can
contribute to a decrease in area and power. In earlier designs like the Xilinx 4000 series
and the Triptych design, only a two layer metal process was available. Process advance-
ments since that time have provided three metal layers as a standard with 4 and 5 moving
into the mainstream. The advantage of more layers of metal is two fold. As can be seen
from the process data for a 0.5um effective channel length technology shown in Figure 4-
2 below, the contributions to capacitance for metal 1 all the way to metal 3 tend to
decrease (Metal 3 with significant interline coupling is an exception). Given that intercon-
nect wiring is a crucial resource in a PGA, the reduction of the intrinsic wiring capacitance
DynamicPower C Vdd Vswing Fclk ActivityFactor××××=
Section 4.2: Reducing Dynamic Power in PGAs 61
by using upper layers of metal can help lower net capacitances. However, the necessity of
reaching the active layers to insert switches leads to the addition of several contacts and
vias to lower levels (stacked vias are unavailable). The area overhead of going up and
down through the metal layer hierarchy must be considered to truly evaluate the benefits
of more metal layers.
The second advantage of more metal layers comes from the examination of Triptych
and the early cell study in [20]. The ability to use a third level of metal can greatly
increases density. When previously restricted to only two layers of metal, the 10X
overhead in interconnect mentioned in Section 1.3 could not be avoided. Now, much of
the interconnect can be placed over the cells instead of in dedicated routing channels.
Although the exact measure of area improvement will depend on the amount of active area
necessary for switches and configuration memory in the design, the bottom line is that the
cell size will decrease. As a result, the interconnect paths will be shorter and hence wiring
capacitances will be lower.
Figure 4-2 illustrates the growth in wiring capacitance as a function of wire length.
The model used for generating the data was made worst case by assuming two wires
running parallel to the subject wire giving rise to interline coupling on both sides. The
equation describing the components to wire capacitance is shown below.
(EQ 8)
From the graph, one can see that Metal 3 offers lower overall capacitance across any range
of lengths except when there is significant capacitance to neighboring M3 tracks. In
addition, the graph shows the contribution a PGA interconnect line would experience from
purely wiring parasitics for a range of possible lengths.
Another important design consideration that has an impact on the resulting PGA
capacitance is the size and number of switches present on each interconnect segment
(Figure 4-4). Ideally, the minimum number of switches to support a design would be
chosen in constructing a PGA architecture. Unfortunately, that minimum number is very
Cwire CareaTopLayer CareaBottomLayer2 Cfringe 2 Cinterline×+×
++
=
Section 4.2: Reducing Dynamic Power in PGAs 62
difficult to pin down. In general, a mapping study across an interesting set of designs
needs to be performed to determine the necessary level of flexibility an architecture must
support. Combining information about the number of switches needed with the relative
impact of drain capacitances on interconnect, a basis to evaluate trade-offs in flexibility
and power can be made. Based on a model of how drain capacitance scales with transistor
size, the chart in Figure 4-3 was generated. In this case, the switches were assumed to be
implemented as just nmos pass transistors. Using this chart, one can quickly evaluate the
impact of changing to larger switch sizes and to increasing the fanout on an interconnect
segment.
In a Xilinx device, 15 switches on single length segments is common, but their exact
size is unknown. As a result, the chart’s usefulness really lies in the ability to gauge an
architectural decision’s impact on resulting interconnect capacitance especially when
combined with the wiring parasitics information in Figure 4-2. An interesting thing to
note is when wiring capacitance begins to dominate switch drain capacitance. This point
varies with the chosen switch size and fanout, but as an example, a fanout of 10 with
1.8um pass transistors will equal the wiring capacitance of a 90um metal 3 line.
FIGURE 4-2: Wiring Capacitance Parasitics
Section 4.2: Reducing Dynamic Power in PGAs 63
The sizing and implementation of switches throughout the PGA interconnect is
another factor with a large impact on a PGA’s resulting power consumption. From a
purely analytical approach, a simple model for an interconnect line can be constructed
based on Figure 4-4.
Assuming that all transistors are the same size, reasonable if all paths are considered
equivalent, the following models for delay and capacitance can be derived in relation to
the size up factor s, pass transistor resistance R, and drain diffusion capacitance Cdrain. N
represents the fanout on the line.:
FIGURE 4-3: Pass Transistor Diffusion Capacitance vs. Fanout and Switch Size (W) L=0.6um
Fanin/FanoutConnections
1 2 N-1
FIGURE 4-4: Model For Programmable Interconnect
Section 4.2: Reducing Dynamic Power in PGAs 64
(EQ 9)
(EQ 10)
From the above equations, one can see that to first order, sizing up will result in no
change in delay along a single interconnect path. The simple relation shows that the
decrease in resistance from sizing up is offset by a proportional increase in capacitance
yielding no delay improvement. More importantly, sizing up always has a negative impact
on power. The equation shows a linear increase in net capacitance, but when other contri-
butions to capacitance are factored in such as wiring parasitics, limited sizing up may be
prudent from an energy delay product point of view.
A series of simulations were performed to more accurately assess the trade-offs in
switch size and implementation. The three switch implementations range from the
simplest: a single nmos pass transistor, to a complementary transmission gate with equal
sized devices, to a transmission gate with a p device sized 2.5 times as large as the n. The
simulation model for the cases appears in Figure 4-5. Node capacitances were added at
each node to simulate the fanout associated with that point. Fanouts of 2 and 12 were used
to examine switch parasitics. In the case of the transmission gate simulations, more node
capacitance was added because the switches are implemented with both n and p devices
and so the capacitance per switch should be proportionately higher. Thus, the value of
loadCap was 4ff for 1.8um pass transistors and 8ff for the similar transmission gate and
the multiplier was 1 and 2 respectively. In order to account for the wiring capacitance
component, a worst case estimate of wiring parasitics was made and added to each
node. The transmission gate case with pmos sized 2.5 times as large as the nmos
performed worse in delay and energy than the equally sized transmission gate scenario so
it was omitted from the following analysis for clarity. The results are depicted in Figure 4-
6.
Td1s---R s N 1–( )Cdrain Cdrain+( ) Cwire+×∝
NRCdrain Cwire+=
Power sNCdrain Cwire+∝
Section 4.2: Reducing Dynamic Power in PGAs 65
Clearly, the transmission gate is at a disadvantage in terms of speed and energy,
especially when large fanouts are required. As the switch size becomes large, there is a
factor of 2 speed penalty and about a factor of 2 energy penalty. Even when the fanout on
the interconnect is only 2, the pass transistor implementation possesses better delay and
energy properties although the gap is less significant. The influence of driver and receiver
overhead and the fixed wiring capacitance is the reason why the transmission gate and
pass transistor methods become more comparable at low fanouts. In general, interconnect
fanout tends to be on the high side, and so the use of nmos pass transistors will result in
superior performance and energy (and area) over transmission gates. Thus, pass
transistors are the preferred implementation method for the switches in a low power PGA.
Cload Cload Cload 10ff
1.5v2v 2v 2v1.5v
1.5v
outin
Cload=loadCap*fanout*multiplier+CwireTransmission gate model just adds PMOS to pass transistors.
FIGURE 4-5: Simulation Model for Switch Size Experiments
FIGURE 4-6: Delay and Energy of Interconnect Chains vs. Switch Size
Wiring Dominated
FanoutDominated
Section 4.2: Reducing Dynamic Power in PGAs 66
When considering increases in switch size, there is a slow decrease in delay and a
nearly linear increase in energy. From the delay graph, it is evident that sizing up the
switches only pays off when the fixed components of capacitance (wiring parasitics and
driver/receiver loads) dominate performance. Both the delay and energy graphs show
close agreement with the models in Equation 10. Therefore, the results confirm that sizing
up switch devices may be prudent depending on the amount of wiring capacitance but
never proves beneficial from a power perspective.
The graph in Figure 4-7 shows the energy-delay product and offers a means for
reconciling the competing factors of performance and energy consumption. Assuming a
wiring capacitance of 50ff, a switch size of approximately 1.8um is optimal for pass
transistor interconnect with a high fanout. However, for low fanouts sizing up shows a
marginal improvement.
The effect of the interfacing structure between the interconnect resources and the cell
inputs to a PGA should also be considered. The Xilinx analysis showed that the CLB
inputs dissipate a sizable fraction of PGA power. An analysis of two different input
topologies and the variation in the number of inputs revealed other insights towards
FIGURE 4-7: Energy-Delay of Interconnect Chains vs. Switch Size
Section 4.2: Reducing Dynamic Power in PGAs 67
managing PGA capacitance. The two topologies examined were a linear pass transistor
input structure and a traditional logarithmic tree (Figure 4-8).
An analysis of delay and capacitance was performed similar to that in the section on inter-
connect above. The resulting equations appear below:
Linear Pass Transistor
(EQ 11)
(EQ 12)
Logarithmic Multiplexer Tree
(EQ 13)
(EQ 14)
(EQ 15)
To more easily see the relationships between the two topologies for typical numbers of
inputs, the data is graphed in Figure 4-9. In terms of delay, the linear pass transistor
structure shows a linear increase with number of inputs, but the mux tree gives a quadratic
increase for every doubling of the number of inputs. Thus, for small numbers of inputs,
the two structures do not exhibit as large a difference as they do at higher numbers of
inputs. Capacitance is another story. In the case of the multiplexer, two models were
developed because the amount of capacitance in the logarithmic multiplexer tree that can
be switched depends on whether the inputs to the mux tree are switching. In the worst
in0
in1
in2
in3
out
in0
in1
in2
in3
out
Linear Logarithmic
FIGURE 4-8: Possible Logic Cell Input Structures
Td R NCdrain×∝
Power NCdrain∝
Td 2 N( )Rlog 2 N( )log 1–( )× 3Cdrain 2Cdrain+∝
MaxPower N 2–( )3Cdrain 2Cdrain+∝
MinPower 2 N( )log 1–( )3Cdrain 2Cdrain+∝
Section 4.2: Reducing Dynamic Power in PGAs 68
case, if the inputs to the tree are switching, even though they are not selected by the
control lines all the intermediate capacitances in the tree will be switched. The best case
occurs when only the selected input is toggling which leads to switching only along the
intended path. Taking this variation into account, one can see that the linear pass transistor
structure is always superior for 4 or less inputs. However, as the number of inputs is
increased past four, the multiplexer under best case conditions can have a lower effective
capacitance than the linear pass transistor. However, when assuming a simple average
case for the multiplexer, the linear is once again lower in capacitance. Unfortunately, as
with many PGA design issues, decisions cannot be considered in isolation. The use of a
linear structure requires N programming cells vs. log2N for the multiplexer. This
additional area for memory cells can lead to an increase in array size and cause routing
capacitances to increase as a result. Therefore, the conclusion that linear will lead to
lower power and is thus the best choice should be qualified by the impact such a decision
will have on the overall design.
4.2.2 Frequency
Often, the frequency of operation can be relaxed somewhat to achieve a reduction in
dynamic power. As previously mentioned, design frequency is not under the control of the
PGA designer because it was thought to be a key variable that should be left to the
discretion of the PGA user. Similarly, the activity of nets is impossible to manipulate
FIGURE 4-9: Delay and Capacitance of Logic Cell Input Structures
Section 4.2: Reducing Dynamic Power in PGAs 69
without knowing what the eventual design will be. Consequently, optimization along the
frequency and activity axes were not a focus of this work.
4.2.3 Voltage Scaling
Voltage scaling can provide significant leverage on a design’s resulting dynamic power
consumption. The quadratic reduction forms the basis of many of the low power
techniques evaluated in Section 4.1. However, the impact of voltage scaling on the circuits
used cannot be overlooked. In the majority of CMOS design, rail to rail voltages are
passed by either static or dynamic circuitry. In this case, voltage levels and margins track
well with lowering supply voltage. However, the same is not true for pass transistor
circuitry and other circuit configurations where a “diode” drop is lost. The severe implica-
tions of diminished margins on PGA circuitry are left to the following chapter.
4.2.4 Low Swing
Another common circuit technique to achieve lower dynamic power is to reduce the
swing on signals to gain a linear power reduction. In general, techniques to support low
swing signaling involve either special timing, voltage references, or charge sharing
techniques. In areas like memory design, low swing techniques have long been used to
reduce the power contribution of heavily capacitive bit-lines. A similar problem exists in
the PGA environment where the interconnect is heavily loaded; therefore, low swing
techniques could potentially offer large power savings.
When considering the incorporation of low swing signalling into a PGA design,
several issues become important. The circuitry necessary to both drive and receive low
swing signals is usually much more involved than a standard inverter would be. Often, the
increased area from a more complex driver and receiver will have a major impact on the
PGA cell area if several of the special drivers and receivers are necessary. This conclusion
was reached based on estimates of cell area for typical designs.
Section 4.2: Reducing Dynamic Power in PGAs 70
The careful analysis and management of noise presents another problem when dealing
with low swing signals. The PGA noise environment is far more irregular than that found
in an arrayed memory. As a result, worst case noise conditions are difficult to predict and
could disrupt proper electrical operation of circuitry given that low swing signals
inherently have lower noise margins. In fact, the use of pass transistors in the interconnect
of current PGAs already causes interconnect signals to swing at VDD-Vt(body effect).
Therefore, a simplistic form of low swing signalling is already performed. Another com-
plication to employing more sophisticated low swing techniques involves the inclusion of
repeaters. In order to break up the large distributed RC lines found in PGAs, repeaters
should be used. However, the construction of a low swing repeater requires significantly
more overhead than a traditional inverter buffer. Lastly, the benefits of low swing
circuitry must be measured against further reductions in supply voltage which allows the
benefits of low swing to be achieved without redesigning drivers and receivers.
Despite the complications involved with low swing design, a more detailed look at the
issues surrounding the use of low swing signalling within a PGA environment is appropri-
ate. In traditional timing based low swing designs, a timing reference is used to enable
circuitry to latch or amplify the low swing signal (Figure 4-10)
Unfortunately, an enable signal is most often derived from the clock and as previously
stressed, the clock is not specified at design time. Other techniques to generate a timing
reference such as a transition triggered scheme proved to be too slow and required
enen
en
in+ in-
out
FIGURE 4-10: Low Swing Latched Sense Amp Receiver
Section 4.2: Reducing Dynamic Power in PGAs 71
excessive circuit overhead. More importantly, even if a timing signal is available, the
exact triggering of the sensing circuitry proves difficult. The capacitance and therefore
the timing properties associated with any path to be sensed will vary depending on the
mapping that a PGA user performs. Once again, the late-binding of important design
parameters makes incorporation of timing based low swing techniques very difficult.
In addition to sensing, voltage reference methods can also be used to send and receive
low swing signals. In order to provide level conversion, differential structures provide a
larger effective signal and can enable swings on the order of a 100mV. However, the
overhead of supporting differential paths must be weighed against the energy savings from
reduced signal voltages. Frequently, static current must be dissipated in differential
topologies. Although static power is generally discouraged in CMOS designs, judicious
use of static power consuming circuitry can result in lower power than more conventional
alternatives. The key trade-off to consider is whether the amount of static power
consumed is less than the dynamic component at typical operating frequencies. Since
PGAs run relatively slowly (tens of MHz range), the power dissipation from static current
outweighs any savings from lower swings.
In addition to differential techniques, many ingenious ways of pulsing drivers to
achieve reduced signal swing have been reported [9]. Incorporation of such concepts into
a PGA has proved precarious due to the dynamic nature of such schemes. PGA intercon-
nect can be directly modeled as a distributed RC line with significant resistance and capac-
in+
out
Vref
FIGURE 4-11: Pseudo-Differential Receiver for Low Swing Signals
Section 4.2: Reducing Dynamic Power in PGAs 72
itance. As a result, after a driver is pulsed off, the delivered charge redistributes among the
entire interconnect line and settles to some equilibrium voltage. This voltage is often
lower than the point at which the driver turned off causing inadequate signal levels at the
receiver or worse yet, oscillations as the driver repeatedly pulses on until a stable signal
level is established. Since the RC of PGA interconnect paths is not a fixed parameter,
reliable design of the charge delivery circuitry is extremely difficult. In areas of the PGA
where some tighter measure of control over the variation of interconnect RC can be relied
upon, pulsing techniques may become useful.
Various simulations were performed to evaluate many of the low swing techniques
described above. In general, they did not perform robustly or did not offer a significant
energy advantage. Consequently, low swing circuitry was not included in the final design
as a way to reduce dynamic power consumption in PGAs. However, many of the issues
complicating low swing design in a PGA environment were made clear from this phase of
design exploration and further research in this area may provide new solutions. As an
example, one area where low swing circuitry may prove beneficial is in the upper levels of
an interconnect hierarchy. On the long interconnect lines at this level, wiring parasitics
will dominate and limited fanout will constrain the variations in load capacitances.
73
CHAPTER 5
Low Voltage Pass Transistor Design
5.1 Design Issues with Pass Transistors at Low Supply
In previous 5 volt designs, the fact that NMOS transistors do not pass a full logic high
was a small concern. However, at an intended supply voltage of 1.5 volts, several circuit
design issues had to be studied in order to evaluate the continued feasibility of pass
transistor switches. A thorough examination of threshold loss effects provided a basis
with which to evaluate circuit performance at low supply voltages, specifically at voltages
nearing the sum of the nmos and pmos thresholds. In order to deal with the problems of
low voltage pass transistor design, lowered device thresholds and level restoration
techniques are discussed
TABLE 5-1 HP Process Parameters
Drawn Channel Length 0.6um
N-channel Vt 650mV
P-channel Vt - 850mV
Poly pitch/spacing 0.6um/0.9um
M1 pitch/spacing 0.9um/0.9um
M2 pitch/spacing 0.9um/0.9um
M3 pitch/spacing 1.5um/0.9um
N-well process
Section 5.1: Design Issues with Pass Transistors at Low Supply 74
5.1.1 Signal Loss Effects on Performance
The most serious consequence of designing with pass transistors at a supply voltage of
2Vt is the inability to effectively flip a standard inverter. Figure 5-1 shows a typical sce-
nario where a pass transistor feeds the gate of an inverter which is skewed low from the P/
N ratio of 1/2. Unfortunately, the inverter switching point does not move much lower
since the supply voltage is the sum of the thresholds.
In the case of Figure 5-1, the pass transistor is effective at pulling low so the inverter’s
pmos device is fully turned on giving a solid low-to-high transition. However, when
attempting to drive an inverter from an nmos pass transistor, the inverter experiences only
100mV of gate overdrive with which to make a high-to-low transition. The maximal volt-
age that can be passed through the nmos transistor sits at Vdd-Vtn(body effect). If the
pass transistor has a nominal Vt of 650mV, then it can barely charge Vint to 800mV after a
significant delay (20ns). Even if the inverter is skewed so the nmos is twice minimum size
and the PMOS is minimum sized, the pass transistor/inverter propagation delay is 4.5ns.
One should note that tpHL is significantly longer than the propagation delay (nearly 2x).
In a typical CMOS path, a transmission gate would produce a full-swing input with which
to drive the inverter. As a result, the gate overdrive voltage is approximately a Vt for both
transitions which is enough for reasonable performance at Vdd=1.5 volts. However, in the
configuration of Figure 5-1 an acceptable level of performance cannot be achieved.
Lowering the threshold voltage of the pass transistors can alleviate the above problem
if the process technology supports such an option. Low Vt pass transistors will enable a
higher output voltage to be passed to the input of the inverter giving the nmos more gate
0.9u/0.6u
1.8u/0.6u1.8u/0.6u
Fanout = 2
Vdd=1.5v
Vt=650mV
Vin Vout
Vint
FIGURE 5-1: NMOS Pass Transistor Feeding an Inverter
Section 5.1: Design Issues with Pass Transistors at Low Supply 75
drive. A lower bound on threshold voltage of 200mV was chosen for the following anal-
ysis. Although an even lower threshold would have produced better signal levels, the
200mV limit was set because of controllability and leakage constraints. When using a low
Vt transistor, the threshold variations due to process can cause excessive delay variability
[37]. Circuits have been demonstrated to control variation by using a leakage monitoring
feedback loop allowing sufficient Vt stabilization at 200mV [19], [33]. In addition to Vt
variation, excessive leakage currents can develop when using low Vt devices which con-
tributes another lower bound to Vt. The topic of leakage will be discussed in Section 5.3.
In order to simulate circuits with lower thresholds, the model file for the targeted HP pro-
cess was modified to give a Vt of 200mV by adjusting the flat-band voltage. Changing the
threshold in this way should have a minimal effect on the other device parameters.
An investigation of the threshold voltage’s impact on propagation delay was studied
using the bulk bias as a control parameter for the circuit in Figure 5-1. The low Vt transis-
tor was used as a baseline, and by applying a negative bias to the bulk connection the
threshold voltage could be increased. Table 5-3 shows well bias voltages and the resulting
Vt values for a source voltage of 0 volts.
Figure 5-2 depicts the dependence of delay as a function of pass transistor threshold
voltage. The nominal Vt inverter curve shows the delay for the case in which a pass tran-
TABLE 5-2 Comparison of Nominal Vt and Low Vt Nmos Devices (w=1.8u)
Low Vt (Vbs=0v => Vt=200mV)
Nominal Vt (Vbs=-3.3v => Vt=650mV)
Max Vout 1.1V 0.8V
Ipeak (diode connected)
200uA 100uA
TABLE 5-3 Vt Variation with Well Bias
Threshold VoltageWell Bias
(for Vsource=0v)
200 mV 0
350 mV -0.85v
500 mV -3v
650 mV original model file used
Section 5.1: Design Issues with Pass Transistors at Low Supply 76
sistor feeds an inverter whose device thresholds are fixed at the nominal process Vts
(Table 5-1). The second curve shows the delay when a pass transistor is connected to an
inverter composed of devices fixed at a low Vt (Vtn=200mV,Vtp=250mV). The delays are
normalized to 312ps with a fanout of 2. From the graph, one can see that there is a signif-
icant delay penalty (>5x) as the threshold voltage of the pass transistor feeding a nominal
Vt inverter is raised; whereas, the case of driving a low Vt inverter sees little variation in
delay (1x). However, circuit delays experience only a 1x slowdown if 500mV pass tran-
sistor thresholds are available. Clearly, signal loss through a pass transistor is a serious
problem when designing for a process with high thresholds. It should also be noticed that
at a supply voltage of 1.5 volts, the delay through a low Vt pass transistor and inverter is
nearly 3 times larger for the nominal Vt inverter case (790ps) than the low Vt inverter case
(312ps). This scenario demonstrates the strong correlation between threshold voltage and
delay at supply voltages nearing the sum of the device thresholds.
A closer look at the effects of supply voltage on pass transistor performance is shown
in Figure 5-3. In this case, the delay of a pass transistor feeding an inverter composed of
nominal Vt devices was plotted. One can see that if thresholds are not scaled appropri-
ately, pass transistor design is significantly complicated by lowering supply voltages. As
the pass transistor threshold voltage is varied, the delay at a 3.3 volt supply experiences
less than 10% degradation. At a 1.5 volt supply, a 5x slow-down occurs, but by raising the
FIGURE 5-2: Pass Transistor Threshold’s Impact on Delay
Section 5.1: Design Issues with Pass Transistors at Low Supply 77
supply voltage to 2 volts, only a 70% decrease in performance is suffered. Thus, either
threshold scaling of pass transistors or a slightly increased supply voltage is neces-
sary to make pass transistors viable for ultra low power design.
5.1.2 Multiple Threshold Losses
In general, pass transistors are best utilized in series chains. However, there are times
when it is desirable to feed a pass transistor output to the gate of another pass transistor
network. Such a scenario is depicted in Figure 5-4 which was an early version of the logic
cell in [20]. During the design of pass transistor based circuitry, one must consider the
detrimental impact of successive losses in signal level. First, the loss of signal presents
less gate overdrive on fanout gates. The use of pass transistors with 200mV thresholds can
alleviate this problem, but only temporarily because after three drain to gate connections
the high output becomes a mere 600mV. Once again, a nominal Vt inverter cannot be
flipped. As a result of the build-up in signal loss, level restoration becomes necessary at
critical points in a circuit to insure proper operation. As will be shown, level restoration is
costly. Consequently, to avoid paying the restoration penalty too often, a designer will
need to trade-off circuit flexibility.
FIGURE 5-3: Supply Voltage Impact on Delay
Section 5.1: Design Issues with Pass Transistors at Low Supply 78
The second important issue involving multiple threshold losses is the observation that
the number of thresholds lost at any point in a circuit sets a lower bound on the supply
voltage. This requirement is similar to the Vt constraints in low voltage analog design.
Obviously, low power design with pass transistors is severely restricted, if too much signal
is lost along any one path a logic high will no longer be able to be distinguished from
ground.
The progressive loss in signal level is illustrated in Figure 5-5. Unless low Vt pass
transistors are used, noise margins become extremely small after just one threshold loss
with Vdd=1.5 volts. Even at a Vt of 200mV, two drain to gate connections in series would
result in a high level that is too susceptible to noise with a supply of 1.5 volts. In the case
of the nominal process Vt (650mV), the signal is completely lost after two drops. In order
to remedy this problem, level restoration must be performed.
a a#
b b#
cin
cin#outa
outa#
outb
outb#
outcoutc#
fout
fout#
o
o#
= memory cell
1Vt drop
2Vt drop
3Vt drop
FIGURE 5-4: Logic Cell Design with Multiple Drain to Gate Pass Transistor Connections
Section 5.1: Design Issues with Pass Transistors at Low Supply 79
FIGURE 5-5: Signal Loss Levels at Various Supply Voltages
Section 5.2: Level Restoration 80
5.2 Level Restoration
As previously explained, level restoration quickly becomes a necessity if reasonable
delays are to be achieved and proper circuit operation maintained. Several options exist,
but all revolve around ratioed logic. Among the possibilities are 1) Feedback pmos, 2)
Cross-Coupled pmos (CCP), 3) Sense Amplifying Latch (SAL) [32],[31] and 4) the case
of no restoration. The three restoration schemes are depicted in Figure 5-6.
All four cases were simulated using a test circuit where the restorer outputs were
loaded with inverters and fed by 1.8um pass transistors. In the case of CCP and SAL, dif-
ferential inputs are needed. In all circuits, the well bias of the pass transistors were varied
according to Table 5-3 to give thresholds from 200mV to 650mV. The restorer transistors
were all nominal Vt. The following graphs show the propagation delay from the full scale
input through a pass transistor to the output of the inverter. Energy is also examined
because static current can flow during the restoration process as the ratioed devices tempo-
rarily fight each other. It is important to note that the supply voltage of ~2Vt essentially
eliminates the transition currents, but at higher supply voltages such currents may become
substantial. Data is plotted in Figure 5-7. The first shows values when the control voltage
on the pass transistor is at Vdd (the case when driven by a full-swing input).
From the results shown, there is a considerable trade-off in delay and energy among
the restoration schemes. When using a low Vt pass transistor, all the techniques experi-
ence similar delays. However, the energy of the differential versions is much higher
because of the necessity to duplicate paths. The best alternative from a delay perspective
is the cross-coupled pmos restorer. In this case, the signals are differential so when one
Cross-coupledPMOS
Sense AmplifyingLatch
FeedbackRestorer
FIGURE 5-6: Level Restoration Schemes
Section 5.2: Level Restoration 81
path gets pulled to ground, the pmos of the other path is fully turned on and the weak pull-
up of the nmos pass transistor is sped up and brought to the supply rail. This type of dual
rail signalling holds much promise for low voltage pass transistor designs where
speed is most critical since the quick pull-down action governs circuit speed. As the
pass transistor thresholds are raised, the non-restored and feedback pmos methods do not
fare as well as the dual rail techniques and experience significant performance degrada-
tions. When moving to a nominal Vt pass transistor (Vt=650mV), the two single-ended
techniques become impractical. Basically, differential signalling must be used in order to
restore signals when designing for supply voltages near 2Vt.
A closer look at the results show that the sense amplifying latch is just a fraction
slower than CCP. In fact, the SAL will always be slower than the CCP in this capacity.
The possible benefit from the SAL is that the full inverter will enable pull-down and pull-
up paths thus amplifying the transition through feedback in both directions. When closely
examined, it is found that the circuit does not take advantage of this because the pull-up by
the pass transistors is much slower than the pull-down. As a result, the pmos part of the
restorer is always enabled first by a low going transition. The nmos devices are never suf-
ficiently turned on by the degraded pull-up of the pass transistors. Furthermore, the pres-
ence of an extra nmos path to ground on the pull-up side further slows the high going
transition of the pass transistor output until it is completely turned off by the other branch.
FIGURE 5-7: Level Restorer Delay and Energy Comparison with Single Threshold Loss
Section 5.2: Level Restoration 82
In terms of energy, the higher currents required to flip a full inverter cause at least a 15%
energy increase over the CCP technique. As a side note, the sense amplifying latch does
offer the possibility of easily incorporating a latch into the output of the logic block. The
addition of the clocked pass transistors to the path results in a 0.5ns slow-down which is
not excessive if the latches are deemed appropriate. Similar conclusions about the sense
amplifying latch were reached in [50].
The next set of data reflects the performance of the restorers when the control voltage
is derived from a circuit where more than one threshold drop is traversed (Figure 5-8). In
all cases, pass transistors’ with 200mV thresholds were used. The voltage that initially
appears at the restoration node moves from Vdd-Vt, to Vdd-2Vt to Vdd-3Vt.
From the above graph, it is evident that the restoration delay increases with increasing
signal loss. When a signal that has lost 3 thresholds is to be restored, a 2x increase in
delay results. It should also be noted that the short-circuit current from the restoring
device to ground during the transition period grows quickly with increasing restoration
delay especially for the feedback pmos technique. Another interesting note is that the
feedback restorer failed to fully restore signals within the 5ns window when fed by an
input that has dropped by more than 1 threshold. As a result, significant design flexibility
is lost when relying on single-ended restoration.
FIGURE 5-8: Level Restorer Delay Comparison with Multiple Threshold Loss (low Vt pass transistors)
Section 5.3: Vt Scaling and Leakage 83
An examination of the variation in restoration delay with supply voltage also yields
some insights into low voltage design with pass transistors. Figure 5-9 shows the delays
of the various techniques as the supply voltage ranges from 3.3 volts to 1.5 volts. At
higher voltages, restoration is fairly quick, but quickly degrades in a square-law depen-
dence for lower supplies. A possible way around this would be to scale the Vt’s on the
restoring devices as well, but this has severe consequences on sizing and leakage as dis-
cussed in Section 5.3. As a final note, the large increase in energy for the non-restored
case at 3.3 volts is a result of static current dissipation from the inadequate turning off of
the pmos device.
5.3 Vt Scaling and Leakage
5.3.1 Subthreshold Leakage Currents
The use of low Vt pass transistors mitigate many of the problems with pass transistor
logic design at low voltages, but they do not come without a price, namely higher sub-
threshold leakage currents. According to the paper by Sakurai [19], leakage currents fol-
low an exponential dependence on Vt where s is defined as the subthreshold slope.
FIGURE 5-9: Restorer Delay and Energy for Different Supply Voltages (low Vt pass transistors)
Section 5.3: Vt Scaling and Leakage 84
(EQ 16)
At room temperature, for a W/L = 7 and Vdd=1.5V, the leakage goes as shown in Figure
5-10 below.
The above values were verified by simulation of a few scenarios. When using a Vt of
200mV, one can expect about 1e-8A or 10nA of leakage current. Although the current
from a single transistor is not significant, the use of hundreds of thousands of these devices
can result in currents in the mA range which would be detrimental in low power applica-
tions.
IsubVgs Vt–( )
s--------------------------exp∝
FIGURE 5-10: Measured Sub-threshold Id-Vt Characteristics for Various Vgs [19]
1
0
FIGURE 5-11: Sneak Leakage Path in Between Pass Transistor Connected LUTs
Section 5.3: Vt Scaling and Leakage 85
In fact, leakage is the primary reason why low Vt devices cannot be used in all the
gates of a PGA array. In some cases, leakage in pass transistor chains like Figure 5-11 can
be controlled by programming the configuration cells to zero, thus eliminating the voltage
drop across leaky transistors in a sleep mode. The use of a dual Vt process where the
memory cells use high Vt devices could also alleviate most leakage concerns. If neces-
sary, further control of subthreshold leakage can be obtained by increasing the reverse bias
on the well which will cause Vt to increase resulting in an exponential decrease in the
leakage component. Unfortunately, separate access to the well gives rise to an area over-
head since well contacts will need to be specially placed to connect to a dedicated well
bias line.
5.3.2 Static Current Dissipation in Pass Transistor Fed Inverters
So far, low Vt devices appear to offer superior performance as long as leakage is con-
trolled. One might be tempted to employ low Vt devices for all logic transistors, not just
the pass transistors. Up to this point, static current dissipation caused by insufficient turn-
ing off of the pmos device in an inverter has not been discussed. However, if all devices
are low Vt, the degraded high level will not be able to turn off a pmos device and a static
current path will exist when driving a logic low (Figure 5-12).
A combination of low Vt pass transistors feeding nominal Vt inverters solves the static
current problem associated with inadequate turning off of the pmos device. In effect, as
long as the logic high value produced by a pass transistor differs from Vdd by less than the
threshold voltage of a pmos device, then no static current will flow.
(EQ 17)
Fanout = 2
Vdd=1.5v
Vout
VintVin
degradedlogic lowIstatic
FIGURE 5-12: Static Current Dissipation for Low Vt Inverter fed by Pass Transistor
not fully turned off
Vdd pass transistorVt Vbs( )– pmosVt<
Section 5.3: Vt Scaling and Leakage 86
One word of caution should be mentioned here. The leakage current through a nomi-
nal Vt device can still be significant when the gate is not completely turned off. Figure 5-
10 shows that there is approximately a 100x increase in leakage currents for every 200
millivolts of additional gate to source voltage for devices that are not completely turned
off regardless of threshold voltage. Thus, if too many devices are driven with a poor logic
high, the aggregate leakage current could become substantial. Fortunately, the pmos
threshold is normally larger than the nmos which allows some extra margin. This scenario
serves as another example of where early power estimation can be valuable. If the calcu-
lated leakage is within some bound of the total power consumption then special design
considerations (well bias, etc.) need not be considered.
In the end, low Vt devices were not used for the final design to be discussed because
they weren’t available in the target process. Nevertheless, a comparison showing the
effects of low Vt devices is useful to illustrate how design options may change if given a
more advanced process technology.
87
CHAPTER 6
Cell Design and Array Architecture
The circuit design and architecture for a low power PGA design are the focus of this
chapter. Several circuit options for the logic cell are examined along with the design of
the interconnect circuitry. Throughout the discussion, various trade-offs among supply
voltage, logic style, and performance are evaluated. Minimization of capacitance and
signal swing factored predominantly in the circuit design phase. Architecturally, this
design strives to achieve low power while still maintaining high flexibility and efficient
logic utilization. A direct-path interconnect mesh joins 3-LUT logic cells thus providing
low capacitance connections for short distance routes. In addition, a local bus routing
grid allows datapath style connectivity to be efficiently supported. Lastly, long-distance
routing is achieved through a hierarchy of coarse interconnect grids.
6.1 Logic Cell Circuit Options
Although the interconnect consumes most of a PGA’s power, the logic cell design can
not be ignored for several reasons. The composition of a logic cell defines the functional-
ity that can be performed by each array element. Simpler cells imply that a greater
number of cells will need to be interconnected to realize a particular overall function.
Conversely, a larger, more capable cell can pack more logic within the cell and thus use
potentially fewer external nets. A complete study of the effects of cell complexity and its
relation to power consumption was beyond the scope of this work, but a couple of trade-
Section 6.1: Logic Cell Circuit Options 88
offs were explored to determine what a suitable granularity would be. The work described
in [20] used a very simple logic cell which could be thought of as offering a bit more
flexibility that a 2-LUT. Results from mappings and circuit simulations revealed that the
fine granularity combined with a restrictive input structure caused a large number of cells
to be needed to implement fairly simple mappings. Since a large number of cells were
needed, a large amount of interconnect was required and despite the fact that the intercon-
nect was low in capacitance, the aggregate energy consumption remained higher than
desired. On the other hand, moving to a more complex LUT would require more configu-
ration cells and interface circuitry. One should keep in mind that increasing the number of
inputs to a LUT causes an exponential increase in LUT configuration cells and will also
require more overhead if intermediate outputs are brought out (i.e a 2-LUT needs 4
memory cells, a 3-LUT needs 8 memory cells, and so-on). More memory cells directly
impact the basic cell area and thus add to the wiring capacitances all the interconnect will
see. In addition, the incomplete utilization of a complex logic cell will indirectly translate
to a waste in energy. Consequently, the cell designs considered were based on a 3-LUT
style as a reasonable balance in efficiency.
Performance is another reason why the logic cell design was carefully evaluated. Even
though speed was not the primary goal of this PGA design, achieving a reasonable level of
performance is critical to building a truly viable design. In fact, the optimization of the
energy-delay product provided a guide for balancing the metrics of energy and speed
throughout the design process. In terms of speed, the logic cell delay represents at least
50% of the total delay for typical mapped designs. Obviously, care must be taken in the
cell design to avoid substantial performance degradation for the entire array.
The logic cell design also plays a role in the determination of supply voltage, and
interconnect structures. As mentioned in Section 5.1.2, circuit topology can easily
become the limiting factor in lowering supply voltage and hence power. At the outset of
this research, the target supply voltage for the design was 1.5 volts and so circuit solutions
focused on that specification. In addition to supply voltage, the type of circuitry used in a
logic cell will often dictate the circuit characteristics of the interconnect network (e.g.
Section 6.1: Logic Cell Circuit Options 89
interface circuitry). Any impact on the interconnect network will likely have a strong
correlation to the resulting PGA power consumption. Lastly, the amount of energy
dissipated by the logic block should not be ignored. Although the CLB energy was shown
to be minor in comparison to the interconnect of a Xilinx, optimization of only one aspect
in a design will always result in diminishing returns.
The next several sections contain a brief evaluation of various logic cell implementa-
tions. After describing each separately, all the schemes are compared with respect to
delay, energy and energy-delay product in Figure 6-10 and Figure 6-12. The final choice
of logic cell implementation is explained at the end.
6.1.1 Pass Transistor
A simple method for implementing a logic cell is depicted in Figure 6-1. The inputs to
the cell are derived from a pass transistor based interconnect network. The entire 3-LUT
structure is basically a tree decoder where the inputs select which configuration bit
appears at the cell output. The structure is entirely composed of nmos-only pass
transistors which minimizes parasitic capacitances. The fact that all functions of 3 inputs
are compactly available using only nmos devices and a single rail signalling is attractive
from a power perspective. The main problem lies with the two threshold voltage drops
that a signal experiences as it passes from input to output. As discussed in the previous
chapter, the minimum supply voltage for the design will be limited by the 2 Vt drops
including body effect plus some margin for resolving a 0 or 1 or about 4Vt. This point
occurs at a supply voltage of approximately 3 volts for the process under consideration
(Vtn=0.65v, Vtp=0.85v). Even at 3 volts, the inverters must be skewed low to get
reasonable performance. At lower voltages, the circuit quickly fails making a 1.5 volt
supply completely out of the question. The most important thing to realize in this case is
that the topology of circuits using only pass transistors should be restricted to as few drain
to gate connections as possible to achieve low supply operation.
Section 6.1: Logic Cell Circuit Options 90
6.1.2 Cross-Coupled PMOS Restorer (CCP)
Although the basic pass transistor implementation is inadequate at a supply voltage of
1.5 volts, nmos pass transistors can still be employed as long as differential signalling is
used to allow level restoration (Section 5.2). Figure 6-2 depicts the resulting lut topology.
The cross-coupled pmos structure belongs to the CPL logic family, and also bears a
striking resemblance to the differential voltage switched cascode logic (DCVSL) family.
In fact, the gate style is identical except that the lut replaces the DCVSL ground
connections with connections to the configuration memory cells. By using this structure, a
pass transistor signal path will operate at 1.5 volts which was impossible in a single ended
design.
The cross-coupled pmos (CCP) gates possess several beneficial properties for a low
power LUT design. Despite the complete duplication of paths, the energy when compared
to a static CMOS implementation does not double. In fact, the energy is approximately
equal to the transmission gate case in Section 6.1.5. The reason is that the signal swing on
pass transistor driven nodes is reduced to only 750mV versus a full swing of 1.5 volts for
the transmission gates. More importantly, the differential CCP achieves a lower delay
input
out
vdd
LUT Tree
LUTConfigMem
input mux
lut inputs
FIGURE 6-1: Pass Transistor Logic Cell
Section 6.1: Logic Cell Circuit Options 91
than transmission gates for about the same amount of energy. Most of the delay
improvement comes from the fact that the pull-up is accelerated by the differential action.
Sizing complications and the necessity for differential signalling mitigate the superior
energy and delay performance of CCP. The path from the pmos devices through the nmos
chain represents a ratioed gate. In order to operate correctly, the pmos devices cannot be
made too strong or the low going signal will not be able to flip the complement path.
Likewise, an overly weak pull-up will slow the flipping of the inverter for the high going
side. In both cases, careful optimization is crucial to ensure reliable operation and
minimal short circuit current during signal transitions. The other drawback to implement-
ing the logic cells in this fashion is more difficult to deal with. Supporting fully differen-
tial paths can complicate the logic cell’s routing if its architecture has a high fanin and
fanout. In addition, differential paths will require a doubling of the interconnect area,
although careful planning of the layout on the local level may be able to minimize the
impact of differential signalling. Fortunately, a differential LUT does not require more
memory cells since true and complement signals are free; however, the area overhead from
the duplication of interconnect buffers and interface circuitry will certainly cause
undesired increases in cell area. Therefore, a single ended option was thought to be more
ideal.
input
out
vdd
LUT Tree
LUTConfigMem
outComplement
LUT Path
inputcross-coupled
restorers
FIGURE 6-2: Cross-coupled pmos Restored Pass Transistor Logic Cell
Section 6.1: Logic Cell Circuit Options 92
6.1.3 Boosted Capacitor Restoration (BCR)
Boosted capacitor restoration is another interesting technique to allow near 2Vt
operation of pass transistor based designs. Similar to the cross-coupled pmos style of
LUT, the boosted capacitor technique also relies on differential signals. Instead of using
an active pull-up, the low to high transition receives a charge boost from the coupling
capacitor. As the low to high side slowly charges through the pass transistor, the
complement path quickly flips the inverter causing the output to capacitively couple to
node X. Two benefits result from the bootstrapping effect. First, the edge rate of the low
to high transition becomes accelerated. Without the charge injection, the pass transistor
slowly reaches its final voltage in an RC fashion. More importantly, the coupling allows
the voltage to rise to a point above Vdd-Vt(body effect), thus surpassing the downstream
inverter’s switching point. This allows the output to transition quickly even at a supply
voltage of 1.5 volts. A capacitor in this process would be formed using the gate oxide of a
transistor. Simulations showed that 10ff of capacitance would be sufficient for good per-
formance. Upon close examination of the circuit, one might think that any benefits would
be offset because of the extra capacitance on the pass transistor nodes which would slow
the low going transition. Although there is some degradation to the high to low edge, the
fact that the nmos device is more capable when pulling low makes the slow-down insignif-
icant.
The original intention of devising the boosted capacitor restoration technique was to
develop a means of operating pass transistor networks faster and at a lower power than the
cross-coupled pmos method. Based on SPICE analysis, the BCR technique actually
inputfull-swing
input
input
X
FIGURE 6-3: Boosted Capacitor Restoration Input Circuitry
Section 6.1: Logic Cell Circuit Options 93
performs equally with the CCP method with regards to energy and delay. Unfortunately,
the necessity for differential paths could not be avoided and so the same comments made
about CCP apply here as well.
6.1.4 Buffered Pass Transistor
All the techniques for implementing a pass transistor LUT discussed so far have had
drawbacks which discourage their use. Thus, a re-examination of the simple pass
transistor LUT made sense. The main hurdle to using the nmos only pass transistors
concerns the voltage loss in passing a ‘1’ which makes it impossible to effectively flip an
inverter (Section 5.1). In the diagram of Figure 6-1, a signal would experience one
threshold loss after coming through a switch from the interconnect and then another in the
drain to gate connection of the tree decoder. By simply ensuring that a signal path never
encounters more than one threshold loss, the minimum supply voltage can be reduced to
about 2 volts and still achieve reasonable performance. Figure 6-4 shows the maximum
voltage levels that can be propagated along the signal path.
The resulting LUT design called buffered pass transistor retains many desirable
properties. Relaxing the supply voltage to 2 volts allows the LUT delay to decrease by
about 30% over the fastest 1.5 volt design. Compared to the other schemes at 2 volts, the
buffered pass transistor performs about the same as a transmission gate design, and about
input
out
LUT Tree
LUTConfigMem
Vdd=2v
Vhigh=1.1v
FIGURE 6-4: Buffered Pass Transistor Logic Cell
Section 6.1: Logic Cell Circuit Options 94
0.5 ns slower than the fastest differential techniques. In terms of energy, buffered pass
transistor consumes the lowest among the 2 volt implementations. The inputs to the LUT
see only nmos gates resulting in minimal input loading. In addition, the reduced internal
swings help to further reduce energy consumption in the LUT.
6.1.5 Transmission Gate
The transmission gate style of 3-LUT resembles the pass transistor method except that
the pmos is added. Figure 6-6 shows the schematic of such a circuit. Addition of the com-
plementary transistor allows the circuit to operate at 1.5 volts without the loss in
thresholds that plagued the nmos only pass transistor design. The energy of the design is
quite low at 0.44pJ per output charge and discharge cycle. The optimum sizing of the
pmos device relative to the nmos was determined to be 1. Moving to a P/N ratio of 3 to
compensate for the mobility difference actually increased delay and energy consumption
as shown in Figure 6-5.
Once again though, delay was the problem. At a supply voltage near 2Vt, a transmission
gate’s performance is severely degraded. As the signal proceeds through Vdd/2, a pseudo-
dead zone occurs as both the pmos and nmos devices are barely conducting. As a result,
there is a noticeable leveling of the transient waveform which translates into a poor
propagation delay. In addition, the pmos devices add some load to the inputs, although it
FIGURE 6-5: Impact of Transmission Gate Sizing
Section 6.1: Logic Cell Circuit Options 95
is not as severe as the static designs to be discussed next. In general, the transmission gate
method proved to be the best solution from an area, delay and energy perspective among
the static CMOS designs
6.1.6 Decoder-Based
Another possible implementation of the logic cell was borrowed from static decoder
structures [28]. The circuit is depicted in Figure 6-7. By using complementary static
gates, the supply voltage could be lowered to 1.5 volts without the problems associated
with the pass transistor options. Instead of implementing the decoder as a tree, a set of
NAND gates select which configuration cell’s value is to be passed to the output. In
addition, only one line will be charged at a time resulting in low energy operation.
Simulations showed the circuit consumed about 0.61 pJ of energy.
Unfortunately, the delay and input loading requirements of the circuit made it undesir-
able. Typical delays from the slow input to the output were 3.4ns which was not
competitive with other design options. Another negative aspect of the circuit was the
vastly increased load on the logic cell inputs. In a pass transistor tree, one input
experiences a maximum load of 4 nmos transistor gates and an inverter to generate the
complement input. The static NAND decoder requires each input signal to connect to 8
nmos and 8 pmos gates. Thus, the energy advantage of this structure due to low voltage
operation is offset by the increase in energy necessary to drive the cell inputs. One might
out
input
LUTConfigMem
FIGURE 6-6: Transmission Gate Logic Cell
Section 6.1: Logic Cell Circuit Options 96
remember that this structure was examined in Section 4.2.1 for implementing the input
multiplexers. The key difference between the two cases is that the inputs driving the nand
gates are essentially static when performing the input multiplexer function whereas they
are toggling in the LUT case. As a result, the circuit in Figure 6-7 is attractive for input
circuitry, but not for the logic cell.
6.1.7 Static Complex Gate
Keeping with the goal of 1.5 volt operation, the static CMOS logic style remains very
attractive. Besides a decoder type of implementation, one could envision a number of
other possibilities. The simplest would involve using a set of static CMOS gates to derive
functions. Some commercial designs employ this technique. The CLAy design described
in Section 2.4 uses an XOR and an AND gate as the basis for generating logic functions.
The main drawback with this ‘discrete’ gate method; however, is that a complete coverage
of all possible functions is not possible. For example, a logic cell with 3 inputs would
require 23^2 gates although some combinations are not unique. As mentioned earlier,
coverage of all possible functions is preferred since it greatly simplifies the mapping
process and can lead to more efficient utilization.
out
LUTConfigMem
inputs
FIGURE 6-7: NAND Decoder Style LUT
Section 6.1: Logic Cell Circuit Options 97
Branch-based logic is another way to take advantage of the low voltage operation of
static CMOS and still retain complete functionality. Figure 6-8 shows the implementation
of a 2 input LUT. One can think of the gate as a direct translation of the truth table for a 2
input function. The memory cells control the upper and lower transistor in the stacks to
program whether a minterm should be a 1 or a 0. One problem, seen also in the decoder
style implementations, is the large increase in input loading. In addition, the extension of
a two input version to a three input version does not scale well. In the 3 LUT case, each
input (true and complement versions) must connect to 4 nmos and 4 pmos gates. Worse
yet, a total of 4 pmos devices must be placed in series for the pull-up paths leading to poor
delay and excessive capacitance and area. Even with intelligent sizing, the performance
and energy of this gate was inferior to other options (5.7ns & 0.5pJ). One possible way of
avoiding 4 series devices would be to break up the gate into a two stage design. After
doing this, the delay improved to 4.5ns and energy was about the same as the single stage
design. Overall, the static CMOS designs were not promising candidates for the logic cell
implementation despite their ability to operate at 1.5 volts.
6.1.8 Current Mode Logic
Current mode logic was another logic style that was investigated for the logic cell
implementation (Figure 6-9). Often, extremely small signal swings (100mV) can be
aa
b
a
a
aaa
a
b
bb
b b
b
b
out
m1 m2 m3
m1 m2 m3
m4
m4
FIGURE 6-8: Static Branch-Based Logic Cell (only 2-LUT)
Section 6.1: Logic Cell Circuit Options 98
achieved by using this style. In order to operate with such small voltage excursions, a
static bias current flows through the circuit. Simulations showed that currents in the 10uA
range were sufficient for correct operation. However, the power consumed by this current
became excessive unless an operating frequency approaching 100MHz could be
maintained. In the designs considered, 100MHz operation would be very difficult and so
the static power would considerably outweigh a dynamic power dominated alternative.
Despite the static current problem of the design, three other issues proved most troubling
to any incorporation of current mode logic in a PGA. The necessity to level shift signals
to a comfortable common mode voltage depending on where they are located in the tree
leads to further design overhead and power consumption. Secondly, a source follower
output stage would be absolutely necessary for driving the capacitive interconnect lines to
other cells. Once again, more static power would be burned. Lastly, the current mode
technique would require using fully differential signal paths which would double the
amount of interconnect area.
6.1.9 Low Threshold Devices
The techniques described so far have all been handicapped by the fact that the device
thresholds with body effect become a large portion of the supply voltage (Vdd ~2-3Vt).
The data in Section 5.1 suggest that using devices with lower threshold voltages will
alleviate the problem of inadequate inverter triggering. Three scenarios employing low
threshold devices were examined. The first replaced all pass transistors with low Vt
out out
bias
aa
b b
bias
aaconfig memboth trueand compoutputs needed
FIGURE 6-9: Current Mode Logic Cell Implementation (only 2-LUT for simplicity)
Section 6.1: Logic Cell Circuit Options 99
devices (Vt~200mV) and used nominal Vt transistors for the inverters. A process that
provides dual Vts makes the leakage problem more manageable than a process with all
low Vt devices because memory cells can be constructed without high leakage currents.
The other two cases used transistors with 200mV and 400mV thresholds respectively
throughout the design.
As one might expect, circuit delays improved dramatically. For all supply voltages,
the low Vt scenarios showed the fastest speed. The designs using all 200mV thresholds
were quickest among the group followed by the version that used low Vt pass transistors.
Unlike delay, low threshold devices did not offer a substantial improvement in energy over
the previously discussed designs. The main reason for this is that the low Vt devices cause
larger signal swings at a given supply voltage. However, the key benefit of low Vt devices
is the ability to operate at a lower supply voltage than allowed in a higher Vt design. A
lower overall supply voltage will save a large amount of energy especially in the pass
transistor-based interconnect network. Unfortunately, one caveat must be observed when
dealing with low Vt devices. The lower threshold will cause the problem of static current
dissipation discussed in Section 5.3.2 and must be dealt with to avoid excessive power
consumption. Lastly, the examination of low Vt scenarios was performed to investigate
the impact a low Vt process would have on PGA design. However, the process that was
intended for this design did not offer optimized Vts and so only the design options using
the higher thresholds were realizable.
6.1.10 Final Cell Results
The following series of figures show the delay, energy, and energy-delay product
comparisons for the design styles that showed reasonable promise. The traditional pass
transistor data refers to the case described in Section 6.1.1. For each case, data is plotted
from a supply voltage of 5 volts to 1 volt to illustrate the relative degradation in speed and
the improvement in energy. One interesting thing to note is that the 200mV low Vt case
was the only design that could function below 1.5 volts.
Section 6.1: Logic Cell Circuit Options 100
The energy-delay product graph in Figure 6-12 provides a useful way to reconcile the
opposing metrics of energy and delay. The buffered pass transistor style of LUT offered
the best overall performance among the high Vt designs and was chosen for this low
power PGA design. However, it should be noted that the transmission gate LUT imple-
mentation was nearly as good. From the data, a supply voltage of 2 volts appears to offer
the optimum operating point. Even though a 1.5 volt supply was originally targeted, by
relaxing the supply by 500mV both delay and energy could be improved. Normally,
FIGURE 6-10: Delay of Potential Cell Designs
FIGURE 6-11: Energy of Potential Cell Designs
Section 6.2: Logic Cell Output Circuitry 101
energy and delay tend to trade-off against each other, but in this case, the slightly
increased supply voltage allowed more circuit implementation options to be explored
resulting in a more optimal single-rail design. Although this was an aggressive design,
as technology scales and circuit topologies become more sensitive to supply voltage,
the ability to optimize designs by using supply voltage as a variable will see
increasing importance.
6.2 Logic Cell Output Circuitry
Keeping with the low energy goal of this PGA design, logic cell output buffers driving
large loads were sized smaller than the optimal size-up factor of e. As shown in [28],
delay varies slowly around the optimum scaling value. Therefore, use of a larger size-up
factor such as 4-5 results in energy savings from reduced driver capacitance.
In addition, the logic cell output circuitry must be carefully designed to avoid several
pitfalls. As discussed in the power analysis chapter, significant power can be wasted in
output fanout chains. To avoid this, each driver chain was isolated from the fanout node
by inserting a pass transistor switch (Figure 6-13). Thus, only pass transistor drains and
enabled outputs are charged when the logic cell output transitions instead of all the
inverter chains. This technique is worthwhile when, on average only a small fraction of
FIGURE 6-12: Energy-Delay Product Of Potential Cell Designs
Section 6.3: Interconnect Circuitry 102
the potential output paths are configured at any given time. A pmos pull-up must be added
to the inverter chain inputs in order to insure that the node does not float when the input is
not being driven. Although setting the node high may result in an extra transition, this
only occurs during reconfiguration so no additional energy consumption takes place
during normal operation.
Lastly, a programming signal must be included to disable all outputs that drive shared
busses. During reconfiguration, the state of the configuration memory cells will be inde-
terminate. As a result, several drivers may be enabled on a shared bus and may be
attempting to drive conflicting logic levels. The special programming signal insures that
all drivers of shared busses will be disabled during reconfiguration. Without the
programming disable, hazards could cause excessive currents to destroy the chip while it
is being programmed.
6.3 Interconnect Circuitry
The interconnect circuit design was considered in conjunction with the cell design
options since the cell often impacts the interconnect implementation. Based on the
decision to use the buffered pass transistor scheme, the following interconnect
methodology was developed. The supply voltage of 2 volts allowed pass transistors to still
be used as the switches throughout the interconnect. At that supply voltage, sufficient
signal could be developed on the input of an inverter to trigger it allowing single ended
signalling to be used with feedback pmos restoration.
The ability to stay with pass transistors was critically important because a significant
power penalty would have to be paid if forced to move to full transmission gates. Figure
VddL=1.5volts
configmem
prog#
input shared busoutput
FIGURE 6-13: Programmable Output Circuitry
Section 6.3: Interconnect Circuitry 103
6-14 shows the relation between pass transistor and transmission gate programmable inter-
connect using the model depicted in Figure 4-5. At first glance, two factors contribute to
the increase in power that the transmission gate interconnect experiences. The addition of
the pmos device at least doubles the capacitance per switch. Simulations showed that the
optimal P/N ratio was 1/1 for speed and especially for power. Secondly, the signal swing
through transmission gate interconnect is larger than pass transistor interconnect since a
full rail signal is propagated. As a result, transmission gate interconnect theoreticallt
experiences a 4x increase in power dissipation.
One word of caution must be made concerning the above comparison of pass transistor
and transmission gate interconnect. The overhead of transmission gates becomes
moderated by the wiring capacitance contribution to total interconnect capacitance. Thus,
one must keep in mind the relative values of wiring and fanout capacitance to determine
the best switch implementation. In the studies performed to evaluate the interconnect
implementation space, estimates of fanout and wiring length were included. The typical
interconnect segment under consideration was estimated to have 50ff of wiring
capacitance and the equivalent of 12 pass transistor drains connected to it (Level 1 inter-
connect in Section 6.6.2). From those estimates, pass transistors were found to offer the
FIGURE 6-14: Delay and Energy for Interconnect Chains
Section 6.3: Interconnect Circuitry 104
best performance in terms of delay and energy consumption. Figure 6-15 depicts the
Energy-Delay product comparison of the interconnect chains.
A further optimization along the low power focus of the design was the use of a dual
supply voltage. As the design stands, all the circuitry operates from a 2 volt supply which
allows a high voltage of about 1.1 volts to be passed through an nmos pass transistor. In
effect, the pass transistors act as voltage converters allowing no more than 1.1 volts to be
passed in either direction. Although the energy from a 1.1 volt swing on the interconnect
is quite low, the fact that it comes from a 2 volt supply implies an inefficiency. Therefore,
all the circuitry except for the memory cells and the buffers which drive the LUT pass
transistors were moved to a 1.5 volt supply. 1.5 volts was chosen because going to an even
lower supply voltage (< 2Vt) would start incurring a substantial delay penalty. By using
the 1.5 volt supply, all the inverters driving large interconnect and load capacitances see a
25% energy savings. The movement to the dual supply environment was deemed
worthwhile because of the degree to which interconnect capacitance dominates PGA
power. In essence, the dual supply allows an interesting balance between delay and
energy efficiency to be made. By operating the LUT circuitry at a higher voltage, better
performance can be achieved. Since the LUT delay accounts for a sizeable portion of
overall path delay, and a small portion of energy consumption, raising the supply is
beneficial. Similarly, by reducing the supply on the cell interface and interconnect drivers
FIGURE 6-15: Energy-Delay Product of Interconnect Chains
Section 6.3: Interconnect Circuitry 105
which contribute a significant amount to total PGA energy consumption, a greater overall
energy reduction can be realized. Therefore, in designs where some blocks account for
most of the delay and other blocks consume most of the energy, a dual voltage
scheme may be appropriate to gain further leverage with which to optimize energy-
delay product. As an additional note, the efficiency of the circuitry that generates the
supplies must also be factored into the overall evaluation of a dual supply scheme. The
other reason to move to a dual supply is that level conversion was much quicker for a 1.1
volt signal feeding a 1.5 volt inverter. As a result, the restoration energy and delay penalty
suffered at every buffering point was lessened. Figure 6-16 below depicts the general
scheme.
Using the dual supply scheme allows the general interconnect power to go as
described in the following equations. Assuming that a 5 volt PGA design using pass
transistor interconnect has 3.5 volt interconnect swings, a 10x reduction in power is
achieved by using the dual voltage scheme despite the wealth of complications associated
with low voltage pass transistor design. As one last note, the small signal swings do come
at the cost of some speed, but the primary goal of the design was ultra low power.
(EQ 18)
(EQ 19)
(EQ 20)
Logic
Block
VddL=1.5volts
Vdd=2volts
Vswing=1.1v
Cinterconnect
Vdd=2volts
1.1v
input
skewedinverters
FIGURE 6-16: Typical path from interconnect through logic cell and back to interconnect
VddL=1.5voltsVdd=2volts
Energy C Vdd Vswing××=
@ Vdd=5 volts Energy C 5 3.5×× 17.5C= =
@ Vdd=1.5 volts Energy C 1.5 1.1×× 1.65C= =
Section 6.3: Interconnect Circuitry 106
During the final design stages, the cell design depicted in Figure 6-4 was altered to
further improve performance and energy. As mentioned in Section 5.2, level restoration is
costly because it involves ratioed paths. After some simulations, it was observed that
better performance could be achieved by removing the restoration devices. Since the
supply voltage on the pass transistors was 2 volts instead of 1.5 volts, there was sufficient
gate overdrive voltage to effectively flip a 1.5 volt inverter. The elimination of the
restoration devices resulted in lower delay because paths no longer involved ratioed
devices fighting each other. In addition, energy decreased slightly since the short circuit
current of the restorer transition was no longer present. The possibility of static current
dissipation as described in Section 5.3.2 is avoided even though the logic levels are not
full-swing because Equation 18 showing the relation between pmos threshold voltage and
pass transistor output voltage is still satisfied in all cases. The resulting circuit path is the
same as Figure 6-16, except the restoring device has been removed as shown in Figure 6-
17.
However, before accepting the optimization using no restorers, a careful examination
of subthreshold leakage current was necessary. Since the pmos devices are no longer
turned off with Vgs=0v, subthreshold currents will still flow. Under worst case process
conditions, an estimate of the leakage current per logic cell was 0.25uA. For comparison,
a logic cell driving one output which toggles at 5MHz will dissipate over 10x more power.
Under nominal conditions, leakage becomes 50-100x less giving a 2000 logic cell array a
total leakage power of ~40uW. Diode leakage for such an array would also fall in the
Logic
Block
VddL=1.5volts
Vdd=2volts
Vswing=1.1v
Cinterconnect
Vdd=2volts
1.1v
input
skewedinverters
VddL=1.5voltsVdd=2volts
FIGURE 6-17: Fully optimized typical path from interconnect to logic cell and back to interconnect
Section 6.4: Config memory/ FFs for CLBs 107
range of 10s of uW so removing the restoration devices was deemed viable. On the other
hand, if leakage currents must be reduced further for a sleep mode, the supply voltage
could be disabled or the restoration devices could be re-inserted. Another possible
leakage control option is discussed in [16] where separate power-down of logic circuitry is
performed to keep memory values intact.
6.4 Config memory/ FFs for CLBs
Configuration memory is used heavily throughout an FPGA in order to program the
switches and lookup tables. Two general options exist for the design of the configuration
memory: 1) Scan Latch-based, 2) 5 transistor SRAM based, 3) 6 transistor SRAM based.
see Figure 6-18
The scan latch design allows minimal programming overhead since decoders and other
addressing hardware need not be included. A single stream of data is simply clocked
through the cells until all memory cells are initialized at which time, the hold mode is acti-
vated and the scan chain is disabled. Two aspects make the distributed memory cell
approach undesirable for this PGA implementation. The first is that the two clock lines
must be global to all scan cells. As a result, difficult skew constraints must be met espe-
cially considering that the data can snake irregularly throughout the chain of back to back
connected memory cells. The more important drawback to the serial programming
approach concerns performance and power dissipation. Serially clocking all configuration
data would take an excessive amount of time and so a parallel approach is preferred. In
addition, the power of serially shifting data through a chain can be quite high since all ele-
ments experience transitions. Instead, a traditional parallel access memory hierarchy will
phi1 phi2
hold
WL
BL#
Q
Q#
QQ#
scan in
Scan Latch 5-T SRAM Cell
WL
BLQQ#
Dual Bitline SRAM Cell
WL
BL#
FIGURE 6-18: Configuration Memory Options
Section 6.5: Logic Cell Array Architecture 108
offer better programming speed and block enables may be used to lower programming
power consumption albeit at the expense of some area.
The 5 transistor SRAM style memory cell serves as the alternative to a scan in pro-
gramming approach and is used in many SRAM-based PGAs. In this case, a single bit
line would be accessed from a wordline transistor to initialize the cell. Correct sizing of
the access transistor is determined by the pull-up case since only one bit line is used. The
ratio of the nmos pull-up current to the nmos pull-down device must be large enough to
reach the threshold voltage of the storage inverter which is approximately Vdd/2. Simula-
tions showed that an nmos only access device would need to be greater than 10um wide to
reliably write a 1.8um pmos and 0.9um nmos memory cell at a supply voltage of 2 volts.
The necessity of such a large access device severely affects the cell area since program-
ming cells account for a significant fraction of total array area.
Consequently, the six transistor SRAM cell was chosen for the configuration memory
throughout the chip. By using the dual bitline approach, minimum sized access transistors
could be used while still maintaining good noise margins. During a read operation
necessary for testing, a Vdd/2 precharge should be used to avoid bitline capacitance upset
of the cell contents. The only major drawback of the dual bitline configuration memory is
the added demand on wiring resources. In previous 5-T designs, only one bitline needed
to be routed per column; whereas two are required in this scheme. Therefore, a low
voltage PGA design presents a much more difficult layout challenge. In effect, low
voltage configuration memory serves as another driver for using low Vt devices if
available, thus allowing a return to the preferred single bitline scheme.
6.5 Logic Cell Array Architecture
The basic cell architecture defines a PGA’s mapability and its internal capacitances.
Once the logic cell’s functionality has been defined, the surrounding interconnect network
can be constructed based on the mapping properties of the cell. Determination of the cell-
interconnect interfacing resources and the amount of flexibility to support were developed
from several mapping experiments and intuition built up from studying PGAs. The
Section 6.6: Detailed Architectural Discussion 109
problem of defining the connective resources to provide in a PGA array is inherently tied
to heuristics since the device is meant for a general purpose environment. Therefore, a
prototype architecture was initially defined to possess particular properties that were
beneficial from a mapping and capacitance perspective. Using the initial design as a
template, a series of real-world design examples and common logic operations were hand
mapped. Performing the mapping exercise allowed the strengths and weaknesses of the
architecture to be quickly revealed along with the relative utility of the proposed architec-
tural features. Accumulators, counters, shift registers, and comparators were some of the
logic building blocks that were test mapped. In addition, the correlator depicted in Figure
3-11 and the add-compare-select block of a Viterbi decoder were mapped to examine the
architecture’s viability given larger functional blocks.
The general motivation for the architecture focused on preserving the locality and
structure present in most designs. By combining the connectivity properties of designs
with an underlying PGA structure which facilitated those required mapping characteris-
tics, low capacitance mappings could be achieved across a broad range of designs. The
resulting structure aims at striking a balance between datapath style design requirements
and random logic mappings. In doing so, highly regular, dense mappings can be produced
yielding very efficient cell utilization. Furthermore, the trade-off between flexibility and
resource capacitance is managed by providing a highly useful, low capacitance, neighbor-
to-neighbor network combined with a datapath oriented local bus structure. Finally, a
hierarchy of stacked grids overlays the lower levels to allow long distance routing.
Careful construction of these interconnect resources creates a gradual progression from
low capacitance local wiring, to more flexible higher capacitance routing, and enables
designs to exhibit much lower average net capacitances.
6.6 Detailed Architectural Discussion
Figure 6-19 depicts the basic cell. In actuality, two logic cells are grouped together so
that they share inputs although they can operate independently from each other through
the direct path interconnect.
Section 6.6: Detailed Architectural Discussion 110
Each logic cell is simply represented by a black box since the circuit details inside are
not important. The only important information from an architectural standpoint is the
ability of the logic cells to implement any function of 3 variables with complete permuta-
bility of the inputs A,B, and C. Surrounding the cells are the input interface multiplexers
which may be implemented as a logarithmic tree or as linear growth switches as discussed
in Section 4.2.1. A simplified diagram showing a pair of logic cells and the breakdown of
inputs appears in Figure 6-20 below.
Each multiplexer provides a single input to the logic cell. The top and bottom muxes per
logic cell choose between inputs coming from four locations:
• the cell vertically adjacent
• the cell to the upper/lower left
• the outermost horizontal local bus (track 0 or 3)
• the innermost horizontal local bus (track 1 or 2)
FIGURE 6-19: Basic Logic CellPair Tile
4x Interconnect
Level 1Interconnect
Level 1Interconnect
FromDiagonalNeighbor
From
NeighborLower
ToUpper
NeighborTo
NeighborDiagonal
SwitchMatrix
3-LUT&DFF
3-LUT&DFF
4x Interconnect
cell
toadjacent
track0
track1
track2
track3
A
BC
BC
A
Section 6.6: Detailed Architectural Discussion 111
The third input is derived from an 8 input center mux. The eight possible sources include:
• any one of the 4 horizontal local busses
• a vertical high fanout length 4 line (4x)
• a direct connection from the cell to the left
• either diagonal from the upper or lower left cell
The single output of each logic cell can be independently routed to 9 destinations. They
include:
• either vertically adjacent cell
• either cell to the upper and lower right
• any of the 4 horizontal local busses in the pair of cells to the right
• the cell directly to the right
Figure 6-21 shows the neighbor-to-neighbor interconnect which will be referred to as the
Level 0 interconnect.
The local horizontal busses which span each pair of logic cells are connected to a grid of
routing bounded by switch matrices. The granularity of the grid is 2 logic cells horizon-
top muxes
bottom muxescentermuxes
track0track1upper diagonalupper vertical
track2track3
lower diagonallower vertical
track0track1
upper diagonal
topmux bottom
mux
track2track3
lower diagonal
vert4x
rightin
center mux
track0track1
track2track3
FIGURE 6-20: Simplified Diagram of Logic Cell Pair with Input Breakdown
Section 6.6: Detailed Architectural Discussion 112
tally by 1 vertically. Figure 6-22 shows the structure of the level 1 interconnect network
which is defined on the 2x1 grid.
The next few sections provide comments about the motivation for each type of intercon-
nect resource and why it was interfaced to the cell in the way it was.
6.6.1 Level 0 Direct Interconnect
The level 0 interconnect layer serves as the primary resource for localized communica-
tion. The segments are designed as point-to-point connections allowing them to have
lower capacitance than the other types of interconnect. Figure 6-23 shows the connectiv-
ity horizons for the inputs and outputs of a cell using only the level 0 interconnect.
Using the direct connects, a logic cell can efficiently support fanouts as high as 5 to
neighboring cells. Although designs commonly see fanouts of only 1 or 2, the architecture
was made symmetrical about the horizontal to facilitate the mapping task. From Figure 6-
23, one can see that most of the direct interconnect flows vertically and diagonally. The
reason for this is that the level 1 interconnect is more aptly suited for horizontal routing
FIGURE 6-21: Level 0 Direct Interconnect (Pattern spans the entire array)
Direct Path
Pair of
Logic Cells
Interconnect
Section 6.6: Detailed Architectural Discussion 113
along the dataflow direction. However, by including the vertical and diagonal connec-
tions, many paths avoid using the more general purpose interconnect layer allowing the
general purpose layer of interconnect to be greatly simplified. In addition, many mapping
targets (adders, comparators, etc.) require data to be passed across the width of a datapath
from bitslice to bitslice, this task is extremely well suited to the level 0 interconnect. In
SwitchMatrix
Logic Cell
HorizontalLevel 1
VerticalLevel 1Track
Track
Level 1SwitchMatrix
Level 1 &Horiz 4xSwitchMatrix
Level 1 &Vert 4xSwitchMatrix
Level 1 &Horiz 4x &Vert 4xSwitchMatrix
FIGURE 6-22: Level 1 Vertical and Horizontal Interconnect (showing cell output connections and the 4 versions of switch matrices
home cell
cells w/odirect pathconnection
cells w/directconnection
Input Horizon Output Horizon
FIGURE 6-23: Diagrams depicting the directly connected cells from Input and Output Perspective
Section 6.6: Detailed Architectural Discussion 114
summary, the paths offered by the direct connection network provide a valuable low cost
routing resource which should be leveraged for as much of the local wiring demands as
possible.
A good example of the utility of the level 0 network is the ripple carry full adder
structure depicted in Figure 6-24. When using 3 input LUTs, two logic cells are required
per full adder. The carry fanout requirements are efficiently handled by the local connec-
tivity allowed by the vertical and diagonal direct connections. In addition, these carry
ripple paths will be faster than any other interconnect because of their low fanout and
dedicated nature.
Clearly, the level 0 interconnect is useful for datapath designs, but since this PGA
design is meant to be general, the interconnect should also be useful in other types of
designs. Overall, all design choices were guided by the intent to only provide resources
that could be highly utilized across a broad range of likely designs while minimizing the
overhead inherently associated with adding any resource. The finite state machine
mapping in Figure 6-25 provides an example of how random logic can also take
advantage of the level 0 network. Other than the primary input paths, all but two interme-
diate paths were able to be mapped using the level 0 resources. Once again, designs that
C0 S0
C1
C2
C3
S1
S2
S3
C4 S4
a0
b0
b1
b2
b3
b4
a1
a2
a3
a4
FIGURE 6-24: Ripple Carry Adder Mapping making use of Direct Connects
Section 6.6: Detailed Architectural Discussion 115
can make efficient use of the level 0 interconnect will be faster and have minimal switched
capacitance.
Once a high leverage resource like the level 0 interconnect has been added to the PGA
architecture, decisions about how to connect those paths to the logic cell must be
considered. The allocation of the paths to the input multiplexers must be done carefully to
avoid overly restricting the legal input combinations. For example, if all four level 0
inputs were fed to the same multiplexer then only one of the three cell inputs could be
derived from a neighbor path. Such a situation is undesirable from two fronts. First, the
amount of level 0 interconnect that can be used would be severely limited, thus preventing
the low capacitance advantage from being maximally exploited. Secondly, if only one
input can come from the level 0 network, the other two must come from the local busses.
The lack of flexibility that results degrades the mapability of the architecture leading to
non-optimal and irregular designs. An example of a decision that was made based on the
above argument is the ability to source each diagonal input from either a top/bottom mux
or the center mux. Allowing this added flexibility enables the case where inputs come
from both the lower (upper) left cell and the cell directly below (above) the destination cell
which would have been impossible without the duplication of the diagonal input. As a
FIGURE 6-25: FSM Mapping of ISCAS S27 (3 states)
Section 6.6: Detailed Architectural Discussion 116
final note, the logic cell output was designed such that it can drive any combination of the
level 0 interconnect wires. Once again, the flexibilty here was deemed necessary from
mapping experiments which showed that restricting the output fanout to a subset of com-
binations was detrimental. In order to minimize the overhead associated with the output
fanout, the cell output structure was optimized to be efficient from a power perspective.
6.6.2 Level 1 Local Bus Interconnect
In spite of the level 0 interconnect’s high utility, a PGA interconnect structure would
be insufficient without further resources. Often, PGAs that rely heavily on neighbor-to-
neighbor interconnect suffer from the inability to support fanout in a regular fashion
suitable for datapaths. Therefore, the level 1 grid was conceived under the purpose of
facilitating dataflow and higher fanouts. Grouping pairs of cells together grew out of the
desire to share the local horizontal busses because signals often fanout to at least two cells.
Once again, the full adder provides an example which is frequently used in designs.
Figure 6-24 shows how the two operand inputs are fed to both the sum and carry 3-LUTs.
Sharing busses leads to fewer distinct routing paths and avoids paying the overhead
capacitance of a switch matrix for every cell pitch. In fact, the initial mapping floorplan
should take advantage of the ability to efficiently route a signal to pairs of cells. By doing
this, spatial locality is encouraged and will lead to lower net capacitances. Lastly, for
cases where signals must fanout to many cells, the horizontal busses can be combined to
allow several cells to tap off the same source.
In addition to providing several input paths to the logic cells, the horizontal busses are
part of the first granularity of general routing network. Although many designs can almost
be completely routed using connections made from one pair of cells to the next (i.e.
pipelined designs), many require signals to follow more complicated paths to reach their
destinations. Thus, the horizontal busses are used to move signals laterally along the
datapath while the vertical segments allow one row of cells to interface with other rows. A
total of four local horizontal busses are provided for each row. This number was the result
of the mapping experiments which showed four busses to be the minimum number
Section 6.6: Detailed Architectural Discussion 117
necessary. Once again, the reader is reminded that additional resources were added
sparingly in order to keep the internal capacitances from interfacing and switches as low
as possible.
In general, lateral movements are usually sufficient to route most signals as
feedthroughs, but vertical resources were needed to handle connectivity across the width
of the datapath that cannot be supported by the level 0 network. A common example of
such a case are datapath shifts. Another key reason for including the vertical tracks is to
provide another path from which inputs and outputs to a datapath can be routed. The
preferred direction for datapath routing is horizontal, but in some cases, the horizontal
tracks will be used up. As a result, signals can flow vertically and connect to the
horizontal busses where the signals are locally consumed or produced by the logic cells.
Only two vertical tracks were provided in order to minimize capacitance associated with
the switch matrices.
Switch matrices connect the horizontal and vertical segments into a 2x1 grid. The
connectivity allowed by a switch matrix is depicted in Figure 6-26. All possible combina-
tions are not supported because excessive diffusion capacitance quickly results. In
Horiz Track 0
Horiz Track 1
Horiz Track 2
Horiz Track 3
VertTrack 0
VertTrack 1
level 1switchmatrix
FIGURE 6-26: Level 1 Switch Matrix Detail
Section 6.6: Detailed Architectural Discussion 118
particular, only the number 1 and 2 horizontal busses are allowed to interface to the
vertical tracks. The number 0 and 3 busses connect to the 4x grid which is the next higher
level in the interconnect hierarchy and will be discussed later.
The level 1 interconnect structure was designed so that data flows very naturally
among columns of the array. A left to right directional bias is built in because the outputs
of the previous pair of cells feed the horizontal busses of the pair of cells to the right. This
decision was made because cell outputs often flow into downstream cell inputs. Despite
this bias, right to left movement is not impaired.
The interface between the logic cells and the level 1 network was determined from the
mapping experiments. The single length vertical tracks cannot make a direct connection
to the logic cells because the added overhead was considered unnecessary. Instead, only
the horizontal local busses can source cell inputs. Once again, redundancy was supported
in the interface because of mapability constraints. In this case, the ability for the top/
bottom muxes to select one of two lines and the center mux to select any of the lines
allows early global routing decisions to be made nearly independently of the detailed
route. Some limitations exist, but these aid in deciding which signals to allocate among
the tracks thus constraining the overall routing problem.
6.6.3 4x interconnect layer
On top of the level 0 and level 1 interconnect sits a further hierarchy of grids.
Additional interconnect is necessary to provide further wiring capacity and to facilitate
long-distance routing since it is impossible to put all sources and sinks close to each other.
The first layer is aligned along a 4x4 block of cells as depicted in Figure 6-27. In addition
to the 4x4 grid, successive levels of hierarchy could be added as needed (i.e. 16x16). In
order to minimize the capacitance impact of the hierarchy on the lower routing levels
connections to lower levels are only made at the edges of the grid. Thus, the long distance
routes are reserved for signals which need to skip over some logic cells before reaching
Section 6.6: Detailed Architectural Discussion 119
their destinations. As a result, the global wires do not see fanout capacitance due to logic
cell inputs, only from interfacing to other interconnect layers.
The only exception to the edge based connectivity of the upper layers of the hierarchy
are the vertical 4x lines. One deficiency of the architecture was the ability to distribute a
common signal across the width of the datapath. This requirement comes up fairly often
for structures like multiplexers and other elements requiring control signals to be fed to all
bitslices. Technically, a signal could be fed to the cells from the level 1 vertical tracks and
then through a switch matrix to a horizontal bus wire for each row. However, the delay
and capacitance of such a path would become excessive for high fanout signals. Instead,
the vertical wires that span 4 cells are populated so that all logic cells can receive an input.
This ability to tap inputs directly from a high fanout vertical resource proved to be a
worthwhile addition to the architecture.
Finally, the interface between the 4X grid and the level 1 interconnect centers around
the switch matrices. Figure 6-28 shows the possible switch connections that may be
enabled. In addition to the pass transistor switches, the switch matrix also contains pro-
4x VerticalTracks4x Horizontal
Tracks
FIGURE 6-27: 4x Horizontal and Vertical Interconnect Layers
Section 6.7: Mapping 120
grammable buffers. Due to the build up in distributed RC that signals can experience as
they travel on lengthy programmable interconnect, signals require repowering to achieve
reasonable delay performance. Thus, when making any transitions between the 4x layer
and the level 1 layer, and also among 4X lines, a buffer would be programmed in the
proper direction for signal regeneration. Without buffering, the local cell outputs would
need to be oversized to satisfy worst case routing concerns. However, because the archi-
tecture exploits locality, most nets will not see large capacitances. As a result, the
switched capacitance due to cell output drivers can be made smaller.
6.7 Mapping
As previously mentioned, mapping experiments were performed to validate many
architectural decisions and to judge the viability of the architecture as a whole. The
correlator in Figure 3-11 and the add-compare-select block of a Viterbi decoder were two
large design examples that were hand mapped to this architecture. In addition, a variety of
logic building blocks were also mapped.
Horiz Track 0
Horiz Track 1Horiz Track 2
Horiz Track 3
VertTrack 0Vert
Track 1
4X Horiz
4X Vert
4X Horiz
4X Vert
1x1x
4x 4x
m1 m2
m0
m3
m4
m5
m6
m1m2
m4
m5
m6
m0
Connections and bufferingfor a 1x to 4x connectionand 4x to 4x horiz.
FIGURE 6-28: Level 1 and Full 4X Switch Matrix Connectivity and Buffering Scheme (only 4X connections shown, rest same as 1X switch matrix).
Section 6.7: Mapping 121
6.7.1 Correlator Mapping
The correlator mapping appears in Figure 6-29. Each column of cells has been labeled
to identify the schematic components more clearly. Excellent logic cell utilization has
been achieved because the datapath fits well into the underlying routing topology. There
are some differences between the schematic depicted in Figure 3-11 and the mapping that
is shown. The most significant is that the dump registers receive the registered output of
the carry-save adders instead of the combinational output a sample earlier as shown in the
schematic. This change was made because only the combinational output or the registered
output can be routed from a logic block. If the timing is such that the dump registers must
receive the combinational output then an extra pair of logic cells need to be used purely as
registers for every bit. Thus, there is a trade-off in resource usage between the two imple-
mentations although either can be supported. The only other difference between the
mapping and the schematic concerns the clocking. In the correlator design, two clocks
and a clock enable for the positive and negative halves is required. The architectural
features to support such a scheme can be added, but at the time of this thesis they have not
been included and therefore are not reflected in the mapping diagram.
6.7.2 Add-Compare-Select Mapping
In addition to the correlator, the add-compare-select (ACS) block of a Viterbi decoder
served as another mapping design example. A schematic block diagram of the add-
compare-select appears in Figure 6-30. High signal fanout combined with reconvergent
fanout makes the ACS block a good mapping challenge for any PGA architecture. A
floorplan of the ACS mapping is depicted in Figure 6-31. The main logic functions were
arranged such that a bit-sliced mapping results. In order to satisfy logic cell input
constraints and horizontal wiring constraints, the position of blocks were exchanged until
a mapping solution was found. Only a partial mapping is shown in Figure 6-32 because
the entire mapping would have made it difficult to see the details in the picture. Only the
final 2:1 multiplexer stage of the updated state metrics and the decision logic has been left
out. In the case of the 2:1 mux, the output of Mux 0,1 in Figure 6-31 would use the level
Section 6.7: Mapping
122
4x
4x
4x
c s
c s
c s
c s
c s c s
c s
c s
c s
c s c s
c s
c s
c s
c s c s
c s
c s
c s
c s c s
c s
c s
c s
c s
p0
p1
p2
n0
n1
n2
in0
in1
in2
PositiveInputRegister
NegativeInputRegister
Pos. Carry-Save Adder
Neg. Carry-Save Adder
Pos. DumpRegisters
Neg. DumpRegisters
RippleAdder
RippleAdder
Pos-NegSubtractor
9-bits wide
FIG
UR
E 6-29: C
orrelator Mapping
Section 6.7: Mapping 123
of interconnect above the 4x layer (16x) to travel across the entire design and feed the final
multiplexer on the right edge. The decision logic conveniently fits below the datapath
where the comparator outputs and the MSBs from the additions are generated. The branch
and state input vectors enter the ACS block at the periphery. The upper levels of the inter-
connect network would be used to efficiently bus the branch and state metrics from where
they are generated to the ACS block inputs. As a reminder, the upper interconnect levels
will have relatively low capacitance because connections are only made at the endpoints.
Therefore, long-distance-feedthrough type movements will not result in a sizeable energy
Comparator
Comparator
Comparator
Comparator
Comparator
Comparator
Adder
Adder
Adder
Adder
MSBs
Bra
nch
Met
rics
State Metrics
Sele
ct L
ogic
Updated State Metric
Register/ Buffer
4:1 Multiplexer
XOR
XOR
XOR
XOR
XOR
XOR
58
2
Decision
FIGURE 6-30: Add-Compare-Select Block of a 32 State, Radix-4 Viterbi Decoder (4-way)
Section 6.7: Mapping 124
penalty. Further planning and optimization of the ACS placement could be done when
considering the mapping for the entire Viterbi decoder, but for the purpose of the test
mapping that was not considered. Lastly, the ACS mapping reflects the high level of
utilization and dense packing that can be achieved on this architecture.
6.7.3 Mapping Guidelines
In order to aid in the future development of automated mapping tools for this PGA
architecture, the following briefly describes some of the basic strategies used to perform
the hand mappings. First, the design must be decomposed into a set of three input
functions which would represent the function to be assigned to each logic cell. After that,
the details of each function can be black-boxed and a netlist connecting the functions
should be constructed. The next step when mapping a design to this architecture is the
most important: the floorplanning process. In order to truly exploit the features in the
architecture, an intelligent placement must be determined which preserves any inherent
structure in the original design. Since the direct path interconnect is the least costly
routing resource, placement should be performed to take advantage of this fact. Signals
with a fanout of 1 or 2 can often be entirely mapped using the direct interconnect. In
datapath-oriented designs, a bitslice placement allows carry chains and other localized
signals to easily map to the direct connection resources. The tight coupling between pairs
of logic cells in the PGA should also be factored into the grouping of logic. For example,
Add
BM
0, S
M0
Add
BM
1, S
M1
Add
BM
2, S
M2
Add
BM
3, S
M3
Com
pare
0,1
Com
pare
1,2
Com
pare
0,2
Com
pare
0,3
Com
pare
1,3
Com
pare
2,3
Mux
0,1
Mux
2,3
Fina
l Mux
(0,
1 an
d 2,
3)
Decision Logic
FIGURE 6-31: Floorplan of ACS Mapping in Figure 6-32
Section 6.7: Mapping
125
4x4x
4x 4x4x
4x 4x4x
4x
4x
4x 4x
4x
4x 4x
BM0_0
BM0_1
BM0_2
BM0_3
BM0_4
SM0_0
SM0_1
SM0_2
SM0_3
SM0_4
SM1_0
SM1_1
SM1_2
SM1_3
SM1_4
BM1_0BM1_1 BM1_2
BM1_3BM1_4
BM2_0BM2_1BM2_2BM2_3
BM2_4
BM3_0
BM3_1
BM3_2
BM3_3
BM3_4
SM2_0
SM2_1
SM2_2
SM2_3
SM2_4
SM3_0
SM3_1
SM3_2
SM3_3
SM3_4
Mux 2:1ControlMux 2:1
Control FIGURE 6-32: Add-Compare-Select Partial Mapping
Section 6.7: Mapping 126
nodes in the netlist that share signals should be assigned to pairs of logic cells such that the
local busses can be efficiently utilized.
Another crucial consideration when doing placement is long distance signal routing.
The number of horizontally flowing interconnect wires represents a constraint that should
factor into the placement of blocks of logic. During the mapping of the add-compare-
select circuit, an initial placement was performed and then the number of horizontal wires
required to pass across each pair of cells was computed. If the number required exceeded
the resource constraints, then columns of logic were interchanged until a legal mapping
was achieved. In most cases a dense mapping results, but if more horizontal wires are
necessary, the design can be expanded to give more wires at the expense of under utilizing
some logic cells. Even though the add-compare-select block represents a difficult
mapping case because of the large number of high fanout signals, it was still mapped with
excellent density.
The following bullet points summarize the important factors to consider when
mapping on this architecture. In some sense, the placement and routing problem is very
similar to the one encountered when doing layout design for ICs.
• Decompose the design netlist into a netlist of 3-input functions.
• For designs with inherent structure (as opposed to random logic), consider functional blocks as entities to be mapped as a single unit. Each of those units then has a set of inputs and outputs which need to satisfy wiring resource capacity limits.
• Determine which nets can make use of the direct path connections. (Carry chains can be laid out at this step, and other localized signals). This step will put constraints on which signals should be assigned to which logic cell inputs.
• Do an initial placement and determine horizontal and vertical wiring capacity require-ments. Manipulate the placement of blocks until a legal placement results. During this step, some signals will need to be routed on the 4x interconnect layer thus determining which signals will need to be on track 0 or 3. Signals that need to be routed through switch matrices onto vertical Level 1 segments will need to be on track 1 or 2. Some horizontal tracks may need to be swapped to get legal input connections. These cases should help to constrain the overall routing problem.
• At this point, the mapping should be nearly complete and final choices about logic cell input connections can be made.
127
CHAPTER 7
Low Power PGA Results
7.1 Overview
In this chapter, the results of simulations from an extracted layout of a mini-array are
presented. The low power design achieved performance only two times slower than the
Xilinx XC4000 series, while saving more than two orders of magnitude in energy con-
sumption. Against current Xilinx designs, the energy savings are about 15 times and the
performance is about 4x slower. A detailed discussion of energy, delay, and area follow.
7.2 Layout Discussion
The layout of a PGA requires a great deal of optimization since it is an array structure
whose basic tile will be repeated thousands of times. As a result, a custom design was
undertaken so that maximum area efficiency could be achieved. For this PGA design, it is
important to minimize cell area since wiring capacitances account for a large fraction of
the total net capacitances. Moreover, in order to produce a manageable design a small
library of optimized cells was created and re-used throughout the layout. By minimizing
the number of unique cells required, significant overhead involved with maintaining many
layout versions of the same cell in the Cadence environment can be avoided. Although
using a small library results in some loss of layout optimality, significant optimization was
performed on the cells such that the area inefficiency was negligible.
Section 7.2: Layout Discussion 128
Although the entire PGA chip layout was beyond the scope of this thesis, the basic tile
for the array was constructed such that a complete array could be fabricated in the future.
Significant effort was made during the floorplanning stages to insure that the logic cells
could be tiled appropriately. To build the complete array, the upper levels of the intercon-
nect hierarchy (4X to 1X,4X to 4X switch matrices) need to be added and the peripheral
memory circuitry needs to be included (row decoders, sense amps). For this work, a 2x4
array of logic cells was constructed (Figure 7-1) which allowed the key performance
FIGURE 7-1: 2x4 Mini-Array of Logic Cells with Block Diagram Representation (4 pairs of cells)
Pair of Logic CellsTrack Output Buffersfor Pair of Logic Cells
1X to 1X Switch Matrix
000
001
100
101
110010
111011
1X to 1X Switch Matrix
Level 1HorizTracks
LogicCell
Section 7.3: Capacitance Data 129
metrics for this thesis to be verified. The basic unit of the array is the pair of logic cells
depicted in Figure 6-19.
The basic structure of each pair of logic cells consists of a memory array with the logic
cell and interface circuitry interspersed throughout the array. Each block of cells can be
directly abutted to neighboring cells allowing seamless power, ground, wordlines, and
bitlines. The horizontal pitch was constrained to the height of a memory cell to achieve
dense packing. Preservation of the memory array structure created many layout difficul-
ties due to the large number of signals traversing the entire cell. Therefore, efficient
planning of wiring layers was necessary to insure that all signals could be routed. Ground
and power lines were routed horizontally along with poly wordlines. Memory cell bitlines
flow vertically in Metal 2. Metal 1 was used for ground and the 2 volt supply while Metal
3 serviced the cells requiring a 1.5 volt supply. The horizontally oriented array signals like
the local busses, clock and reset are also routed in Metal 3. Layout for a logic cell appears
in Figure 7-2.
The use of a dual supply presented some difficulties within the array-based layout
environment. Normally, separate n-wells are used for cells operating off different
supplies. In this case, a significant area penalty would have resulted because of the fairly
tight integration of low supply and high supply circuitry. (All the memory cells operate
off 2 volts and the logic circuitry at 1.5 volts.) To solve this problem, the low supply
circuitry was placed in the same well as the high supply circuitry. Although this results in
a slightly slower pmos device (pmos Vbs=0.5v instead of 0), significant area from well
spacing and connections from metal 3 to the well are saved. In fact, the slightly increased
threshold of the pmos devices helps to reduce the subthreshold leakage discussed in
Section 6.3.
7.3 Capacitance Data
The following table lists the energy and extracted capacitance data for the paths in the
PGA basic cell. When looking at the table, one should remember that the relationship
between the energy and capacitance numbers for a given component cannot be described
Section 7.3: Capacitance Data 130
by a single scaling factor, instead the energies depend on the supply voltage and swing for
that particular path. The number in parenthesis that appears next to some values indicates
the reduction that was achieved over the corresponding Xilinx component in Table 3-2.
Since the direct path connections did not have a direct analog in the Xilinx XC4003A part
that was studied, the reduction factor is given relative to a single line Xilinx interconnect
wire. A more detailed breakdown of the capacitance and energy for this design appears in
Appendix C.
TABLE 7-1 Energy and Extracted Capacitance Data
Path Energy (pJ) Capacitance (fF)
A or B Input .275 (145x) 95 (20x)
C input .34 (147x) 140 (17x)
LUT Tree (all paths toggling) .265 (94x) 120 (9x)
FIGURE 7-2: Logic Cell Layout
LUTB Input Mux
A Input Mux
C Input Mux
Track InputBuffers
D-Flip-Flop
Section 7.3: Capacitance Data 131
From the data, one can see that the cell capacitances and energy have been greatly
reduced over the estimated values for the Xilinx part. The average energy reduction for
the various components was over two orders of magnitude. The substantial improvement
in energy consumption comes from a combination of lower voltage and lower capaci-
tances. The XC4003A operated at 5 volts and the internal interconnect swing was
estimated at 3.5 volts. However, not all nodes experience the reduced swing so an average
swing of 4 volts was used for the determination of capacitance values from energy.
Similarly, not all nodes in the PGA discussed in this thesis swing at the same voltage.
Therefore, an average supply voltage and swing of 1.5 volts is used for the following cal-
culations1. Using these voltage numbers, the approximate energy reduction from supply
voltage and swing reduction is calculated in Equation 21 and Equation 22. Clock power
was also reduced by using 1.5 volt circuitry for clock distribution and the static D flip-flop
implementation.
1. Because of the dual supply voltage (2volts, 1.5 volts) and the various combinations of signal swings (2volts, 1.5 volts, 1.1 volts), an analytical expression of energy becomes complicated. Based on the pro-portion of logic supplied by each voltage and the proportion of capacitance experiencing each swing, a rough estimate of overall energy can be found by assuming a Vdd=1.5volts and a average swing of 1.5volts.
LUT Intermediate Output Buffers and Fanout Node
.33 (124x) 180 (9x)
Vertical Direct Path .115 (417x) 55 (43x)
Diagonal Direct Path .135 (355x) 60 (40x)
Direct Horizontal Path .14 (343x) 65 (37x)
Level 1 Horizontal Track 0 or 3 and Buffers .42 (143x) 200 (15x)
Level 1 Horizontal Track 1 or 2 and Buffers .435 (138x) 220 (13.6x)
Level 1 Vertical Track 0.066 40
Clock Input/ Logic Cell .13 (115x) 60 (12.5x)
Reset# Signal/ Logic Cell 0.063 28
Program#/ Logic Cell .13 32.5
Bit Line/ Logic Cell .13 33
Word Line/ Logic Cell .08 20
TABLE 7-1 Energy and Extracted Capacitance Data
Path Energy (pJ) Capacitance (fF)
Section 7.3: Capacitance Data 132
(EQ 21)
(EQ 22)
In addition to the voltage-based improvements, the average capacitance of resources
was lowered by 10x-15x. Much of the decrease in capacitance can be attributed to
efficient sizing of drivers, architectural minimization of fanout, appropriate choice of
switch size (1.8um) to reduce switch capacitances, and a compact layout strategy. It
should also be noted that the Xilinx part was fabricated in a 0.6um double metal process,
whereas this design was done in a 0.6um 3 layer process. Therefore, a decrease in wiring
capacitance should also factor into the overall capacitance reduction, although the
discussion in Section 3.2 made clear that wiring capacitances were not the dominant
capacitance contributor in the Xilinx design. On the other hand, the Xilinx device does
offer more functionality in that the logic blocks can be used as RAMs, but the added
features of the Xilinx would not account for the significant energy difference shown here.
Lastly, the energy consumption for an 8-bit adder compared to the one measured in the
power analysis of Section 3.3 is made. The Xilinx design consumed about 3.5nJ of energy
excluding the I/O and clock components1. By comparison, an 8-bit adder on the low
power design is estimated to burn 50pJ of energy assuming a comparable amount of inter-
connect resource usage and an activity factor of 1 (The cell mapping appears in Figure 6-
24). Thus, the low power design consumes 70 times less energy. If an average activity
factor of 0.3 is used, the difference becomes greater than two orders of magnitude.
Although a wealth of data and intuition was gathered from the study of the Xilinx
XC4003A, several improvements have been made to the Xilinx family since the 4003A.
From some recently reported data on the latest Xilinx parts [46], a rough comparison of
energy consumption can be made. The best device from an energy perspective listed in
1. All power analysis measurements were done on designs with a small clock distribution path for simplicity even though some circuits were purely combinational.
Xilinx: Vdd=5v, Vaveswing=4vEnergy C 5 4×× 20C= =
Low Power PGA: Vddave=1.5v, Vaveswing=1.5vEnergy C 1.5 1.5×× 2.25C
Voltage-Based Energy Savings 9xCapacitance-Based Energy Savings 10x 15x–
= ==
=
Section 7.4: Performance Data 133
the Xilinx Brief Report is the XC4000XL family which operates at 3.3 volts and was
fabricated in a 0.35um process. The power factor reported in the document is 28e-12
which when multiplied by Vdd=3.3 volts gives an energy of 92pJ/ logic cell or 46pJ/ logic
function1. By comparison, this low power design consumes 1.5pJ/ logic function without
interconnect. Assuming that the average cell’s interconnect needs require an equal
amount of energy (reasonable given the numbers listed in Table 7-1), the total energy/logic
cell is about 3pJ. Thus, when compared to a recent Xilinx PGA in a process that is one
generation ahead, this low power design still offers a substantial improvement in energy
(15x).
7.4 Performance Data
Although low energy was the primary design goal for this project, delay performance
cannot be neglected. The delays for this low power PGA design were characterized by
extracted simulations of the 8 cell mini-array. Results are presented in Table 7-2 below.
1. It was unclear whether a logic cell corresponds to a CLB which contains 2 function gen-erators or just refers to a single function generator. In this comparison, it will be assumed that the logic cell corresponds to a CLB and so each function generator and associated interconnect consumes half the number reported in the brief.
TABLE 7-2 Performance Data
PathTPLH Delay
(ns)TPHL Delay
(ns)
Logic Function Delay 5.2 5.4
Logic Block Output Delay 3.5 1.8
Combinational Delay 8.7 7.2
Setup Time 1.5 1.5
Clock to Q 4.0 4.5
Level 1 Track Input to Direct Output 10.4 10.3
Level 1 Track Input to Track 0,3 Output 10.3 10.5
Level 1 Track Input to Track 1,2 Output 10.5 10.3
Direct Input to Direct Output 7.9 7.3
Level 1 Track Input to Track 1 through switch matrix to Vertical Track thru switch matrix to
Track 2
11.6 9.8
Diagonal Input to Track 1 through switch matrix to Track 1
9.7 10.1
Section 7.4: Performance Data 134
The combinational delay refers to the delay from a logic cell’s inputs to its output fanout
node (the node immediately before the programmable output drivers). The A and B inputs
experience a slightly faster delay than the C input. Even though the C input has a higher
gate load, the A and B inputs have higher wiring parasitics such that all input paths are
nearly equal. The added delay for using the D flip-flop with synchronous reset is the setup
time (1.5ns) + the clock to Q delay (4.5ns) for a total of 6ns. Thus, fully pipelined
operation could be sustained at speeds above 50MHz. The next entries in Table 7-2 show
the path delay for a number of typical logic cell configurations. Three involve an input
coming from one of the Level 1 horizontal tracks and then routed to a direct path output,
another Level 1 track (0 or 3), and Level 1 track (1 or 2) respectively. The next entry
shows the performance numbers for a direct cell-to-cell path which would be typical in the
ripple carry chain of an adder. The last two entries give the delays for a logic cell which
drives its output through multiple segments of the Level 1 interconnect network.
Simulations using the slow process file resulted in some paths slowing by as much as
2x. High process sensitivity is a direct consequence of the reduced swings that are
propagated throughout the interconnect. Therefore, delay variation represents a
significant problem for designs which produce gate overdrive voltages at or below 1 Vt.
Despite the slow-down, all paths remained functional.
Often, significant energy savings come at the cost of reduced performance. Because of
the reduced voltage levels, a slow-down is experienced in this design, but the drop in
performance is not severe. Compared to the Xilinx XC4000 series delays in Table 2-1
which showed a combinational delay of 4ns and interconnect delays slightly above 1 ns,
this low energy design gives up about 2x in speed. For this design, reduced capacitance
resulting in smaller buffering delays probably helped minimize the performance
degradation when operating at a low supply of 1.5 volts. Current Xilinx designs such as
the XC4000XL achieve combinational delays of 1.5-2ns in a 0.35um process.
Figure 7-3 represents the signal path of a typical signal traveling from the logic cell
input to a diagonal output. Most of the circuitry from other paths has been removed from
the figure for clarity. Table 7-3 shows the delay along this path.
Section 7.4: Performance Data 135
TABLE 7-3 Delay Breakdown for Path in Figure 7-3
path tplh (ns) tphl (ns)cumulative
tplh (ns)cumulative
tphl (ns)
Level 1 Track -> Track Buffer Output
1.5 1.15 1.5 1.15
Track Buffer Output -> A Input MuxOut
0.25 0.25 1.75 1.4
A Input MuxOut -> A 1.66 1.8 3.41 3.2
A -> A# 0.54 0.35 3.95 3.55
A# -> Fout 1.3 1.87 5.25 5.42
Fout -> Lutout 3.47 1.77 8.72 7.19
Lutout -> Direct Output 1.77 3.2 10.49 10.39
inout
FF
track out
diag. outbuffer
buffer
AA# foutlutout
input trackbuffer
luttree
logiccell
in
FIGURE 7-3: Typical Signal Path from Level 1Horizontal Track Input to Diagonal Output
Section 7.4: Performance Data 136
Lastly, a simulation was performed for a multi-path interconnection of the 2x4 mini-
array as shown in Figure 7-5. Table 7-4 contains the delays for the paths in bold
measured from the primary inputs to the labeled points. A plot of some of the waveforms
from the simulation appears in Figure 7-4. From the HSPICE plot the three voltage
swings are evident as the logic cell inputs swing at 2 volts, the diagonal outputs swing at
1.5 volts and the general interconnect and internal nodes swing at about 1.1 volts. The two
1.5 volt signals correspond to the outputs of successive direct path vertical connections
rippling through the cells. The last two ~1.1 volt waveforms were taken along intercon-
nect path A.
TABLE 7-4 Delays from Multi-Path Example
Path Delay (ns)
A 10.2
B 18.5
C 17.7
D 21.6
FIGURE 7-4: HSPICE Waveforms for Multi-Path Simulation
Section 7.5: Area 137
7.5 Area
The area of the design was kept as small as possible to minimize the impact of wiring
parasitics. On many of the interconnect nets, wiring capacitances made up half the total
load capacitance so a dense layout was crucial. Table 7-5 shows the area breakdown for
major components of the design. For comparison, a Xilinx XC4000 CLB occupies an area
of 600um x 600um = 360,000um2 in a 2-layer metal, 0.6um process.
TABLE 7-5 Area Breakdown
Block Area
2x4 Mini-Array 435um x 230um = 100,000um2
Pair of Logic Cells 205um x 115um = 23,600um2
Switch Matrix 25um x 115um = 2,875 um2
Configuration Memory Cells / Logic Cell Pair
74 *150um2 = 11,100um2
Output Driver Circuitry/ Logic Cell Pair
5,800um2
Static D Flip-Flop 400um2
A
B
C
D
FIGURE 7-5: Multi-Path Example
Section 7.5: Area 138
139
CHAPTER 8
Conclusions
8.1 Conclusions
This thesis has shown that low energy programmable logic is possible without giving
up the flexibility that makes programmable logic attractive. In doing so, a detailed power
analysis of a Xilinx PGA was undertaken to gain a solid understanding of the power con-
tributors in a PGA. Based on the intuitions formed from the power profiling, a low power
PGA architecture was specified. Several circuit issues were examined relating to PGA
circuit design and low voltage pass transistor operation during this research. Finally, the
main building blocks of a low power PGA were implemented and extracted simulations
were performed to verify performance.
From the power analysis work, several important conclusions arose. First, the internal
load capacitances of a Xilinx device are 100 times greater than the capacitance levels seen
in a custom integrated circuit design. The principal reason for such high capacitances was
shown to be the parasitic capacitance from large switches and other circuitry to support
high fanout and in turn high flexibility. Wiring capacitance was not a significant factor.
Secondly, the wealth of heavily capacitive interconnect resources present in PGA designs
cause interconnect circuitry to account for over 65% of PGA power. The next 20% of
power goes into clocks. In addition, profiling showed that local interconnect dominates
PGA design power as opposed to long distance interconnect. The reason for this is that
140
the local level must support the high fanout and complicated routing patterns among logic
cells resulting in high capacitances, which when coupled with the fact that local resources
make up over 80% of a design’s interconnect result in substantial energy consumption.
Overall, the many insights drawn from the power analysis helped to shed light on the
previously neglected area of PGA power consumption allowing the later design efforts to
be clearly focused.
During the design phase of this work, several studies were performed to understand
the constraints and evaluate the trade-offs involved in low power PGA design. Ironically,
the aim to preserve the powerful attribute of PGA flexibility imposed tighter constraints on
the design space causing several conventional low power techniques to be unfeasible. In
the end, reduction of capacitance and voltage levels proved to be the highest leverage low
power tools. Prudent switch sizing, architectural minimization of fanout, and a pass
transistor implementation combined to lower capacitance such that wiring parasitics
became a comparable factor to overall load capacitance. Furthermore, the examination of
low voltage pass transistor design showed that, unless differential restoration techniques
are used, device thresholds must be lowered if pass transistors are to retain their utility as
supply voltages continue to fall.
Lastly, a PGA architecture has been specified and a mini-array designed to
demonstrate the energy improvements that have been achieved. Architecturally, the array
consists of 3-LUT logic cells connected by a hierarchy of interconnect. At the lower
levels of interconnect, the combination of a direct path mesh with a local bus structure
allow locality to be exploited resulting in energy savings. Aggressive circuit design
techniques were used to enable a single ended pass transistor implementation at low
voltage. As a result, energy savings from one to two orders of magnitude were realized
over commercial designs at a relatively small compromise in performance. Overall, a
substantial decrease in energy-delay product was achieved through this design.
141
8.2 Future Work
As in any research project, several avenues for future work exist. The integration of a
complete PGA array will allow the design to be fully verified and its utility to be
exploited. Software mapping tools need to be generated which would allow a more
complete evaluation of the architecture’s mapability to be performed. In addition, charac-
terization data for the design can be used to assess power consumption during the mapping
process. For the research discussed in this thesis, intuition built up from experimental
implementations and a variety of specialized studies proved essential; however, better
ways to efficiently explore PGA architecture need to be developed. Tools which can
specify architectural features including different interconnect topologies in a template
fashion, would allow a more quantitative approach to be taken to a field which is predomi-
nantly driven by heuristics.
142
9.0 References
[1] Ahrens, M. , Gamal, A. , “An FPGA Family Optimized for High Densities and Reduced Routing Delay,”Actel Corporation, IEEE Custom Integrated Circuits Conference, 1990.
[2] Atmel “Configurable Logic Design and Application Handbook” , 1995.
[3] Black, P, Meng. T., “A 140 mb/s 32 State, Radix 4 Viterbi Decoder,” IEEE Journal of Solid State Circuits Vol. 27, No. 12, December 1992, pp. 1877-1885.
[4] Bowhill, W., et al, “A 433 MHz 64b Quad Issue RISC Microprocessor,” IEEE Journal of Solid State Circuits, Vol 31, No. 11, November 1996, pp ##.
[5] Brebner, G. , “Configurable Array Logic Circuits for Computing Network Error Detection Codes,” Journal of VLSI Signal Processing, 6, 1993, pp. 101-117.
[6] Chandrakasan, A. , “Low Power Digital CMOS Design,” PhD. Thesis, U.C. Berkeley, August 1994.
[7] CLAy Family Introduction Datasheet, National Semiconductor, June 1994.
[8] DeHon, A. , “Reconfigurable Architectures for General Purpose Computing,” M.I.T. PhD Thesis, A.I. Technical Report 1586, October 1996.
[9] Dobbelaere, I. , Horowitz, M. , Gamal, A. , “Regenerative Feedback Repeaters for Programmable Interconnections”, ISSSC Digest of Technical Papers 1995, p.116-117.
[10] Farrhi, A. , Sarrafzadeh, M. , “FPGA Technology Mapping for Power Minimiza-tion,” International Workshop on Field-Programmable Logic and Applications, FPL ‘94. Proceedings, Springer-Verlag, 1994. p. 66-77.
[11] Gamal, A. , et al., “An Architecture for Electrically Configurable Gate Arrays,” IEEE Journal of Solid-State Circuits, Vol. 24, No. 2, April 1989, pp 394-398.
[12] George, V. , The Effect of Logic Block Granularity on Interconnect power in a Reconfigurable Logic Array”, CS 294 report, May 1997.
[13] Goto, G., et al “A 4.1ns Compact 54x54 Multiplier Utilizing Sign-Select Booth Encoders,” IEEE Journal of Solid State Circuits, Vol 32, No. 11, November 1997, pp 1676-1683.
[14] Hauck, S., Borriello, G., Ebeling, C. , “Triptych: An FPGA Architecture with Inte-grated Logic and Routing”, in Advanced Research in VLSI and Parallel Systems: Proceedings of the 1992 Brown/MIT Conference, pp. 26-43, March 1992.
[15] Infopad Project, U.C. Berkeley, http://infopad.EECS.Berkeley.EDU/infopad
[16] Izumikawa, M. , et al., “A 0.25um CMOS 0.9v 100-MHz DSP Core,” IEEE Journal of Solid-State Circuits, Vol. 32, No. 1, January 1997, p. 52-60.
143
[17] Jou, S., et al, “A Pipelined Multiply-Accumulator using a High-Speed Low-Power Static and Dynamic Full Adder Design,” IEEE Journal of Solid State Circuits, Vol 32, No. 11, November 1997, pp ##.
[18] Kaushik, R. , Prasad, S. , “FPGA Technology Mapping for Power Minimization,” International Workshop on Field-Programmable Logic and Applications, FPL ‘94. Proceedings, Springer-Verlag, 1994. p. 57-65.
[19] Kobayashi, T. , Sakurai, T. , “Self-Adjusting Threshold- Voltage Scheme (SATS) For Low-Voltage High Speed Operation, IEEE 1994 Custom Integrated Circuits Con-ference, 12.3.1-12.3.4.
[20] Kusse, E., Carloni, L., Chong, P., “A 1.5 Volt Fine Grain Pass Transistor FPGA”,EE241Project, http://infopad.EECS.Berkeley.EDU/~icdesign/ee241_97/PROJECTS/kusse.pdf, May 1997.
[21] Lee, W., et al, “A 1-Volt Programmable DSP for Wireless Communications,” Journal of Solid State Circuits, Vol 32, No. 11, November 1997, pp 1766-1777.
[22] Lucent Technologies, ORCA Field Programmable Gate Arrays Datasheet, August 1996.
[23] Lytle, C. , “Altera FLEX Programmable Logic - Largest Density PLD,” Altera Cor-poration, 1993.
[24] Montanaro, J., et al, “A 160 MHz, 3b, 0.5W CMOS RISC Microprocessor,” Journal of Solid State Circuits, Vol 31, No. 11, November 1996, pp 1703-1712.
[25] Mutoh, S. ,Yamada, J. , et al., “1-V Power Supply High-Speed Digital Circuit Tech-nology with Multithreshold-Voltage CMOS”, IEEE Journal of Solid State Circuits, Vol 30, No. 8, August 1995, pp 847-853.
[26] O'Donnell, I. , “Digital Circuit and Board Design for a Low Power, Wideband CDMA Receiver” Master Thesis, UC Berkeley Dec 1996.
[27] Pleiades Project, U.C. Berkeley, http://infopad.EECS.Berkeley.EDU/research/recon-figurable/
[28] Rabaey, J., Digital Integrated Circuits, Prentice Hall, New Jersey, 1996.
[29] Rose, J., Francis, R.J. ,Lewis , D., and Chow, P. , “Architecture of Field-Program-mable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency”, in IEEE Journal of Solid-State Circuits, v. 25, no. 5, October 1990, pp. 2117-1225.
[30] Rose, J. , Hill, D., “Architectural and Physical Design Challenges for One-Million Gate FPGAs and Beyond,” FPGA 97 Conference, February 1997.
[31] Sakurai, T., et al, “A 200 MHz 13 mm2 2-D DCT Macrocell Using Sense-Amplify-ing Pipeline Flip-Flop Scheme,” IEEE Journal of Solid State Circuits, Vol 29, No. 12, December 1994, pp 1482-1489.
[32] Sakurai, T. , et al, “A Swing Restored Pass-Transistor Logic-Based Multiply and Accumulate Circuit for Multimedia Applications,” IEEE Journal of Solid State Circuits, Vol 31, No. 6, June 1996, pp 804-809.
144
[33] Sakurai, T. , et al, “A 0.9 V, 150 MHz, 10-mW, 4mm2, 2-D Discrete Cosine Trans-form Core Processor with Variable Threshold-Voltage Scheme,” IEEE Journal of Solid State Circuits, Vol 31, No. 11, November 1996, pp 1770-1777.
[34] Scalera, S. , Sanders, private communication.
[35] Singh, S., Rose, J. , “The Effect of Logic Block Architecture on FPGA Perfor-mance,” IEEE Journal of Solid-State Circuits, Vol. 27, No. 3, March 1992, pp. 281-287.
[36] Srinivas, R. , et al., “A High Density Embedded Array Programmable Logic Archi-tecture”, Altera Corporation, IEEE 1996 Custom Integrated Circuits Conference.
[37] Sun, S. ,Tsui, P. , “Limitation of CMOS Supply Voltage Scaling by MOSFET Threshold Variation”, in IEEE Journal of Solid State Circuits, Vol 30, No. 8, August 1995, pp 947-949.
[38] Trimberger, S., “Field Programmable Gate Array Technology, Kluwer Academic Publishers, Boston Mass., 1994.
[39] Trimberger, S. , et al., “Architecture Issues and solutions for a High-Capacity FPGA,” Xilinx, Inc., FPGA ‘97, February 1997.
[40] Veendrick, H., et al, “An Efficient and Flexible Architecture for High-Density Gate Arrays,” IEEE Journal of Solid State Circuits, October 1990, pp 1153-1157.
[41] ViewLogic Corporation, Viewsim on-line documentation. http://www.view-logic.com/support/.
[42] Weste, N. ,Eshraghian, K. , Principles of CMOS VLSI Design, Addison-Wesley, Massachusetts 1993.
[43] Xilinx Corporation, “XC4000 Field Programmable Gate Arrays: Programmable Logic Databook”, 1996.
[44] Xilinx Corporation, “XC6200 Field Programmable Gate Arrays: Advance Product Specification”, June 1996.
[45] Xilinx Corporation, “Application Brief #2, Low Power Benefits of XC4000E/X Overview’”, May 1997.
[46] Xilnx Corporation, “Application Brief #14, A Simple Method of Estimating Power in XC4000 XL/EX/E FPGAs”, June 1997.
[47] Xilinx Corporation, XACT Reference Guide, April 1994.
[48] Yano, K. , et al, “A 3.8nS CMOS 16x16-b Multiplier Using Complementary Pass-Transistor Logic,” IEEE Journal of Solid State Circuits, Vol 25, No. 2, April 1990, pp 388-394.
[49] Yano, K. , et al, “Top-Down Pass Transistor Logic Design,” IEEE Journal of Solid State Circuits, Vol 31, No. 6, June 1996, pp 792-803.
[50] Zimmerman, R., Fichtner, W., “Low Power Logic Styles: CMOS versus Pass Tran-sistor Logic”, IEEE Journal of Solid State Circuits, Vol. 32, July 1997.
145
APPENDICES
Appendix A: Xilinx XC4003A Resource Measurements
Appendix B: Power Profiling Data
Appendix C: Detailed Breakdown of Low Power PGA Energy and
Capacitance
Appendix D: PGA Configuration Encodings and Example Initialization File
Appendix E: PGA Memory Cell Programming Polarities and Memory Cell
Layout Locations.
146
APPENDIX APOWER MEASUREMENTS Energy EstimatedXilinx XC4003A-PC84C 0.6um 2-layers of metal in mW/MHz Capacitance in pF
INTERCONNECT
half vertical longline 0.041 2.05full length vertical longline 0.088 4.4half vertical longline + 0.044 2.2full length vertical longline + 0.094 4.7
half horizontal longline 0.022 1.1full length horizontal longline 0.054 2.7half horizontal longline + 0.090 4.5full length horizontal longline + 0.160 8
vertical right doubleline 0.098 4.9vertical left doubleline 0.060 3horizontal down doubleline 0.080 4horizontal up doubleline 0.107 5.35
clb input interconnect horizontal C 0.044 2.2clb input interconnect horizontal G 0.039 1.95clb input interconnect horizontal F 0.047 2.35
clb input interconnect vertical C 0.047 2.35clb input interconnect vertical G 0.040 2clb input interconnect vertical F 0.051 2.55
single line X vertical between SM left 0.067 3.35single line X vertical between SM right 0.056 2.8single line X horizontal between SM up 0.085 4.25single line X horizontal between SM down 0.054 2.7
single line Y vertical between SM left 0.096 4.8single line Y vertical between SM right 0.048 2.4single line Y horizontal between SM up 0.054 2.7single line Y horizontal between SM down 0.088 4.4
global clk distribution line (not connected to CLBs) 1.287 64.351 global column wire 0.128 6.4
carry chain 0.050 2.5
147
The ‘+’ sign signifies another longline, but one that has slightly different properties which
give rise to a different capacitance value than the longlines without a ‘+’ symbol. The
CLB input entries represent the circuitry associated with interfacing to the inputs of a CLB
(usually a multiplexer structure). The letters refer to an F or G function input, although
the four C inputs do not feed distinct function generators, they serve as alternative inputs
to some of the other CLB circuitry. The entries for “correl4b” refer to the structured
correlator mapping and “correl4rb” refer to the simulated annealing based mapping.
POWER MEASUREMENTS Energy Estimated I/O in mW/MHz Capacitance in pF
i1 input path 0.062 2.48i2 input path 0.140 5.6input clked path 0.151 6.04
output I/O path inc interconnect (slow) 1.157 46.28output I/O path inc interconnect (fast) 1.269 50.76output I/O path inverted inc interconnect (slow) 0.989 39.56output I/O clked path inc interconnect+clk (slow) 1.326 53.04output I/O clked path inc interconnect+clk (fast) 1.459 58.36
CLBs
F function output to X 0.041 1.64F function clked output of FFX 0.060 2.4F function clked output of FFX to XQ is negligibleF,G function output to X,Y 0.082 3.28F,G function output to H function generator 0.006 0.24extension of H output to X output negligiblenegligible for internal muxing connections < 0.010connection of clk to 2 internal CLB FFs 0.030 1.5F function internal 0.025 1.25G function internal 0.020 1
148
AP
PE
ND
IX B
Design Nam e # of CLBs # of IO Bs # nets # SM passes Total Power exc l CLB M eas Power M eas Energy Total Estim ated Power Freqin m W in m W in nJ in m W M Hz
accum y16b 18 18 69 62 76 119.5 8.13 83.08 14.7accum y8b 10 10 36 31 41.9 65 4.42 45.85 14.7accum y4b 7 6 19 28 24.6 37.5 2.55 26.64 14.7addery8b 7 18 34 53 45.7 62.2 4.23 48.25 14.7addery4b 6 10 20 23 23.3 32.5 2.21 24.72 14.7barrely8b 25 13 38 51 57.7 77 5.24 63.58 14.7barrely4b 9 8 17 22 25 35.5 2.41 27 14.7com parem cy8b 8 18 32 61 43.5 61.5 4.18 44.67 14.7com parem y8b 16 18 37 73 53.6 71.5 4.86 55.7 14.7com parem y4b 7 10 18 34 27 37.5 2.55 28.36 14.7com parem y2b 3 6 9 10 13.7 16.5 1.12 14.51 14.7com parey8b 3 18 21 23 29.7 34.5 2.35 30.23 14.7com parey4b 2 10 12 9 17.1 20.25 1.38 17.68 14.7com parey2b 2 6 8 6 11.4 14 0.95 12.02 14.7countercy16b 9 2 26 4 12.8 18.5 1.26 13.87 14.7countercy8b 5 2 14 2 12.2 16 1.09 13.27 14.7countery16b 19 2 25 44 32.3 35 2.38 33.46 14.7countery8b 10 2 12 14 19.6 22.5 1.53 20.81 14.7countery4b 5 2 7 8 15.6 18.5 1.26 16.72 14.7decodery4x16b 17 6 26 18 30.7 39 2.65 31.76 14.7m ux y8b 4 13 17 16 24.1 28 1.90 24.94 14.7m ux y4b 2 8 10 12 14.9 18.25 1.24 15.46 14.7m ux y2b 2 5 7 5 9.5 11.75 0.80 10.03 14.7parityy9b 4 11 15 13 22 30.75 2.09 23.31 14.7shifty16b 17 3 20 15 39.6 42.5 2.89 40.12 14.7shifty8b 9 3 12 4 21.5 23.5 1.60 23.34 14.7shifty4b 5 3 8 2 14.9 17 1.16 16.08 14.7
m atchb 100 6 181 268 197 204 13.88 203.2 14.7456tx chip 100 13 178 433 133 157 10.68 136.4 14.7456correl4rb 71 7 176 168 39.2 42.6 5.84 40.5 7.373correl4b 76 7 171 223 35.8 40 5.48 37.1 7.373l5b datapa th 13 10 33 59 56.6 67 4.56 58.7 14.7456l2b pipe datapath 17 10 35 66 62.9 73 4.97 64.7 14.7456l3b random logic 29 6 61 97 45.6 49 3.33 46.2 14.7456fir5b 52 10 140 199 212 283.6 19.29 231.24 14.7456rs_m ult 57 18 85 163 154 232 15.78 170.2 14.7456
All DesignsCum m ulativ e A v e.Std . Dev
Large Designs O nlyCum m ulativ e A v e.Std . Dev
149
Design Name I/O Power CLB power Clock Net Power Single Wire Power Double Wire Power Longline Power Carry Chain Power CLB Input Line Power CLB Output Line Powerin mW in mW in mW in mW in mW in mW in mW in mW in mW
accumy16b 7.11 7.08 9.81 26 12.45 2.46 1.67 10.61 8.45accumy8b 3.27 3.95 9.46 15.3 5.85 0.73 0.75 5.78 4.7accumy4b 1.5 2.04 7.17 5.98 6.78 0 0.22 3.02 2.53addery8b 7.7 2.55 5.84 10.57 16.48 1.85 0.8 3.54 2.76addery4b 3.55 1.42 5.84 5.11 8.21 0.3 0.41 2.22 1.59barrely8b 5.4 5.88 5.84 14.16 14.76 2.56 0 13.82 5.11barrely4b 3.1 2 5.84 6.8 5.49 1.27 0 4.62 1.87comparemcy8b 6.5 1.17 5.84 9.66 18.02 1.19 0.87 3.96 1.35comparemy8b 7.4 2.1 5.84 10.81 19.79 1.91 0 9.42 2.22comparemy4b 3.55 1.36 5.84 9.15 6.23 0.69 0 4.16 1.37comparemy2b 2.06 0.81 5.84 3.61 3.23 0 0 2.01 0.9comparey8b 7.11 0.53 5.84 9.95 6.52 0.28 0 3.22 0.67comparey4b 3.3 0.58 5.84 5.76 3.55 0.08 0 1.87 0.71comparey2b 1.8 0.62 5.84 4.41 1.17 0.08 0 1.29 0.75countercy16b 0 1.07 9.86 1.95 0.61 0 0.25 1.23 1.48countercy8b 0 1.07 9.46 3.33 0.45 0 0.25 1.22 1.47countery16b 0 1.16 24.18 4.45 1.75 0.66 0 3.72 1.52countery8b 0 1.21 11.22 4.78 1.63 0.77 0 3.69 1.51countery4b 0 1.12 9.46 3.77 2.2 0 0 2.73 1.42decodery4x16b 2.01 1.06 30.66 10.08 1.84 2 0 11.3 1.2muxy8b 5.4 0.84 5.84 7.82 3.42 0.7 0 3.85 0.97muxy4b 2.81 0.56 5.84 4.38 3.26 0 0 1.87 0.7muxy2b 1.26 0.53 5.84 3.22 1.33 0 0 1.1 0.68parityy9b 4.4 1.31 5.84 7.11 3.51 0.2 0 3.44 1.43shifty16b 3.33 0.52 24.06 5.81 5.07 0.81 0 3.21 4.07shifty8b 0.52 1.84 14.95 4.52 1.02 0.23 0 1.9 2.37shifty4b 0.52 1.18 11.33 3.36 0.72 0.23 0 1.23 1.48
matchb 3.4 6.2 63.8 41.2 39 6.2 0 27.1 17.1txchip 3.7 3.4 49 33.6 15.6 4.8 0 18.8 7.3correl4rb 0.8 1.3 16.7 5 8.4 0.9 0 6.1 3.2correl4b 0.8 1.3 12.5 5.5 8.4 1.5 0 6.1 3.1l5b datapath 4.1 2.1 9.1 14 14.4 1.9 0.3 8.6 4.1l2b pipe datapath 4.2 1.8 20.8 19.4 14 6.2 0.8 7.6 4.2l3b random logic 2.1 0.6 29.3 4.4 8.1 0.4 0 3.75 1.5fir5b 4.81 19.24 26.28 66.86 48.37 10.74 6.24 29.73 21.77rs_mult 6.8 16.2 5.6 49.3 37.3 12.4 0 34 13.4
All DesignsCummulative Ave. 3.18 2.71Std. Dev 2.38 4.04
Large Designs OnlyCummulative Ave. 3.41 5.79 25.90 26.58 21.51 5.00 0.82 15.75 8.41Std. Dev 1.94 7.00 19.26 22.40 15.58 4.34 2.05 11.80 7.24
150
Design Name # Single Lines # Double Lines #Long Lines # Carry Chains # CLB Input Lines # CLB Output Lines # Clock Lines # of F inputs # of G inputs # of C inputs
accumy16b 88 42 12 8 74 41 10 41 33 0accumy8b 43 20 6 4 38 22 6 21 17 0accumy4b 27 24 2 2 20 11 4 11 9 0addery8b 27 54 8 5 20 11 1 11 9 0addery4b 12 26 3 3 12 7 1 7 5 0barrely8b 43 46 15 0 73 25 1 69 4 0barrely4b 20 18 9 0 25 9 1 24 1 0comparemcy8b 29 60 7 5 22 7 1 12 9 1comparemy8b 40 70 10 0 60 17 1 46 13 1comparemy4b 31 23 6 0 24 8 1 18 5 1comparemy2b 8 10 1 0 9 3 1 8 1 0comparey8b 33 22 3 0 18 3 1 8 9 1comparey4b 15 12 2 0 9 2 1 4 5 0comparey2b 9 4 2 0 5 2 1 4 1 0countercy16b 15 7 1 7 17 17 10 8 9 0countercy8b 9 3 1 3 9 9 6 4 5 0countery16b 63 21 5 0 66 21 23 57 5 4countery8b 23 10 6 0 29 10 10 27 1 1countery4b 11 4 1 0 11 5 6 10 1 0decodery4x16b 41 7 7 0 65 17 1 40 25 0muxy8b 24 12 3 0 21 4 1 10 9 2muxy4b 10 11 1 0 9 2 1 4 5 0muxy2b 6 5 1 0 4 2 1 3 1 0parityy9b 18 10 2 0 16 4 1 12 1 3shifty16b 17 16 4 0 17 17 23 0 1 16shifty8b 11 3 3 0 9 9 12 0 1 8shifty4b 6 2 3 0 5 5 7 0 1 4
matchb 284 205 41 0 350 175 110 80 120 150txchip 446 243 88 0 555 151 69 297 237 21correl4rb 207 119 29 15 202 146 77 72 60 70correl4b 222 167 38 15 209 153 76 71 62 76l5b datapath 48 42 9 3 51 21 11 25 9 17l2b pipe datapath 58 48 15 3 64 24 15 32 9 23l3b random logic 62 98 16 0 119 53 32 51 55 13fir5b 200 116 42 32 152 82 28 80 61 11rs_mult 161 118 45 0 197 57 0 174 23 0
All DesignsCummulative Ave.Std. Dev
Large Designs OnlyCummulative Ave. 187.56 128.44 35.89 7.56 211.00 95.78 46.44 98.00 70.67 42.33Std. Dev 127.74 66.80 23.63 11.06 156.96 60.61 37.59 86.11 71.16 48.21
151
Design Name Ave. Net Cap Max Net Cap Min Net Cap Clock Cap Ave Fanout Max Fanout Ave Man Dist Ave Route Dist Ave. Net Activity Ave. Switched Cap.in pF in pF in pF in pF in CLB pitches in CLB pitches in pF
accumy16b 12 17.5 2.5 33.6 1.2 2 4 8.2 0.27 3.45accumy8b 10.85 17.5 2.5 32.4 1.2 1 4.3 7 0.26 3.23accumy4b 12.1 15 2.5 24.6 1.2 2 4.6 5.5 0.24 3.48addery8b 15.15 7.2 2.5 20 1.1 2 7.7 9.7 0.27 4.22addery4b 11.75 6.85 2.5 20 1.2 2 4.6 5.9 0.23 3.28barrely8b 17.28 72.2 4.75 20 2.7 8 4.4 6.5 0.28 4.89barrely4b 15.76 34 4.85 20 2.4 4 5.8 8.2 0.24 4.34comparemcy8b 16.22 26.5 2.5 20 1 1 5.6 7.3 0.21 4.41comparemy8b 19.5 36.6 4.85 20 2 4 5 6.9 0.19 4.76comparemy4b 16.9 19.35 4.85 20 1.8 2 6.1 7.7 0.21 4.49comparemy2b 11.84 4.85 4.85 20 2 2 3.5 5.1 0.21 3.81comparey8b 15.78 7.05 4.85 20 1 1 3.6 5.4 0.22 4.25comparey4b 12.84 4.85 4.85 20 1 1 3.8 5.2 0.22 3.82comparey2b 9.34 5.4 5.4 20 1 1 3.5 4.8 0.21 3.12countercy16b 6.62 12.35 2.5 33 1 1 0.3 1.4 0.06 0.41countercy8b 5.72 10.35 2.5 32.4 1 1 0.3 1.3 0.11 0.78countery16b 21.13 49.8 8.05 82.4 3.2 5 1.8 3.5 0.05 1.26countery8b 18.64 33.7 10.1 38.4 3.1 5 1.6 4.5 0.1 2.84countery4b 11.26 28.4 11.65 32.4 2.5 4 1 2.7 0.19 4.15decodery4x16b 15.37 6.85 4.45 20 12.8 16 6 9.4 0.09 3.65muxy8b 14.7 12.6 6.3 20 1.5 4 4.6 7 0.23 4.12muxy4b 12.45 26.3 4.85 20 1.3 2 4 5.4 0.21 3.84muxy2b 7.35 24.35 4.85 20 1 1 2.7 4.3 0.18 2.46parityy9b 13.27 14.75 4.85 20 1.5 2 2.5 3.6 0.28 4.2shifty16b 10.72 29.4 5.7 82.4 1 1 2.2 4.1 0.23 2.92shifty8b 7.44 12 5.7 51.2 1 1 2.4 2.9 0.22 2.22shifty4b 6.05 11.25 5.7 38.8 1 1 3.2 4 0.21 2.02
matchb 16.9 41 5 198 2 3 2.6 4.1 0.12 2.3txchip 23.6 134 6 153 3.5 33 3.4 6.2 0.07 1.6correl4rb 11.4 76 5 103 1.4 12 2.6 4.7 0.05 0.8correl4b 12.7 55 5 76 1.4 12 2.8 4.7 0.05 0.8l5b datapath 15.4 38 4 32.7 1.5 4 3.4 5.7 0.25 4.3l2b pipe datapath 15.5 36 4 63.4 1.5 4 3.3 4.8 0.23 3.59l3b random logic 15.4 31 4 115 1.9 4 3.3 5.5 0.03 0.83fir5b 14.19 51.6 2.5 90 1.6 15 4 6 0.31 4.82rs_mult 22.3 29 5.4 19.3 2.9 11 6.5 9 0.32 6.43
All DesignsCummulative Ave. 13.76 26.30 1.96 3.64 5.51 0.19 3.13Std. Dev 4.35 1.99 1.66 2.03 0.08 1.43
Large Designs OnlyCummulative Ave. 16.38 54.62 4.54 94.49 1.97 10.89 3.54 5.63 0.16 2.83Std. Dev 4.08 33.15 1.03 56.38 0.74 9.44 1.20 1.44 0.12 2.05
152
Error % % of power relative to total power Input StatisticsDesign Name CLB Power IO Power Clock Power Interconnect Power Ave Inputs Used/ CLB Ave. # F Inputs Ave. # of G Inputs Ave # of C Inputs
accumy16b 30.48% 8.52% 8.56% 8.71% 74.21% 4.11 2.28 1.83 0.00accumy8b 29.46% 8.62% 7.13% 11.97% 72.28% 3.80 2.10 1.70 0.00accumy4b 28.96% 7.66% 5.63% 17.30% 69.41% 2.86 1.57 1.29 0.00addery8b 22.43% 5.28% 15.96% 3.88% 74.88% 2.86 1.57 1.29 0.00addery4b 23.94% 5.74% 14.36% 7.56% 72.33% 2.00 1.17 0.83 0.00barrely8b 17.43% 9.25% 8.49% 2.94% 79.32% 2.92 2.76 0.16 0.00barrely4b 23.94% 7.41% 11.48% 6.93% 74.19% 2.78 2.67 0.11 0.00comparemcy8b 27.37% 2.62% 14.55% 4.19% 78.64% 2.75 1.50 1.13 0.13comparemy8b 22.10% 3.77% 13.29% 3.36% 79.59% 3.75 2.88 0.81 0.06comparemy4b 24.37% 4.80% 12.52% 6.59% 76.09% 3.43 2.57 0.71 0.14comparemy2b 12.06% 5.58% 14.20% 12.89% 67.33% 3.00 2.67 0.33 0.00comparey8b 12.38% 1.75% 23.52% 6.19% 68.54% 6.00 2.67 3.00 0.33comparey4b 12.69% 3.28% 18.67% 10.58% 67.48% 4.50 2.00 2.50 0.00comparey2b 14.14% 5.16% 14.98% 15.56% 64.31% 2.50 2.00 0.50 0.00countercy16b 25.03% 7.71% 0.00% 52.70% 39.58% 1.89 0.89 1.00 0.00countercy8b 17.06% 8.06% 0.00% 41.37% 50.57% 1.80 0.80 1.00 0.00countery16b 4.40% 3.47% 0.00% 60.64% 35.89% 3.47 3.00 0.26 0.21countery8b 7.51% 5.81% 0.00% 34.79% 59.39% 2.90 2.70 0.10 0.10countery4b 9.62% 6.70% 0.00% 32.83% 60.47% 2.20 2.00 0.20 0.00decodery4x16b 18.56% 3.34% 6.33% 5.95% 84.38% 3.82 2.35 1.47 0.00muxy8b 10.93% 3.37% 21.65% 7.50% 67.48% 5.25 2.50 2.25 0.50muxy4b 15.29% 3.62% 18.18% 12.10% 66.11% 4.50 2.00 2.50 0.00muxy2b 14.64% 5.28% 12.56% 18.64% 63.51% 2.00 1.50 0.50 0.00parityy9b 24.20% 5.62% 18.88% 8.02% 67.48% 4.00 3.00 0.25 0.75shifty16b 5.60% 1.30% 8.30% 50.07% 40.33% 1.00 0.00 0.06 0.94shifty8b 0.68% 7.88% 2.23% 47.04% 42.84% 1.00 0.00 0.11 0.89shifty4b 5.41% 7.34% 3.23% 45.77% 43.66% 1.00 0.00 0.20 0.80
matchb 0.39% 3.05% 1.67% 30.76% 64.52% 3.50 0.80 1.20 1.50txchip 13.12% 2.49% 2.71% 32.62% 62.17% 5.55 2.97 2.37 0.21correl4rb 4.93% 3.21% 1.98% 36.54% 58.27% 2.85 1.01 0.85 0.99correl4b 7.25% 3.50% 2.16% 28.84% 65.50% 2.75 0.93 0.82 1.00l5b datapath 12.39% 3.58% 6.98% 15.50% 73.94% 3.92 1.92 0.69 1.31l2b pipe datapath 11.37% 2.78% 6.49% 10.20% 80.53% 3.76 1.88 0.53 1.35l3b random logic 5.71% 1.30% 4.55% 55.19% 38.96% 4.10 1.76 1.90 0.45fir5b 18.46% 8.32% 2.08% 10.25% 79.35% 2.92 1.54 1.17 0.21rs_mult 26.64% 9.52% 4.00% 0.00% 86.49% 3.46 3.05 0.40 0.00
CLB Power IO Power Clock Power Interconnect Power Ave Inputs Used/ CLB Ave. # F Inputs Ave. # of G Inputs Ave # of C InputsCummaltive Ave. 15.58% 5.19% 8.54% 21.00% 65.28% 3.19 1.86 1.00 0.33Std. Dev 8.62% 2.40% 6.88% 17.72% 13.69% 1.19 0.88 0.80 0.46
Cummaltive Ave. 11.14% 4.19% 3.62% 24.44% 67.75% 3.65 1.76 1.10 0.78Std. Dev 7.89% 2.78% 2.01% 16.96% 14.37% 0.86 0.82 0.65 0.57
153
% C ontr ibu tion o f Inte rconnec t R esourc e to In te rconnec t P ower % M akeup o f O v era ll In te rconnect ResourcesD esign Nam e Loca l L ines D oub le L ines Long L ines C LB Inpu t L ines C LB O utpu t L ines Loca l L ines D oub le L ines Long L ines C arry Lines
accum y16b 42.17% 20.19% 3.99% 17.21% 13.71% 58.67% 28.00% 8.00% 5.33%accum y8b 46.17% 17.65% 2.20% 17.44% 14.18% 58.90% 27.40% 8.22% 5.48%accum y4b 32.34% 36.67% 0.00% 16.33% 13.68% 49.09% 43.64% 3.64% 3.64%addery8b 29.26% 45.61% 5.12% 9.80% 7.64% 28.72% 57.45% 8.51% 5.32%addery4b 28.58% 45.92% 1.68% 12.42% 8.89% 27.27% 59.09% 6.82% 6.82%barre ly8b 28.08% 29.27% 5.08% 27.40% 10.13% 41.35% 44.23% 14.42% 0.00%barre ly4b 33.95% 27.41% 6.34% 23.07% 9.34% 42.55% 38.30% 19.15% 0.00%com parem cy 8b 27.50% 51.30% 3.39% 11.27% 3.84% 28.71% 59.41% 6.93% 4.95%com parem y8b 24.39% 44.64% 4.31% 21.25% 5.01% 33.33% 58.33% 8.33% 0.00%com parem y4b 42.40% 28.87% 3.20% 19.28% 6.35% 51.67% 38.33% 10.00% 0.00%com parem y2b 36.95% 33.06% 0.00% 20.57% 9.21% 42.11% 52.63% 5.26% 0.00%com parey8b 48.02% 31.47% 1.35% 15.54% 3.23% 56.90% 37.93% 5.17% 0.00%com parey4b 48.28% 29.76% 0.67% 15.67% 5.95% 51.72% 41.38% 6.90% 0.00%com parey2b 57.05% 15.14% 1.03% 16.69% 9.70% 60.00% 26.67% 13.33% 0.00%countercy16b 35.52% 11.11% 0.00% 22.40% 26.96% 50.00% 23.33% 3.33% 23.33%countercy8b 49.63% 6.71% 0.00% 18.18% 21.91% 56.25% 18.75% 6.25% 18.75%countery16b 37.05% 14.57% 5.50% 30.97% 12.66% 70.79% 23.60% 5.62% 0.00%countery8b 38.67% 13.19% 6.23% 29.85% 12.22% 58.97% 25.64% 15.38% 0.00%countery4b 37.29% 21.76% 0.00% 27.00% 14.05% 68.75% 25.00% 6.25% 0.00%decodery4x 16b 37.61% 6.87% 7.46% 42.16% 4.48% 74.55% 12.73% 12.73% 0.00%m ux y 8b 46.46% 20.32% 4.16% 22.88% 5.76% 61.54% 30.77% 7.69% 0.00%m ux y 4b 42.86% 31.90% 0.00% 18.30% 6.85% 45.45% 50.00% 4.55% 0.00%m ux y 2b 50.55% 20.88% 0.00% 17.27% 10.68% 50.00% 41.67% 8.33% 0.00%parityy 9b 45.20% 22.31% 1.27% 21.87% 9.09% 60.00% 33.33% 6.67% 0.00%sh if ty16b 35.91% 31.33% 5.01% 19.84% 25.15% 45.95% 43.24% 10.81% 0.00%sh if ty8b 45.20% 10.20% 2.30% 19.00% 23.70% 64.71% 17.65% 17.65% 0.00%sh if ty4b 47.86% 10.26% 3.28% 17.52% 21.08% 54.55% 18.18% 27.27% 0.00%
m atchb 31.43% 29.75% 4.73% 20.67% 13.04% 53.58% 38.68% 7.74% 0.00%tx ch ip 39 .62% 18.40% 5.66% 22.17% 8.61% 57.40% 31.27% 11.33% 0.00%corre l4 rb 21 .19% 35.59% 3.81% 25.85% 13.56% 55.95% 32.16% 7.84% 4.05%corre l4b 22.63% 34.57% 6.17% 25.10% 12.76% 50.23% 37.78% 8.60% 3.39%l5b da tapa th 32 .26% 33.18% 4.38% 19.82% 9.45% 47.06% 41.18% 8.82% 2.94%l2b p ipe da tapa th 37 .24% 26.87% 11.90% 14.59% 8.06% 46.77% 38.71% 12.10% 2.42%l3b random log ic 24 .44% 45.00% 2.22% 20.83% 8.33% 35.23% 55.68% 9.09% 0.00%f ir5b 36.44% 26.36% 5.85% 16.20% 11.87% 51.28% 29.74% 10.77% 8.21%rs_m ult 33 .49% 25.34% 8.42% 23.10% 9.10% 49.69% 36.42% 13.89% 0.00%
Loca l L ines D oub le L ines Long L ines C LB Inpu t L ines C LB O utpu t L ines Loca l L ines D oub le L ines Long L ines C arry LinesC um m altiv e A v e . 37 .60% 26.48% 3.52% 20.54% 11.40% 51.10% 36.62% 9.65% 2.63%S td . D ev 8 .75% 11.67% 2.81% 6.06% 5.91% 11.35% 12.60% 4.80% 5.15%
C um m altiv e A v e . 30 .97% 30.56% 5.91% 20.93% 10.53% 49.69% 37.96% 10.02% 2.33%S td . D ev 6 .71% 7.61% 2.83% 3.74% 2.24% 6.53% 7.70% 2.12% 2.75%
154
Design Name CLB Local Wire Double Wire Longline Carry Chain CLB Input CLB Output Clock WireEnergy Energy Energy Energy Energy Line Energy Line Energy Energy
accumy16b 1.71 6.07 3.53 0.81 0.4 3.29 1.99 0.5accumy8b 0.92 3.2 1.59 0.42 0.2 1.69 1.05 0.38accumy4b 0.43 1.7 1.94 0.08 0.1 0.89 0.53 0.32addery8b 0.45 2.03 4.47 0.56 0.25 0.89 0.45 0.13addery4b 0.27 0.84 2.23 0.14 0.15 0.53 0.29 0.13barrely8b 1.21 2.98 3.88 0.7 0 3.53 1.03 0.13barrely4b 0.43 1.3 1.66 0.41 0 1.22 0.37 0.13comparemcy8b 0.27 2.26 4.85 0.41 0.25 0.99 0.29 0.13comparemy8b 0.75 2.77 5.96 0.63 0 2.81 0.72 0.13comparemy4b 0.36 2.15 1.71 0.25 0 1.13 0.33 0.13comparemy2b 0.13 0.52 0.88 0.05 0 0.43 0.12 0.13comparey8b 0.13 2.33 1.77 0.13 0 0.79 0.12 0.13comparey4b 0.08 1.12 0.96 0.08 0 0.39 0.08 0.13comparey2b 0.08 0.74 0.32 0.08 0 0.24 0.08 0.13countercy16b 0.67 1.04 0.6 0.05 0.35 0.73 1 0.5countercy8b 0.35 0.62 0.3 0.05 0.15 0.39 0.52 0.38countery16b 0.89 4.42 1.96 0.29 0 3.17 1.16 1.38countery8b 0.48 1.61 1.02 0.32 0 1.4 0.56 0.5countery4b 0.23 0.75 0.4 0.05 0 0.53 0.28 0.38decodery4x16b 0.71 3.14 0.5 0.6 0 2.95 0.7 0.13muxy8b 0.16 1.7 0.93 0.24 0 0.94 0.16 0.13muxy4b 0.08 0.75 0.89 0.05 0 0.39 0.08 0.13muxy2b 0.08 0.44 0.36 0.05 0 0.18 0.08 0.13parityy9b 0.16 1.33 0.87 0.11 0 0.77 0.16 0.13shifty16b 0.83 1.14 1.4 0.28 0 0.77 1 1.38shifty8b 0.41 0.77 0.28 0.12 0 0.4 0.52 0.75shifty4b 0.23 0.45 0.2 0.12 0 0.22 0.28 0.5
matchb 3.9 19.4 18.1 2.9 0 15.5 10.5 4.3txchip 3.5 31.1 20.6 5.6 0 24.8 7.5 3.1correl4rb 6.3 13.33 10.7 1.11 0.75 9.1 7.7 4.96correl4b 6.63 14.24 14.5 2.11 0.75 9.4 8 4.62l5b datapath 0.5 3.4 3.7 0.5 0.2 2.4 1 0.6l2b pipe datapath 0.6 4.1 4 0.9 0.2 3 1.3 0.5l3b random logic 1.2 4.3 8.6 0.8 0 5.2 3.1 1.7fir5b 3.46 13.52 10.2 2.63 1.6 6.79 4.03 1.62rs_mult 2.81 11.3 10.04 2.81 0 9.4 2.34 0
155
APPENDIX C
Detailed Breakdown of PGA Energy and Capacitances
The following list shows the energy breakdown of the major blocks for the mini-array.Numbers are specified for both the high and low supply indicating the energy that is drawnfrom each one. Along with the energy number, the approximate capacitance that ischarged through the given supply is listed The relationship between the energy andcapacitance is a complicated one because of the variety of swings that are possible so asimple scaling factor does not hold across all blocks.
Block Vdd (2 volts) Vddl (1.5 volts)
Level 1 Track Buffer:
0.87 pJ -> 38 fF
C Input Multiplexer and Drivers:
0.18 pJ -> 45 fF 0.16 pJ -> 95 fF
A/B Input Multiplexer and Drivers:
0.2 pJ -> 50 fF 0.075 pJ -> 45 fF
LUT one path toggling:
0.1 pJ -> 45 fF
LUT all paths toggling:
0.265 pJ -> 120 fF
LUT output buffer path (includes ff input, and output fanout node to programmabledrivers):
0.33 pJ -> 180 fF
Vertical (up/down) path incl. output drivers: 0.115 pJ -> 55 fF
Diagonal (up/down) path incl. output drivers: 0.135 pJ -> 60 fF
156
Direct Horizontal path(left to right) incl. output drivers:
0.14 pJ -> 65 fF
Level 1 Horizontal Track (0 or 3) total: 0.405 pJ -> 200 fF
- Track Buffers (always toggling if track is toggling):
0.18 pJ -> 80 fF
- Programmable Output Inverter 0.126 pJ -> 60 fF
- Track Capacitance 0.1 pJ -> 60 fF
Level 1 Horizontal Track (1 or 2) total: 0.435 pJ -> 220 fF
- Track Buffers (always toggling if track is toggling):
0.18 pJ -> 80 fF
- Programmable Output Inverter 0.126 pJ -> 60 fF
- Track Capacitance 0.129 pJ -> 80 fF
157
APPENDIX D
PGA Encodings
************************
FORMAT OF THE INITDATA FILE
******************************
Cell Pair Instance Name
Cell Instance Name
keyword = lutmemory
bit vector for lut mem [7:0]
keyword = cinput
encoded selected input [0..7]
keyword = binput
encoded selected input [0..3]
keyword = ainput
encoded selected input [0..3]
keyword = ffmux
bit [0,1]
keyword = outputs
bit vector for output enables in this order vu du dd vd r t0 t1 t2 t3
keyword = endcell
keyword = encellpair
keyword = endlogiccells
Smatrix Instance Name
keyword = horiz [0,1,2,3]
keyword = left [tv0,tv1,bv0,bv1]
keyword = right [tv0,tv1,bv0,bv1]
keyword = vert [0,1]
keyword = endsmatrix
keyword = endswitchmatrices
*************************************
Encoding Information for the data file
****************************************
Lut Memory Encoding=
bit vector [7:0] inputs are [abc]
158
abc lutmem
000 bit0
001 bit1
010 bit2
011 bit3
100 bit4
101 bit5
110 bit6
111 bit7
Center Mux Encoding (c input)=
numerical digit [0..7] indicating which input is selected.
input digit
duin 0
vertfanin 1
track1in 2
track0in 3
track2in 4
track3in 5
rightin 6
ddin 7
Bottom Mux Encoding (a input)=
numerical digit [0..3] indicating which input is selected (converted to 2-bits of binary)
input digit
vdin 0
ddin 1
track2in 2
track3in 3
Top Mux Encoding (b input)=
numerical digit [0..3] indicating which input is selected (converted to 2-bits of binary)
input digit
vuin 0
duin 1
track1in 2
track0in 3
Output Encoding=
9 bit vector indicating enabled cell outputs in this order:
vu du dd vd r t0 t1 t2 t3
159
Switch matrix Encoding for 1x1
horiz [lh0rh0 lh1rh1 lh2rh2 lh3rh3]
left [lh1tv0 lh1tv1 lh2bv0 lh2bv1]
right [rh1tv0 rh1tv1 rh2bv0 rh2bv1]
vert [tv0bv0 tv1bv1]
***********************************
usage: initpga <file>
Example PGA Initialization File
* Main test file for final schems with loadcaps* cellpair listingcellpair00cell0* lut memory 7:0lutmemory* binput passes11001100* c input [0..7]cinput6* b input [0..3]binput3* a input [0..3]ainput3* ff muxffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000100endcellcell1lutmemory* passes c input10101010cinput0binput3ainput3ffmux1
160
* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000010endcellendcellpair00cellpair10cell0lutmemory* passes a input11110000cinput0binput3ainput3ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs110000001endcellcell1lutmemory*passes c input10101010cinput2binput3ainput3ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs001100010endcellendcellpair10cellpair01cell0* lut memory 7:0lutmemory* passes c input10101010* c input [0..7]cinput6* b input [0..3]binput
161
3* a input [0..3]ainput3* ff muxffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs001010100endcellcell1lutmemory* passes c input10101010cinput6binput3ainput3ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000010endcellendcellpair01cellpair11cell0lutmemory* passes a input11110000cinput5binput3ainput0ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs110000001endcellcell1lutmemory* passes a input11110000cinput
162
5binput3ainput1ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000010endcellendcellpair11endlogiccellssmatrix110horiz1011left0000right0100vert00endsmatrixsmatrix111horiz1101left0001right0000vert00endsmatrixendsmatrices
163
APPENDIX E
PGA Configuration Memory Programming PolaritiesB IT LIN E P R O G R A M M IN G P O LA R IT IE S F O R C O N F IG U R A T IO N M E M O R Y(T H E G IV E N B IT L IN E IN D IC A T E S W H IC H B IT L IN E IS H IG HF O R A M E M O R Y C E LL V A LU E O F 1 .)
m em ce ll w l b itline to b ring h igh fo r m em ce ll= 1= = = = = = = = = = = = = = = = = = = = = = =
----------------------------------ce ll0 in a ce llpa ir----------------------------------C E N T E R M U X0m em 0 wl5 b l1#m em 1 wl4 b l1#m em 2 wl4 b l0#m em 3 wl5 b l0#m em 4 wl3 b l0#m em 5 wl2 b l0#m em 6 wl3 b l1#m em 7 wl2 b l1#
T O P M U X0m em 0 wl7 b l0#m em 1 wl6 b l0#
B O T M U X0m em 0 wl0 b l0#m em 1 wl1 b l0#
LU T 0lm em 0 wl3 b l2#lm em 1 wl2 b l2lm em 2 wl6 b l2#lm em 3 wl7 b l2lm em 4 wl2 b l3lm em 5 wl3 b l3#lm em 6 wl7 b l3lm em 7 wl6 b l3#
P R O G O U T 0m 0(f fm em ) w l4 b l4#
poutv u w l7 b l4#poutdu w l6 b l4#poutr w l1 b l2poutdd w l1 b l4#poutv d w l0 b l4#
T O U T C E LL0
poutt0 w l5 b l10poutt1 w l4 b l10poutt2 w l3 b l10poutt3 w l1 b l10
164
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -c e l l1 in a c e l lp a i r- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -C E N T E R M U X 1m e m 0 w l5 b l6 #m e m 1 w l4 b l6 #m e m 2 w l4 b l5 #m e m 3 w l5 b l5 #m e m 4 w l3 b l5 #m e m 5 w l2 b l5 #m e m 6 w l3 b l6 #m e m 7 w l2 b l6 #
T O P M U X 1m e m 0 w l7 b l5 #m e m 1 w l6 b l5 #
B O T M U X 1m e m 0 w l0 b l5 #m e m 1 w l1 b l5 #
L U T 1lm e m 0 w l3 b l7 #lm e m 1 w l2 b l7lm e m 2 w l6 b l7 #lm e m 3 w l7 b l7lm e m 4 w l2 b l8lm e m 5 w l3 b l8 #lm e m 6 w l7 b l8lm e m 7 w l6 b l8 #
P R O G O U Tm 0 ( f f m e m ) w l4 b l9 #
p o u tv u w l7 b l9 #p o u td u w l6 b l9 #p o u t r w l1 b l7p o u td d w l1 b l9 #p o u tv d w l0 b l9 #
T O U T C E L L 1
p o u t t 0 w l7 b l1 0p o u t t 1 w l5 b l1 0p o u t t 2 w l2 b l1 0p o u t t 3 w l0 b l1 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -s w i t c h m a t r i x ( n o t a b s o lu te b i t l i n e la b e l in t h is c a s e , m u s tlo o k a t s c h e m a t ic t o g e t r e a l o n e . . . . . r e la t i v e d a ta )- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
m e m lh 1 t v 0 w l5 b l1m e m lh 1 tv 1 w l4 b l1m e m r h 1 tv 0 w l6 b l0m e m r h 1 tv 1 w l0 b l1
m e m lh 2 b v 0 w l2 b l0m e m lh 2 b v 1 w l3 b l1m e m r h 2 b v 0 w l0 b l0m e m r h 2 b v 1 w l1 b l1
m e m tv 0 b v 0 w l7 b l0m e m tv 1 b v 1 w l6 b l1
m e m lh 0 r h 0 w l7 b l1m e m lh 1 r h 1 w l5 b l0m e m lh 2 r h 2 w l1 b l0m e m lh 3 r h 3 w l2 b l1
165
Relative Location of Configuration Memory Cells in Layout
Below are a series of diagrams that show the layout of the memory cells within eachblock that contains configuration cells. The orientation matches the picture of the mini-array in Figure 7-1. The memory cells are labeled according to the bit position that theyrepresent in the encoding vectors of Appendix D or the connection path that is enabled inthe switch matrix case.
Bottom Top
Center
Mux Mux
Mux
0
1 0
1
12
3
4
5
6
7
0
Lut Memory
12
3
4
5
6
7
0
lh0rh0
tv1bv1
lh1tv0
lh1tv1
lh2bv1
lh3rh3
rh2bv1
rh1tv1rh2bv0
tv0bv0
rh1tv0
lh1rh1
lh2bv0
lh2rh2
1x to 1xSwitch Matrix