analysis and circuit design for low power programmable ... · figure 5-6: level restoration...

Analysis and Circuit Design

for Low Power Programmable

Logic Modules

by

Eric A. Kusse

Master of Sciencein

Electrical Engineering and Computer Science

University of California at Berkeley

Abstract

This thesis presents research in low power programmable logic. In particular, pro-grammable gate array (PGA) structures are examined emphasizing their strengths andweaknesses from a power perspective. A detailed analysis of power consumption for aXilinx XC4003A was conducted to gain insights into what blocks of a PGA consume themost power and how those elements contribute to a mapped design’s total power. In addi-tion, several circuit design issues pertaining to PGAs were studied. Several problems sur-rounding low voltage pass transistor design were identified and solutions were evaluated.Based on the results from the power analysis and the circuit studies, a new PGA architec-ture has been defined. Finally, a 2x4 mini-array implementation of the design in a 3-layermetal, 0.6um CMOS process has been completed. Extracted simulations show order ofmagnitude energy improvements over the 4003A with only a factor of 2 speed penaltyusing a dual supply voltage of 1.5 and 2 volts.

i

Table of Contents

Introduction 11.1 What Are Programmable Gate Arrays (PGAs)?..................................................................1

1.2 Why Use Them? ..................................................................................................................3

1.3 What about performance? ....................................................................................................4

1.4 PGA Power?.........................................................................................................................6

1.5 Research Contributions ........................................................................................................7

1.6 Thesis Organization .............................................................................................................8

PGA Architecture Overview 102.1 Basic Components and Blocks...........................................................................................10

2.1.1 Logic Blocks........................................................................................................102.1.2 Programmable Interconnect.................................................................................122.1.3 I/O Ring ...............................................................................................................14

2.2 Architectural Styles............................................................................................................142.2.1 Island ...................................................................................................................142.2.2 Cellular ................................................................................................................15

2.3 The Xilinx XC4000 Family ...............................................................................................152.3.1 Composition of a configurable logic block (CLB) ..............................................152.3.2 Interconnect Resources........................................................................................16

2.4 The CLAy Architecture......................................................................................................172.4.1 Logic Cells...........................................................................................................182.4.2 Interconnect Resources........................................................................................19

2.5 Triptych..............................................................................................................................20

2.6 Conclusions........................................................................................................................212.6.1 Architectural Design Insights ..............................................................................22

PGA Power Analysis 233.1 Sources of Power Consumption.........................................................................................23

3.1.1 Dynamic ..............................................................................................................233.1.2 Short Circuit Current ...........................................................................................243.1.3 Leakage................................................................................................................243.1.4 Static Current.......................................................................................................25

3.2 Xilinx Component Power Measurements (opening the black box) ...................................263.2.1 Method.................................................................................................................263.2.2 Components measured.........................................................................................273.2.3 Results .................................................................................................................28

3.3 Xilinx Profiler Description.................................................................................................313.3.1 General Idea.........................................................................................................313.3.2 Flow....................................................................................................................323.3.3 Summary of Profiler Outputs...............................................................................38

3.4 Results & Recommendations.............................................................................................38

ii

3.4.1 Benchmark Data ..................................................................................................383.4.2 Conclusions .........................................................................................................403.4.3 Error Sources .......................................................................................................453.4.4 Periphery effects ..................................................................................................453.4.5 Activity estimation ..............................................................................................453.4.6 Interconnect cap values .......................................................................................46

3.5 Triptych Analysis...............................................................................................................46

Application of Low Power Techniques to PGA Design 524.1 Hierarchy of Low Power Techniques.................................................................................53

4.1.1 Algorithmic..........................................................................................................534.1.2 Architectural ........................................................................................................544.1.3 Logic....................................................................................................................554.1.4 Circuit ..................................................................................................................574.1.5 Layout..................................................................................................................59

4.2 Reducing Dynamic Power in PGAs...................................................................................604.2.1 Capacitance Reduction ........................................................................................604.2.2 Frequency ............................................................................................................684.2.3 Voltage Scaling ....................................................................................................694.2.4 Low Swing...........................................................................................................69

Low Voltage Pass Transistor Design 735.1 Design Issues with Pass Transistors at Low Supply ..........................................................73

5.1.1 Signal Loss Effects on Performance....................................................................745.1.2 Multiple Threshold Losses ..................................................................................77

5.2 Level Restoration ...............................................................................................................80

5.3 Vt Scaling and Leakage .....................................................................................................835.3.1 Subthreshold Leakage Currents...........................................................................835.3.2 Static Current Dissipation in Pass Transistor Fed Inverters ...............................85

Cell Design and Array Architecture 876.1 Logic Cell Circuit Options.................................................................................................87

6.1.1 Pass Transistor ....................................................................................................896.1.2 Cross-Coupled PMOS Restorer (CCP) ...............................................................906.1.3 Boosted Capacitor Restoration (BCR) ................................................................926.1.4 Buffered Pass Transistor......................................................................................936.1.5 Transmission Gate ...............................................................................................946.1.6 Decoder-Based.....................................................................................................956.1.7 Static Complex Gate............................................................................................966.1.8 Current Mode Logic ............................................................................................976.1.9 Low Threshold Devices.......................................................................................986.1.10 Final Cell Results ...............................................................................................99

6.2 Logic Cell Output Circuitry .............................................................................................101

6.3 Interconnect Circuitry ......................................................................................................102

6.4 Config memory/ FFs for CLBs ........................................................................................107

6.5 Logic Cell Array Architecture .........................................................................................108

6.6 Detailed Architectural Discussion ...................................................................................1096.6.1 Level 0 Direct Interconnect ...............................................................................112

iii

6.6.2 Level 1 Local Bus Interconnect.........................................................................1166.6.3 4x interconnect layer .........................................................................................118

6.7 Mapping ...........................................................................................................................1206.7.1 Correlator Mapping ...........................................................................................1216.7.2 Add-Compare-Select Mapping..........................................................................1216.7.3 Mapping Guidelines ..........................................................................................124

Low Power PGA Results 1277.1 Overview..........................................................................................................................127

7.2 Layout Discussion............................................................................................................127

7.3 Capacitance Data .............................................................................................................129

7.4 Performance Data.............................................................................................................133

7.5 Area..................................................................................................................................137

Conclusions 1398.1 Conclusions......................................................................................................................139

8.2 Future Work .....................................................................................................................141

References 142

Appendix A: Xilinx XC4003A Resource Measurements

Appendix B: Power Profiling Data

Appendix C: Detailed Breakdown of Low Power PGA Energy and Capacitance

Appendix D: PGA Configuration Encodings and Example Initialization File

Appendix E: PGA Memory Cell Programming Polarities and Memory Cell Layout Locations.

iv

List of Figures

FIGURE 1-1: Design Implementation Spectrum..........................................................1FIGURE 1-2: PGA User Design Flow .........................................................................2FIGURE 1-3: PGA Application Domains: Stand-Alone Commodity Parts and

Embedded Section on a ‘System on a Chip’..........................................3

FIGURE 2-1: Logic Block Designs...........................................................................11FIGURE 2-2: Types of PGA Interconnect Resources ................................................12FIGURE 2-3: Two Programming Methods for PGA Interconnect .............................13FIGURE 2-4: Various Levels of Switch Population and Examples

of Segmentation ..................................................................................13FIGURE 2-5: Representative Architecture of Island and Cellular PGAs...................14FIGURE 2-6: Xilinx XC4000 CLB............................................................................16FIGURE 2-7: Switch Matrix Detail............................................................................17FIGURE 2-8: Xilinx 4003A Interconnect Diagram ...................................................18FIGURE 2-9: CLAy Architecture Logic Block ..........................................................19FIGURE 2-10: CLAy Architecture Interconnect Scheme ............................................20FIGURE 2-11: Triptych Logic Cell and Interconnect Scheme ....................................21

FIGURE 3-1: MOS Parasitic Capacitances ................................................................24FIGURE 3-2: Heavily Loaded Interconnect due to Switch

Diffusion Capacitance.........................................................................30FIGURE 3-3: Xilinx Power Analysis Flow ................................................................32FIGURE 3-4: Excerpt from .lca file (contains routing and configuration data for a

mapped design)....................................................................................33FIGURE 3-5: Output of xpx.pm in initial non-activity weighted

analysis of Correl4b............................................................................35FIGURE 3-6: Output from xpx.pm on post-processing pass of Correl4b. .................37FIGURE 3-7: Profiler Outputs....................................................................................38FIGURE 3-8: Power Breakdown for Xilinx XC4003A..............................................39FIGURE 3-9: Interconnect Power Breakdown by Resource for XC4003A ...............39FIGURE 3-10: Interconnect Wiring Resource Usage Breakdown for XC4003A ........40FIGURE 3-11: Correlator Schematic ...........................................................................43FIGURE 3-12: Triptych Routing Scheme and Cell Routing Paths ..............................47FIGURE 3-13: Triptych Logic Cell ..............................................................................47

v

FIGURE 3-14: Triptych Center Output Driver Chain ..................................................50

FIGURE 4-1: Hierarchy of Power Reduction Domains .............................................54FIGURE 4-2: Wiring Capacitance Parasitics..............................................................62FIGURE 4-3: Pass Transistor Diffusion Capacitance vs. Fanout and

Switch Size (W) L=0.6um.................................................................63FIGURE 4-4: Model For Programmable Interconnect ...............................................63FIGURE 4-5: Simulation Model for Switch Size Experiments..................................65FIGURE 4-6: Delay and Energy of Interconnect Chains vs. Switch Size..................65FIGURE 4-7: Energy-Delay of Interconnect Chains vs. Switch Size ........................66FIGURE 4-8: Possible Logic Cell Input Structures....................................................67FIGURE 4-9: Delay and Capacitance of Logic Cell Input Structures........................68FIGURE 4-10: Low Swing Latched Sense Amp Receiver...........................................70FIGURE 4-11: Pseudo-Differential Receiver for Low Swing Signals .........................71

FIGURE 5-1: NMOS Pass Transistor Feeding an Inverter.........................................74FIGURE 5-2: Pass Transistor Threshold’s Impact on Delay......................................76FIGURE 5-3: Supply Voltage Impact on Delay .........................................................77FIGURE 5-4: Logic Cell Design with Multiple Drain to Gate Pass Transistor

Connections..........................................................................................78FIGURE 5-5: Signal Loss Levels at Various Supply Voltages ...................................79FIGURE 5-6: Level Restoration Schemes..................................................................80FIGURE 5-7: Level Restorer Delay and Energy Comparison with

Single Threshold Loss.........................................................................81FIGURE 5-8: Level Restorer Delay Comparison with Multiple Threshold

Loss (low Vt pass transistors)............................................. ................82FIGURE 5-9: Restorer Delay and Energy for Different Supply Voltages

(low Vt pass transistors)......................................................................83FIGURE 5-10: Measured Sub-threshold Id-Vt Characteristics for

Various Vgs [19] .................................................................................84FIGURE 5-11: Sneak Leakage Path in Between Pass Transistor

Connected LUTs .................................................................................84FIGURE 5-12: Static Current Dissipation for Low Vt Inverter

fed by Pass Transistor .........................................................................85

FIGURE 6-1: Pass Transistor Logic Cell ...................................................................90FIGURE 6-2: Cross-coupled pmos Restored Pass Transistor Logic Cell ..................91FIGURE 6-3: Boosted Capacitor Restoration Input Circuitry....................................92FIGURE 6-4: Buffered Pass Transistor Logic Cell ....................................................93FIGURE 6-5: Impact of Transmission Gate Sizing....................................................94

vi

FIGURE 6-6: Transmission Gate Logic Cell..............................................................95FIGURE 6-7: NAND Decoder Style LUT..................................................................96FIGURE 6-8: Static Branch-Based Logic Cell (only 2-LUT)....................................97FIGURE 6-9: Current Mode Logic Cell Implementation (only 2-LUT

for simplicity)......................................................................................98FIGURE 6-10: Delay of Potential Cell Designs.........................................................100FIGURE 6-11: Energy of Potential Cell Designs .......................................................100FIGURE 6-12: Energy-Delay Product Of Potential Cell Designs..............................101FIGURE 6-13: Programmable Output Circuitry ........................................................102FIGURE 6-14: Delay and Energy for Interconnect Chains ........................................103FIGURE 6-15: Energy-Delay Product of Interconnect Chains ..................................104FIGURE 6-16: Typical path from interconnect through logic cell

and back to interconnect ...................................................................105FIGURE 6-17: Fully optimized typical path from interconnect to

logic cell and back to interconnect....................................................106FIGURE 6-18: Configuration Memory Options........................................................107FIGURE 6-19: Basic Logic CellPair Tile ...................................................................110FIGURE 6-20: Simplified Diagram of Logic Cell Pair with Input Breakdown .........111FIGURE 6-21: Level 0 Direct Interconnect (Pattern spans the entire array)..............112FIGURE 6-22: Level 1 Vertical and Horizontal Interconnect (showing

cell output connections and the 4 versions of switch matrices .........113FIGURE 6-23: Diagrams depicting the directly connected cells from

Input and Output Perspective............................................................113FIGURE 6-24: Ripple Carry Adder Mapping making use of Direct Connects..........114FIGURE 6-25: FSM Mapping of ISCAS S27 (3 states).............................................115FIGURE 6-26: Level 1 Switch Matrix Detail.............................................................117FIGURE 6-27: 4x Horizontal and Vertical Interconnect Layers ................................119FIGURE 6-28: Level 1 and Full 4X Switch Matrix Connectivity and

Buffering Scheme (only 4X connections shown, rest same as 1X switch matrix). ...............................................................120

FIGURE 6-29: Correlator Mapping............................................................................122FIGURE 6-30: Add-Compare-Select Block of a 32 State, Radix-4 Viterbi

Decoder (4-way) ...............................................................................123FIGURE 6-31: Floorplan of ACS Mapping in Figure 6-32........................................124FIGURE 6-32: Add-Compare-Select Partial Mapping...............................................125

FIGURE 7-1: 2x4 Mini-Array of Logic Cells with Block Diagram Representation (4 pairs of cells) .......................................................128

FIGURE 7-2: Logic Cell Layout ..............................................................................130FIGURE 7-3: Typical Signal Path from Level 1 Horizontal Track

vii

Input to Diagonal Output ..................................................................135FIGURE 7-4: HSPICE Waveforms for Multi-Path Simulation ................................136FIGURE 7-5: Multi-Path Example ...........................................................................137

viii

List of Tables

TABLE 1-1 Energy Metrics for Various Designs ............................................................7TABLE 1-2 Process Parameters.......................................................................................8

TABLE 2-1 XC4003A Delays .......................................................................................17

TABLE 3-1 Xilinx XC4003A Architecture Components..............................................27TABLE 3-2 Measured Component Energies and Capacitances ....................................28TABLE 3-3 Estimated Correlator Capacitance Breakdown ..........................................43TABLE 3-4 Net Capacitance Statistics..........................................................................44TABLE 3-5 Triptych Cell Area .....................................................................................48TABLE 3-6 Triptych Delay Performance......................................................................48TABLE 3-7 Triptych Cell Capacitances ........................................................................49

TABLE 5-1 HP Process Parameters ..............................................................................73TABLE 5-2 Comparison of Nominal Vt and Low Vt Nmos Devices (w=1.8u) ...........75TABLE 5-3 Vt Variation with Well Bias .......................................................................75

TABLE 7-1 Energy and Extracted Capacitance Data..................................................130TABLE 7-2 Performance Data.....................................................................................133TABLE 7-3 Delay Breakdown for Path in Figure 7-3 .................................................135TABLE 7-4 Delays from Multi-Path Example ............................................................136TABLE 7-5 Area Breakdown ......................................................................................137

1

CHAPTER 1

Introduction

1.1 What Are Programmable Gate Arrays (PGAs)?

A number of choices exist when implementing a digital integrated circuit design. The

options available traverse a range which can be thought of as an increasing investment in

design time and customization. Figure 1-1 shows programmable gate arrays (PGAs) at

one end of the implementation spectrum and a full custom design at the other.

Full custom design involves time-intensive optimization of circuits and layout resulting in

the maximal performance. By giving up a small amount of performance, a standard cell

library methodology (typically used in ASIC design) can be applied which allows the use

of several automated tools to more quickly bring a design to completion. On the far end

of the spectrum lies the programmable gate array approach. This method has been

growing in popularity as time to market pressures force reductions in design time. As the

FullCustom

StandardCell PGA

Implementation Flexibility

Increasing Design Time & Cost

GateArray

FIGURE 1-1: Design Implementation Spectrum

Section 1.1: What Are Programmable Gate Arrays (PGAs)? 2

cost of custom IC manufacturing necessitate very high volume production, amortization of

costs by fabricating a general purpose IC have become more attractive.

The programmable flexibility of a PGA is what gives rise to its benefits over custom

and semi-custom designs. Once a design has been specified as a set of schematics or in a

an HDL description, it is mapped to the hardware defined by the PGA architecture. A two

step mapping process is typically used to transform the logic to fit the physical constraints

of the PGA. The first step is termed technology independent mapping where the logic

equations are optimized and redundant terms are removed. Next, the technology

dependent mapping is performed where logic is allocated to programmable logic blocks

whose functionality is determined by the PGA architecture. After mapping, a router

defines the connectivity of the design using the interconnection resources available in the

PGA (Figure 1-2). At this point, the design is complete and can be verified by lab tests or

back-annotated simulation.

Technology IndependentOptimization

Schematic/HDLDescription

PGA ProgrammingBitstream

Placement& Routing

Logic Mapping

FIGURE 1-2: PGA User Design Flow

Section 1.2: Why Use Them? 3

1.2 Why Use Them?

As mentioned above, PGAs offer a unique alternative in the IC design implementation

space; however, there are many other reasons to choose PGAs. Since the programming of

a PGA is determined by a configuration memory, the design can be quickly and easily

changed after an initial design version. This allows fixing bugs with minimal effort

whereas a custom design requires a new mask set resulting in a significant added design

cost. In a relatively low volume design, the ability to fix bugs cheaply is crucial.

In addition, reprogrammability offers further design advantages. Reconfigurable

computing architectures have generated much interest recently as a viable trade-off for

absolute performance in return for more flexible logic utilization. Rather than dedicate

large amounts of custom logic to performing one function, a programmable body of logic

can offer acceptable performance over a wider range of functions. In fact, the presence of

a programmable body of logic embedded within a larger custom chip design can allow a

greater measure of programmability and flexibility to the whole design [27]. One can

envision several different programming templates which can be used to reconfigure the

PGA for any one of a number of desired functions depending on the current application

being executed. Depending on the task, a PGA may provide the best trade-off in perfor-

mance, power, and flexibility among the resources available. Specifically, a PGA may

provide an advantage over purely software and ASIC approaches for certain functions.

SpecializedDatapathFunctionality

EmbeddedPGA

Mem

Ctrl

PGA Array

FIGURE 1-3: PGA Application Domains: Stand-Alone Commodity Parts and Embedded Section on a ‘System on a Chip’.

Section 1.3: What about performance? 4

Thus, PGAs can be seen as stand-alone commodity parts as well as modules in a larger

chip design. Along these lines, some environments suitable to PGAs are:

• multi-modal test structure for system on a chip.

• prototyping and testing of asic designs.

• hardware accelerators (decoders, encoders, error correction, bit and byte-wise opera-

tions).

• flexible logic resource (address generation, encryption).

Lastly, not unlike traditional CMOS designs, FPGAs have benefited from shrinking

feature sizes resulting in higher capacity arrays. This increase in gate counts has further

expanded the range of possible PGA implementations opening the doors to an even greater

body of applications.

1.3 What about performance?

Whenever performance is to be examined, a number of metrics must be used. Within

the PGA industry, the most important have been:

•Density or Equivalent Gates

•Logic Utilization

•Routability

•Speed

Density is often given in terms of the ability to realize an equivalent number of gates.

Today, FPGA parts typically range from a thousand to one hundred thousand gates. This

represents the ideal number of gates that can be used towards implementing a design. Two

words of caution must be made here. First, the number of gates that can be derived from a

logic block is somewhat subjective due to the programmable nature of the block. In

general, a reasonable approximation of the gate capacity can be derived from the number

of inputs and outputs the logic block offers. Secondly, the achievable logic utilization is

likely to be far lower (50% to 10%) than the equivalent number of gates that can be imple-

Section 1.3: What about performance? 5

mented. This inefficiency arises from the difficulties encountered when routing a densely

packed design and cases where logic blocks are used to implement very simple functional-

ity (i.e. a single inversion).

In terms of physical area, overhead from logic and interconnect greatly impact packing

density. Instead of the simple implementation of a desired gate, a PGA logic block must

support several different logic configurations. As a result, more transistors are necessary

and configuration memory must be added. However, active area is not the largest source

of area overhead. The large amounts of interconnect lead to large channels dedicated to

only wiring. An empirical review conducted in [8] shows that the interconnect area

(including switches) is nearly 100 times greater than the active area devoted to the

functional block. As will be discussed in a later section, the interconnect area penalty can

be alleviated somewhat by using more advanced processes which offer more layers of

metallization. In general though, configuration memory and interconnect account for

nearly 90% of a PGA’s die area.

As previously alluded to, routability influences effective density, but it is also an

important metric on its own. Routability is not a concrete metric like delay. Instead,

routability reflects the amount of interconnection resources provided for connecting logic

blocks together and also the efficiency with which those resources can be utilized.

Routability cannot be guaranteed for all possible designs to be implemented, but is heuris-

tically determined for a representative subset of potential designs. The flexibility and

amount of interconnect provided in a PGA architecture strongly influences its routability.

However, routability’s heuristic nature complicates an analytical determination of the

necessary quantity and make-up of interconnection resources. Thus, a PGA designer must

remain mindful of the logic intended to be implemented on the PGA.

After density, logic utilization, and routability, speed is the most touted architecture

performance metric in the PGA industry. The obvious drawback of the general purpose

design offered by a PGA is sub-optimal performance. The speed of programmable logic

considerably lags behind full-custom and standard cell designs. Typical design clock

Section 1.4: PGA Power? 6

speeds are less than 50MHz, far less than the hundreds of MHz in recent, full-custom

microprocessor designs.

Speed or conversely delay consists of two basic components, the logic delay and the

interconnection delay. The logic delay consists of the propagation time through a series of

configured logic blocks from their inputs to their outputs. In typical ASIC and custom IC

designs, logic delay is the dominant component of overall delay although shrinking

feature sizes have begun to place more importance on interconnect delays. A programma-

ble gate array, on the other hand, experiences much greater interconnection delays and in

many cases, the interconnect delays can be the performance limiter for a design. To

support the flexibility necessary to map nearly any logic design to the hardware, FPGAs

contain an abundance of interconnect. Since the interconnect resources must support

general connection patterns through switching, rather than dedicated wiring, routing is

significantly more resistive and capacitive. In a typical PGA implementation, the intercon-

nect accounts for 40%-80% of the overall design delay [8],[47]. Thus, PGA wiring delays

are substantially worse than in a traditional CMOS design.

The above discussion of performance mentions nothing about power consumption.

This is because power consumption has been a long-neglected metric of programmable

gate array design. Xilinx and Altera, the leading FPGA manufacturers, do not provide any

tools for estimating power consumption for their parts, and commercial databooks only

give rough power consumption guidelines suitable for rough hand calculations [43],[46].

1.4 PGA Power?

Power consumption has become a major concern in IC design. Higher clock speeds

coupled with ever increasing integration levels have combined to increase power

consumption of high performance chips into the tens of watts range. PGAs, as could be

expected from their general purpose architecture, pay a high price in power consumption.

Table 1-1 shows some energy metrics for a variety of designs. A comparison between the

XC4003A and the static CMOS full adder implementation show a 100x difference in

Section 1.5: Research Contributions 7

energy consumption for an 8-bit adder. Entire processors consume less power than a

design consisting of 50 XC4000XL logic functions.

Furthermore, as the capacity of these arrays has increased due to technology scaling,

so has their power dissipation. In some sense, the maximum clock frequency and size of

future designs will become limited by power dissipation and heat removal requirements.

This is becoming apparent in some of the recent commercial literature where packaging

constraints have become a concern [45],[46]. In addition, many of today's applications

revolve around embedded solutions and portable mediums where battery life is at a

premium. An example of such a design is the Infopad developed for wireless multimedia

[15]. Clearly, power and energy consumption in PGAs must be examined if their utility is

to be preserved in future applications.

1.5 Research Contributions

This work represents a significant in-road into the design of low energy programmable

logic modules. Unfortunately, in many chip designs downstream power saving solutions

and heat management become costly afterthoughts; whereas, an up-front approach can

have a much greater impact. As a result, a firm understanding of PGA power consumption

was built up. Then, through several physical measurements an explicit breakdown of PGA

capacitance was achieved which allowed a detailed power analysis to be performed.

Based on the intuitions generated from the power analysis and a series of circuit studies, a

low energy programmable gate array architecture has been defined. Several insights into

TABLE 1-1 Energy Metrics for Various Designs

Design Example Vdd Energy

Xilinx XC4003A 8-bit Adder (measured) 5v 4.2mW/MHz

Xilinx XC4000XL Series [46] 3.3v 92uW/MHz/Logic Function Output

.8um Gate Array [40] unspecified 7.5uW/gate/MHz

Static CMOS Full Adder [17] 3.3v 5.5uW/MHz

54x54 Multiplier [13] 2.5v 2.23mW/MHz

DSP Processor [21] 1v 0.21mW/MHz

StrongArm Microprocessor [24] 1.5v 2.1mW/MHz

Alpha Microprocessor [4] 2v 60mW/MHz

Section 1.6: Thesis Organization 8

PGA circuit design and low voltage pass transistor design are also discussed. A 0.5um

effective channel length HP process was the target technology for the design (Table 1-2).

A supply voltage of 1.5 volts was intended, but during the course of the research, 2 volts

was determined to be more optimal from a circuit design and energy-delay product

standpoint. Finally, the layout for the main blocks of a PGA array have been extracted and

simulated to demonstrate the significant improvements in energy that have been attained.

As previously stated, power has been neglected within the PGA industry resulting in

virtually no information relating to power consumption in PGAs. In fact, circuit details

are regarded as highly proprietary so little is known about the internal structure of

commercial PGA designs, unlike memories and common datapath blocks. This

complicates any examination of power since a design’s power consumption implemented

on a PGA is inherently defined at the circuit level. Therefore, significant effort was spent

in understanding where the PGA power problem lies and in developing a methodology

from which to study PGA power.

As a final note, the goal of defining a low power PGA design is constrained by the

intent to preserve the PGA’s most attractive asset, namely its flexibility. As will be seen

throughout this thesis, this goal influences nearly every design decision.

1.6 Thesis Organization

The organization of this work is divided into 8 chapters. Chapter 2 provides an

overview of PGA architecture and illustrates some of the trade-offs involved. An

extensive discussion of PGA power analysis is in Chapter 3. A presentation of low power

TABLE 1-2 Process Parameters

Drawn Channel Length 0.6um

N-channel Vt 650mV

P-channel Vt - 850mV

Poly pitch/spacing 0.6um/0.9um

M1 pitch/spacing 0.9um/0.9um



N-well process

Section 1.6: Thesis Organization 9

techniques and their applicability to the PGA design environment follows in Chapter 4.

Chapter 5 considers issues surrounding low voltage design and Chapter 6 details circuit

design options and the architectural description of the proposed low power array. A

description of the layout appears in Chapter 7 along with simulated results. Conclusions

and thoughts on future work appear in Chapter 8. A set of Appendices contain the power

analysis data discussed in Chapter 3, a detailed capacitance breakdown for the newly

proposed design, and information relating to programming of the array.

10

CHAPTER 2

PGA Architecture Overview

2.1 Basic Components and Blocks

As alluded to previously, a programmable gate array consists of an array of logic

blocks and a body of programmable interconnect resources. The entire array is then

surrounded by a set of I/O blocks providing the interface to the chip’s pins. Once the user

has specified a design, the logic block functionality is programmed into the configuration

memory of the logic blocks and the interconnect resources are configured to support the

necessary communication paths among logic and I/O blocks. The next few sections give a

brief description of the three basic PGA components.

2.1.1 Logic Blocks

The logic blocks represent the sole computational resources of a PGA. Depending on

how the internal connectivity and gate structure of the logic cell is programmed, a variety

of logic functionality can be realized. Often, the logic capacity of a block is thought of in

terms of gate equivalents. Typical gates that can be implemented are NANDs, NORs,

XORs and their complements, but more complex combinations can also be made. In some

cases, multiple levels of logic can be implemented within one cell. Logic blocks can vary

widely in logic capability, depending on their design. The number of unique inputs to a

logic cell can range anywhere from two to upwards of 16. Similarly, the number of

Section 2.1: Basic Components and Blocks 11

outputs provided can be as little as one and as high as 4 or 5. As a result, high fan-in

functions are possible as well as the implementation of multiple independent functions. A

common measure of expressing the utility of a logic block is the number of 2, 3, 4 etc.

input functions it can form. ATT’s ORCA FPGA provides an example of one of the most

complex logic blocks in use. It can compute any 4 independent functions of 4 distinct sets

of inputs [22].

The actual implementation of logic blocks also varies, but can be grouped into two

styles. The first is the discrete gate method. In this case, a small number of primitive

static gates are provided and then the inputs to those gates are switched through multiplex-

ers giving rise to different functions. The state of the switches is usually held in a static

configuration memory. This style typically has less total coverage of the possible two and

three input functions than, the second style of decoder-based logic blocks. In decoder

based logic blocks, the inputs select various storage elements which are programmed with

the 1’s and 0’s corresponding to the logic function that is to be realized. This technique

can be thought of as assigning 1’s to the desired minterms. Frequently, the decoder style is

referred to as a look-up table, or k-LUT where k is the number of inputs. Decoder style

LUTs offer the advantage that any function of the inputs can be formed. Figure 2-1

illustrates the difference between the two basic logic cell implementation styles.

So far, only combinational logic has been discussed. In order to perform retiming and

data storage, flip-flops are often included within the logic blocks to provide a registered

output if desired. As a final note, the design of the logic block is crucial to the PGA’s

2-LUT2-LUT

2-LUT

FIGURE 2-1: Logic Block Designs


overall utility and since it can be repeated thousands of times in the array, significant effort

is expended on optimizing its design and layout.

2.1.2 Programmable Interconnect

Once the circuit is mapped to the logic blocks, the interconnection or routing resources

are configured to provide the required connectivity. Just as custom IC design requires

global signal distribution, busses, and local wiring, so too does a PGA architecture. The

key difference; however, is that the PGA must allocate wiring resources before the target

design is known. As a result, the interconnection resources provided by a PGA must be

designed to support a variety of possible patterns, and so a variety of different interconnect

is usually included in a PGA architecture. In general, long length wires spanning the

entire array, medium length wires spanning a few logic blocks, short length wires for cell

to cell connections, and globally distributed wires are available to make the necessary

connections among the logic and I/O blocks.

In order to allow flexibility in the interconnect, two schemes are in popular use (Figure

2-3). The first employs antifuses which when blown, provide permanent connections

between the various interconnect segments and logic block pins. Anti-fuse programming

introduces much less resistance and capacitance to an interconnect network [38]. The

family of parts offered by Actel [1],[5] utilizes this method, but most PGA designs have

gone with an SRAM based configuration technique allowing multiple programming

iterations. In the SRAM style, programmable switches are turned on or off based on the

Array LengthInterconnect

SingleLengthWires

DoubleLengthWires

FIGURE 2-2: Types of PGA Interconnect Resources


value stored in the configuration bit that controls that particular switch. An NMOS pass

transistor, allowing bi-directional signal flow, forms the basis for all the switchable

connections in SRAM based interconnection networks. As will be seen later, the

pervasive use of pass transistors in interconnect and logic resources presents some serious

challenges for low power design.

Determining the number and placement of switches within the PGA interconnect

encompasses several design issues and greatly affects the performance of the mapped

circuit. The switches form the basis of the interface between the logic block and the

routing resources. Switches at the logic block outputs offer a larger fanout capability just

as more switches on the inputs allow a greater number of options from which inputs may

be derived. The number of switches relative to the number of interconnect lines is often

referred to as the population of the network. In addition to cell interfacing, switches also

make connections between different types of interconnect and allow the construction of

longer segments from shorter ones called segmentation (Figure 2-4).

n diff+

poly

dielectric

fieldoxide

ab

AntifuseSRAM Controlled

Switch

a b

configmemory

cell

FIGURE 2-3: Two Programming Methods for PGA Interconnect

segmentation

highlypopulated

minimallypopulated

FIGURE 2-4: Various Levels of Switch Population and Examples of Segmentation

Section 2.2: Architectural Styles 14

2.1.3 I/O Ring

A ring of I/O blocks around the logic block array provides the PGA’s interface to the

pins of the chip. Only a few differences separate it from a standard pad frame. Instead of

having a collection of dedicated input or output pins, all PGA pins have circuitry in the I/

O blocks to support either an input or output path. The configuration of the I/O block

determines which path is active. In addition, flip-flops may be present in the I/O block.

2.2 Architectural Styles

The architecture of a programmable gate array can be loosely characterized into one of

two different styles: 1) Island-based, or 2) Cellular-based. (Figure 2-5)

2.2.1 Island

Several examples of PGAs that can be grouped into the island-based category include

designs by Xilinx [43], Lucent Technologies [22], and Altera [23],[36]. An island-based

array typically employs larger and more complex logic blocks allowing more computation

per logic block. The interconnect structure of such arrays consists of a set of general

purpose wiring used to connect logic ‘islands’ located throughout the array.

single wire

double wire

long wire

CLB

Switch Matrix

Logic Cell

direct interconnect

Island Style Cellular Style

FIGURE 2-5: Representative Architecture of Island and Cellular PGAs

Section 2.3: The Xilinx XC4000 Family 15

2.2.2 Cellular

In contrast to the island-based approach is the cellular class of architectures. Repre-

sentatives of this style include the CLAy [26], Atmel [11], Actel [5], CAL [5], and

XC6200 [44]. Logic cells in these architectures are much simpler than their island-based

counterparts. In many cases, only a single two input function is performed in each logic

cell yielding a more fine grain approach to computation. The interconnect of cellular

arrays relies heavily on neighbor to neighbor communication. More specifically, logic

cells have dedicated wires connecting the outputs of an adjacent cell to its inputs. This

mechanism of neighbor to neighbor connectivity is intended as the primary means of

routing signals among the logic cells.

2.3 The Xilinx XC4000 Family

The Xilinx XC4000 family of programmable gate arrays will now be discussed in

greater detail. An extremely good understanding of the architecture was necessary in

order to perform the power analysis discussed in Section 3.3. Many of the architectural

resources referred to in Section 3.2 are explained here as well.

2.3.1 Composition of a configurable logic block (CLB)

The logic blocks in a Xilinx PGA are called configurable logic blocks or CLBs. Two 4

input look-up tables (LUT) form the basis of the cells. The use of look-up tables insures

that any function of 4 inputs can be realized. Each LUT receives four independent inputs

(F1..4 and G1..4). A second level 3-LUT (H function generator) allows any function of

the previous outputs of the two 4-LUTs and a special extra input H1. Two CLB outputs (X

and Y) can be configured such that the LUT outputs are passed to the programmable

interconnect network outside the CLB. Each output can also be registered by a flip-flop

before leaving the CLB making a total of 4 outputs. The other signals shown in Figure 2-

6 pertain to the use of the CLB as a RAM and will not be discussed any further.

Configuring the logic block is performed by setting the memory bits that control the multi-

plexers and programming the look-up tables for the desired functions.

Section 2.3: The Xilinx XC4000 Family 16

In order to speed up the carry ripple path for adders, dedicated carry circuitry is included

with the F and G function generators. This allows the computation of 2 sum and carry bits

within one CLB. A dedicated carry in and carry out path feeds the carry logic thus

isolating it from the rest of the general interconnect resources resulting in a significant

decrease in carry propagation delay.

2.3.2 Interconnect Resources

The interconnect structure of the Xilinx XC4000 series appears in Figure 2-8. The

array of switch boxes perform two functions: the connection of shorter segments into

longer ones and the ability to make corner turns in the routing. A six pass transistor

connection composes each connection dot of the switch matrix as shown in Figure 2-7.

The shortest interconnect resource called single lines span one CLB pitch and are bounded

by switch boxes on each end. Double lines allow a switch matrix to be bypassed when

going a distance of two CLBs. Depending on the size of the array, other intermediate

lengths like “quad lines”, and “octal lines” may be included and are similar in construction

to the double lines. For cross chip connections, longlines traverse the CLB array. If the

full length of the longline is not required, they may be halved into two independent

FIGURE 2-6: Xilinx XC4000 CLB

Section 2.4: The CLAy Architecture 17

segments by disabling the splitter transistor (segmentation). All the interconnect

described allows bidirectional signal flow. As mentioned earlier, a vertically oriented

dedicated carry chain connects CLBs independent from the general purpose interconnect.

Lastly, all cross points that indicate a possible connection among the interconnect and

CLBs represents a single pass transistor between the horizontal and vertical wires.

2.4 The CLAy Architecture

Next, the details of the Configurable Logic Array (CLAy) from National Semiconduc-

tor [26] will be briefly discussed to provide an example of the cellular type of architecture.

The CLAy design strongly contrasts with the Xilinx 4000 style and thus illustrates some of

the different directions PGA architecture can take. A representative section of a CLAy

array is shown in Figure 2-10.

a. Data from Xilinx Databook and XACT Xdelay estimates

TABLE 2-1 XC4003A Delays

Path Delaya (ns)

LUT Combinational 4

CLK to Q 3

Setup Time 4.5

Single Line 1.3

Double Line 1.3

Long Line ~1.1

Carry Chain 1.5

6-passTransistorSwitch MatrixConnection

FIGURE 2-7: Switch Matrix Detail


2.4.1 Logic Cells

The logic cells of the array are simpler than the Xilinx CLB. An XOR and a set of

AND gates form the heart of the function unit (Figure 2-9). Unlike the Xilinx, the config-

uration of the block is solely determined by multiplexers and hardwired gates. The

number of logic states that can be supported by this cell is not as prolific as a look-up table

structure, but commonly used combinations are included. Three unique inputs are routed

to the logic cell via the interfaces to the interconnect network. Two of these inputs come

from nearest neighbor direct interconnect (A, B) and one from the local bussing network

adjoining the cell.

DoubleLine

SingleLine

Long Line

CLB InputLine

Pass Xtr.Switch

SwitchMatrix

CLB

FIGURE 2-8: Xilinx 4003A Interconnect Diagram


2.4.2 Interconnect Resources

Two types of interconnect surround the cell array in the CLAy architecture. Figure 2-

10 shows the bussing interconnect that connects cells in 8x8 blocks. Local busses have

connections to all 8 cells in a row or column and express busses simply bypass the entire

group of 8 cells. At the block boundary, a set of interface boxes allow connections

between the express and local busses of the neighboring block. The interface boxes also

provide signal regeneration since a signal may pass through several highly resistive

switches and will need amplification for reasonable performance.

The bulk of the interconnect fabric lies at the inter cell level. Figure 2-10 depicts the

cell-to-cell connectivity. Two distinct inputs arrive from each neighboring cell and the two

cell outputs are connected to all four cartesian neighbors. By providing fast, direct, cell-

to-cell connections, logic cells can be combined into larger blocks without much

overhead. Clearly, the cellular architecture emphasizes the use of local interconnect.

However, mapping experiments have shown that sharing a local bus amongst 8 cells

produces inefficient mappings [34]. The key problem is that when the local busses are

used, the entire row or column of 8 cells can only be sourced by neighboring cells. As a

FIGURE 2-9: CLAy Architecture Logic Block

Section 2.5: Triptych 20

result, many cells become used only for routing and many irregular, paths develop . A

possible solution involves the addition of more tracks or the sharing of busses across fewer

cells.

2.5 Triptych

A summary of the Triptych [14] architecture follows because it differs significantly

from the other architectures. In addition, Triptych was studied in detail due to the avail-

ability of layout.

The basic logic cell appears in Figure 2-11. It implements a 3-LUT allowing any 3

input function to be realized and includes a flip-flop to register the output. The unique

property of this cell is the ability to route 3 independent paths from input to any output.

Thus, the problem of a fixed amount of routing and logic resources is alleviated by

allowing logic cells to serve as short distance routing structures.

Interconnection of logic cells takes place along two levels. Horizontal flow is achieved

by routing through the logic cells (actually in a diagonal fashion) and is depicted in Figure

2-11. Two of the three cell outputs connect to the downstream diagonal neighbor.

Likewise, two of the three inputs to the cell are derived from the diagonal connections.

Vertical connections are supported from a channel of variable length bus segments also

FIGURE 2-10: CLAy Architecture Interconnect Scheme

Section 2.6: Conclusions 21

depicted in Figure 2-11. The complete interconnection network is actually two checker-

boards superimposed on each other giving a left to right and right to left diagonal flow.

Connections between the two grids are through loopback connections. Similar to the

CLAy architecture, the Triptych interconnect attempts to rely on more dedicated, localized

routing resources.

2.6 Conclusions

From the above discussion of a few architectures, one can see that a PGA design can

take on many different forms. By learning where power consumption is concentrated in

such architectures, decisions can be made towards the development of lower power PGAs.

In addition, several general architectural conclusions have been reported. Although they

do not focus on power, the following statements are still applicable to the development of

a low power PGA.

3-LUT DQ

Loopback Path

Logic Cell

Logic Cell

FIGURE 2-11: Triptych Logic Cell and Interconnect Scheme

Section 2.6: Conclusions 22

2.6.1 Architectural Design Insights

• There is a trade-off in block functionality: too little leads to higher interconnect

demands, too much leads to wasted active area and underutilization [29]. In the case of

having very simple logic blocks, a large amount of high cost, general purpose intercon-

nect becomes necessary to form relatively simple functions. Whereas an overly com-

plex logic block may not be efficiently used.

• Logic block inputs should posses high functionality/pin for area efficiency [29]. In

other words, once a signal is brought to the input of a logic block, it is best to perform

as much of the necessary computation involving that signal rather than having to route

it to several blocks and use several blocks for computation.

• The optimal number of LUT inputs is 3-4 for lowest total area [29].

• A flip-flop should be included within the logic block [29]. In this case, the work

showed that directly implementing a flip-flop in the architecture is much better than

constructing one from a group of logic blocks even though it won’t always be needed.

• Avoiding repeated paths through general purpose interconnect can increase PGA per-

formance on the critical path [29].

• Connection delays often limit the system clock. This paper demonstrates the impor-

tance of optimizing the interconnect to improve overall design speed as was mentioned

in the introduction to this work [29].

• The efficiency and quality of software mapping is impacted by how well the underlying

PGA architecture can support the intended applications. Symmetry in the interconnect

and logic block interfaces along with input permutability facilitate mapping designs to

a PGA.

23

CHAPTER 3

PGA Power Analysis

In order to begin the design of a low power PGA, a good understanding of the sources

of power consumption for existing designs is essential. As mentioned earlier, very little

detailed information on power is readily available and so the first part of this research

focused on gaining some knowledge about PGA power.

3.1 Sources of Power Consumption

The sources of power consumption for CMOS integrated circuits have been studied in

great detail the last several years [6]. The predominant contributions to CMOS power in

relative order of their magnitudes are: 1) dynamic switching power, 2) short circuit power,

3) leakage power, and 4) static power. Each of these will be explained below.

(EQ 1)

3.1.1 Dynamic

The dynamic part of power consumption usually accounts for upwards of 90% of total

power. It arises from the charging and discharging of the parasitic capacitances shown in

Figure 3-1. The components of dynamic power appear in Equation 2 and a derivation can

be found in [28]. From the equation, one can see that the degrees of freedom offered are

Pav Pdyn Psc Pleak Pstatic+ + +=

Section 3.1: Sources of Power Consumption 24

capacitance, supply voltage, signal swing, frequency and activity. The manipulation of

these variables will be closely examined in Chapter 4.

(EQ 2)

3.1.2 Short Circuit Current

The next largest part of CMOS power consumption is generally due to short circuit

current flow during the switching transient. A slow input slope relative to the output

node’s transition causes both PMOS and NMOS devices to conduct. As a result, a low

resistance path momentarily exists between the supply and ground leading to wasted

charge. Keeping the input and output rise and fall times approximately equal insures that

short circuit power will be kept to under 10% of total design power. An important note to

make is that virtually no short circuit current flows if the supply voltage is lowered to the

sum of the PMOS and NMOS device thresholds. This fact plays an important role in the

later discussion of low voltage pass transistor design.

3.1.3 Leakage

Even if a circuit does not switch, the circuit will still consume power due to the

reverse-biased drain/source diodes and the non-idealities of the MOSFET device.

Leakage of the drain/source diodes is governed by Equation 3 in which Is is the reverse

DynamicPower C Vdd Vswing Fclk ActivityFactor××××=

Cdbn

Cgp

Cgdp

Cdbp

Cgn

CgdnCwiring+Cload

In Out

FIGURE 3-1: MOS Parasitic Capacitances

Section 3.1: Sources of Power Consumption 25

saturation current, V is the voltage across the junction, and Vth is the thermal voltage (kT/

q).

(EQ 3)

A simple estimate of the leakage from all the drain area of a 1 million transistor chip

reveals that the power is on the order of 50-100uW and can therefore be considered

negligible in the case of any PGA.

The other source of leakage current is due to the non-ideal I-V characteristics of a

MOSFET which gives rise to subthreshold conduction. When the applied gate voltage is

below the threshold voltage a small current will flow due to weak inversion of the channel.

This current is an exponential function of the device threshold as shown in Equation 4.

For a 3.3v process, threshold voltages remain fairly high (600-800mV) and so the

aggregate current from subthreshold conduction is minimal. However, attempts to lower

device thresholds for improved performance will cause subthreshold currents to quickly

rise. This fact must be carefully weighed when designing for low voltage operation so that

leakage currents do not dominate a design’s power consumption.

(EQ 4)

3.1.4 Static Current

The magnitude of the last component of power consumption depends heavily on the

type of design. In a standard CMOS design with rail to rail swings, static power should be

zero. However, other circuit styles can introduce static current components. Possible

sources of static power include pseudo-NMOS gates and circuits requiring bias currents

(e.g. amplifiers). Contrary to dynamic power, static power is independent of frequency.

Ileakage Is e

VVth---------

1–

=

Isub e

Vgs Vt–( )s

--------------------------s∝ subthreshold slope=

Section 3.2: Xilinx Component Power Measurements (opening the black box) 26

3.2 Xilinx Component Power Measurements (opening the black box)

The first step in gaining some insight into PGA power consumption was to open the

black box. However, without the luxury of detailed schematics or layout of a complete

PGA, the only means to study an array was by physical lab measurements. As a result, a

systematic procedure was developed to obtain a detailed picture of the internal capaci-

tances of the chip. These capacitance values were then combined with architectural

knowledge to form some crucial insights about where power goes in PGAs. A Xilinx

XC4003A became the analysis target because of the existence of a test board and the

necessary Xilinx design software. The XC4003A was fabricated in a 0.6um, 2-layer metal

process. Although the XC4000 series has been improved upon in recent years, the

underlying architectural design has remained consistent allowing useful conclusions to

still be drawn. The structural details of the Xilinx device were presented in Section 2.3.

3.2.1 Method

As with any experimental measurements, the method used to obtain the data play an

important role governing the utility of the information. In addition, the procedure

described can be used for future analysis of more recent Xilinx designs. The intent of the

lab measurements was aimed towards getting capacitance information, but capacitance

values are not directly observable. Thus, current measurements were performed on

mapped designs under carefully controlled operating conditions. Using the current values,

average power can be calculated as can average energy in terms of mW/MHz. Assuming

that the overwhelming majority of power comes from the dynamic component, the

capacitance of the structure under examination can be found from Equation 5.

(EQ 5)

The first step in achieving the measurements was the generation of test designs.

However, the normal synthesis and automatic partition, place and route (PPR) flow does

not allow sufficient control over the resulting mapped design. Instead, the Xilinx XACT

Energy DynamicPower( ) ToggleFrequency( )⁄( )

CEnergy

Vdd Vswing×------------------------------------

=

=


tool which allows very detailed hand routing of a design, was extensively used [47]. The

ability to enable the exact interconnect paths desired and to set the configurable logic

block (CLB) and I/O structure functionality allowed a high level of control over what

circuit structures were toggling. This degree of control insured the proper isolation and

orthogonality among measured components so that an accurate measurement of only the

component under examination was performed. A thorough understanding of how the

PGA architecture works was essential to manipulating the PGA at such an intricate level.

Lastly, each measurement was performed in a relative fashion. That is to say, one

measurement was recorded without the circuit component toggling and one with the

component toggling. Thus, all extraneous sources of current draw (power) are eliminated

from the desired measurement. In addition, the accuracy of measurements was improved

by enabling several of the same components at one time to give a larger current differen-

tial. A larger current differential gets around the problem of finite measurement resolution

and also provides an averaging effect to improve accuracy.

3.2.2 Components measured

The components analyzed in the lab measurements were chosen based on their archi-

tectural impact and on the ability to accurately isolate them. Table 3-1 shows all the

resources that were measured.

TABLE 3-1 Xilinx XC4003A Architecture Components

Component Class Type Comments

CLB CLB Function Logic LUT Tree and Internal Muxes

I/O I/O Interface Input and Output Buffers

Interconnect

CLB Output Interface Output Buffers and Track Switch Load

CLB Input Interface LUT Input Multiplexers

Full & Half Longlines Wires span 5 or 10 CLBs

Double Lines Wires span 2 CLBs

Single Lines Wires span 1 CLB

Clock Lines Clock Distribution and Clk Input Loads

Carry Chains Direct Path between CLBs


The majority of items listed in the table belong to the interconnect resources of the PGA

since much of a PGA is simply wiring. A full longline corresponds to the case of a

longline where the splitter connects the two segments into one. The half longline

represents the case where the splitter switch is off and only half of the longline is toggling.

Overall, a very detailed breakdown of the interconnect structure was performed in order to

achieve high accuracy in later analyses.

The CLB related components combine to cover nearly all the major potential sources

of power within a logic block. Only the carry logic and the RAM configuration areas were

neglected.

The I/O structures were included for the sake of completeness. Although output driver

power can be significant, those currents are largely set by the board capacitance and other

system issues so they were not seen as something to be optimized. In fact, all measure-

ments were taken without any output pins enabled.

Finally, the clock distribution network was also included among the components to be

measured. In the Xilinx architecture, a series of dedicated vertical wires traverse each

column of the chip to provide global signals to a column of CLBs.

3.2.3 Results

TABLE 3-2 Measured Component Energies and Capacitances

ComponentEnergy (mW/

MHz=nJ)Estimated

Capacitance (pF)

CLB Function Generator 0.025 1.10

I/O Input Path (I1,I2) 0.062,0.140 2.5,5.6

CLB Input Interface 0.040-0.050 1.95-2.4

CLB Output Interface 0.041 1.64

10 CLB Longline (Horiz,Vert) 0.054,0.088 2.7,4.4

5 CLB Longline (Horiz,Vert) 0.022,0.040 1.1,2.2

Double Line 0.06-0.107 3-5.35

Single Line 0.048-0.088 2.4-4.4

Clock Connection 0.030 1.5


The data in Table 3-2 shows the results of the board-level measurements on the Xilinx

chip. The complete detailed breakdown of the lab measurements for each component are

listed in the Appendix. The process used for this chip was two-layer metal with 0.6um

feature size. These values provide the first concrete numbers on internal PGA capaci-

tances for an industry chip. A range of energy and capacitance values appears for some of

the table entries because several components fall into those categories. The range

represents the bounds that were measured among the components. In addition to the

absolute capacitance magnitudes, relative comparisons among the components reveal

some interesting insights.

The most astounding fact about the numbers is their order of magnitude. In conven-

tional VLSI CMOS designs, the capacitances are typically in the range of tens to hundreds

of fFs. Only on long, high fanin/fanout busses would one be likely to encounter capacitive

loads in the picofarad range. However, in the case of the PGA, a single wire spanning one

CLB pitch has a load of 2-4 pFs and the other types of interconnect fall in this range as

well. One might have anticipated a high value for interconnect capacitances given that

interconnect delays are fairly substantial in these devices, but picofarads is quite a

surprise.

The reason for such large capacitances must be related to the fact that the interconnect

is programmable. As a comparison, parasitic wiring capacitance to the substrate and other

metal layers can only account for capacitances on the order of 150fFs (500um line). The

only other source of capacitance is from the devices relating to configuring the intercon-

nect. Although the exact circuit structure of the CLB - interconnect interface is not

known, it is known that each possible connection point has a switch to enable or disable an

electrical connection. Each switch takes the form of single NMOS pass transistor and so

each one contributes a parasitic drain cap to the interconnect that it attaches to regardless

Clock Column Distribution Wire 0.128 6.4

Carry Chain 0.050 2.5

TABLE 3-2 Measured Component Energies and Capacitances

ComponentEnergy (mW/

MHz=nJ)Estimated

Capacitance (pF)


of whether a signal has been routed through the switch. A single line type of interconnect

is loaded by 15-20 pass transistor switches. The size of the pass transistors has been

referred to as being “very big” which leads to large parasitic drain capacitances. Very

large pass transistors allow the delay of a multi-segment path to appear linear as opposed

to quadratic because the path becomes completely capacitively dominated. This attempt

to linearize long-distance delay growth by using large pass transistors makes sense

because buffering within the interconnect network was not used in the design. Only in the

most recent chip families has buffering been introduced within the switch matrices [39].

Therefore the large capacitance seen on the interconnect, especially the shorter varieties,

appears to be predominantly due to switch or fanout capacitance as opposed to wiring

capacitance. Lastly, it is important to recognize the fact that fanout capacitance is a

function of the number of switches hanging off an interconnect segment, as well as the

size of those switches. (Figure 3-2)

In addition to absolute capacitance information, the characterization data in Table 3-2

also allows comparisons to be made among the different types of interconnect. The most

interesting result is that the shorter distance interconnect experiences approximately the

same loading as the more global, long distance interconnect. Once again, this implies the

importance of fanout over wiring length in determining the overall interconnect parasitic

Fanin/FanoutConnections


FIGURE 3-2: Heavily Loaded Interconnect due to Switch Diffusion Capacitance

Section 3.3: Xilinx Profiler Description 31

load capacitance. However, the longlines which span 10 CLBs are exposed to 3 times

more switches than the shorter interconnect so one would expect a proportionate increase

in capacitance, but that is not the case. A plausible explanation for this seemingly incon-

sistent result is that the switches used to connect to the longlines are smaller since a

longline path is not exposed to as much series resistance as a path using the more local

interconnect resources. Another possible reason is the use of buffers at the CLB inputs

which would contribute significant gate capacitance. Regardless of the exact reason, the

fact that local and global interconnect possess the same capacitive load gives rise to

another crucial difference between conventional IC designs and PGA design.

In most IC design, the local wiring has a minimal effect on the total parasitic node

capacitance as should be the case for localized connections. Recently though shrinking

feature sizes have begun to shift the balance of parasitic capacitance more towards the

interconnect which does not scale as well due to fringing and interline capacitance

components. Despite this trend, very large capacitances are usually seen only on long-

distance bus connections. In a PGA, any kind of wiring experiences large capacitances

and hence, local wiring is just as expensive from a power standpoint as are cross-chip con-

nections. Thus, an island style PGA does not exploit the benefits of locality since no

“cheap” local interconnect is available. In essence, the large discontinuity between the

CLB and interconnect capacitance numbers causes a serious power penalty to be paid

every time a signal leaves the CLB.

3.3 Xilinx Profiler Description

3.3.1 General Idea

The insights developed from the baseline capacitance measurements shed some light

on the power profile of a PGA, but they do not provide a complete picture. Any given

design will be made up of several of the building blocks in Table 3-2 and the aggregate

design power will be a function of those components. An important question to be

answered is how these building blocks contribute to a design’s overall power consumption.

By generating routed designs for common library functions and other design examples, a


more complete and useful analysis can be performed. In fact, complete power profiles can

be generated for designs, thus revealing general trends as well as providing useful metric

data such as average net capacitance. In addition, a first order activity factor weighting

must be included in the analysis. This is especially important because power is directly

proportional to transition rate and so any attempt to model power must include some

estimation of activity.

In order to perform the analysis discussed above, a methodolgy and power estimation

tool needed to be developed. The basic idea leveraged on the use of the baseline

capacitance data obtained from the detailed lab measurements and the existing Xilinx

software flow. By forming a linear combination of all the characterized components used

in a design, the total capacitance and dynamic power consumption could be calculated. A

first order estimate of the activities of various nets was then included to achieve the proper

weighting of terms. A set of Perl programs were written to perform these various tasks.

3.3.2 Flow

Figure 3-3 shows the methodology used in the power analysis of the Xilinx.

LCA File

XPX.pm

NET2CMD.pm

ACTIVITY.pm

AVECAP.pm

DISTANCE.pm

ViewLogic

XPX.pm

CharacterizationData

.len

.net

.trace

.dat

.est

.dist

.net

.net

.cmd

post-process

FIGURE 3-3: Xilinx Power Analysis Flow


At the top of the figure is the most essential piece of information: the LCA file. A

designer starts with a description of a circuit in either HDL or schematic form and then the

Xilinx tools are used to optimize and transform the design into a placed and routed imple-

mentation on the PGA architecture. The only file that contains information about the fully

mapped design is the LCA file. Without the detailed description of how the specified

design is actually implemented on the PGA, a linkage between the physical capacitances

and the routed design would have been impossible.

Unfortunately, the LCA file syntax is not in a documented format, and to make matters

worse, the file is not a traditional netlist. Rather, the LCA file specifies the set of program-

mable interconnect points that need to be enabled to realize a routed design. CLB and I/O

block functionality also appears in the LCA file. A sample portion of an LCA file appears

below:

Upon careful examination of the LCA file, it was found that sufficient information was

present to decompose a design into its component building blocks. As a result, the

connection between circuit capacitances and a software mapped design could be

determined.

Addnet B6 PAD2.I1 FD.F2

Netdelay B6 FD.F2 3.4

Program B6 {137G224} {85G226} {79G232} {79G312} {79G319} {79G388} {79G395}

{79G450}

NProgram B6 row.H.local.7:FD.F2 HC.24.1.11 HC.24.1.0 EC.24.1.17 EC.24.1.0

CC.24.1.17 CC.24.1.0 col.C.local.1:PAD2.I1

Addnet B7 PAD1.I2 FD.G1

Netdelay B7 FD.G1 2.9

Program B7 {114G262} {114G429} {85G429} {78G429} {47G429} {43G433} {43G444}

NProgram B7 col.D.long.1:FD.G1 col.D.long.1:row.B.local.4-s BC.24.1.9 BC.24.1.20

BB.24.1.9 BB.24.1.2 col.B.local.3:PAD1.I2

Addnet CLK_1 LL.Y startup.CLK

Netdelay CLK_1 startup.CLK 1.8

Program CLK_1 {428G19} {428G33} {428G40} {428G63}

NProgram CLK_1 col.M.local.4:startup.CLK MM.24.1.14 MM.24.1.3 col.M.local.4:LL.Y

Addnet CLK_2 bufgp_tl.O LL.G3 HE.K IE.K JE.K BE.K CE.K DE.K EE.K FE.K

Netdelay CLK_2 LL.G3 1.7 HE.K 1.7 IE.K 1.7 JE.K 1.7 BE.K 1.7 CE.K 1.7 DE.K 1.7

EE.K 1.7 FE.K 1.7

Program CLK_2 {165G134} {165G172} {165G210} {447G69} {447G238} {165G411} {165G373}

{165G335} {165G297} {165G259} {165G238}

NProgram CLK_2 col.E.long.4:JE.K col.E.long.4:IE.K col.E.long.4:HE.K

col.M.long.4:LL.G3 col.M.long.4:bufgp_tl.O col.E.long.4:BE.K col.E.long.4:CE.K

col.E.long.4:DE.K col.E.long.4:EE.K col.E.long.4:FE.K col.E.long.4:bufgp_tl.O

FIGURE 3-4: Excerpt from .lca file (contains routing and configuration data for a mapped design).


The xpx.pm program forms the heart of the analysis flow. The LCA file serves as the

only input to the script and is parsed in a line by line fashion. Several output files are

generated from the xpx.pm program and all begin with the design name that was specified.

A [name].dat file contains a wealth of intermediate processing information that is useful

for debugging purposes and obtaining more detailed information than is normally

produced in the standard program output. The [name].len file contains source, sink and

routed length information used by the distance.pm script. Net capacitance data resides in

the [name].net file and is used as an input to several other programs.

The processing performed by xpx.pm consists of a number of different tasks. One of

these is global entry translation. In order to construct a list of the interconnect resources

that a design uses, the programmable interconnect point (PIP) representations must be

translated into unique segment names. All nets in a given design are specified by a list of

PIPs which define the enabled path through the programmable interconnect network.

Each PIP defines the presence of a connection between two segments of interconnect.

The variety of syntaxes used to describe PIPs complicates the segment extraction and

translation task. However, once the translation has been finished, each segment entry is

represented by a globally unique name that contains information about that segment’s ori-

entation, array location index (row and column), type (single line, longlines, etc.) and

track value (1,2,3, etc.). As the xpx tool processes each net, the interconnect segments are

recorded allowing the data to be used in later functions.

The net length subroutines use the segment data to get an estimate of the length of

routed nets and their fanout. From the net specification, the source and a list of sinks can

be easily identified giving a measure of net fanout. However, the determination of routed

length is a bit trickier since nets can fork off the main net at any PIP point. This makes it

difficult to determine which segments are common among sinks and which are not; one

should remember that nets are stored in the LCA file just as a list of PIPs and thus very

little information about the actual path taken by the net is preserved. Despite this difficulty,

a routed length estimate for each sink was calculated from the combination of the segment

information and the source/sink data.


The most important use of the interconnect segment data revolves around energy com-

putations. Each global segment name undergoes a mapping to an index name which

corresponds to one of the measured energies in Appendix A and summarized in Table 3-2

as discussed in Section 3.2.3. The energy contribution from CLB outputs, CLB inputs and

the internal CLB logic is then combined with the accumulated interconnect segment

energies to give the total energy required to charge and discharge each net. A simple

conversion to capacitance is performed using the clock frequency, supply voltage, and

voltage swing parameters. Once, all nets are processed, the [name].net file is generated

which contains a list of all the netnames with their associated capacitance. In addition, a

preliminary energy report for the design is sent to the [name].dat file and STDOUT. This

report contains a complete breakdown of the overall design energy in terms of CLB energy

and energies from the various types of interconnect prior to activity weighting (All

components are treated with activity of 1). General CLB usage and a detailed breakdown

of interconnect statistics are also provided in the listing. A sample output of the xpx.pm

script is shown in Figure 3-5.

After progressing from the xpx.pm program, the two auxiliary scripts, distance.pm and

avecap.pm can be executed to give processed distance and net capacitance information.

The distance script examines the [name].len file and returns four values. The first pair of

numbers are the average and maximum fanout for all the nets in the mapped design.

These values give an idea of the typical interconnect fanout requirements that need to be

=================================================================Results for: correl4b.lcaNumber of CLBs used= 76 Number of IOBs used= 7Number of switch matrix passes = 223Energy in mW/MHz: Total= 54.31 Total(w/o IO)= 53.63 Total IO= 0.68Fraction of power in CLBs vs. Interconnect = 6.66%CLB function power = 3.57 mW/MHzLocal Wires= 222 lines consuming a total of 15.52 mW/MHzCarry Wires= 15 lines consuming a total of 0.75 mW/MHzDouble Wires= 167 lines consuming a total of 14.50 mW/MHzLong Wires= 38 lines consuming a total of 2.71 mW/MHzInput Lines= 209 lines consuming a total of 9.38 mW/MHzOutput Lines= 153 lines consuming a total of 8.00 mW/MHzClock Lines= 76 lines consuming a total of 2.77 mW/MHzF inputs= 71 linesG inputs= 62 linesC inputs= 76 lines=================================================================

FIGURE 3-5: Output of xpx.pm in initial non-activity weighted analysis of Correl4b.


supported. The second pair of numbers are net length measures. Two net length measures

are computed to give an upper and lower bound on the average net length for a design in

terms of CLB pitches. The manhattan distance gives a lower bound assuming ideal

connections on a manhattan grid between the list of sources and sinks for the design. A

routed distance value comes from the computations performed while processing each net

in the xpx.pm script. The routed value will always be higher than the manhattan value

because it takes into account the non-ideal routing patterns that resulted from architectural

and design restrictions during PPR.

The avecap.pm script generates three capacitance metrics. The average net

capacitance value shows how much capacitance a typical net sees. In addition, a

maximum and minimum net capacitance is recorded to give an idea of the bounds that

routed nets see for capacitive load.

Up to this point, all the capacitance information obtained has not been weighted by an

activity factor and so a crucial component of dynamic power is still lacking. To solve this

inadequacy, the Viewlogic Viewsim [41] gate-level simulator was added to the analysis

flow. By performing a simulation of the design using random input data, a first order

approximation of transition activity is obtained for all nets in the mapped design. The

net2cmd.pm script automates some of the functions necessary to generate a cmd file for

Viewsim. The most important task that net2cmd performs is to enable the tracking of all

the nets found from the xpx program. The list of nets to be tracked will be much smaller

than the original netlist for the design because groups of logic become subsumed inside

the CLBs. Often, new nets are generated from the mapping process that were not present

in the original schematic for the design. However, this is always the result of a duplication

and buffering operation by the routing tools. As a result, this case is detected by net2cmd

and handled so that the correct activity factor for the new nets is appropriately assigned.

Lastly, the simulation produces a trace file containing all the transitions for the routed nets.

After computing activity factors from the Viewsim trace file, activity.pm generates an

estimate of the total power consumption for the design being analyzed. The value

produced does not include the power contribution from the logic within the CLBs;


however, the logic power is again factored into the post-processing step described later.

Despite leaving out the logic power, the power estimate given by activity.pm is accurate

because nearly all PGA power is consumed by the interconnect as will be explained in a

later section. The total clock power for the design is also reported. Along with the power

numbers, the average activity of the circuit is given as is the average switched capacitance

of the nets.

In a final post-processing step, the xpx program is re-run to give an activity weighted

breakdown of CLB, I/O, and interconnect power. A sample output after the post-

processing run of xpx appears in Figure 3-6.

============= Activity Weighted Final Results for correl4b.lca ====================

=================================================================Post Processed Results for: correl4b.lcaNumber of CLBs used= 76 Number of IOBs used= 7Number of switch matrix passes = 223Power in mW: Total= 71.65 Total(w/o IO)= 70.14 Total IO= 1.51Fraction of power in CLBs vs. Interconnect = 3.65%CLB function power = 2.56 mWLocal Wires= consuming a total of 10.92 mWCarry Wires= consuming a total of 0.01 mWDouble Wires= consuming a total of 16.60 mWLong Wires= consuming a total of 2.90 mWInput Lines= consuming a total of 12.12 mWOutput Lines= consuming a total of 6.22 mWClock Lines= consuming a total of 21.36 mW=================================================================

FIGURE 3-6: Output from xpx.pm on post-processing pass of Correl4b.

Section 3.4: Results & Recommendations 38

3.3.3 Summary of Profiler Outputs

3.4 Results & Recommendations

From running several designs through the power analysis flow, a number of interesting

results arose leading to important considerations for low power PGA design. In addition,

error sources in the power analysis flow were carefully examined.

3.4.1 Benchmark Data

The distilled results from running the analysis tools on a set of 36 mapped designs are

displayed in the series of graphs displayed below. The 36 designs used for the data are

composed of 27 Xilinx macros for common functions and 9 larger designs. The set of

designs give a good cross-section of the probable make-up of a PGA design. In all cases,

the designs were analyzed without any outputs driving chip pins. This insures that the

power contribution from driving board capacitances does not get factored into the power

profile of the internal PGA circuitry. All graphs represent average values across all 36

designs. The averages from only the large designs did not show much difference from the

whole group and so no distinction was made in the graphs. Lastly, the complete set of data

for all the analysis metrics can be found in Appendix B.

• Total Design Power (xpx.pm postpro-cess)

• I/O, CLB Function, Interconnect Power Breakdown (xpx.pm post-process)

• Interconnect Power Breakdown (Single, Double, Long, CLB input, CLB output) (xpx.pm post-process)

• Interconnect Resource Usage Break-down (xpx.pm)

• Clock Power (activity.pm)

• Average # of Inputs Used per Function Block (xpx.pm)

• Max, Min, Ave Net Capacitance (ave-cap.pm)

• Ave, Max Signal Fanout (distance.pm)

• Ave Manhattan and Routed Net Length(distance.pm)

• Ave Switched Capacitance (activity.pm)

• Various Other Design Metrics (see appendix)

FIGURE 3-7: Profiler Outputs


FIGURE 3-8: Power Breakdown for Xilinx XC4003A

FIGURE 3-9: Interconnect Power Breakdown by Resource for XC4003A


3.4.2 Conclusions

The wealth of data provided by the power analysis scripts described above allowed the

construction of a PGA power profile for designs mapped to Xilinx devices. From the

profile data, various conclusions could be made about PGA power consumption. These

insights proved invaluable in guiding the development of a low power PGA design.

The most important result from the power analysis studies was the overwhelming

dominance of interconnect to a design’s power consumption. In all cases, at least 65% of

a design’s power is dissipated in the collection of interconnect resources and logic cell

interface circuitry that a design utilizes. The category of interconnect resources

encompasses single length, double length, carry chains, and longlines. Interface circuitry

includes the input multiplexers used to select CLB inputs from the wiring tracks (an

essential part of the interconnect structure) and the output buffers used to drive signals

onto the interconnect fabric. Astoundingly, the logic circuitry that actually implements

the desired functionality consumes a minimal amount of power. A good way of thinking

about the situation is to treat the CLBs as islands in a sea of very expensive interconnect

resources. As soon as a signal leaves an island, it immediately drowns in the capacitance

FIGURE 3-10: Interconnect Wiring Resource Usage


discontinuity caused by the general purpose routing network. Therefore, any attempt to

reduce PGA power must focus on how to minimize interconnect power.

Another surprising fact discovered by the power analysis was the relationship between

local wiring and long-distance wiring. On the level of a mapped design, local interconnect

dissipated ten times more power than long-distance interconnect. In making this

comparison, single and double line segments were considered local resources and the

longlines made up the long-distance resources. One can understand such a result from an

analysis of design properties and the baseline component capacitance measurements. As

discussed in an earlier section, the local and long-distance wiring resources have

comparable capacitances. However, in typical designs local wiring usage and activity is

much higher than long-distance wiring. As a result, the aggregate capacitance of the local

wires amounts to about 10 times more than the long-distance wires. Clearly, the property

of spatial locality inherent in designs drives this imbalance in wire usage. One might

wonder how increasing array size would affect the local vs. long-distance relationship.

Although long-distance wiring requirements will grow with increasing array size, so too

will the amount of local wiring to accommodate the increased number of cells. Thus,

from this very simplistic argument, one can be sure that optimization of local connection

resources is and will continue to be a crucial issue in low power PGA design.

In addition to the realizations made about the power consumed in routing resources,

the importance of the logic cell interface circuitry should not be overshadowed. Every

time a logic cell receives an input, the signal must be selected from a number of possible

tracks. In the architecture studied, anywhere from 8 to 16 possible inputs exist per CLB

input. As a result, a signal must pass through a large multiplexer tree and thus switch

significant capacitance. Power from the CLB input paths consistently accounted for 15%-

25% of total interconnect power. Architectural and circuit design decisions can have a

large impact on the composition of the logic cell input interface and thus, PGA power can

be further reduced by paying attention to logic cell interfaces.

Clock power has long been the achilles heel of low power design and PGAs are no

exception to this rule. Except for data signals which may experience excessive glitching,


the clock signal possesses the highest transition activity. As a result, any capacitance due

to clock distribution and clocked transistor gates generally factors much more heavily in

total power consumption than other individual nets. In the PGA design space, the clock

contribution to power was seen to vary over a large range. Many designs are entirely com-

binational whereas some are almost solely state-based. The data for percentage of total

power due to clocks reflects this variation as some designs have essentially zero clock

power and others consume nearly 50% of their power in the clock network. Careful

thought about conditional clocking, the style of flip-flop and the chosen supply voltage

will most likely have the greatest impact on reducing clock power for a low power PGA

design.

Some general statements about the affect of bitwidth on power consumption can also

be made from the collected data. For datapath operators like adders, comparators, shifters,

and accumulators, power tends to scale linearly with bit width. This effect is a direct

result of the linear scaling in the number of CLBs required, and thus the amount of inter-

connections. Often, designs containing a clock component will have some fixed offset

factors from clock distribution resource usage. Because clocks are enabled on a per

column basis, the clock power contribution for distribution resources occurs in discrete

amounts depending on which columns need a clock. Therefore, designs with a large clock

power component do not depend as strongly on the number of CLBs used, but rather on

the clock distribution resources that have been enabled.

Another interesting result of the analysis was the effect of structured vs. unstructured

placement of logic on power consumption. In datapath-oriented designs, data signals

generally flow in one direction and fanouts are very low (1 or 2). In other words, connec-

tivity resides mostly on a cell to cell basis leading to high spatial locality.

Two versions of a correlator design [26] were analyzed (Figure 3-11). The first was

completely mapped using the simulated annealing algorithm used by the Xilinx software

tools. The second design was manually placed such that dataflow was oriented from

column to column of cells with each row of cells forming a bitslice. The resulting power


numbers revealed a minimal improvement for the structured mapping over the random

one.

Two conclusions can be drawn from this result. First, although simulated annealing

usually upsets the inherent structure in many designs, cost-function based placement does

do a good job of optimizing overall power for a Xilinx architecture. However, the second

conclusion is that the architecture is not designed to exploit the benefits of locality.

Supporting this conclusion is the fact that the average net capacitance for both designs was

about the same. One would expect, that a structured design given a suitable architecture

would have lower net lengths and in turn lower power than a design using a more

TABLE 3-3 Estimated Correlator Capacitance Breakdown

Component Class Structured Unstructured

Total Power 38.4 mW 42 mW

Single Line Capacitance 712 pF 666 pF

Double Line Capacitance 725 pF 535 pF

Long Line Capacitance 105.5 pF 55.5 pF

Clock Capacitance 231 pF 248 pF

CLB Inputs Capacitance 470 pF 455 pF

CLB Output Capacitance 400 pF 385 pF

Average Net Capacitance 13.9 pF 12.5 pF

FIGURE 3-11: Correlator Schematic

IN

GATED

+3

GATED GATED

POSACC

CLK

NEGACC

CLK CLK

DATAIN

CLK(64MHz)

4

OCLK(1MHz)

-10

RSTRST

10

3

SIG

N-B

IT(T

o C

ontr

ol)

OCLK (1MHz)

OCLK (1MHz)

+

+3

GATED GATEDCLKCLK

RSTRST

OCLK (1MHz)

OCLK (1MHz)

+

GATEDCLK

SU

M

SU

MS

UM

SU

M

CA

RR

Y

CA

RR

YC

AR

RY

CA

RR

Y

9

9

9

9

99

9

9

9

9

99

9

9

CORROUT


“random” placement. It turns out, the minor decrease in power for the structured design

was due to a reduction in clock capacitance, which due to its high activity weighs heavily

in the correlator’s power. In actuality, the average net capacitance increased slightly for

the structured design, but the average net activity (excluding clocks) of 0.05 kept this fact

from influencing total correlator power.

Many other insights can be gained from perusing the analysis data. A complete listing

appears in the appendix to this work for completeness. However, one additional result that

deserves mentioning concerns net capacitance. The net capacitance refers to the

capacitance starting with the buffering of a CLB output, all the capacitance along the

routed interconnect path, and then ending with the CLB input multiplexers. In the subset

of designs that were analyzed, some extremes of net capacitance were identified. The

highest values measured 30-40pF and the lowest came in at 4pFs. This huge variation in

net capacitance poses serious problems from a circuit design standpoint. CLB output

buffers must be oversized to drive worst case loads resulting in wasted capacitance when

the post PPR nets do not have a worst case load. Clearly, some effort to control the range

of net capacitances and a buffering scheme must be included in a low power PGA design.

As a final note, the average net capacitance of a design provides an excellent metric for

evaluating the quality of an architecture from a power perspective. Both the impact of

mapping efficiency (exploiting locality) and the intrinsic design capacitance are captured

in the average net capacitance metric. For the set of designs analyzed, the net capacitance

statistics appear in Table 3-4. Once again, the internal capacitances of a PGA represent a

substantial increase over typical VLSI designs. Furthermore, a significant design

challenge must be dealt with due to the high variance in capacitance on the nets since the

distribution of high and low capacitance paths is not known a priori, clearly some attempt

to bound net capacitances must be made.

TABLE 3-4 Net Capacitance Statistics

Net Capacitance Metric Capacitance in pFs

Average 13.76

Standard Deviation 4.0

Median 13.73


3.4.3 Error Sources

Overall, the accuracy of the profiler results was sufficient for learning the power

properties of the Xilinx FPGA. However, in some cases, the error between estimation and

measurement became as high as 30%. Thus, a firm understanding of the error sources is

necessary to gain confidence in the validity of the results. Several experiments were run to

determine what the major sources of error were. From these experiments, essentially all

the error could be accounted for by considering three different areas: 1) periphery effects,

2) activity estimation errors, and 3) interconnect capacitance values.

3.4.4 Periphery effects

As a result of the non-regular structure of a PGA at the array boundaries, some inter-

connect segments are not properly included in the interconnect estimation. Throughout

most of the array, the interconnect forms a very regular pattern and because of this, the

types of interconnect can be uniquely disseminated. However, along the edges of the array

the regular layout breaks down and several ‘special’ cases exist. Other than hardcoding

each periphery case (there are many), there is no way to determine the correct capacitance

contribution from these resources. Fortunately, the contribution of these resources is fairly

small when taken together with an entire design’s resources. Typically, the resulting error

is a less than a few per cent.

3.4.5 Activity estimation

Inaccuracy of net activity values was the predominant source of error for the

estimation. By carefully isolating the output nodes of CLB outputs, and determining their

energy contribution, a measurement of the node activity could be made. Several such

measurements were taken and compared with the activity values given by the gate-level

simulation. Despite the use of back-annotated timing information for the RC based

simulation, the measured activities were found to be higher than those achieved from the

simulation.

Section 3.5: Triptych Analysis 46

The impact of an inaccuracy of the net activities is quite severe. First, the component

of CLB energy should be slightly higher due to the increased internal activity and

secondly, the interconnect contribution needs to be increased. Since the interconnect

component dominates the switched capacitance of the design, most of the difference in

measured and estimated power numbers comes from interconnect. Although the range of

activity inaccuracy was highly design dependent, in the cases looked at, an activity

correction could account for all but about 10% of the estimator error. Specifically, in the

case of the adders, the adjusted weighting of interconnect and CLB components to

compensate for the underestimate of activity allowed much closer agreement with

measured power numbers.

3.4.6 Interconnect cap values

After eliminating errors due to activity, the remaining results underestimated by about

10%. This source of error was traced to the interconnect contribution accuracy. One

possible reason for this is the impact of cross-coupling components which would make the

baseline capacitance measurements appear slightly lower. Irrespective of the source of the

error, the fact that it is due to the interconnect is all that is important for the purposes of

this study since the conclusions are not affected as a result.

3.5 Triptych Analysis

In addition to the detailed analysis of a Xilinx XC4000 PGA architecture, the Triptych

[14] design from the University of Washington offered further insight into power

consumption in PGA designs. There were two main reasons for choosing Triptych as

another design to be studied. First, Triptych represents a much different style of PGA

architecture from the Xilinx approach. As was discussed in Section 2.5, the design makes

extensive use of more dedicated diagonal neighbor connections, instead of using a

completely general purpose sea of interconnect. Thus, a first order examination of the

impact of differing architectural characteristics could be achieved. Secondly, the layout of

a Triptych test chip was available1 for detailed analysis. In fact, this was the only design

that was studied where extracted physical capacitances could be determined. Based on the


capacitance data and circuit structure information from the layout, some important power

considerations were discovered.

1. Special thanks to Scott Hauck for offering Triptych layout for this study.

back

Routing Scheme

logiccell

loop-

path

verticaltracks

(not allare shown)

diagonal path

center todiagonalpath

Thru-Cell Routing Paths

centerpath

FIGURE 3-12: Triptych Routing Scheme and Cell Routing Paths

3-LUT DQ

Logic Cell

FIGURE 3-13: Triptych Logic Cell


Before stating the conclusions about capacitance, some general information about the

Triptych design should be covered. The layout consisted of a 4x4 array of basic cells. A

detailed description of a basic cell appears in Section 2.5, but a diagram is shown here as

Figure 3-13 for convenience. The layout was drawn for a 1.2um, 2-layer metal process

operating at 5 volts. As a result, some area goes to routing the interconnection network

leading to a lower density than that which could be achieved in a more advanced three

metal layer process. Table 3-5 displays the measured areas of important subsections of a

basic cell. The LUT transistors which perform the desired logic function only take 2%

out of the total cell area. On the other hand, the configuration storage requires about 25%

of the cell area as does the output vertical track drivers. Typical, delay performance

among the most common paths appears in Table 3-6 as obtained from [14].

In terms of the capacitance analysis of the Triptych design, the heaviest loading

appears on the vertical track interconnect. This is not too surprising since these lines are

basically highly shared busses. The worst case for capacitive loading is found on the

longest segments which are fed from one of 16 inputs and can be tapped off to 16 other

cells. Any signal using this resource will see approximately 5pF of capacitance. The

shorter vertical tracks are not as severe although they still see loads of about 1.7pF for the

length 8 line and 330fF for the length 4 line. Two observations can be made from these

TABLE 3-5 Triptych Cell Area

Cell Component Area (um2)

Total ~100,000

3-LUT 2,200

Config Memory 22,000

Output Drivers 24,000

TABLE 3-6 Triptych Delay Performance

Resource Delay (ns)

Routing Path thru Cell 1.6

Function Computation Path (in addition to routing path)

2.2

Channel Wire 2.5-3.7


results. First, the large fanout of highly shared bus structures causes very high capaci-

tances. By supporting the ability of any cell along the path of the long wires to directly

connect to that bus, the exposure to added diffusion capacitances of several large output

driver transistors is unavoidable. In addition, the relation among the 4,8,16 progression in

track length and fanout does not result in a linear increase in net capacitance. This is an

important conclusion because it implies that the cost of supporting highly populated1

busses is much higher due to the necessity to increase driver size (a significant component

of the total capacitance) to attempt to compensate for some of the loss of speed on the

more heavily loaded bus. Although the actual implementation of a design can have a

significant impact of the actual growth in capacitance, it is safe to say that the two

components of fanout and increased driver size combine to at least cause a super-

linear increase in capacitance with track length. Therefore, a low power design should

endeavor to minimize the population on busses so that such a penalty in capacitance can

be avoided while still preserving an overall efficient routing topology.

The relationship between the capacitances of the dedicated, diagonal interconnect

paths and the shared vertical tracks provides another insight into low power PGA design.

The diagonal paths feed each cell with two of the three inputs and can only be derived

from two possible sources each. Thus, the capacitance on these paths is lower than the

vertical tracks and consists largely of wiring parasitics. A typical diagonal routing path

1. Population is a term used to indicate the degree to which a long wiring segment is accessible to the cells which it traverses. A highly populated segment offers direct connections to all cells, whereas a minimal population would just have connections at the segment’s endpoints.

a. Estimates from layout extraction and transistor cap calculations, Track paths do not include track wiring capacitance.

TABLE 3-7 Triptych Cell Capacitancesa

Signal Path Path Capacitance (pF)

Center to Diagonal 0.51-0.60

Diagonal to Diagonal 0.29-0.46

Center to 4x Track 4.3




will see between 200fF and 300fF of load, and would be significantly less without the

wiring capacitance. From a power standpoint, the presence of such low capacitance inter-

connect resources is attractive. However, an important trade-off must be observed. For

the direct interconnects to truly yield a benefit, they must be utilized often and the

overhead of supporting them in the design must be minimal. As a side-note, the more

dedicated interconnect will be much faster than the general interconnect because of less

capacitance and less path resistance.

Another interesting result of the Triptych capacitance analysis surrounds the output

path of a cell (Figure 3-14). A large driver chain feeds the heavily loaded vertical track

drivers and supports the fanout to the many possible tracks to which the output can be

directed. The total capacitance of this output chain approaches 4pF. However, although

this is a large capacitance, the critically important point is that all that capacitance is

switched no matter which track is being used. For example, if a length 4 line (~330fF) is

driven, an additional 4pF also toggles resulting in a huge overhead. Triptych was not

optimized for low power, but by studying it, the problem of inefficiency in output fanout

chains in PGA cells has been highlighted. A low power approach to PGA design must pay

attention to minimizing the significant waste associated with logic cell fanout.

Consequently, the unique opportunity to examine the layout of a PGA design lead to

many further insights into the issues and areas for improvement regarding low power PGA

design. Due to the architectural differences between the Triptych design and the Xilinx

series designs, many new observations could be made. Most importantly, the cost of

16track 8track 4track

16track 8track 4track

CenterOutput

X5pf

1.7pf

330ff

330ff5pf

1.7pf

Cequiv=3.7pf

FIGURE 3-14: Triptych Center Output Driver Chain


highly shared busses, the relative benefits of dedicated interconnect and the waste

associated with cell fanout paths were discovered.

In summary, the PGA power analysis proved valuable from many different aspects.

The measurements of the internal capacitances of a Xilinx chip have begun to open the

black box of PGA power consumption. Both the absolute data and the relative comparison

of PGA capacitances has allowed several insights to be formulated. More specifically, the

data revealed the importance of fanout capacitance versus wiring parasitics. In the later

chapters that discuss design issues, reduction of fanout capacitance will be crucial. In

addition, a more comprehensive study was performed by developing an analysis

methodology for profiling PGA power and the associated tools. Finally, results from the

Xilinx estimations and the Triptych study led to further insights that guided the specifica-

tion and design of a low power PGA device.

52

CHAPTER 4

Application of Low Power

Techniques to PGA Design

The growing crisis of managing power consumption on VLSI chips has prompted a

large amount of research resulting in the development of a number of techniques effective

at reducing power. Despite the diversity of available power minimization methods, they

all follow three general themes: 1) performance/area trade-offs, 2) reducing waste, and 3)

exploiting locality.

A common example of a performance tradeoff is the reduction of supply voltage to

achieve a quadratic improvement in dynamic power. Unfortunately, the power

improvement comes at the cost of increased delay and thus a linear drop in performance.

Similarly, an area/power trade-off results when moving to a parallel implementation. By

duplicating functional units, slower logic paths can be used at the expense of increased

area for redundant logic.

The second theme for reducing power consumption involves reducing waste.

Although a simple idea, eliminating waste can have a considerable impact. Typical

examples include turning off clocks to unused functional units, minimizing transistor

parasitic capacitance on non-critical paths, and using dedicated rather than programmable

hardware.

Section 4.1: Hierarchy of Low Power Techniques 53

The third recurring theme in low power design is exploiting locality. Locality is a

natural result of complex design and implies that hardware communication is highest

among neighboring modules. As a result, an efficient partitioning of a design can reduce

bus lengths and capacitive load, thereby reducing power consumption.

When attempting to achieve a low power design, the various techniques available must

be weighed against their suitability and applicability to the design problem one is dealing

with. In the case of PGAs, the design constraints are quite different from most IC design

environments. In many cases, the unique considerations of the PGA rule out the use of

often used low power techniques. Thus, some of the challenges in designing a low power

PGA are to determine which low power techniques can be utilized, which techniques will

provide the largest impact, and how they can be efficiently incorporated into the design.

These questions will be explored in the subsequent sections.

4.1 Hierarchy of Low Power Techniques

As mentioned above, several techniques have been developed to reduce power

consumption in IC designs (Figure 4-1). The methods can be organized in a hierarchy

starting with the highest level of design, the algorithmic level, and proceeding downward

through the layers of specification until the physical layout is reached. As each case is

presented, the specific constraints imposed by the PGA environment will be explained and

the applicability of the techniques will be judged.

4.1.1 Algorithmic

The algorithmic level of design offers the largest degree of flexibility and thus

provides the greatest amount of leverage on the resulting power consumption for a design.

Depending on the application, several optimizing transforms can be performed to reduce

power such as parallelizing and constant propagation.

Converting a design to exploit parallelism is one of the most effective tools in low

power design. By replicating functionality N times, a specified throughput can be

delivered at a cycle time N times longer. Although the overall capacitance is increased by


approximately N, the decreased timing constraint, allows the supply voltage to be dropped

by a factor of N and thus leads to an overall N2 decrease in dynamic power. Unfortu-

nately, parallelizing cannot be employed when designing a PGA because the application is

not yet defined.

Another common technique to lower power involves converting a multiplication

operation to an add and shift. Designs mapped to PGAs frequently utilize this method

since the user knows the exact algorithm to be implemented. As a result, multiplications

by constants can be simplified by using hard-wired shifts and adds giving a reduction in

hardware complexity leading to power savings over a completely general purpose

multiplier. However, add-shift transformations, like parallelism, can only be applied by

the PGA user during the mapping and implementation process and not by the PGA

designer. When designing a general purpose PGA, one needs to look at the class of appli-

cations to be supported, but cannot rely on any particular application’s property to

optimize on. Thus, the low power tools available to a PGA designer are more restricted

and begin at the architectural level.

4.1.2 Architectural

The technique of pipelining a datapath is often used to achieve a higher clock

frequency. Nevertheless, pipelining can also be thought of as a way to reduce power. Due

Algorithmic

Architectural

Logic

Circuit

Layout

IncreasingLeverage

GeneralPurpose

Applicability

FIGURE 4-1: Hierarchy of Power Reduction Domains


to the increase in achievable operating frequency, the supply voltage can be reduced until

the original cycle time is met again thus realizing a quadratic improvement in power.

Although the PGA designer is once again limited by the late binding of the actual imple-

mentation, as was the case in the discussion of parallelizing, the pipelining case is a bit

different in that a PGA should be architected to allow easy use of pipelining if chosen by

the user. Examples of architectural features to facilitate the incorporation of pipelining are

the ability to register all outputs1 and an underlying interconnect structure that supports a

datapath style of placement.

In many IC designs, busses account for a significant portion of the power consump-

tion. The high fanout and long wires associated with global bus structures, lead to large

capacitances and high activity. PGA’s are no exception to this trend. As shown in the

analysis of Section 3.4.1, the interconnect accounts for an overwhelming fraction of total

power consumption. In fact, the PGA case is even worse because the length of a wire is

not important due to the extremely high programmable fanout on all wires. In addition,

the previous analyses have shown that most PGA wiring is local in nature (a likely by-

product of spatial locality in designs). Therefore, the low power technique of replacing

high fanout busses with more local, dedicated busses should definitely be exploited in the

PGA environment. The difficulty however, is to keep a reasonable measure of flexibility

to preserve the general purpose capability of the device and to avoid an explosive growth

in area since wiring and programmable switches already dominate array size.

4.1.3 Logic

The next level in the hierarchy of low power techniques resides at the logic layer.

Several important choices such as logic block composition and mapping will impact

overall array power consumption.

The decision of logic block composition has been studied in great detail. In several

papers by Rose [29], [31] the impact of logic block functionality against area and speed

1. The Xilinx series contains special carry chain paths which do not allow easy latching of the carry bit which is often necessary to implement carry-save adders as well as other designs.


have been examined. However, the metric of power has not been explored. In order to

gain an intuitive understanding of the role logic block composition has on power con-

sumption, one can consider the following thought process. From previous studies [29], as

the complexity or size of a logic block increases, the number of inputs and outputs to that

block must increase. From a power standpoint, this means more connections to intercon-

nect must be made resulting in higher per wire capacitance. Despite the increase in wiring

capacitance, more logic can be subsumed within a logic block. This fact helps to offset

the higher capacitance penalty of the general interconnect network. Finding the optimum

granularity of the logic block rests with the resulting level of utilization that can be

achieved across typical designs. A complete exploration of this trade-off was beyond the

scope of this thesis, but work in [12] has begun to shed some light on the proper logic

block from a power minimization standpoint. The conclusions from the granularity study

reveal that a 5 input LUT with 2 outputs works well for datapath intensive operations. For

this work, a 3 input LUT was chosen as a reasonable middle ground based on the power

results from an earlier fine-grain design [20] which showed that even smaller logic blocks

tend to be at a disadvantage from a mapping perspective.

The logic level of design in a PGA is closely related to the mapping process which

determines how the actual hardware is allocated to the PGA resources. Logic optimiza-

tion at this stage allows judicious trimming of unused or redundant functionality in

accordance with the low power theme of reducing waste. Another way in which mapping

impacts power is through the placement process. The degree of locality in a mapped PGA

design is subject to the efficiency of the logic partition, placement, and routing functions.

Some work has been performed in this area and have shown that by weighting these

functions against power, reductions of 10%-40% are possible [10],[18]. This is a

surprising result considering that designs are known to possess locality. However, the

examination of Xilinx power contributors revealed that local wiring is just as costly from a

power standpoint as is long-distance wiring. This leads one to believe that current archi-

tectures are not designed to exploit locality. Thus, a low power PGA design must pay

stricter attention to the inherent structure in designs by providing a more natural fit to the


underlying PGA architecture. Furthermore, by adding a cheaper local interconnect the

benefits of locality can be translated into greater power savings.

4.1.4 Circuit

At the circuit level of design, many choices can impact power consumption. As was

noted in previous sections, the flexibility offered by a PGA is strongly dependent on archi-

tecture. However, flexibility is one of the PGA’s most attractive assets and thus, should

not be unduly sacrificed. In some sense, the circuit level becomes the most appropriate

point to introduce low power techniques because circuit optimizations are relatively

orthogonal to the architecture of a PGA. Therefore, the impact of circuit design is well

worth looking into.

The decision of which logic style to use often accounts for a 10%-50% difference in

power consumption and in some cases like dynamic logic a factor of 2x is common.

Often, dynamic logic is thought to offer low power benefits along with an increase in per-

formance. Although dynamic gates have reduced input and output capacitance since they

lack all but one PMOS device, the increased activity of the gate due to precharging clocks

outweigh the reductions in capacitance. As a result, dynamic gates are generally not a low

power logic style. More importantly, the incorporation of dynamic circuitry into a PGA

design is non-trivial. As noted in an earlier section, an important parameter left to the

user’s discretion is the clocking in a design. Without a pre-defined clock period to work

from and a known depth of logic, the timing of dynamic precharge and evaluate cycles can

not be designed. In addition, one must be reminded of the fact that a PGA will be used for

implementing many different types of designs. Thus, the user will most likely design the

clock frequency for the most optimal point with regards to power and performance.

Clearly, the necessity of supporting operation at a wide range of frequencies coupled with

unknown logic depths precludes the use of any dynamic logic.

Fortunately, static CMOS and pass transistor logic offer a lower power alternative to

dynamic logic. Many studies have attempted to determine whether static CMOS or pass

transistor logic is more superior as a low power logic style [50]. Based on these studies,


the answer seems to be very dependent on the design being implemented. The potential

power savings offered by pass transistor logic derive from the sole use of NMOS transis-

tors, thus avoiding the larger PMOS devices in a complementary structure. In addition,

some functions (XOR) map to much simpler implementations with pass transistors as

compared to static CMOS. In the case of a PGA, most designs are known to make

extensive use of pass transistors in the LUTs and as switching elements. The reasoning

behind such a choice is evident because switches and multiplexer structures can be

minimally implemented using just NMOS pass transistors. Thus, significant capacitance

savings have already been realized through smart choice of logic style in current PGAs.

However, the use of pass transistors complicates low voltage design making further

reductions in power much more difficult as will be discussed in the following chapter.

Another key issue in low power circuit design is transistor sizing. In order to minimize

power, all devices should be kept as small as possible. Frequent use of minimum size

transistors will lead to an overall decrease in the switched capacitance of a circuit and in

turn, a decrease in dynamic power consumption. Unfortunately, some devices will require

sizing up to meet timing constraints on paths with heavy loads. In the case of a PGA, the

timing constraint is even more difficult to optimize against because the critical path of a

circuit has not yet been defined. Therefore, circuit paths must be sized to meet some

bounds on average circuit performance. In doing so, the cost of sizing up transistors must

be carefully weighed against the impact that larger parasitic capacitances will have on

power consumption. This is an especially important consideration along high fanout inter-

connect paths and will be examined in detail in a later section.

The last circuit design issue surrounds the choice of supply voltage. The discussion of

parallel and pipelined architectures showed how it is possible to achieve substantial

reductions in power without giving up on throughput. However, when power is the most

critical design constraint, some performance can be given up. The reason why such a

tradeoff appears attractive derives from the quadratic dependence of dynamic power on

supply voltage. While on the other hand, delay tends to vary linearly with supply. In

actual practice, the argument is not as simple and two caveats should be mentioned. First,


the push to smaller feature sizes causes most devices to become velocity saturated fairly

quickly. As a result, the current characteristic of a device more closely resembles a linear

increase with the applied gate voltage. From this, one might think that delay will now

appear independent of supply, but one should also consider threshold voltage. When the

supply voltage falls in the range of a couple Vt, constants in the Ids equation become

important and the delay experiences a further degradation. Thus, lowering the supply

voltage always results in a decrease in performance.

(EQ 6)

Evaluating the tradeoff in performance and power as the supply voltage is reduced is

crucial to determining an optimal operating point. The standard measures of delay and

energy give a good indication of performance and circuit efficiency respectively, but tend

to be too biased towards a high supply for speed and a low supply for low energy. A

better metric to represent a design’s overall quality is the energy-delay product. This

metric captures the opposing influences of performance and energy efficiency leading to a

more balanced picture from which to optimize a design. Later studies in this thesis will

use all three metrics when evaluating circuit structures.

4.1.5 Layout

Once the physical level of design is reached, most parameters affecting the power

consumption of a design have been determined. Despite this, significant differences in

power consumption can arise between a well-planned layout and a haphazard one. Early

floorplanning is especially important for an array based structure like the PGA so that

basic blocks can be connected by abutment. Direct abutment of neighbor to neighbor

interconnect avoids the overhead of routing channels which lead to longer wires and hence

more capacitance. Along similar lines, the effective use of upper level metals for

connecting longer bus-like wires (feedthroughs, over the cell routing) allows a smaller

layout. In general, a compact basic cell layout will result in a more dense PGA array with

smaller wiring capacitances. Considering that a PGA array can have upwards of a

Id κυ satCoxW Vgs Vt–( ) for Vds Vdsat≥=

Section 4.2: Reducing Dynamic Power in PGAs 60

thousand cells, the optimization of the basic cell layout should be leveraged as much as

possible for reducing power consumption.

4.2 Reducing Dynamic Power in PGAs

The previous section provided a better perspective of the methods from which to

attack the PGA power problem. A more detailed discussion of the techniques and trade-

offs used to reduce dynamic power is the focus of this section. All the topics discussed

basically aim to reduce the dynamic power component which dominates PGA power con-

sumption. Specific issues are divided into attacking the capacitance, frequency and

activity, supply voltage, and voltage swing components of dynamic power.

(EQ 7)

4.2.1 Capacitance Reduction

Several architectural and circuit design choices impact the amount of capacitance that

will be charged and discharged within a PGA design. The number of metal layers

available in the selected process technology influence the final area and power of a PGA

array. In addition, the sizing and number of switches a design uses within the interconnect

fabric can also seriously affect power as seen in the Xilinx study of Section 3.2. Lastly,

capacitance is affected by the interface of the basic logic cell to the surrounding intercon-

nect and to the neighboring cells.

On a global level, the choice of a process that allows 3 or more metal layers can

contribute to a decrease in area and power. In earlier designs like the Xilinx 4000 series

and the Triptych design, only a two layer metal process was available. Process advance-

ments since that time have provided three metal layers as a standard with 4 and 5 moving

into the mainstream. The advantage of more layers of metal is two fold. As can be seen

from the process data for a 0.5um effective channel length technology shown in Figure 4-

2 below, the contributions to capacitance for metal 1 all the way to metal 3 tend to

decrease (Metal 3 with significant interline coupling is an exception). Given that intercon-

nect wiring is a crucial resource in a PGA, the reduction of the intrinsic wiring capacitance

DynamicPower C Vdd Vswing Fclk ActivityFactor××××=


by using upper layers of metal can help lower net capacitances. However, the necessity of

reaching the active layers to insert switches leads to the addition of several contacts and

vias to lower levels (stacked vias are unavailable). The area overhead of going up and

down through the metal layer hierarchy must be considered to truly evaluate the benefits

of more metal layers.

The second advantage of more metal layers comes from the examination of Triptych

and the early cell study in [20]. The ability to use a third level of metal can greatly

increases density. When previously restricted to only two layers of metal, the 10X

overhead in interconnect mentioned in Section 1.3 could not be avoided. Now, much of

the interconnect can be placed over the cells instead of in dedicated routing channels.

Although the exact measure of area improvement will depend on the amount of active area

necessary for switches and configuration memory in the design, the bottom line is that the

cell size will decrease. As a result, the interconnect paths will be shorter and hence wiring

capacitances will be lower.

Figure 4-2 illustrates the growth in wiring capacitance as a function of wire length.

The model used for generating the data was made worst case by assuming two wires

running parallel to the subject wire giving rise to interline coupling on both sides. The

equation describing the components to wire capacitance is shown below.

(EQ 8)

From the graph, one can see that Metal 3 offers lower overall capacitance across any range

of lengths except when there is significant capacitance to neighboring M3 tracks. In

addition, the graph shows the contribution a PGA interconnect line would experience from

purely wiring parasitics for a range of possible lengths.

Another important design consideration that has an impact on the resulting PGA

capacitance is the size and number of switches present on each interconnect segment

(Figure 4-4). Ideally, the minimum number of switches to support a design would be

chosen in constructing a PGA architecture. Unfortunately, that minimum number is very

Cwire CareaTopLayer CareaBottomLayer2 Cfringe 2 Cinterline×+×

++

=


difficult to pin down. In general, a mapping study across an interesting set of designs

needs to be performed to determine the necessary level of flexibility an architecture must

support. Combining information about the number of switches needed with the relative

impact of drain capacitances on interconnect, a basis to evaluate trade-offs in flexibility

and power can be made. Based on a model of how drain capacitance scales with transistor

size, the chart in Figure 4-3 was generated. In this case, the switches were assumed to be

implemented as just nmos pass transistors. Using this chart, one can quickly evaluate the

impact of changing to larger switch sizes and to increasing the fanout on an interconnect

segment.

In a Xilinx device, 15 switches on single length segments is common, but their exact

size is unknown. As a result, the chart’s usefulness really lies in the ability to gauge an

architectural decision’s impact on resulting interconnect capacitance especially when

combined with the wiring parasitics information in Figure 4-2. An interesting thing to

note is when wiring capacitance begins to dominate switch drain capacitance. This point

varies with the chosen switch size and fanout, but as an example, a fanout of 10 with

1.8um pass transistors will equal the wiring capacitance of a 90um metal 3 line.

FIGURE 4-2: Wiring Capacitance Parasitics


The sizing and implementation of switches throughout the PGA interconnect is

another factor with a large impact on a PGA’s resulting power consumption. From a

purely analytical approach, a simple model for an interconnect line can be constructed

based on Figure 4-4.

Assuming that all transistors are the same size, reasonable if all paths are considered

equivalent, the following models for delay and capacitance can be derived in relation to

the size up factor s, pass transistor resistance R, and drain diffusion capacitance Cdrain. N

represents the fanout on the line.:

FIGURE 4-3: Pass Transistor Diffusion Capacitance vs. Fanout and Switch Size (W) L=0.6um


1 2 N-1

FIGURE 4-4: Model For Programmable Interconnect


(EQ 9)

(EQ 10)

From the above equations, one can see that to first order, sizing up will result in no

change in delay along a single interconnect path. The simple relation shows that the

decrease in resistance from sizing up is offset by a proportional increase in capacitance

yielding no delay improvement. More importantly, sizing up always has a negative impact

on power. The equation shows a linear increase in net capacitance, but when other contri-

butions to capacitance are factored in such as wiring parasitics, limited sizing up may be

prudent from an energy delay product point of view.

A series of simulations were performed to more accurately assess the trade-offs in

switch size and implementation. The three switch implementations range from the

simplest: a single nmos pass transistor, to a complementary transmission gate with equal

sized devices, to a transmission gate with a p device sized 2.5 times as large as the n. The

simulation model for the cases appears in Figure 4-5. Node capacitances were added at

each node to simulate the fanout associated with that point. Fanouts of 2 and 12 were used

to examine switch parasitics. In the case of the transmission gate simulations, more node

capacitance was added because the switches are implemented with both n and p devices

and so the capacitance per switch should be proportionately higher. Thus, the value of

loadCap was 4ff for 1.8um pass transistors and 8ff for the similar transmission gate and

the multiplier was 1 and 2 respectively. In order to account for the wiring capacitance

component, a worst case estimate of wiring parasitics was made and added to each

node. The transmission gate case with pmos sized 2.5 times as large as the nmos

performed worse in delay and energy than the equally sized transmission gate scenario so

it was omitted from the following analysis for clarity. The results are depicted in Figure 4-

6.

Td1s---R s N 1–( )Cdrain Cdrain+( ) Cwire+×∝

NRCdrain Cwire+=

Power sNCdrain Cwire+∝


Clearly, the transmission gate is at a disadvantage in terms of speed and energy,

especially when large fanouts are required. As the switch size becomes large, there is a

factor of 2 speed penalty and about a factor of 2 energy penalty. Even when the fanout on

the interconnect is only 2, the pass transistor implementation possesses better delay and

energy properties although the gap is less significant. The influence of driver and receiver

overhead and the fixed wiring capacitance is the reason why the transmission gate and

pass transistor methods become more comparable at low fanouts. In general, interconnect

fanout tends to be on the high side, and so the use of nmos pass transistors will result in

superior performance and energy (and area) over transmission gates. Thus, pass

transistors are the preferred implementation method for the switches in a low power PGA.

Cload Cload Cload 10ff

1.5v2v 2v 2v1.5v

1.5v

outin

Cload=loadCap*fanout*multiplier+CwireTransmission gate model just adds PMOS to pass transistors.

FIGURE 4-5: Simulation Model for Switch Size Experiments

FIGURE 4-6: Delay and Energy of Interconnect Chains vs. Switch Size

Wiring Dominated

FanoutDominated


When considering increases in switch size, there is a slow decrease in delay and a

nearly linear increase in energy. From the delay graph, it is evident that sizing up the

switches only pays off when the fixed components of capacitance (wiring parasitics and

driver/receiver loads) dominate performance. Both the delay and energy graphs show

close agreement with the models in Equation 10. Therefore, the results confirm that sizing

up switch devices may be prudent depending on the amount of wiring capacitance but

never proves beneficial from a power perspective.

The graph in Figure 4-7 shows the energy-delay product and offers a means for

reconciling the competing factors of performance and energy consumption. Assuming a

wiring capacitance of 50ff, a switch size of approximately 1.8um is optimal for pass

transistor interconnect with a high fanout. However, for low fanouts sizing up shows a

marginal improvement.

The effect of the interfacing structure between the interconnect resources and the cell

inputs to a PGA should also be considered. The Xilinx analysis showed that the CLB

inputs dissipate a sizable fraction of PGA power. An analysis of two different input

topologies and the variation in the number of inputs revealed other insights towards

FIGURE 4-7: Energy-Delay of Interconnect Chains vs. Switch Size


managing PGA capacitance. The two topologies examined were a linear pass transistor

input structure and a traditional logarithmic tree (Figure 4-8).

An analysis of delay and capacitance was performed similar to that in the section on inter-

connect above. The resulting equations appear below:

Linear Pass Transistor

(EQ 11)

(EQ 12)

Logarithmic Multiplexer Tree

(EQ 13)

(EQ 14)

(EQ 15)

To more easily see the relationships between the two topologies for typical numbers of

inputs, the data is graphed in Figure 4-9. In terms of delay, the linear pass transistor

structure shows a linear increase with number of inputs, but the mux tree gives a quadratic

increase for every doubling of the number of inputs. Thus, for small numbers of inputs,

the two structures do not exhibit as large a difference as they do at higher numbers of

inputs. Capacitance is another story. In the case of the multiplexer, two models were

developed because the amount of capacitance in the logarithmic multiplexer tree that can

be switched depends on whether the inputs to the mux tree are switching. In the worst

in0

in1

in2

in3

out

in0

in1

in2

in3

out

Linear Logarithmic

FIGURE 4-8: Possible Logic Cell Input Structures

Td R NCdrain×∝

Power NCdrain∝

Td 2 N( )Rlog 2 N( )log 1–( )× 3Cdrain 2Cdrain+∝

MaxPower N 2–( )3Cdrain 2Cdrain+∝

MinPower 2 N( )log 1–( )3Cdrain 2Cdrain+∝


case, if the inputs to the tree are switching, even though they are not selected by the

control lines all the intermediate capacitances in the tree will be switched. The best case

occurs when only the selected input is toggling which leads to switching only along the

intended path. Taking this variation into account, one can see that the linear pass transistor

structure is always superior for 4 or less inputs. However, as the number of inputs is

increased past four, the multiplexer under best case conditions can have a lower effective

capacitance than the linear pass transistor. However, when assuming a simple average

case for the multiplexer, the linear is once again lower in capacitance. Unfortunately, as

with many PGA design issues, decisions cannot be considered in isolation. The use of a

linear structure requires N programming cells vs. log2N for the multiplexer. This

additional area for memory cells can lead to an increase in array size and cause routing

capacitances to increase as a result. Therefore, the conclusion that linear will lead to

lower power and is thus the best choice should be qualified by the impact such a decision

will have on the overall design.

4.2.2 Frequency

Often, the frequency of operation can be relaxed somewhat to achieve a reduction in

dynamic power. As previously mentioned, design frequency is not under the control of the

PGA designer because it was thought to be a key variable that should be left to the

discretion of the PGA user. Similarly, the activity of nets is impossible to manipulate

FIGURE 4-9: Delay and Capacitance of Logic Cell Input Structures


without knowing what the eventual design will be. Consequently, optimization along the

frequency and activity axes were not a focus of this work.

4.2.3 Voltage Scaling

Voltage scaling can provide significant leverage on a design’s resulting dynamic power

consumption. The quadratic reduction forms the basis of many of the low power

techniques evaluated in Section 4.1. However, the impact of voltage scaling on the circuits

used cannot be overlooked. In the majority of CMOS design, rail to rail voltages are

passed by either static or dynamic circuitry. In this case, voltage levels and margins track

well with lowering supply voltage. However, the same is not true for pass transistor

circuitry and other circuit configurations where a “diode” drop is lost. The severe implica-

tions of diminished margins on PGA circuitry are left to the following chapter.

4.2.4 Low Swing

Another common circuit technique to achieve lower dynamic power is to reduce the

swing on signals to gain a linear power reduction. In general, techniques to support low

swing signaling involve either special timing, voltage references, or charge sharing

techniques. In areas like memory design, low swing techniques have long been used to

reduce the power contribution of heavily capacitive bit-lines. A similar problem exists in

the PGA environment where the interconnect is heavily loaded; therefore, low swing

techniques could potentially offer large power savings.

When considering the incorporation of low swing signalling into a PGA design,

several issues become important. The circuitry necessary to both drive and receive low

swing signals is usually much more involved than a standard inverter would be. Often, the

increased area from a more complex driver and receiver will have a major impact on the

PGA cell area if several of the special drivers and receivers are necessary. This conclusion

was reached based on estimates of cell area for typical designs.


The careful analysis and management of noise presents another problem when dealing

with low swing signals. The PGA noise environment is far more irregular than that found

in an arrayed memory. As a result, worst case noise conditions are difficult to predict and

could disrupt proper electrical operation of circuitry given that low swing signals

inherently have lower noise margins. In fact, the use of pass transistors in the interconnect

of current PGAs already causes interconnect signals to swing at VDD-Vt(body effect).

Therefore, a simplistic form of low swing signalling is already performed. Another com-

plication to employing more sophisticated low swing techniques involves the inclusion of

repeaters. In order to break up the large distributed RC lines found in PGAs, repeaters

should be used. However, the construction of a low swing repeater requires significantly

more overhead than a traditional inverter buffer. Lastly, the benefits of low swing

circuitry must be measured against further reductions in supply voltage which allows the

benefits of low swing to be achieved without redesigning drivers and receivers.

Despite the complications involved with low swing design, a more detailed look at the

issues surrounding the use of low swing signalling within a PGA environment is appropri-

ate. In traditional timing based low swing designs, a timing reference is used to enable

circuitry to latch or amplify the low swing signal (Figure 4-10)

Unfortunately, an enable signal is most often derived from the clock and as previously

stressed, the clock is not specified at design time. Other techniques to generate a timing

reference such as a transition triggered scheme proved to be too slow and required

enen

en

in+ in-

out

FIGURE 4-10: Low Swing Latched Sense Amp Receiver


excessive circuit overhead. More importantly, even if a timing signal is available, the

exact triggering of the sensing circuitry proves difficult. The capacitance and therefore

the timing properties associated with any path to be sensed will vary depending on the

mapping that a PGA user performs. Once again, the late-binding of important design

parameters makes incorporation of timing based low swing techniques very difficult.

In addition to sensing, voltage reference methods can also be used to send and receive

low swing signals. In order to provide level conversion, differential structures provide a

larger effective signal and can enable swings on the order of a 100mV. However, the

overhead of supporting differential paths must be weighed against the energy savings from

reduced signal voltages. Frequently, static current must be dissipated in differential

topologies. Although static power is generally discouraged in CMOS designs, judicious

use of static power consuming circuitry can result in lower power than more conventional

alternatives. The key trade-off to consider is whether the amount of static power

consumed is less than the dynamic component at typical operating frequencies. Since

PGAs run relatively slowly (tens of MHz range), the power dissipation from static current

outweighs any savings from lower swings.

In addition to differential techniques, many ingenious ways of pulsing drivers to

achieve reduced signal swing have been reported [9]. Incorporation of such concepts into

a PGA has proved precarious due to the dynamic nature of such schemes. PGA intercon-

nect can be directly modeled as a distributed RC line with significant resistance and capac-

in+

out

Vref

FIGURE 4-11: Pseudo-Differential Receiver for Low Swing Signals


itance. As a result, after a driver is pulsed off, the delivered charge redistributes among the

entire interconnect line and settles to some equilibrium voltage. This voltage is often

lower than the point at which the driver turned off causing inadequate signal levels at the

receiver or worse yet, oscillations as the driver repeatedly pulses on until a stable signal

level is established. Since the RC of PGA interconnect paths is not a fixed parameter,

reliable design of the charge delivery circuitry is extremely difficult. In areas of the PGA

where some tighter measure of control over the variation of interconnect RC can be relied

upon, pulsing techniques may become useful.

Various simulations were performed to evaluate many of the low swing techniques

described above. In general, they did not perform robustly or did not offer a significant

energy advantage. Consequently, low swing circuitry was not included in the final design

as a way to reduce dynamic power consumption in PGAs. However, many of the issues

complicating low swing design in a PGA environment were made clear from this phase of

design exploration and further research in this area may provide new solutions. As an

example, one area where low swing circuitry may prove beneficial is in the upper levels of

an interconnect hierarchy. On the long interconnect lines at this level, wiring parasitics

will dominate and limited fanout will constrain the variations in load capacitances.

73

CHAPTER 5

Low Voltage Pass Transistor Design

5.1 Design Issues with Pass Transistors at Low Supply

In previous 5 volt designs, the fact that NMOS transistors do not pass a full logic high

was a small concern. However, at an intended supply voltage of 1.5 volts, several circuit

design issues had to be studied in order to evaluate the continued feasibility of pass

transistor switches. A thorough examination of threshold loss effects provided a basis

with which to evaluate circuit performance at low supply voltages, specifically at voltages

nearing the sum of the nmos and pmos thresholds. In order to deal with the problems of

low voltage pass transistor design, lowered device thresholds and level restoration

techniques are discussed

TABLE 5-1 HP Process Parameters

Drawn Channel Length 0.6um

N-channel Vt 650mV

P-channel Vt - 850mV

Poly pitch/spacing 0.6um/0.9um




N-well process

Section 5.1: Design Issues with Pass Transistors at Low Supply 74

5.1.1 Signal Loss Effects on Performance

The most serious consequence of designing with pass transistors at a supply voltage of

2Vt is the inability to effectively flip a standard inverter. Figure 5-1 shows a typical sce-

nario where a pass transistor feeds the gate of an inverter which is skewed low from the P/

N ratio of 1/2. Unfortunately, the inverter switching point does not move much lower

since the supply voltage is the sum of the thresholds.

In the case of Figure 5-1, the pass transistor is effective at pulling low so the inverter’s

pmos device is fully turned on giving a solid low-to-high transition. However, when

attempting to drive an inverter from an nmos pass transistor, the inverter experiences only

100mV of gate overdrive with which to make a high-to-low transition. The maximal volt-

age that can be passed through the nmos transistor sits at Vdd-Vtn(body effect). If the

pass transistor has a nominal Vt of 650mV, then it can barely charge Vint to 800mV after a

significant delay (20ns). Even if the inverter is skewed so the nmos is twice minimum size

and the PMOS is minimum sized, the pass transistor/inverter propagation delay is 4.5ns.

One should note that tpHL is significantly longer than the propagation delay (nearly 2x).

In a typical CMOS path, a transmission gate would produce a full-swing input with which

to drive the inverter. As a result, the gate overdrive voltage is approximately a Vt for both

transitions which is enough for reasonable performance at Vdd=1.5 volts. However, in the

configuration of Figure 5-1 an acceptable level of performance cannot be achieved.

Lowering the threshold voltage of the pass transistors can alleviate the above problem

if the process technology supports such an option. Low Vt pass transistors will enable a

higher output voltage to be passed to the input of the inverter giving the nmos more gate

0.9u/0.6u

1.8u/0.6u1.8u/0.6u

Fanout = 2

Vdd=1.5v

Vt=650mV

Vin Vout

Vint

FIGURE 5-1: NMOS Pass Transistor Feeding an Inverter


drive. A lower bound on threshold voltage of 200mV was chosen for the following anal-

ysis. Although an even lower threshold would have produced better signal levels, the

200mV limit was set because of controllability and leakage constraints. When using a low

Vt transistor, the threshold variations due to process can cause excessive delay variability

[37]. Circuits have been demonstrated to control variation by using a leakage monitoring

feedback loop allowing sufficient Vt stabilization at 200mV [19], [33]. In addition to Vt

variation, excessive leakage currents can develop when using low Vt devices which con-

tributes another lower bound to Vt. The topic of leakage will be discussed in Section 5.3.

In order to simulate circuits with lower thresholds, the model file for the targeted HP pro-

cess was modified to give a Vt of 200mV by adjusting the flat-band voltage. Changing the

threshold in this way should have a minimal effect on the other device parameters.

An investigation of the threshold voltage’s impact on propagation delay was studied

using the bulk bias as a control parameter for the circuit in Figure 5-1. The low Vt transis-

tor was used as a baseline, and by applying a negative bias to the bulk connection the

threshold voltage could be increased. Table 5-3 shows well bias voltages and the resulting

Vt values for a source voltage of 0 volts.

Figure 5-2 depicts the dependence of delay as a function of pass transistor threshold

voltage. The nominal Vt inverter curve shows the delay for the case in which a pass tran-

TABLE 5-2 Comparison of Nominal Vt and Low Vt Nmos Devices (w=1.8u)

Low Vt (Vbs=0v => Vt=200mV)

Nominal Vt (Vbs=-3.3v => Vt=650mV)

Max Vout 1.1V 0.8V

Ipeak (diode connected)

200uA 100uA

TABLE 5-3 Vt Variation with Well Bias

Threshold VoltageWell Bias

(for Vsource=0v)

200 mV 0

350 mV -0.85v

500 mV -3v

650 mV original model file used


sistor feeds an inverter whose device thresholds are fixed at the nominal process Vts

(Table 5-1). The second curve shows the delay when a pass transistor is connected to an

inverter composed of devices fixed at a low Vt (Vtn=200mV,Vtp=250mV). The delays are

normalized to 312ps with a fanout of 2. From the graph, one can see that there is a signif-

icant delay penalty (>5x) as the threshold voltage of the pass transistor feeding a nominal

Vt inverter is raised; whereas, the case of driving a low Vt inverter sees little variation in

delay (1x). However, circuit delays experience only a 1x slowdown if 500mV pass tran-

sistor thresholds are available. Clearly, signal loss through a pass transistor is a serious

problem when designing for a process with high thresholds. It should also be noticed that

at a supply voltage of 1.5 volts, the delay through a low Vt pass transistor and inverter is

nearly 3 times larger for the nominal Vt inverter case (790ps) than the low Vt inverter case

(312ps). This scenario demonstrates the strong correlation between threshold voltage and

delay at supply voltages nearing the sum of the device thresholds.

A closer look at the effects of supply voltage on pass transistor performance is shown

in Figure 5-3. In this case, the delay of a pass transistor feeding an inverter composed of

nominal Vt devices was plotted. One can see that if thresholds are not scaled appropri-

ately, pass transistor design is significantly complicated by lowering supply voltages. As

the pass transistor threshold voltage is varied, the delay at a 3.3 volt supply experiences

less than 10% degradation. At a 1.5 volt supply, a 5x slow-down occurs, but by raising the

FIGURE 5-2: Pass Transistor Threshold’s Impact on Delay


supply voltage to 2 volts, only a 70% decrease in performance is suffered. Thus, either

threshold scaling of pass transistors or a slightly increased supply voltage is neces-

sary to make pass transistors viable for ultra low power design.

5.1.2 Multiple Threshold Losses

In general, pass transistors are best utilized in series chains. However, there are times

when it is desirable to feed a pass transistor output to the gate of another pass transistor

network. Such a scenario is depicted in Figure 5-4 which was an early version of the logic

cell in [20]. During the design of pass transistor based circuitry, one must consider the

detrimental impact of successive losses in signal level. First, the loss of signal presents

less gate overdrive on fanout gates. The use of pass transistors with 200mV thresholds can

alleviate this problem, but only temporarily because after three drain to gate connections

the high output becomes a mere 600mV. Once again, a nominal Vt inverter cannot be

flipped. As a result of the build-up in signal loss, level restoration becomes necessary at

critical points in a circuit to insure proper operation. As will be shown, level restoration is

costly. Consequently, to avoid paying the restoration penalty too often, a designer will

need to trade-off circuit flexibility.

FIGURE 5-3: Supply Voltage Impact on Delay


The second important issue involving multiple threshold losses is the observation that

the number of thresholds lost at any point in a circuit sets a lower bound on the supply

voltage. This requirement is similar to the Vt constraints in low voltage analog design.

Obviously, low power design with pass transistors is severely restricted, if too much signal

is lost along any one path a logic high will no longer be able to be distinguished from

ground.

The progressive loss in signal level is illustrated in Figure 5-5. Unless low Vt pass

transistors are used, noise margins become extremely small after just one threshold loss

with Vdd=1.5 volts. Even at a Vt of 200mV, two drain to gate connections in series would

result in a high level that is too susceptible to noise with a supply of 1.5 volts. In the case

of the nominal process Vt (650mV), the signal is completely lost after two drops. In order

to remedy this problem, level restoration must be performed.

a a#

b b#

cin

cin#outa

outa#

outb

outb#

outcoutc#

fout

fout#

o

o#

= memory cell

1Vt drop

2Vt drop

3Vt drop

FIGURE 5-4: Logic Cell Design with Multiple Drain to Gate Pass Transistor Connections


FIGURE 5-5: Signal Loss Levels at Various Supply Voltages

Section 5.2: Level Restoration 80

5.2 Level Restoration

As previously explained, level restoration quickly becomes a necessity if reasonable

delays are to be achieved and proper circuit operation maintained. Several options exist,

but all revolve around ratioed logic. Among the possibilities are 1) Feedback pmos, 2)

Cross-Coupled pmos (CCP), 3) Sense Amplifying Latch (SAL) [32],[31] and 4) the case

of no restoration. The three restoration schemes are depicted in Figure 5-6.

All four cases were simulated using a test circuit where the restorer outputs were

loaded with inverters and fed by 1.8um pass transistors. In the case of CCP and SAL, dif-

ferential inputs are needed. In all circuits, the well bias of the pass transistors were varied

according to Table 5-3 to give thresholds from 200mV to 650mV. The restorer transistors

were all nominal Vt. The following graphs show the propagation delay from the full scale

input through a pass transistor to the output of the inverter. Energy is also examined

because static current can flow during the restoration process as the ratioed devices tempo-

rarily fight each other. It is important to note that the supply voltage of ~2Vt essentially

eliminates the transition currents, but at higher supply voltages such currents may become

substantial. Data is plotted in Figure 5-7. The first shows values when the control voltage

on the pass transistor is at Vdd (the case when driven by a full-swing input).

From the results shown, there is a considerable trade-off in delay and energy among

the restoration schemes. When using a low Vt pass transistor, all the techniques experi-

ence similar delays. However, the energy of the differential versions is much higher

because of the necessity to duplicate paths. The best alternative from a delay perspective

is the cross-coupled pmos restorer. In this case, the signals are differential so when one

Cross-coupledPMOS

Sense AmplifyingLatch

FeedbackRestorer

FIGURE 5-6: Level Restoration Schemes


path gets pulled to ground, the pmos of the other path is fully turned on and the weak pull-

up of the nmos pass transistor is sped up and brought to the supply rail. This type of dual

rail signalling holds much promise for low voltage pass transistor designs where

speed is most critical since the quick pull-down action governs circuit speed. As the

pass transistor thresholds are raised, the non-restored and feedback pmos methods do not

fare as well as the dual rail techniques and experience significant performance degrada-

tions. When moving to a nominal Vt pass transistor (Vt=650mV), the two single-ended

techniques become impractical. Basically, differential signalling must be used in order to

restore signals when designing for supply voltages near 2Vt.

A closer look at the results show that the sense amplifying latch is just a fraction

slower than CCP. In fact, the SAL will always be slower than the CCP in this capacity.

The possible benefit from the SAL is that the full inverter will enable pull-down and pull-

up paths thus amplifying the transition through feedback in both directions. When closely

examined, it is found that the circuit does not take advantage of this because the pull-up by

the pass transistors is much slower than the pull-down. As a result, the pmos part of the

restorer is always enabled first by a low going transition. The nmos devices are never suf-

ficiently turned on by the degraded pull-up of the pass transistors. Furthermore, the pres-

ence of an extra nmos path to ground on the pull-up side further slows the high going

transition of the pass transistor output until it is completely turned off by the other branch.

FIGURE 5-7: Level Restorer Delay and Energy Comparison with Single Threshold Loss


In terms of energy, the higher currents required to flip a full inverter cause at least a 15%

energy increase over the CCP technique. As a side note, the sense amplifying latch does

offer the possibility of easily incorporating a latch into the output of the logic block. The

addition of the clocked pass transistors to the path results in a 0.5ns slow-down which is

not excessive if the latches are deemed appropriate. Similar conclusions about the sense

amplifying latch were reached in [50].

The next set of data reflects the performance of the restorers when the control voltage

is derived from a circuit where more than one threshold drop is traversed (Figure 5-8). In

all cases, pass transistors’ with 200mV thresholds were used. The voltage that initially

appears at the restoration node moves from Vdd-Vt, to Vdd-2Vt to Vdd-3Vt.

From the above graph, it is evident that the restoration delay increases with increasing

signal loss. When a signal that has lost 3 thresholds is to be restored, a 2x increase in

delay results. It should also be noted that the short-circuit current from the restoring

device to ground during the transition period grows quickly with increasing restoration

delay especially for the feedback pmos technique. Another interesting note is that the

feedback restorer failed to fully restore signals within the 5ns window when fed by an

input that has dropped by more than 1 threshold. As a result, significant design flexibility

is lost when relying on single-ended restoration.

FIGURE 5-8: Level Restorer Delay Comparison with Multiple Threshold Loss (low Vt pass transistors)

Section 5.3: Vt Scaling and Leakage 83

An examination of the variation in restoration delay with supply voltage also yields

some insights into low voltage design with pass transistors. Figure 5-9 shows the delays

of the various techniques as the supply voltage ranges from 3.3 volts to 1.5 volts. At

higher voltages, restoration is fairly quick, but quickly degrades in a square-law depen-

dence for lower supplies. A possible way around this would be to scale the Vt’s on the

restoring devices as well, but this has severe consequences on sizing and leakage as dis-

cussed in Section 5.3. As a final note, the large increase in energy for the non-restored

case at 3.3 volts is a result of static current dissipation from the inadequate turning off of

the pmos device.

5.3 Vt Scaling and Leakage

5.3.1 Subthreshold Leakage Currents

The use of low Vt pass transistors mitigate many of the problems with pass transistor

logic design at low voltages, but they do not come without a price, namely higher sub-

threshold leakage currents. According to the paper by Sakurai [19], leakage currents fol-

low an exponential dependence on Vt where s is defined as the subthreshold slope.

FIGURE 5-9: Restorer Delay and Energy for Different Supply Voltages (low Vt pass transistors)


(EQ 16)

At room temperature, for a W/L = 7 and Vdd=1.5V, the leakage goes as shown in Figure

5-10 below.

The above values were verified by simulation of a few scenarios. When using a Vt of

200mV, one can expect about 1e-8A or 10nA of leakage current. Although the current

from a single transistor is not significant, the use of hundreds of thousands of these devices

can result in currents in the mA range which would be detrimental in low power applica-

tions.

IsubVgs Vt–( )

s--------------------------exp∝

FIGURE 5-10: Measured Sub-threshold Id-Vt Characteristics for Various Vgs [19]

1

0

FIGURE 5-11: Sneak Leakage Path in Between Pass Transistor Connected LUTs


In fact, leakage is the primary reason why low Vt devices cannot be used in all the

gates of a PGA array. In some cases, leakage in pass transistor chains like Figure 5-11 can

be controlled by programming the configuration cells to zero, thus eliminating the voltage

drop across leaky transistors in a sleep mode. The use of a dual Vt process where the

memory cells use high Vt devices could also alleviate most leakage concerns. If neces-

sary, further control of subthreshold leakage can be obtained by increasing the reverse bias

on the well which will cause Vt to increase resulting in an exponential decrease in the

leakage component. Unfortunately, separate access to the well gives rise to an area over-

head since well contacts will need to be specially placed to connect to a dedicated well

bias line.

5.3.2 Static Current Dissipation in Pass Transistor Fed Inverters

So far, low Vt devices appear to offer superior performance as long as leakage is con-

trolled. One might be tempted to employ low Vt devices for all logic transistors, not just

the pass transistors. Up to this point, static current dissipation caused by insufficient turn-

ing off of the pmos device in an inverter has not been discussed. However, if all devices

are low Vt, the degraded high level will not be able to turn off a pmos device and a static

current path will exist when driving a logic low (Figure 5-12).

A combination of low Vt pass transistors feeding nominal Vt inverters solves the static

current problem associated with inadequate turning off of the pmos device. In effect, as

long as the logic high value produced by a pass transistor differs from Vdd by less than the

threshold voltage of a pmos device, then no static current will flow.

(EQ 17)

Fanout = 2

Vdd=1.5v

Vout

VintVin

degradedlogic lowIstatic

FIGURE 5-12: Static Current Dissipation for Low Vt Inverter fed by Pass Transistor

not fully turned off

Vdd pass transistorVt Vbs( )– pmosVt<


One word of caution should be mentioned here. The leakage current through a nomi-

nal Vt device can still be significant when the gate is not completely turned off. Figure 5-

10 shows that there is approximately a 100x increase in leakage currents for every 200

millivolts of additional gate to source voltage for devices that are not completely turned

off regardless of threshold voltage. Thus, if too many devices are driven with a poor logic

high, the aggregate leakage current could become substantial. Fortunately, the pmos

threshold is normally larger than the nmos which allows some extra margin. This scenario

serves as another example of where early power estimation can be valuable. If the calcu-

lated leakage is within some bound of the total power consumption then special design

considerations (well bias, etc.) need not be considered.

In the end, low Vt devices were not used for the final design to be discussed because

they weren’t available in the target process. Nevertheless, a comparison showing the

effects of low Vt devices is useful to illustrate how design options may change if given a

more advanced process technology.

87

CHAPTER 6

Cell Design and Array Architecture

The circuit design and architecture for a low power PGA design are the focus of this

chapter. Several circuit options for the logic cell are examined along with the design of

the interconnect circuitry. Throughout the discussion, various trade-offs among supply

voltage, logic style, and performance are evaluated. Minimization of capacitance and

signal swing factored predominantly in the circuit design phase. Architecturally, this

design strives to achieve low power while still maintaining high flexibility and efficient

logic utilization. A direct-path interconnect mesh joins 3-LUT logic cells thus providing

low capacitance connections for short distance routes. In addition, a local bus routing

grid allows datapath style connectivity to be efficiently supported. Lastly, long-distance

routing is achieved through a hierarchy of coarse interconnect grids.

6.1 Logic Cell Circuit Options

Although the interconnect consumes most of a PGA’s power, the logic cell design can

not be ignored for several reasons. The composition of a logic cell defines the functional-

ity that can be performed by each array element. Simpler cells imply that a greater

number of cells will need to be interconnected to realize a particular overall function.

Conversely, a larger, more capable cell can pack more logic within the cell and thus use

potentially fewer external nets. A complete study of the effects of cell complexity and its

relation to power consumption was beyond the scope of this work, but a couple of trade-

Section 6.1: Logic Cell Circuit Options 88

offs were explored to determine what a suitable granularity would be. The work described

in [20] used a very simple logic cell which could be thought of as offering a bit more

flexibility that a 2-LUT. Results from mappings and circuit simulations revealed that the

fine granularity combined with a restrictive input structure caused a large number of cells

to be needed to implement fairly simple mappings. Since a large number of cells were

needed, a large amount of interconnect was required and despite the fact that the intercon-

nect was low in capacitance, the aggregate energy consumption remained higher than

desired. On the other hand, moving to a more complex LUT would require more configu-

ration cells and interface circuitry. One should keep in mind that increasing the number of

inputs to a LUT causes an exponential increase in LUT configuration cells and will also

require more overhead if intermediate outputs are brought out (i.e a 2-LUT needs 4

memory cells, a 3-LUT needs 8 memory cells, and so-on). More memory cells directly

impact the basic cell area and thus add to the wiring capacitances all the interconnect will

see. In addition, the incomplete utilization of a complex logic cell will indirectly translate

to a waste in energy. Consequently, the cell designs considered were based on a 3-LUT

style as a reasonable balance in efficiency.

Performance is another reason why the logic cell design was carefully evaluated. Even

though speed was not the primary goal of this PGA design, achieving a reasonable level of

performance is critical to building a truly viable design. In fact, the optimization of the

energy-delay product provided a guide for balancing the metrics of energy and speed

throughout the design process. In terms of speed, the logic cell delay represents at least

50% of the total delay for typical mapped designs. Obviously, care must be taken in the

cell design to avoid substantial performance degradation for the entire array.

The logic cell design also plays a role in the determination of supply voltage, and

interconnect structures. As mentioned in Section 5.1.2, circuit topology can easily

become the limiting factor in lowering supply voltage and hence power. At the outset of

this research, the target supply voltage for the design was 1.5 volts and so circuit solutions

focused on that specification. In addition to supply voltage, the type of circuitry used in a

logic cell will often dictate the circuit characteristics of the interconnect network (e.g.


interface circuitry). Any impact on the interconnect network will likely have a strong

correlation to the resulting PGA power consumption. Lastly, the amount of energy

dissipated by the logic block should not be ignored. Although the CLB energy was shown

to be minor in comparison to the interconnect of a Xilinx, optimization of only one aspect

in a design will always result in diminishing returns.

The next several sections contain a brief evaluation of various logic cell implementa-

tions. After describing each separately, all the schemes are compared with respect to

delay, energy and energy-delay product in Figure 6-10 and Figure 6-12. The final choice

of logic cell implementation is explained at the end.

6.1.1 Pass Transistor

A simple method for implementing a logic cell is depicted in Figure 6-1. The inputs to

the cell are derived from a pass transistor based interconnect network. The entire 3-LUT

structure is basically a tree decoder where the inputs select which configuration bit

appears at the cell output. The structure is entirely composed of nmos-only pass

transistors which minimizes parasitic capacitances. The fact that all functions of 3 inputs

are compactly available using only nmos devices and a single rail signalling is attractive

from a power perspective. The main problem lies with the two threshold voltage drops

that a signal experiences as it passes from input to output. As discussed in the previous

chapter, the minimum supply voltage for the design will be limited by the 2 Vt drops

including body effect plus some margin for resolving a 0 or 1 or about 4Vt. This point

occurs at a supply voltage of approximately 3 volts for the process under consideration

(Vtn=0.65v, Vtp=0.85v). Even at 3 volts, the inverters must be skewed low to get

reasonable performance. At lower voltages, the circuit quickly fails making a 1.5 volt

supply completely out of the question. The most important thing to realize in this case is

that the topology of circuits using only pass transistors should be restricted to as few drain

to gate connections as possible to achieve low supply operation.


6.1.2 Cross-Coupled PMOS Restorer (CCP)

Although the basic pass transistor implementation is inadequate at a supply voltage of

1.5 volts, nmos pass transistors can still be employed as long as differential signalling is

used to allow level restoration (Section 5.2). Figure 6-2 depicts the resulting lut topology.

The cross-coupled pmos structure belongs to the CPL logic family, and also bears a

striking resemblance to the differential voltage switched cascode logic (DCVSL) family.

In fact, the gate style is identical except that the lut replaces the DCVSL ground

connections with connections to the configuration memory cells. By using this structure, a

pass transistor signal path will operate at 1.5 volts which was impossible in a single ended

design.

The cross-coupled pmos (CCP) gates possess several beneficial properties for a low

power LUT design. Despite the complete duplication of paths, the energy when compared

to a static CMOS implementation does not double. In fact, the energy is approximately

equal to the transmission gate case in Section 6.1.5. The reason is that the signal swing on

pass transistor driven nodes is reduced to only 750mV versus a full swing of 1.5 volts for

the transmission gates. More importantly, the differential CCP achieves a lower delay

input

out

vdd

LUT Tree

LUTConfigMem

input mux

lut inputs

FIGURE 6-1: Pass Transistor Logic Cell


than transmission gates for about the same amount of energy. Most of the delay

improvement comes from the fact that the pull-up is accelerated by the differential action.

Sizing complications and the necessity for differential signalling mitigate the superior

energy and delay performance of CCP. The path from the pmos devices through the nmos

chain represents a ratioed gate. In order to operate correctly, the pmos devices cannot be

made too strong or the low going signal will not be able to flip the complement path.

Likewise, an overly weak pull-up will slow the flipping of the inverter for the high going

side. In both cases, careful optimization is crucial to ensure reliable operation and

minimal short circuit current during signal transitions. The other drawback to implement-

ing the logic cells in this fashion is more difficult to deal with. Supporting fully differen-

tial paths can complicate the logic cell’s routing if its architecture has a high fanin and

fanout. In addition, differential paths will require a doubling of the interconnect area,

although careful planning of the layout on the local level may be able to minimize the

impact of differential signalling. Fortunately, a differential LUT does not require more

memory cells since true and complement signals are free; however, the area overhead from

the duplication of interconnect buffers and interface circuitry will certainly cause

undesired increases in cell area. Therefore, a single ended option was thought to be more

ideal.

input

out

vdd

LUT Tree

LUTConfigMem

outComplement

LUT Path

inputcross-coupled

restorers

FIGURE 6-2: Cross-coupled pmos Restored Pass Transistor Logic Cell


6.1.3 Boosted Capacitor Restoration (BCR)

Boosted capacitor restoration is another interesting technique to allow near 2Vt

operation of pass transistor based designs. Similar to the cross-coupled pmos style of

LUT, the boosted capacitor technique also relies on differential signals. Instead of using

an active pull-up, the low to high transition receives a charge boost from the coupling

capacitor. As the low to high side slowly charges through the pass transistor, the

complement path quickly flips the inverter causing the output to capacitively couple to

node X. Two benefits result from the bootstrapping effect. First, the edge rate of the low

to high transition becomes accelerated. Without the charge injection, the pass transistor

slowly reaches its final voltage in an RC fashion. More importantly, the coupling allows

the voltage to rise to a point above Vdd-Vt(body effect), thus surpassing the downstream

inverter’s switching point. This allows the output to transition quickly even at a supply

voltage of 1.5 volts. A capacitor in this process would be formed using the gate oxide of a

transistor. Simulations showed that 10ff of capacitance would be sufficient for good per-

formance. Upon close examination of the circuit, one might think that any benefits would

be offset because of the extra capacitance on the pass transistor nodes which would slow

the low going transition. Although there is some degradation to the high to low edge, the

fact that the nmos device is more capable when pulling low makes the slow-down insignif-

icant.

The original intention of devising the boosted capacitor restoration technique was to

develop a means of operating pass transistor networks faster and at a lower power than the

cross-coupled pmos method. Based on SPICE analysis, the BCR technique actually

inputfull-swing

input

input

X

FIGURE 6-3: Boosted Capacitor Restoration Input Circuitry


performs equally with the CCP method with regards to energy and delay. Unfortunately,

the necessity for differential paths could not be avoided and so the same comments made

about CCP apply here as well.

6.1.4 Buffered Pass Transistor

All the techniques for implementing a pass transistor LUT discussed so far have had

drawbacks which discourage their use. Thus, a re-examination of the simple pass

transistor LUT made sense. The main hurdle to using the nmos only pass transistors

concerns the voltage loss in passing a ‘1’ which makes it impossible to effectively flip an

inverter (Section 5.1). In the diagram of Figure 6-1, a signal would experience one

threshold loss after coming through a switch from the interconnect and then another in the

drain to gate connection of the tree decoder. By simply ensuring that a signal path never

encounters more than one threshold loss, the minimum supply voltage can be reduced to

about 2 volts and still achieve reasonable performance. Figure 6-4 shows the maximum

voltage levels that can be propagated along the signal path.

The resulting LUT design called buffered pass transistor retains many desirable

properties. Relaxing the supply voltage to 2 volts allows the LUT delay to decrease by

about 30% over the fastest 1.5 volt design. Compared to the other schemes at 2 volts, the

buffered pass transistor performs about the same as a transmission gate design, and about

input

out

LUT Tree

LUTConfigMem

Vdd=2v

Vhigh=1.1v

FIGURE 6-4: Buffered Pass Transistor Logic Cell


0.5 ns slower than the fastest differential techniques. In terms of energy, buffered pass

transistor consumes the lowest among the 2 volt implementations. The inputs to the LUT

see only nmos gates resulting in minimal input loading. In addition, the reduced internal

swings help to further reduce energy consumption in the LUT.

6.1.5 Transmission Gate

The transmission gate style of 3-LUT resembles the pass transistor method except that

the pmos is added. Figure 6-6 shows the schematic of such a circuit. Addition of the com-

plementary transistor allows the circuit to operate at 1.5 volts without the loss in

thresholds that plagued the nmos only pass transistor design. The energy of the design is

quite low at 0.44pJ per output charge and discharge cycle. The optimum sizing of the

pmos device relative to the nmos was determined to be 1. Moving to a P/N ratio of 3 to

compensate for the mobility difference actually increased delay and energy consumption

as shown in Figure 6-5.

Once again though, delay was the problem. At a supply voltage near 2Vt, a transmission

gate’s performance is severely degraded. As the signal proceeds through Vdd/2, a pseudo-

dead zone occurs as both the pmos and nmos devices are barely conducting. As a result,

there is a noticeable leveling of the transient waveform which translates into a poor

propagation delay. In addition, the pmos devices add some load to the inputs, although it

FIGURE 6-5: Impact of Transmission Gate Sizing


is not as severe as the static designs to be discussed next. In general, the transmission gate

method proved to be the best solution from an area, delay and energy perspective among

the static CMOS designs

6.1.6 Decoder-Based

Another possible implementation of the logic cell was borrowed from static decoder

structures [28]. The circuit is depicted in Figure 6-7. By using complementary static

gates, the supply voltage could be lowered to 1.5 volts without the problems associated

with the pass transistor options. Instead of implementing the decoder as a tree, a set of

NAND gates select which configuration cell’s value is to be passed to the output. In

addition, only one line will be charged at a time resulting in low energy operation.

Simulations showed the circuit consumed about 0.61 pJ of energy.

Unfortunately, the delay and input loading requirements of the circuit made it undesir-

able. Typical delays from the slow input to the output were 3.4ns which was not

competitive with other design options. Another negative aspect of the circuit was the

vastly increased load on the logic cell inputs. In a pass transistor tree, one input

experiences a maximum load of 4 nmos transistor gates and an inverter to generate the

complement input. The static NAND decoder requires each input signal to connect to 8

nmos and 8 pmos gates. Thus, the energy advantage of this structure due to low voltage

operation is offset by the increase in energy necessary to drive the cell inputs. One might

out

input

LUTConfigMem

FIGURE 6-6: Transmission Gate Logic Cell


remember that this structure was examined in Section 4.2.1 for implementing the input

multiplexers. The key difference between the two cases is that the inputs driving the nand

gates are essentially static when performing the input multiplexer function whereas they

are toggling in the LUT case. As a result, the circuit in Figure 6-7 is attractive for input

circuitry, but not for the logic cell.

6.1.7 Static Complex Gate

Keeping with the goal of 1.5 volt operation, the static CMOS logic style remains very

attractive. Besides a decoder type of implementation, one could envision a number of

other possibilities. The simplest would involve using a set of static CMOS gates to derive

functions. Some commercial designs employ this technique. The CLAy design described

in Section 2.4 uses an XOR and an AND gate as the basis for generating logic functions.

The main drawback with this ‘discrete’ gate method; however, is that a complete coverage

of all possible functions is not possible. For example, a logic cell with 3 inputs would

require 23^2 gates although some combinations are not unique. As mentioned earlier,

coverage of all possible functions is preferred since it greatly simplifies the mapping

process and can lead to more efficient utilization.

out

LUTConfigMem

inputs

FIGURE 6-7: NAND Decoder Style LUT


Branch-based logic is another way to take advantage of the low voltage operation of

static CMOS and still retain complete functionality. Figure 6-8 shows the implementation

of a 2 input LUT. One can think of the gate as a direct translation of the truth table for a 2

input function. The memory cells control the upper and lower transistor in the stacks to

program whether a minterm should be a 1 or a 0. One problem, seen also in the decoder

style implementations, is the large increase in input loading. In addition, the extension of

a two input version to a three input version does not scale well. In the 3 LUT case, each

input (true and complement versions) must connect to 4 nmos and 4 pmos gates. Worse

yet, a total of 4 pmos devices must be placed in series for the pull-up paths leading to poor

delay and excessive capacitance and area. Even with intelligent sizing, the performance

and energy of this gate was inferior to other options (5.7ns & 0.5pJ). One possible way of

avoiding 4 series devices would be to break up the gate into a two stage design. After

doing this, the delay improved to 4.5ns and energy was about the same as the single stage

design. Overall, the static CMOS designs were not promising candidates for the logic cell

implementation despite their ability to operate at 1.5 volts.

6.1.8 Current Mode Logic

Current mode logic was another logic style that was investigated for the logic cell

implementation (Figure 6-9). Often, extremely small signal swings (100mV) can be

aa

b

a

a

aaa

a

b

bb

b b

b

b

out

m1 m2 m3

m1 m2 m3

m4

m4

FIGURE 6-8: Static Branch-Based Logic Cell (only 2-LUT)


achieved by using this style. In order to operate with such small voltage excursions, a

static bias current flows through the circuit. Simulations showed that currents in the 10uA

range were sufficient for correct operation. However, the power consumed by this current

became excessive unless an operating frequency approaching 100MHz could be

maintained. In the designs considered, 100MHz operation would be very difficult and so

the static power would considerably outweigh a dynamic power dominated alternative.

Despite the static current problem of the design, three other issues proved most troubling

to any incorporation of current mode logic in a PGA. The necessity to level shift signals

to a comfortable common mode voltage depending on where they are located in the tree

leads to further design overhead and power consumption. Secondly, a source follower

output stage would be absolutely necessary for driving the capacitive interconnect lines to

other cells. Once again, more static power would be burned. Lastly, the current mode

technique would require using fully differential signal paths which would double the

amount of interconnect area.

6.1.9 Low Threshold Devices

The techniques described so far have all been handicapped by the fact that the device

thresholds with body effect become a large portion of the supply voltage (Vdd ~2-3Vt).

The data in Section 5.1 suggest that using devices with lower threshold voltages will

alleviate the problem of inadequate inverter triggering. Three scenarios employing low

threshold devices were examined. The first replaced all pass transistors with low Vt

out out

bias

aa

b b

bias

aaconfig memboth trueand compoutputs needed

FIGURE 6-9: Current Mode Logic Cell Implementation (only 2-LUT for simplicity)


devices (Vt~200mV) and used nominal Vt transistors for the inverters. A process that

provides dual Vts makes the leakage problem more manageable than a process with all

low Vt devices because memory cells can be constructed without high leakage currents.

The other two cases used transistors with 200mV and 400mV thresholds respectively

throughout the design.

As one might expect, circuit delays improved dramatically. For all supply voltages,

the low Vt scenarios showed the fastest speed. The designs using all 200mV thresholds

were quickest among the group followed by the version that used low Vt pass transistors.

Unlike delay, low threshold devices did not offer a substantial improvement in energy over

the previously discussed designs. The main reason for this is that the low Vt devices cause

larger signal swings at a given supply voltage. However, the key benefit of low Vt devices

is the ability to operate at a lower supply voltage than allowed in a higher Vt design. A

lower overall supply voltage will save a large amount of energy especially in the pass

transistor-based interconnect network. Unfortunately, one caveat must be observed when

dealing with low Vt devices. The lower threshold will cause the problem of static current

dissipation discussed in Section 5.3.2 and must be dealt with to avoid excessive power

consumption. Lastly, the examination of low Vt scenarios was performed to investigate

the impact a low Vt process would have on PGA design. However, the process that was

intended for this design did not offer optimized Vts and so only the design options using

the higher thresholds were realizable.

6.1.10 Final Cell Results

The following series of figures show the delay, energy, and energy-delay product

comparisons for the design styles that showed reasonable promise. The traditional pass

transistor data refers to the case described in Section 6.1.1. For each case, data is plotted

from a supply voltage of 5 volts to 1 volt to illustrate the relative degradation in speed and

the improvement in energy. One interesting thing to note is that the 200mV low Vt case

was the only design that could function below 1.5 volts.


The energy-delay product graph in Figure 6-12 provides a useful way to reconcile the

opposing metrics of energy and delay. The buffered pass transistor style of LUT offered

the best overall performance among the high Vt designs and was chosen for this low

power PGA design. However, it should be noted that the transmission gate LUT imple-

mentation was nearly as good. From the data, a supply voltage of 2 volts appears to offer

the optimum operating point. Even though a 1.5 volt supply was originally targeted, by

relaxing the supply by 500mV both delay and energy could be improved. Normally,

FIGURE 6-10: Delay of Potential Cell Designs

FIGURE 6-11: Energy of Potential Cell Designs

Section 6.2: Logic Cell Output Circuitry 101

energy and delay tend to trade-off against each other, but in this case, the slightly

increased supply voltage allowed more circuit implementation options to be explored

resulting in a more optimal single-rail design. Although this was an aggressive design,

as technology scales and circuit topologies become more sensitive to supply voltage,

the ability to optimize designs by using supply voltage as a variable will see

increasing importance.

6.2 Logic Cell Output Circuitry

Keeping with the low energy goal of this PGA design, logic cell output buffers driving

large loads were sized smaller than the optimal size-up factor of e. As shown in [28],

delay varies slowly around the optimum scaling value. Therefore, use of a larger size-up

factor such as 4-5 results in energy savings from reduced driver capacitance.

In addition, the logic cell output circuitry must be carefully designed to avoid several

pitfalls. As discussed in the power analysis chapter, significant power can be wasted in

output fanout chains. To avoid this, each driver chain was isolated from the fanout node

by inserting a pass transistor switch (Figure 6-13). Thus, only pass transistor drains and

enabled outputs are charged when the logic cell output transitions instead of all the

inverter chains. This technique is worthwhile when, on average only a small fraction of

FIGURE 6-12: Energy-Delay Product Of Potential Cell Designs

Section 6.3: Interconnect Circuitry 102

the potential output paths are configured at any given time. A pmos pull-up must be added

to the inverter chain inputs in order to insure that the node does not float when the input is

not being driven. Although setting the node high may result in an extra transition, this

only occurs during reconfiguration so no additional energy consumption takes place

during normal operation.

Lastly, a programming signal must be included to disable all outputs that drive shared

busses. During reconfiguration, the state of the configuration memory cells will be inde-

terminate. As a result, several drivers may be enabled on a shared bus and may be

attempting to drive conflicting logic levels. The special programming signal insures that

all drivers of shared busses will be disabled during reconfiguration. Without the

programming disable, hazards could cause excessive currents to destroy the chip while it

is being programmed.

6.3 Interconnect Circuitry

The interconnect circuit design was considered in conjunction with the cell design

options since the cell often impacts the interconnect implementation. Based on the

decision to use the buffered pass transistor scheme, the following interconnect

methodology was developed. The supply voltage of 2 volts allowed pass transistors to still

be used as the switches throughout the interconnect. At that supply voltage, sufficient

signal could be developed on the input of an inverter to trigger it allowing single ended

signalling to be used with feedback pmos restoration.

The ability to stay with pass transistors was critically important because a significant

power penalty would have to be paid if forced to move to full transmission gates. Figure

VddL=1.5volts

configmem

prog#

input shared busoutput

FIGURE 6-13: Programmable Output Circuitry


6-14 shows the relation between pass transistor and transmission gate programmable inter-

connect using the model depicted in Figure 4-5. At first glance, two factors contribute to

the increase in power that the transmission gate interconnect experiences. The addition of

the pmos device at least doubles the capacitance per switch. Simulations showed that the

optimal P/N ratio was 1/1 for speed and especially for power. Secondly, the signal swing

through transmission gate interconnect is larger than pass transistor interconnect since a

full rail signal is propagated. As a result, transmission gate interconnect theoreticallt

experiences a 4x increase in power dissipation.

One word of caution must be made concerning the above comparison of pass transistor

and transmission gate interconnect. The overhead of transmission gates becomes

moderated by the wiring capacitance contribution to total interconnect capacitance. Thus,

one must keep in mind the relative values of wiring and fanout capacitance to determine

the best switch implementation. In the studies performed to evaluate the interconnect

implementation space, estimates of fanout and wiring length were included. The typical

interconnect segment under consideration was estimated to have 50ff of wiring

capacitance and the equivalent of 12 pass transistor drains connected to it (Level 1 inter-

connect in Section 6.6.2). From those estimates, pass transistors were found to offer the

FIGURE 6-14: Delay and Energy for Interconnect Chains


best performance in terms of delay and energy consumption. Figure 6-15 depicts the

Energy-Delay product comparison of the interconnect chains.

A further optimization along the low power focus of the design was the use of a dual

supply voltage. As the design stands, all the circuitry operates from a 2 volt supply which

allows a high voltage of about 1.1 volts to be passed through an nmos pass transistor. In

effect, the pass transistors act as voltage converters allowing no more than 1.1 volts to be

passed in either direction. Although the energy from a 1.1 volt swing on the interconnect

is quite low, the fact that it comes from a 2 volt supply implies an inefficiency. Therefore,

all the circuitry except for the memory cells and the buffers which drive the LUT pass

transistors were moved to a 1.5 volt supply. 1.5 volts was chosen because going to an even

lower supply voltage (< 2Vt) would start incurring a substantial delay penalty. By using

the 1.5 volt supply, all the inverters driving large interconnect and load capacitances see a

25% energy savings. The movement to the dual supply environment was deemed

worthwhile because of the degree to which interconnect capacitance dominates PGA

power. In essence, the dual supply allows an interesting balance between delay and

energy efficiency to be made. By operating the LUT circuitry at a higher voltage, better

performance can be achieved. Since the LUT delay accounts for a sizeable portion of

overall path delay, and a small portion of energy consumption, raising the supply is

beneficial. Similarly, by reducing the supply on the cell interface and interconnect drivers

FIGURE 6-15: Energy-Delay Product of Interconnect Chains


which contribute a significant amount to total PGA energy consumption, a greater overall

energy reduction can be realized. Therefore, in designs where some blocks account for

most of the delay and other blocks consume most of the energy, a dual voltage

scheme may be appropriate to gain further leverage with which to optimize energy-

delay product. As an additional note, the efficiency of the circuitry that generates the

supplies must also be factored into the overall evaluation of a dual supply scheme. The

other reason to move to a dual supply is that level conversion was much quicker for a 1.1

volt signal feeding a 1.5 volt inverter. As a result, the restoration energy and delay penalty

suffered at every buffering point was lessened. Figure 6-16 below depicts the general

scheme.

Using the dual supply scheme allows the general interconnect power to go as

described in the following equations. Assuming that a 5 volt PGA design using pass

transistor interconnect has 3.5 volt interconnect swings, a 10x reduction in power is

achieved by using the dual voltage scheme despite the wealth of complications associated

with low voltage pass transistor design. As one last note, the small signal swings do come

at the cost of some speed, but the primary goal of the design was ultra low power.

(EQ 18)

(EQ 19)

(EQ 20)

Logic

Block

VddL=1.5volts

Vdd=2volts

Vswing=1.1v

Cinterconnect

Vdd=2volts

1.1v

input

skewedinverters

FIGURE 6-16: Typical path from interconnect through logic cell and back to interconnect

VddL=1.5voltsVdd=2volts

Energy C Vdd Vswing××=

@ Vdd=5 volts Energy C 5 3.5×× 17.5C= =

@ Vdd=1.5 volts Energy C 1.5 1.1×× 1.65C= =


During the final design stages, the cell design depicted in Figure 6-4 was altered to

further improve performance and energy. As mentioned in Section 5.2, level restoration is

costly because it involves ratioed paths. After some simulations, it was observed that

better performance could be achieved by removing the restoration devices. Since the

supply voltage on the pass transistors was 2 volts instead of 1.5 volts, there was sufficient

gate overdrive voltage to effectively flip a 1.5 volt inverter. The elimination of the

restoration devices resulted in lower delay because paths no longer involved ratioed

devices fighting each other. In addition, energy decreased slightly since the short circuit

current of the restorer transition was no longer present. The possibility of static current

dissipation as described in Section 5.3.2 is avoided even though the logic levels are not

full-swing because Equation 18 showing the relation between pmos threshold voltage and

pass transistor output voltage is still satisfied in all cases. The resulting circuit path is the

same as Figure 6-16, except the restoring device has been removed as shown in Figure 6-

17.

However, before accepting the optimization using no restorers, a careful examination

of subthreshold leakage current was necessary. Since the pmos devices are no longer

turned off with Vgs=0v, subthreshold currents will still flow. Under worst case process

conditions, an estimate of the leakage current per logic cell was 0.25uA. For comparison,

a logic cell driving one output which toggles at 5MHz will dissipate over 10x more power.

Under nominal conditions, leakage becomes 50-100x less giving a 2000 logic cell array a

total leakage power of ~40uW. Diode leakage for such an array would also fall in the

Logic

Block

VddL=1.5volts

Vdd=2volts

Vswing=1.1v

Cinterconnect

Vdd=2volts

1.1v

input

skewedinverters

VddL=1.5voltsVdd=2volts

FIGURE 6-17: Fully optimized typical path from interconnect to logic cell and back to interconnect

Section 6.4: Config memory/ FFs for CLBs 107

range of 10s of uW so removing the restoration devices was deemed viable. On the other

hand, if leakage currents must be reduced further for a sleep mode, the supply voltage

could be disabled or the restoration devices could be re-inserted. Another possible

leakage control option is discussed in [16] where separate power-down of logic circuitry is

performed to keep memory values intact.

6.4 Config memory/ FFs for CLBs

Configuration memory is used heavily throughout an FPGA in order to program the

switches and lookup tables. Two general options exist for the design of the configuration

memory: 1) Scan Latch-based, 2) 5 transistor SRAM based, 3) 6 transistor SRAM based.

see Figure 6-18

The scan latch design allows minimal programming overhead since decoders and other

addressing hardware need not be included. A single stream of data is simply clocked

through the cells until all memory cells are initialized at which time, the hold mode is acti-

vated and the scan chain is disabled. Two aspects make the distributed memory cell

approach undesirable for this PGA implementation. The first is that the two clock lines

must be global to all scan cells. As a result, difficult skew constraints must be met espe-

cially considering that the data can snake irregularly throughout the chain of back to back

connected memory cells. The more important drawback to the serial programming

approach concerns performance and power dissipation. Serially clocking all configuration

data would take an excessive amount of time and so a parallel approach is preferred. In

addition, the power of serially shifting data through a chain can be quite high since all ele-

ments experience transitions. Instead, a traditional parallel access memory hierarchy will

phi1 phi2

hold

WL

BL#

Q

Q#

QQ#

scan in

Scan Latch 5-T SRAM Cell

WL

BLQQ#

Dual Bitline SRAM Cell

WL

BL#

FIGURE 6-18: Configuration Memory Options

Section 6.5: Logic Cell Array Architecture 108

offer better programming speed and block enables may be used to lower programming

power consumption albeit at the expense of some area.

The 5 transistor SRAM style memory cell serves as the alternative to a scan in pro-

gramming approach and is used in many SRAM-based PGAs. In this case, a single bit

line would be accessed from a wordline transistor to initialize the cell. Correct sizing of

the access transistor is determined by the pull-up case since only one bit line is used. The

ratio of the nmos pull-up current to the nmos pull-down device must be large enough to

reach the threshold voltage of the storage inverter which is approximately Vdd/2. Simula-

tions showed that an nmos only access device would need to be greater than 10um wide to

reliably write a 1.8um pmos and 0.9um nmos memory cell at a supply voltage of 2 volts.

The necessity of such a large access device severely affects the cell area since program-

ming cells account for a significant fraction of total array area.

Consequently, the six transistor SRAM cell was chosen for the configuration memory

throughout the chip. By using the dual bitline approach, minimum sized access transistors

could be used while still maintaining good noise margins. During a read operation

necessary for testing, a Vdd/2 precharge should be used to avoid bitline capacitance upset

of the cell contents. The only major drawback of the dual bitline configuration memory is

the added demand on wiring resources. In previous 5-T designs, only one bitline needed

to be routed per column; whereas two are required in this scheme. Therefore, a low

voltage PGA design presents a much more difficult layout challenge. In effect, low

voltage configuration memory serves as another driver for using low Vt devices if

available, thus allowing a return to the preferred single bitline scheme.

6.5 Logic Cell Array Architecture

The basic cell architecture defines a PGA’s mapability and its internal capacitances.

Once the logic cell’s functionality has been defined, the surrounding interconnect network

can be constructed based on the mapping properties of the cell. Determination of the cell-

interconnect interfacing resources and the amount of flexibility to support were developed

from several mapping experiments and intuition built up from studying PGAs. The

Section 6.6: Detailed Architectural Discussion 109

problem of defining the connective resources to provide in a PGA array is inherently tied

to heuristics since the device is meant for a general purpose environment. Therefore, a

prototype architecture was initially defined to possess particular properties that were

beneficial from a mapping and capacitance perspective. Using the initial design as a

template, a series of real-world design examples and common logic operations were hand

mapped. Performing the mapping exercise allowed the strengths and weaknesses of the

architecture to be quickly revealed along with the relative utility of the proposed architec-

tural features. Accumulators, counters, shift registers, and comparators were some of the

logic building blocks that were test mapped. In addition, the correlator depicted in Figure

3-11 and the add-compare-select block of a Viterbi decoder were mapped to examine the

architecture’s viability given larger functional blocks.

The general motivation for the architecture focused on preserving the locality and

structure present in most designs. By combining the connectivity properties of designs

with an underlying PGA structure which facilitated those required mapping characteris-

tics, low capacitance mappings could be achieved across a broad range of designs. The

resulting structure aims at striking a balance between datapath style design requirements

and random logic mappings. In doing so, highly regular, dense mappings can be produced

yielding very efficient cell utilization. Furthermore, the trade-off between flexibility and

resource capacitance is managed by providing a highly useful, low capacitance, neighbor-

to-neighbor network combined with a datapath oriented local bus structure. Finally, a

hierarchy of stacked grids overlays the lower levels to allow long distance routing.

Careful construction of these interconnect resources creates a gradual progression from

low capacitance local wiring, to more flexible higher capacitance routing, and enables

designs to exhibit much lower average net capacitances.

6.6 Detailed Architectural Discussion

Figure 6-19 depicts the basic cell. In actuality, two logic cells are grouped together so

that they share inputs although they can operate independently from each other through

the direct path interconnect.


Each logic cell is simply represented by a black box since the circuit details inside are

not important. The only important information from an architectural standpoint is the

ability of the logic cells to implement any function of 3 variables with complete permuta-

bility of the inputs A,B, and C. Surrounding the cells are the input interface multiplexers

which may be implemented as a logarithmic tree or as linear growth switches as discussed

in Section 4.2.1. A simplified diagram showing a pair of logic cells and the breakdown of

inputs appears in Figure 6-20 below.

Each multiplexer provides a single input to the logic cell. The top and bottom muxes per

logic cell choose between inputs coming from four locations:

• the cell vertically adjacent

• the cell to the upper/lower left

• the outermost horizontal local bus (track 0 or 3)

• the innermost horizontal local bus (track 1 or 2)

FIGURE 6-19: Basic Logic CellPair Tile

4x Interconnect

Level 1Interconnect

Level 1Interconnect

FromDiagonalNeighbor

From

NeighborLower

ToUpper

NeighborTo

NeighborDiagonal

SwitchMatrix

3-LUT&DFF

3-LUT&DFF

4x Interconnect

cell

toadjacent

track0

track1

track2

track3

A

BC

BC

A


The third input is derived from an 8 input center mux. The eight possible sources include:

• any one of the 4 horizontal local busses

• a vertical high fanout length 4 line (4x)

• a direct connection from the cell to the left

• either diagonal from the upper or lower left cell

The single output of each logic cell can be independently routed to 9 destinations. They

include:

• either vertically adjacent cell

• either cell to the upper and lower right

• any of the 4 horizontal local busses in the pair of cells to the right

• the cell directly to the right

Figure 6-21 shows the neighbor-to-neighbor interconnect which will be referred to as the

Level 0 interconnect.

The local horizontal busses which span each pair of logic cells are connected to a grid of

routing bounded by switch matrices. The granularity of the grid is 2 logic cells horizon-

top muxes

bottom muxescentermuxes

track0track1upper diagonalupper vertical

track2track3

lower diagonallower vertical

track0track1

upper diagonal

topmux bottom

mux

track2track3

lower diagonal

vert4x

rightin

center mux

track0track1

track2track3

FIGURE 6-20: Simplified Diagram of Logic Cell Pair with Input Breakdown


tally by 1 vertically. Figure 6-22 shows the structure of the level 1 interconnect network

which is defined on the 2x1 grid.

The next few sections provide comments about the motivation for each type of intercon-

nect resource and why it was interfaced to the cell in the way it was.

6.6.1 Level 0 Direct Interconnect

The level 0 interconnect layer serves as the primary resource for localized communica-

tion. The segments are designed as point-to-point connections allowing them to have

lower capacitance than the other types of interconnect. Figure 6-23 shows the connectiv-

ity horizons for the inputs and outputs of a cell using only the level 0 interconnect.

Using the direct connects, a logic cell can efficiently support fanouts as high as 5 to

neighboring cells. Although designs commonly see fanouts of only 1 or 2, the architecture

was made symmetrical about the horizontal to facilitate the mapping task. From Figure 6-

23, one can see that most of the direct interconnect flows vertically and diagonally. The

reason for this is that the level 1 interconnect is more aptly suited for horizontal routing

FIGURE 6-21: Level 0 Direct Interconnect (Pattern spans the entire array)

Direct Path

Pair of

Logic Cells

Interconnect


along the dataflow direction. However, by including the vertical and diagonal connec-

tions, many paths avoid using the more general purpose interconnect layer allowing the

general purpose layer of interconnect to be greatly simplified. In addition, many mapping

targets (adders, comparators, etc.) require data to be passed across the width of a datapath

from bitslice to bitslice, this task is extremely well suited to the level 0 interconnect. In

SwitchMatrix

Logic Cell

HorizontalLevel 1

VerticalLevel 1Track

Track

Level 1SwitchMatrix

Level 1 &Horiz 4xSwitchMatrix

Level 1 &Vert 4xSwitchMatrix

Level 1 &Horiz 4x &Vert 4xSwitchMatrix

FIGURE 6-22: Level 1 Vertical and Horizontal Interconnect (showing cell output connections and the 4 versions of switch matrices

home cell

cells w/odirect pathconnection

cells w/directconnection

Input Horizon Output Horizon

FIGURE 6-23: Diagrams depicting the directly connected cells from Input and Output Perspective


summary, the paths offered by the direct connection network provide a valuable low cost

routing resource which should be leveraged for as much of the local wiring demands as

possible.

A good example of the utility of the level 0 network is the ripple carry full adder

structure depicted in Figure 6-24. When using 3 input LUTs, two logic cells are required

per full adder. The carry fanout requirements are efficiently handled by the local connec-

tivity allowed by the vertical and diagonal direct connections. In addition, these carry

ripple paths will be faster than any other interconnect because of their low fanout and

dedicated nature.

Clearly, the level 0 interconnect is useful for datapath designs, but since this PGA

design is meant to be general, the interconnect should also be useful in other types of

designs. Overall, all design choices were guided by the intent to only provide resources

that could be highly utilized across a broad range of likely designs while minimizing the

overhead inherently associated with adding any resource. The finite state machine

mapping in Figure 6-25 provides an example of how random logic can also take

advantage of the level 0 network. Other than the primary input paths, all but two interme-

diate paths were able to be mapped using the level 0 resources. Once again, designs that

C0 S0

C1

C2

C3

S1

S2

S3

C4 S4

a0

b0

b1

b2

b3

b4

a1

a2

a3

a4

FIGURE 6-24: Ripple Carry Adder Mapping making use of Direct Connects


can make efficient use of the level 0 interconnect will be faster and have minimal switched

capacitance.

Once a high leverage resource like the level 0 interconnect has been added to the PGA

architecture, decisions about how to connect those paths to the logic cell must be

considered. The allocation of the paths to the input multiplexers must be done carefully to

avoid overly restricting the legal input combinations. For example, if all four level 0

inputs were fed to the same multiplexer then only one of the three cell inputs could be

derived from a neighbor path. Such a situation is undesirable from two fronts. First, the

amount of level 0 interconnect that can be used would be severely limited, thus preventing

the low capacitance advantage from being maximally exploited. Secondly, if only one

input can come from the level 0 network, the other two must come from the local busses.

The lack of flexibility that results degrades the mapability of the architecture leading to

non-optimal and irregular designs. An example of a decision that was made based on the

above argument is the ability to source each diagonal input from either a top/bottom mux

or the center mux. Allowing this added flexibility enables the case where inputs come

from both the lower (upper) left cell and the cell directly below (above) the destination cell

which would have been impossible without the duplication of the diagonal input. As a

FIGURE 6-25: FSM Mapping of ISCAS S27 (3 states)


final note, the logic cell output was designed such that it can drive any combination of the

level 0 interconnect wires. Once again, the flexibilty here was deemed necessary from

mapping experiments which showed that restricting the output fanout to a subset of com-

binations was detrimental. In order to minimize the overhead associated with the output

fanout, the cell output structure was optimized to be efficient from a power perspective.

6.6.2 Level 1 Local Bus Interconnect

In spite of the level 0 interconnect’s high utility, a PGA interconnect structure would

be insufficient without further resources. Often, PGAs that rely heavily on neighbor-to-

neighbor interconnect suffer from the inability to support fanout in a regular fashion

suitable for datapaths. Therefore, the level 1 grid was conceived under the purpose of

facilitating dataflow and higher fanouts. Grouping pairs of cells together grew out of the

desire to share the local horizontal busses because signals often fanout to at least two cells.

Once again, the full adder provides an example which is frequently used in designs.

Figure 6-24 shows how the two operand inputs are fed to both the sum and carry 3-LUTs.

Sharing busses leads to fewer distinct routing paths and avoids paying the overhead

capacitance of a switch matrix for every cell pitch. In fact, the initial mapping floorplan

should take advantage of the ability to efficiently route a signal to pairs of cells. By doing

this, spatial locality is encouraged and will lead to lower net capacitances. Lastly, for

cases where signals must fanout to many cells, the horizontal busses can be combined to

allow several cells to tap off the same source.

In addition to providing several input paths to the logic cells, the horizontal busses are

part of the first granularity of general routing network. Although many designs can almost

be completely routed using connections made from one pair of cells to the next (i.e.

pipelined designs), many require signals to follow more complicated paths to reach their

destinations. Thus, the horizontal busses are used to move signals laterally along the

datapath while the vertical segments allow one row of cells to interface with other rows. A

total of four local horizontal busses are provided for each row. This number was the result

of the mapping experiments which showed four busses to be the minimum number


necessary. Once again, the reader is reminded that additional resources were added

sparingly in order to keep the internal capacitances from interfacing and switches as low

as possible.

In general, lateral movements are usually sufficient to route most signals as

feedthroughs, but vertical resources were needed to handle connectivity across the width

of the datapath that cannot be supported by the level 0 network. A common example of

such a case are datapath shifts. Another key reason for including the vertical tracks is to

provide another path from which inputs and outputs to a datapath can be routed. The

preferred direction for datapath routing is horizontal, but in some cases, the horizontal

tracks will be used up. As a result, signals can flow vertically and connect to the

horizontal busses where the signals are locally consumed or produced by the logic cells.

Only two vertical tracks were provided in order to minimize capacitance associated with

the switch matrices.

Switch matrices connect the horizontal and vertical segments into a 2x1 grid. The

connectivity allowed by a switch matrix is depicted in Figure 6-26. All possible combina-

tions are not supported because excessive diffusion capacitance quickly results. In

Horiz Track 0

Horiz Track 1

Horiz Track 2

Horiz Track 3

VertTrack 0

VertTrack 1

level 1switchmatrix

FIGURE 6-26: Level 1 Switch Matrix Detail


particular, only the number 1 and 2 horizontal busses are allowed to interface to the

vertical tracks. The number 0 and 3 busses connect to the 4x grid which is the next higher

level in the interconnect hierarchy and will be discussed later.

The level 1 interconnect structure was designed so that data flows very naturally

among columns of the array. A left to right directional bias is built in because the outputs

of the previous pair of cells feed the horizontal busses of the pair of cells to the right. This

decision was made because cell outputs often flow into downstream cell inputs. Despite

this bias, right to left movement is not impaired.

The interface between the logic cells and the level 1 network was determined from the

mapping experiments. The single length vertical tracks cannot make a direct connection

to the logic cells because the added overhead was considered unnecessary. Instead, only

the horizontal local busses can source cell inputs. Once again, redundancy was supported

in the interface because of mapability constraints. In this case, the ability for the top/

bottom muxes to select one of two lines and the center mux to select any of the lines

allows early global routing decisions to be made nearly independently of the detailed

route. Some limitations exist, but these aid in deciding which signals to allocate among

the tracks thus constraining the overall routing problem.

6.6.3 4x interconnect layer

On top of the level 0 and level 1 interconnect sits a further hierarchy of grids.

Additional interconnect is necessary to provide further wiring capacity and to facilitate

long-distance routing since it is impossible to put all sources and sinks close to each other.

The first layer is aligned along a 4x4 block of cells as depicted in Figure 6-27. In addition

to the 4x4 grid, successive levels of hierarchy could be added as needed (i.e. 16x16). In

order to minimize the capacitance impact of the hierarchy on the lower routing levels

connections to lower levels are only made at the edges of the grid. Thus, the long distance

routes are reserved for signals which need to skip over some logic cells before reaching


their destinations. As a result, the global wires do not see fanout capacitance due to logic

cell inputs, only from interfacing to other interconnect layers.

The only exception to the edge based connectivity of the upper layers of the hierarchy

are the vertical 4x lines. One deficiency of the architecture was the ability to distribute a

common signal across the width of the datapath. This requirement comes up fairly often

for structures like multiplexers and other elements requiring control signals to be fed to all

bitslices. Technically, a signal could be fed to the cells from the level 1 vertical tracks and

then through a switch matrix to a horizontal bus wire for each row. However, the delay

and capacitance of such a path would become excessive for high fanout signals. Instead,

the vertical wires that span 4 cells are populated so that all logic cells can receive an input.

This ability to tap inputs directly from a high fanout vertical resource proved to be a

worthwhile addition to the architecture.

Finally, the interface between the 4X grid and the level 1 interconnect centers around

the switch matrices. Figure 6-28 shows the possible switch connections that may be

enabled. In addition to the pass transistor switches, the switch matrix also contains pro-

4x VerticalTracks4x Horizontal

Tracks

FIGURE 6-27: 4x Horizontal and Vertical Interconnect Layers

Section 6.7: Mapping 120

grammable buffers. Due to the build up in distributed RC that signals can experience as

they travel on lengthy programmable interconnect, signals require repowering to achieve

reasonable delay performance. Thus, when making any transitions between the 4x layer

and the level 1 layer, and also among 4X lines, a buffer would be programmed in the

proper direction for signal regeneration. Without buffering, the local cell outputs would

need to be oversized to satisfy worst case routing concerns. However, because the archi-

tecture exploits locality, most nets will not see large capacitances. As a result, the

switched capacitance due to cell output drivers can be made smaller.

6.7 Mapping

As previously mentioned, mapping experiments were performed to validate many

architectural decisions and to judge the viability of the architecture as a whole. The

correlator in Figure 3-11 and the add-compare-select block of a Viterbi decoder were two

large design examples that were hand mapped to this architecture. In addition, a variety of

logic building blocks were also mapped.

Horiz Track 0

Horiz Track 1Horiz Track 2

Horiz Track 3

VertTrack 0Vert

Track 1

4X Horiz

4X Vert

4X Horiz

4X Vert

1x1x

4x 4x

m1 m2

m0

m3

m4

m5

m6

m1m2

m4

m5

m6

m0

Connections and bufferingfor a 1x to 4x connectionand 4x to 4x horiz.

FIGURE 6-28: Level 1 and Full 4X Switch Matrix Connectivity and Buffering Scheme (only 4X connections shown, rest same as 1X switch matrix).


6.7.1 Correlator Mapping

The correlator mapping appears in Figure 6-29. Each column of cells has been labeled

to identify the schematic components more clearly. Excellent logic cell utilization has

been achieved because the datapath fits well into the underlying routing topology. There

are some differences between the schematic depicted in Figure 3-11 and the mapping that

is shown. The most significant is that the dump registers receive the registered output of

the carry-save adders instead of the combinational output a sample earlier as shown in the

schematic. This change was made because only the combinational output or the registered

output can be routed from a logic block. If the timing is such that the dump registers must

receive the combinational output then an extra pair of logic cells need to be used purely as

registers for every bit. Thus, there is a trade-off in resource usage between the two imple-

mentations although either can be supported. The only other difference between the

mapping and the schematic concerns the clocking. In the correlator design, two clocks

and a clock enable for the positive and negative halves is required. The architectural

features to support such a scheme can be added, but at the time of this thesis they have not

been included and therefore are not reflected in the mapping diagram.

6.7.2 Add-Compare-Select Mapping

In addition to the correlator, the add-compare-select (ACS) block of a Viterbi decoder

served as another mapping design example. A schematic block diagram of the add-

compare-select appears in Figure 6-30. High signal fanout combined with reconvergent

fanout makes the ACS block a good mapping challenge for any PGA architecture. A

floorplan of the ACS mapping is depicted in Figure 6-31. The main logic functions were

arranged such that a bit-sliced mapping results. In order to satisfy logic cell input

constraints and horizontal wiring constraints, the position of blocks were exchanged until

a mapping solution was found. Only a partial mapping is shown in Figure 6-32 because

the entire mapping would have made it difficult to see the details in the picture. Only the

final 2:1 multiplexer stage of the updated state metrics and the decision logic has been left

out. In the case of the 2:1 mux, the output of Mux 0,1 in Figure 6-31 would use the level

Section 6.7: Mapping

122

4x

4x

4x

c s

c s

c s

c s

c s c s

c s

c s

c s

c s c s

c s

c s

c s

c s c s

c s

c s

c s

c s c s

c s

c s

c s

c s

p0

p1

p2

n0

n1

n2

in0

in1

in2

PositiveInputRegister

NegativeInputRegister

Pos. Carry-Save Adder

Neg. Carry-Save Adder

Pos. DumpRegisters

Neg. DumpRegisters

RippleAdder

RippleAdder

Pos-NegSubtractor

9-bits wide

FIG

UR

E 6-29: C

orrelator Mapping


of interconnect above the 4x layer (16x) to travel across the entire design and feed the final

multiplexer on the right edge. The decision logic conveniently fits below the datapath

where the comparator outputs and the MSBs from the additions are generated. The branch

and state input vectors enter the ACS block at the periphery. The upper levels of the inter-

connect network would be used to efficiently bus the branch and state metrics from where

they are generated to the ACS block inputs. As a reminder, the upper interconnect levels

will have relatively low capacitance because connections are only made at the endpoints.

Therefore, long-distance-feedthrough type movements will not result in a sizeable energy

Comparator

Comparator

Comparator

Comparator

Comparator

Comparator

Adder

Adder

Adder

Adder

MSBs

Bra

nch

Met

rics

State Metrics

Sele

ct L

ogic

Updated State Metric

Register/ Buffer

4:1 Multiplexer

XOR

XOR

XOR

XOR

XOR

XOR

58

2

Decision

FIGURE 6-30: Add-Compare-Select Block of a 32 State, Radix-4 Viterbi Decoder (4-way)


penalty. Further planning and optimization of the ACS placement could be done when

considering the mapping for the entire Viterbi decoder, but for the purpose of the test

mapping that was not considered. Lastly, the ACS mapping reflects the high level of

utilization and dense packing that can be achieved on this architecture.

6.7.3 Mapping Guidelines

In order to aid in the future development of automated mapping tools for this PGA

architecture, the following briefly describes some of the basic strategies used to perform

the hand mappings. First, the design must be decomposed into a set of three input

functions which would represent the function to be assigned to each logic cell. After that,

the details of each function can be black-boxed and a netlist connecting the functions

should be constructed. The next step when mapping a design to this architecture is the

most important: the floorplanning process. In order to truly exploit the features in the

architecture, an intelligent placement must be determined which preserves any inherent

structure in the original design. Since the direct path interconnect is the least costly

routing resource, placement should be performed to take advantage of this fact. Signals

with a fanout of 1 or 2 can often be entirely mapped using the direct interconnect. In

datapath-oriented designs, a bitslice placement allows carry chains and other localized

signals to easily map to the direct connection resources. The tight coupling between pairs

of logic cells in the PGA should also be factored into the grouping of logic. For example,

Add

BM

0, S

M0

Add

BM

1, S

M1

Add

BM

2, S

M2

Add

BM

3, S

M3

Com

pare

0,1

Com

pare

1,2

Com

pare

0,2

Com

pare

0,3

Com

pare

1,3

Com

pare

2,3

Mux

0,1

Mux

2,3

Fina

l Mux

(0,

1 an

d 2,

3)

Decision Logic

FIGURE 6-31: Floorplan of ACS Mapping in Figure 6-32

Section 6.7: Mapping

125

4x4x

4x 4x4x

4x 4x4x

4x

4x

4x 4x

4x

4x 4x

BM0_0

BM0_1

BM0_2

BM0_3

BM0_4

SM0_0

SM0_1

SM0_2

SM0_3

SM0_4

SM1_0

SM1_1

SM1_2

SM1_3

SM1_4

BM1_0BM1_1 BM1_2

BM1_3BM1_4

BM2_0BM2_1BM2_2BM2_3

BM2_4

BM3_0

BM3_1

BM3_2

BM3_3

BM3_4

SM2_0

SM2_1

SM2_2

SM2_3

SM2_4

SM3_0

SM3_1

SM3_2

SM3_3

SM3_4

Mux 2:1ControlMux 2:1

Control FIGURE 6-32: Add-Compare-Select Partial Mapping


nodes in the netlist that share signals should be assigned to pairs of logic cells such that the

local busses can be efficiently utilized.

Another crucial consideration when doing placement is long distance signal routing.

The number of horizontally flowing interconnect wires represents a constraint that should

factor into the placement of blocks of logic. During the mapping of the add-compare-

select circuit, an initial placement was performed and then the number of horizontal wires

required to pass across each pair of cells was computed. If the number required exceeded

the resource constraints, then columns of logic were interchanged until a legal mapping

was achieved. In most cases a dense mapping results, but if more horizontal wires are

necessary, the design can be expanded to give more wires at the expense of under utilizing

some logic cells. Even though the add-compare-select block represents a difficult

mapping case because of the large number of high fanout signals, it was still mapped with

excellent density.

The following bullet points summarize the important factors to consider when

mapping on this architecture. In some sense, the placement and routing problem is very

similar to the one encountered when doing layout design for ICs.

• Decompose the design netlist into a netlist of 3-input functions.

• For designs with inherent structure (as opposed to random logic), consider functional blocks as entities to be mapped as a single unit. Each of those units then has a set of inputs and outputs which need to satisfy wiring resource capacity limits.

• Determine which nets can make use of the direct path connections. (Carry chains can be laid out at this step, and other localized signals). This step will put constraints on which signals should be assigned to which logic cell inputs.

• Do an initial placement and determine horizontal and vertical wiring capacity require-ments. Manipulate the placement of blocks until a legal placement results. During this step, some signals will need to be routed on the 4x interconnect layer thus determining which signals will need to be on track 0 or 3. Signals that need to be routed through switch matrices onto vertical Level 1 segments will need to be on track 1 or 2. Some horizontal tracks may need to be swapped to get legal input connections. These cases should help to constrain the overall routing problem.

• At this point, the mapping should be nearly complete and final choices about logic cell input connections can be made.

127

CHAPTER 7

Low Power PGA Results

7.1 Overview

In this chapter, the results of simulations from an extracted layout of a mini-array are

presented. The low power design achieved performance only two times slower than the

Xilinx XC4000 series, while saving more than two orders of magnitude in energy con-

sumption. Against current Xilinx designs, the energy savings are about 15 times and the

performance is about 4x slower. A detailed discussion of energy, delay, and area follow.

7.2 Layout Discussion

The layout of a PGA requires a great deal of optimization since it is an array structure

whose basic tile will be repeated thousands of times. As a result, a custom design was

undertaken so that maximum area efficiency could be achieved. For this PGA design, it is

important to minimize cell area since wiring capacitances account for a large fraction of

the total net capacitances. Moreover, in order to produce a manageable design a small

library of optimized cells was created and re-used throughout the layout. By minimizing

the number of unique cells required, significant overhead involved with maintaining many

layout versions of the same cell in the Cadence environment can be avoided. Although

using a small library results in some loss of layout optimality, significant optimization was

performed on the cells such that the area inefficiency was negligible.

Section 7.2: Layout Discussion 128

Although the entire PGA chip layout was beyond the scope of this thesis, the basic tile

for the array was constructed such that a complete array could be fabricated in the future.

Significant effort was made during the floorplanning stages to insure that the logic cells

could be tiled appropriately. To build the complete array, the upper levels of the intercon-

nect hierarchy (4X to 1X,4X to 4X switch matrices) need to be added and the peripheral

memory circuitry needs to be included (row decoders, sense amps). For this work, a 2x4

array of logic cells was constructed (Figure 7-1) which allowed the key performance

FIGURE 7-1: 2x4 Mini-Array of Logic Cells with Block Diagram Representation (4 pairs of cells)

Pair of Logic CellsTrack Output Buffersfor Pair of Logic Cells

1X to 1X Switch Matrix

000

001

100

101

110010

111011

1X to 1X Switch Matrix

Level 1HorizTracks

LogicCell

Section 7.3: Capacitance Data 129

metrics for this thesis to be verified. The basic unit of the array is the pair of logic cells

depicted in Figure 6-19.

The basic structure of each pair of logic cells consists of a memory array with the logic

cell and interface circuitry interspersed throughout the array. Each block of cells can be

directly abutted to neighboring cells allowing seamless power, ground, wordlines, and

bitlines. The horizontal pitch was constrained to the height of a memory cell to achieve

dense packing. Preservation of the memory array structure created many layout difficul-

ties due to the large number of signals traversing the entire cell. Therefore, efficient

planning of wiring layers was necessary to insure that all signals could be routed. Ground

and power lines were routed horizontally along with poly wordlines. Memory cell bitlines

flow vertically in Metal 2. Metal 1 was used for ground and the 2 volt supply while Metal

3 serviced the cells requiring a 1.5 volt supply. The horizontally oriented array signals like

the local busses, clock and reset are also routed in Metal 3. Layout for a logic cell appears

in Figure 7-2.

The use of a dual supply presented some difficulties within the array-based layout

environment. Normally, separate n-wells are used for cells operating off different

supplies. In this case, a significant area penalty would have resulted because of the fairly

tight integration of low supply and high supply circuitry. (All the memory cells operate

off 2 volts and the logic circuitry at 1.5 volts.) To solve this problem, the low supply

circuitry was placed in the same well as the high supply circuitry. Although this results in

a slightly slower pmos device (pmos Vbs=0.5v instead of 0), significant area from well

spacing and connections from metal 3 to the well are saved. In fact, the slightly increased

threshold of the pmos devices helps to reduce the subthreshold leakage discussed in

Section 6.3.

7.3 Capacitance Data

The following table lists the energy and extracted capacitance data for the paths in the

PGA basic cell. When looking at the table, one should remember that the relationship

between the energy and capacitance numbers for a given component cannot be described


by a single scaling factor, instead the energies depend on the supply voltage and swing for

that particular path. The number in parenthesis that appears next to some values indicates

the reduction that was achieved over the corresponding Xilinx component in Table 3-2.

Since the direct path connections did not have a direct analog in the Xilinx XC4003A part

that was studied, the reduction factor is given relative to a single line Xilinx interconnect

wire. A more detailed breakdown of the capacitance and energy for this design appears in

Appendix C.

TABLE 7-1 Energy and Extracted Capacitance Data

Path Energy (pJ) Capacitance (fF)

A or B Input .275 (145x) 95 (20x)

C input .34 (147x) 140 (17x)

LUT Tree (all paths toggling) .265 (94x) 120 (9x)

FIGURE 7-2: Logic Cell Layout

LUTB Input Mux

A Input Mux

C Input Mux

Track InputBuffers

D-Flip-Flop


From the data, one can see that the cell capacitances and energy have been greatly

reduced over the estimated values for the Xilinx part. The average energy reduction for

the various components was over two orders of magnitude. The substantial improvement

in energy consumption comes from a combination of lower voltage and lower capaci-

tances. The XC4003A operated at 5 volts and the internal interconnect swing was

estimated at 3.5 volts. However, not all nodes experience the reduced swing so an average

swing of 4 volts was used for the determination of capacitance values from energy.

Similarly, not all nodes in the PGA discussed in this thesis swing at the same voltage.

Therefore, an average supply voltage and swing of 1.5 volts is used for the following cal-

culations1. Using these voltage numbers, the approximate energy reduction from supply

voltage and swing reduction is calculated in Equation 21 and Equation 22. Clock power

was also reduced by using 1.5 volt circuitry for clock distribution and the static D flip-flop

implementation.

1. Because of the dual supply voltage (2volts, 1.5 volts) and the various combinations of signal swings (2volts, 1.5 volts, 1.1 volts), an analytical expression of energy becomes complicated. Based on the pro-portion of logic supplied by each voltage and the proportion of capacitance experiencing each swing, a rough estimate of overall energy can be found by assuming a Vdd=1.5volts and a average swing of 1.5volts.

LUT Intermediate Output Buffers and Fanout Node

.33 (124x) 180 (9x)

Vertical Direct Path .115 (417x) 55 (43x)

Diagonal Direct Path .135 (355x) 60 (40x)

Direct Horizontal Path .14 (343x) 65 (37x)

Level 1 Horizontal Track 0 or 3 and Buffers .42 (143x) 200 (15x)

Level 1 Horizontal Track 1 or 2 and Buffers .435 (138x) 220 (13.6x)

Level 1 Vertical Track 0.066 40

Clock Input/ Logic Cell .13 (115x) 60 (12.5x)

Reset# Signal/ Logic Cell 0.063 28

Program#/ Logic Cell .13 32.5

Bit Line/ Logic Cell .13 33

Word Line/ Logic Cell .08 20

TABLE 7-1 Energy and Extracted Capacitance Data

Path Energy (pJ) Capacitance (fF)


(EQ 21)

(EQ 22)

In addition to the voltage-based improvements, the average capacitance of resources

was lowered by 10x-15x. Much of the decrease in capacitance can be attributed to

efficient sizing of drivers, architectural minimization of fanout, appropriate choice of

switch size (1.8um) to reduce switch capacitances, and a compact layout strategy. It

should also be noted that the Xilinx part was fabricated in a 0.6um double metal process,

whereas this design was done in a 0.6um 3 layer process. Therefore, a decrease in wiring

capacitance should also factor into the overall capacitance reduction, although the

discussion in Section 3.2 made clear that wiring capacitances were not the dominant

capacitance contributor in the Xilinx design. On the other hand, the Xilinx device does

offer more functionality in that the logic blocks can be used as RAMs, but the added

features of the Xilinx would not account for the significant energy difference shown here.

Lastly, the energy consumption for an 8-bit adder compared to the one measured in the

power analysis of Section 3.3 is made. The Xilinx design consumed about 3.5nJ of energy

excluding the I/O and clock components1. By comparison, an 8-bit adder on the low

power design is estimated to burn 50pJ of energy assuming a comparable amount of inter-

connect resource usage and an activity factor of 1 (The cell mapping appears in Figure 6-

24). Thus, the low power design consumes 70 times less energy. If an average activity

factor of 0.3 is used, the difference becomes greater than two orders of magnitude.

Although a wealth of data and intuition was gathered from the study of the Xilinx

XC4003A, several improvements have been made to the Xilinx family since the 4003A.

From some recently reported data on the latest Xilinx parts [46], a rough comparison of

energy consumption can be made. The best device from an energy perspective listed in

1. All power analysis measurements were done on designs with a small clock distribution path for simplicity even though some circuits were purely combinational.

Xilinx: Vdd=5v, Vaveswing=4vEnergy C 5 4×× 20C= =

Low Power PGA: Vddave=1.5v, Vaveswing=1.5vEnergy C 1.5 1.5×× 2.25C

Voltage-Based Energy Savings 9xCapacitance-Based Energy Savings 10x 15x–

= ==

=

Section 7.4: Performance Data 133

the Xilinx Brief Report is the XC4000XL family which operates at 3.3 volts and was

fabricated in a 0.35um process. The power factor reported in the document is 28e-12

which when multiplied by Vdd=3.3 volts gives an energy of 92pJ/ logic cell or 46pJ/ logic

function1. By comparison, this low power design consumes 1.5pJ/ logic function without

interconnect. Assuming that the average cell’s interconnect needs require an equal

amount of energy (reasonable given the numbers listed in Table 7-1), the total energy/logic

cell is about 3pJ. Thus, when compared to a recent Xilinx PGA in a process that is one

generation ahead, this low power design still offers a substantial improvement in energy

(15x).

7.4 Performance Data

Although low energy was the primary design goal for this project, delay performance

cannot be neglected. The delays for this low power PGA design were characterized by

extracted simulations of the 8 cell mini-array. Results are presented in Table 7-2 below.

1. It was unclear whether a logic cell corresponds to a CLB which contains 2 function gen-erators or just refers to a single function generator. In this comparison, it will be assumed that the logic cell corresponds to a CLB and so each function generator and associated interconnect consumes half the number reported in the brief.

TABLE 7-2 Performance Data

PathTPLH Delay

(ns)TPHL Delay

(ns)

Logic Function Delay 5.2 5.4

Logic Block Output Delay 3.5 1.8

Combinational Delay 8.7 7.2

Setup Time 1.5 1.5

Clock to Q 4.0 4.5

Level 1 Track Input to Direct Output 10.4 10.3

Level 1 Track Input to Track 0,3 Output 10.3 10.5

Level 1 Track Input to Track 1,2 Output 10.5 10.3

Direct Input to Direct Output 7.9 7.3

Level 1 Track Input to Track 1 through switch matrix to Vertical Track thru switch matrix to

Track 2

11.6 9.8

Diagonal Input to Track 1 through switch matrix to Track 1

9.7 10.1


The combinational delay refers to the delay from a logic cell’s inputs to its output fanout

node (the node immediately before the programmable output drivers). The A and B inputs

experience a slightly faster delay than the C input. Even though the C input has a higher

gate load, the A and B inputs have higher wiring parasitics such that all input paths are

nearly equal. The added delay for using the D flip-flop with synchronous reset is the setup

time (1.5ns) + the clock to Q delay (4.5ns) for a total of 6ns. Thus, fully pipelined

operation could be sustained at speeds above 50MHz. The next entries in Table 7-2 show

the path delay for a number of typical logic cell configurations. Three involve an input

coming from one of the Level 1 horizontal tracks and then routed to a direct path output,

another Level 1 track (0 or 3), and Level 1 track (1 or 2) respectively. The next entry

shows the performance numbers for a direct cell-to-cell path which would be typical in the

ripple carry chain of an adder. The last two entries give the delays for a logic cell which

drives its output through multiple segments of the Level 1 interconnect network.

Simulations using the slow process file resulted in some paths slowing by as much as

2x. High process sensitivity is a direct consequence of the reduced swings that are

propagated throughout the interconnect. Therefore, delay variation represents a

significant problem for designs which produce gate overdrive voltages at or below 1 Vt.

Despite the slow-down, all paths remained functional.

Often, significant energy savings come at the cost of reduced performance. Because of

the reduced voltage levels, a slow-down is experienced in this design, but the drop in

performance is not severe. Compared to the Xilinx XC4000 series delays in Table 2-1

which showed a combinational delay of 4ns and interconnect delays slightly above 1 ns,

this low energy design gives up about 2x in speed. For this design, reduced capacitance

resulting in smaller buffering delays probably helped minimize the performance

degradation when operating at a low supply of 1.5 volts. Current Xilinx designs such as

the XC4000XL achieve combinational delays of 1.5-2ns in a 0.35um process.

Figure 7-3 represents the signal path of a typical signal traveling from the logic cell

input to a diagonal output. Most of the circuitry from other paths has been removed from

the figure for clarity. Table 7-3 shows the delay along this path.


TABLE 7-3 Delay Breakdown for Path in Figure 7-3

path tplh (ns) tphl (ns)cumulative

tplh (ns)cumulative

tphl (ns)

Level 1 Track -> Track Buffer Output

1.5 1.15 1.5 1.15

Track Buffer Output -> A Input MuxOut

0.25 0.25 1.75 1.4

A Input MuxOut -> A 1.66 1.8 3.41 3.2

A -> A# 0.54 0.35 3.95 3.55

A# -> Fout 1.3 1.87 5.25 5.42

Fout -> Lutout 3.47 1.77 8.72 7.19

Lutout -> Direct Output 1.77 3.2 10.49 10.39

inout

FF

track out

diag. outbuffer

buffer

AA# foutlutout

input trackbuffer

luttree

logiccell

in

FIGURE 7-3: Typical Signal Path from Level 1Horizontal Track Input to Diagonal Output


Lastly, a simulation was performed for a multi-path interconnection of the 2x4 mini-

array as shown in Figure 7-5. Table 7-4 contains the delays for the paths in bold

measured from the primary inputs to the labeled points. A plot of some of the waveforms

from the simulation appears in Figure 7-4. From the HSPICE plot the three voltage

swings are evident as the logic cell inputs swing at 2 volts, the diagonal outputs swing at

1.5 volts and the general interconnect and internal nodes swing at about 1.1 volts. The two

1.5 volt signals correspond to the outputs of successive direct path vertical connections

rippling through the cells. The last two ~1.1 volt waveforms were taken along intercon-

nect path A.

TABLE 7-4 Delays from Multi-Path Example

Path Delay (ns)

A 10.2

B 18.5

C 17.7

D 21.6

FIGURE 7-4: HSPICE Waveforms for Multi-Path Simulation

Section 7.5: Area 137

7.5 Area

The area of the design was kept as small as possible to minimize the impact of wiring

parasitics. On many of the interconnect nets, wiring capacitances made up half the total

load capacitance so a dense layout was crucial. Table 7-5 shows the area breakdown for

major components of the design. For comparison, a Xilinx XC4000 CLB occupies an area

of 600um x 600um = 360,000um2 in a 2-layer metal, 0.6um process.

TABLE 7-5 Area Breakdown

Block Area

2x4 Mini-Array 435um x 230um = 100,000um2

Pair of Logic Cells 205um x 115um = 23,600um2

Switch Matrix 25um x 115um = 2,875 um2

Configuration Memory Cells / Logic Cell Pair

74 *150um2 = 11,100um2

Output Driver Circuitry/ Logic Cell Pair

5,800um2

Static D Flip-Flop 400um2

A

B

C

D

FIGURE 7-5: Multi-Path Example

Section 7.5: Area 138

139

CHAPTER 8

Conclusions

8.1 Conclusions

This thesis has shown that low energy programmable logic is possible without giving

up the flexibility that makes programmable logic attractive. In doing so, a detailed power

analysis of a Xilinx PGA was undertaken to gain a solid understanding of the power con-

tributors in a PGA. Based on the intuitions formed from the power profiling, a low power

PGA architecture was specified. Several circuit issues were examined relating to PGA

circuit design and low voltage pass transistor operation during this research. Finally, the

main building blocks of a low power PGA were implemented and extracted simulations

were performed to verify performance.

From the power analysis work, several important conclusions arose. First, the internal

load capacitances of a Xilinx device are 100 times greater than the capacitance levels seen

in a custom integrated circuit design. The principal reason for such high capacitances was

shown to be the parasitic capacitance from large switches and other circuitry to support

high fanout and in turn high flexibility. Wiring capacitance was not a significant factor.

Secondly, the wealth of heavily capacitive interconnect resources present in PGA designs

cause interconnect circuitry to account for over 65% of PGA power. The next 20% of

power goes into clocks. In addition, profiling showed that local interconnect dominates

PGA design power as opposed to long distance interconnect. The reason for this is that

140

the local level must support the high fanout and complicated routing patterns among logic

cells resulting in high capacitances, which when coupled with the fact that local resources

make up over 80% of a design’s interconnect result in substantial energy consumption.

Overall, the many insights drawn from the power analysis helped to shed light on the

previously neglected area of PGA power consumption allowing the later design efforts to

be clearly focused.

During the design phase of this work, several studies were performed to understand

the constraints and evaluate the trade-offs involved in low power PGA design. Ironically,

the aim to preserve the powerful attribute of PGA flexibility imposed tighter constraints on

the design space causing several conventional low power techniques to be unfeasible. In

the end, reduction of capacitance and voltage levels proved to be the highest leverage low

power tools. Prudent switch sizing, architectural minimization of fanout, and a pass

transistor implementation combined to lower capacitance such that wiring parasitics

became a comparable factor to overall load capacitance. Furthermore, the examination of

low voltage pass transistor design showed that, unless differential restoration techniques

are used, device thresholds must be lowered if pass transistors are to retain their utility as

supply voltages continue to fall.

Lastly, a PGA architecture has been specified and a mini-array designed to

demonstrate the energy improvements that have been achieved. Architecturally, the array

consists of 3-LUT logic cells connected by a hierarchy of interconnect. At the lower

levels of interconnect, the combination of a direct path mesh with a local bus structure

allow locality to be exploited resulting in energy savings. Aggressive circuit design

techniques were used to enable a single ended pass transistor implementation at low

voltage. As a result, energy savings from one to two orders of magnitude were realized

over commercial designs at a relatively small compromise in performance. Overall, a

substantial decrease in energy-delay product was achieved through this design.

141

8.2 Future Work

As in any research project, several avenues for future work exist. The integration of a

complete PGA array will allow the design to be fully verified and its utility to be

exploited. Software mapping tools need to be generated which would allow a more

complete evaluation of the architecture’s mapability to be performed. In addition, charac-

terization data for the design can be used to assess power consumption during the mapping

process. For the research discussed in this thesis, intuition built up from experimental

implementations and a variety of specialized studies proved essential; however, better

ways to efficiently explore PGA architecture need to be developed. Tools which can

specify architectural features including different interconnect topologies in a template

fashion, would allow a more quantitative approach to be taken to a field which is predomi-

nantly driven by heuristics.

142

9.0 References

[1] Ahrens, M. , Gamal, A. , “An FPGA Family Optimized for High Densities and Reduced Routing Delay,”Actel Corporation, IEEE Custom Integrated Circuits Conference, 1990.

[2] Atmel “Configurable Logic Design and Application Handbook” , 1995.

[3] Black, P, Meng. T., “A 140 mb/s 32 State, Radix 4 Viterbi Decoder,” IEEE Journal of Solid State Circuits Vol. 27, No. 12, December 1992, pp. 1877-1885.

[4] Bowhill, W., et al, “A 433 MHz 64b Quad Issue RISC Microprocessor,” IEEE Journal of Solid State Circuits, Vol 31, No. 11, November 1996, pp ##.

[5] Brebner, G. , “Configurable Array Logic Circuits for Computing Network Error Detection Codes,” Journal of VLSI Signal Processing, 6, 1993, pp. 101-117.

[6] Chandrakasan, A. , “Low Power Digital CMOS Design,” PhD. Thesis, U.C. Berkeley, August 1994.

[7] CLAy Family Introduction Datasheet, National Semiconductor, June 1994.

[8] DeHon, A. , “Reconfigurable Architectures for General Purpose Computing,” M.I.T. PhD Thesis, A.I. Technical Report 1586, October 1996.

[9] Dobbelaere, I. , Horowitz, M. , Gamal, A. , “Regenerative Feedback Repeaters for Programmable Interconnections”, ISSSC Digest of Technical Papers 1995, p.116-117.

[10] Farrhi, A. , Sarrafzadeh, M. , “FPGA Technology Mapping for Power Minimiza-tion,” International Workshop on Field-Programmable Logic and Applications, FPL ‘94. Proceedings, Springer-Verlag, 1994. p. 66-77.

[11] Gamal, A. , et al., “An Architecture for Electrically Configurable Gate Arrays,” IEEE Journal of Solid-State Circuits, Vol. 24, No. 2, April 1989, pp 394-398.

[12] George, V. , The Effect of Logic Block Granularity on Interconnect power in a Reconfigurable Logic Array”, CS 294 report, May 1997.

[13] Goto, G., et al “A 4.1ns Compact 54x54 Multiplier Utilizing Sign-Select Booth Encoders,” IEEE Journal of Solid State Circuits, Vol 32, No. 11, November 1997, pp 1676-1683.

[14] Hauck, S., Borriello, G., Ebeling, C. , “Triptych: An FPGA Architecture with Inte-grated Logic and Routing”, in Advanced Research in VLSI and Parallel Systems: Proceedings of the 1992 Brown/MIT Conference, pp. 26-43, March 1992.

[15] Infopad Project, U.C. Berkeley, http://infopad.EECS.Berkeley.EDU/infopad

[16] Izumikawa, M. , et al., “A 0.25um CMOS 0.9v 100-MHz DSP Core,” IEEE Journal of Solid-State Circuits, Vol. 32, No. 1, January 1997, p. 52-60.

143

[17] Jou, S., et al, “A Pipelined Multiply-Accumulator using a High-Speed Low-Power Static and Dynamic Full Adder Design,” IEEE Journal of Solid State Circuits, Vol 32, No. 11, November 1997, pp ##.

[18] Kaushik, R. , Prasad, S. , “FPGA Technology Mapping for Power Minimization,” International Workshop on Field-Programmable Logic and Applications, FPL ‘94. Proceedings, Springer-Verlag, 1994. p. 57-65.

[19] Kobayashi, T. , Sakurai, T. , “Self-Adjusting Threshold- Voltage Scheme (SATS) For Low-Voltage High Speed Operation, IEEE 1994 Custom Integrated Circuits Con-ference, 12.3.1-12.3.4.

[20] Kusse, E., Carloni, L., Chong, P., “A 1.5 Volt Fine Grain Pass Transistor FPGA”,EE241Project, http://infopad.EECS.Berkeley.EDU/~icdesign/ee241_97/PROJECTS/kusse.pdf, May 1997.

[21] Lee, W., et al, “A 1-Volt Programmable DSP for Wireless Communications,” Journal of Solid State Circuits, Vol 32, No. 11, November 1997, pp 1766-1777.

[22] Lucent Technologies, ORCA Field Programmable Gate Arrays Datasheet, August 1996.

[23] Lytle, C. , “Altera FLEX Programmable Logic - Largest Density PLD,” Altera Cor-poration, 1993.

[24] Montanaro, J., et al, “A 160 MHz, 3b, 0.5W CMOS RISC Microprocessor,” Journal of Solid State Circuits, Vol 31, No. 11, November 1996, pp 1703-1712.

[25] Mutoh, S. ,Yamada, J. , et al., “1-V Power Supply High-Speed Digital Circuit Tech-nology with Multithreshold-Voltage CMOS”, IEEE Journal of Solid State Circuits, Vol 30, No. 8, August 1995, pp 847-853.

[26] O'Donnell, I. , “Digital Circuit and Board Design for a Low Power, Wideband CDMA Receiver” Master Thesis, UC Berkeley Dec 1996.

[27] Pleiades Project, U.C. Berkeley, http://infopad.EECS.Berkeley.EDU/research/recon-figurable/

[28] Rabaey, J., Digital Integrated Circuits, Prentice Hall, New Jersey, 1996.

[29] Rose, J., Francis, R.J. ,Lewis , D., and Chow, P. , “Architecture of Field-Program-mable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency”, in IEEE Journal of Solid-State Circuits, v. 25, no. 5, October 1990, pp. 2117-1225.

[30] Rose, J. , Hill, D., “Architectural and Physical Design Challenges for One-Million Gate FPGAs and Beyond,” FPGA 97 Conference, February 1997.

[31] Sakurai, T., et al, “A 200 MHz 13 mm2 2-D DCT Macrocell Using Sense-Amplify-ing Pipeline Flip-Flop Scheme,” IEEE Journal of Solid State Circuits, Vol 29, No. 12, December 1994, pp 1482-1489.

[32] Sakurai, T. , et al, “A Swing Restored Pass-Transistor Logic-Based Multiply and Accumulate Circuit for Multimedia Applications,” IEEE Journal of Solid State Circuits, Vol 31, No. 6, June 1996, pp 804-809.

144

[33] Sakurai, T. , et al, “A 0.9 V, 150 MHz, 10-mW, 4mm2, 2-D Discrete Cosine Trans-form Core Processor with Variable Threshold-Voltage Scheme,” IEEE Journal of Solid State Circuits, Vol 31, No. 11, November 1996, pp 1770-1777.

[34] Scalera, S. , Sanders, private communication.

[35] Singh, S., Rose, J. , “The Effect of Logic Block Architecture on FPGA Perfor-mance,” IEEE Journal of Solid-State Circuits, Vol. 27, No. 3, March 1992, pp. 281-287.

[36] Srinivas, R. , et al., “A High Density Embedded Array Programmable Logic Archi-tecture”, Altera Corporation, IEEE 1996 Custom Integrated Circuits Conference.

[37] Sun, S. ,Tsui, P. , “Limitation of CMOS Supply Voltage Scaling by MOSFET Threshold Variation”, in IEEE Journal of Solid State Circuits, Vol 30, No. 8, August 1995, pp 947-949.

[38] Trimberger, S., “Field Programmable Gate Array Technology, Kluwer Academic Publishers, Boston Mass., 1994.

[39] Trimberger, S. , et al., “Architecture Issues and solutions for a High-Capacity FPGA,” Xilinx, Inc., FPGA ‘97, February 1997.

[40] Veendrick, H., et al, “An Efficient and Flexible Architecture for High-Density Gate Arrays,” IEEE Journal of Solid State Circuits, October 1990, pp 1153-1157.

[41] ViewLogic Corporation, Viewsim on-line documentation. http://www.view-logic.com/support/.

[42] Weste, N. ,Eshraghian, K. , Principles of CMOS VLSI Design, Addison-Wesley, Massachusetts 1993.

[43] Xilinx Corporation, “XC4000 Field Programmable Gate Arrays: Programmable Logic Databook”, 1996.

[44] Xilinx Corporation, “XC6200 Field Programmable Gate Arrays: Advance Product Specification”, June 1996.

[45] Xilinx Corporation, “Application Brief #2, Low Power Benefits of XC4000E/X Overview’”, May 1997.

[46] Xilnx Corporation, “Application Brief #14, A Simple Method of Estimating Power in XC4000 XL/EX/E FPGAs”, June 1997.

[47] Xilinx Corporation, XACT Reference Guide, April 1994.

[48] Yano, K. , et al, “A 3.8nS CMOS 16x16-b Multiplier Using Complementary Pass-Transistor Logic,” IEEE Journal of Solid State Circuits, Vol 25, No. 2, April 1990, pp 388-394.

[49] Yano, K. , et al, “Top-Down Pass Transistor Logic Design,” IEEE Journal of Solid State Circuits, Vol 31, No. 6, June 1996, pp 792-803.

[50] Zimmerman, R., Fichtner, W., “Low Power Logic Styles: CMOS versus Pass Tran-sistor Logic”, IEEE Journal of Solid State Circuits, Vol. 32, July 1997.

145

APPENDICES

Appendix A: Xilinx XC4003A Resource Measurements

Appendix B: Power Profiling Data

Appendix C: Detailed Breakdown of Low Power PGA Energy and

Capacitance

Appendix D: PGA Configuration Encodings and Example Initialization File

Appendix E: PGA Memory Cell Programming Polarities and Memory Cell

Layout Locations.

146

APPENDIX APOWER MEASUREMENTS Energy EstimatedXilinx XC4003A-PC84C 0.6um 2-layers of metal in mW/MHz Capacitance in pF

INTERCONNECT

half vertical longline 0.041 2.05full length vertical longline 0.088 4.4half vertical longline + 0.044 2.2full length vertical longline + 0.094 4.7

half horizontal longline 0.022 1.1full length horizontal longline 0.054 2.7half horizontal longline + 0.090 4.5full length horizontal longline + 0.160 8

vertical right doubleline 0.098 4.9vertical left doubleline 0.060 3horizontal down doubleline 0.080 4horizontal up doubleline 0.107 5.35

clb input interconnect horizontal C 0.044 2.2clb input interconnect horizontal G 0.039 1.95clb input interconnect horizontal F 0.047 2.35

clb input interconnect vertical C 0.047 2.35clb input interconnect vertical G 0.040 2clb input interconnect vertical F 0.051 2.55

single line X vertical between SM left 0.067 3.35single line X vertical between SM right 0.056 2.8single line X horizontal between SM up 0.085 4.25single line X horizontal between SM down 0.054 2.7

single line Y vertical between SM left 0.096 4.8single line Y vertical between SM right 0.048 2.4single line Y horizontal between SM up 0.054 2.7single line Y horizontal between SM down 0.088 4.4

global clk distribution line (not connected to CLBs) 1.287 64.351 global column wire 0.128 6.4

carry chain 0.050 2.5

147

The ‘+’ sign signifies another longline, but one that has slightly different properties which

give rise to a different capacitance value than the longlines without a ‘+’ symbol. The

CLB input entries represent the circuitry associated with interfacing to the inputs of a CLB

(usually a multiplexer structure). The letters refer to an F or G function input, although

the four C inputs do not feed distinct function generators, they serve as alternative inputs

to some of the other CLB circuitry. The entries for “correl4b” refer to the structured

correlator mapping and “correl4rb” refer to the simulated annealing based mapping.

POWER MEASUREMENTS Energy Estimated I/O in mW/MHz Capacitance in pF

i1 input path 0.062 2.48i2 input path 0.140 5.6input clked path 0.151 6.04

output I/O path inc interconnect (slow) 1.157 46.28output I/O path inc interconnect (fast) 1.269 50.76output I/O path inverted inc interconnect (slow) 0.989 39.56output I/O clked path inc interconnect+clk (slow) 1.326 53.04output I/O clked path inc interconnect+clk (fast) 1.459 58.36

CLBs

F function output to X 0.041 1.64F function clked output of FFX 0.060 2.4F function clked output of FFX to XQ is negligibleF,G function output to X,Y 0.082 3.28F,G function output to H function generator 0.006 0.24extension of H output to X output negligiblenegligible for internal muxing connections < 0.010connection of clk to 2 internal CLB FFs 0.030 1.5F function internal 0.025 1.25G function internal 0.020 1

148

AP

PE

ND

IX B

Design Nam e # of CLBs # of IO Bs # nets # SM passes Total Power exc l CLB M eas Power M eas Energy Total Estim ated Power Freqin m W in m W in nJ in m W M Hz

accum y16b 18 18 69 62 76 119.5 8.13 83.08 14.7accum y8b 10 10 36 31 41.9 65 4.42 45.85 14.7accum y4b 7 6 19 28 24.6 37.5 2.55 26.64 14.7addery8b 7 18 34 53 45.7 62.2 4.23 48.25 14.7addery4b 6 10 20 23 23.3 32.5 2.21 24.72 14.7barrely8b 25 13 38 51 57.7 77 5.24 63.58 14.7barrely4b 9 8 17 22 25 35.5 2.41 27 14.7com parem cy8b 8 18 32 61 43.5 61.5 4.18 44.67 14.7com parem y8b 16 18 37 73 53.6 71.5 4.86 55.7 14.7com parem y4b 7 10 18 34 27 37.5 2.55 28.36 14.7com parem y2b 3 6 9 10 13.7 16.5 1.12 14.51 14.7com parey8b 3 18 21 23 29.7 34.5 2.35 30.23 14.7com parey4b 2 10 12 9 17.1 20.25 1.38 17.68 14.7com parey2b 2 6 8 6 11.4 14 0.95 12.02 14.7countercy16b 9 2 26 4 12.8 18.5 1.26 13.87 14.7countercy8b 5 2 14 2 12.2 16 1.09 13.27 14.7countery16b 19 2 25 44 32.3 35 2.38 33.46 14.7countery8b 10 2 12 14 19.6 22.5 1.53 20.81 14.7countery4b 5 2 7 8 15.6 18.5 1.26 16.72 14.7decodery4x16b 17 6 26 18 30.7 39 2.65 31.76 14.7m ux y8b 4 13 17 16 24.1 28 1.90 24.94 14.7m ux y4b 2 8 10 12 14.9 18.25 1.24 15.46 14.7m ux y2b 2 5 7 5 9.5 11.75 0.80 10.03 14.7parityy9b 4 11 15 13 22 30.75 2.09 23.31 14.7shifty16b 17 3 20 15 39.6 42.5 2.89 40.12 14.7shifty8b 9 3 12 4 21.5 23.5 1.60 23.34 14.7shifty4b 5 3 8 2 14.9 17 1.16 16.08 14.7

m atchb 100 6 181 268 197 204 13.88 203.2 14.7456tx chip 100 13 178 433 133 157 10.68 136.4 14.7456correl4rb 71 7 176 168 39.2 42.6 5.84 40.5 7.373correl4b 76 7 171 223 35.8 40 5.48 37.1 7.373l5b datapa th 13 10 33 59 56.6 67 4.56 58.7 14.7456l2b pipe datapath 17 10 35 66 62.9 73 4.97 64.7 14.7456l3b random logic 29 6 61 97 45.6 49 3.33 46.2 14.7456fir5b 52 10 140 199 212 283.6 19.29 231.24 14.7456rs_m ult 57 18 85 163 154 232 15.78 170.2 14.7456

All DesignsCum m ulativ e A v e.Std . Dev

Large Designs O nlyCum m ulativ e A v e.Std . Dev

149

Design Name I/O Power CLB power Clock Net Power Single Wire Power Double Wire Power Longline Power Carry Chain Power CLB Input Line Power CLB Output Line Powerin mW in mW in mW in mW in mW in mW in mW in mW in mW

accumy16b 7.11 7.08 9.81 26 12.45 2.46 1.67 10.61 8.45accumy8b 3.27 3.95 9.46 15.3 5.85 0.73 0.75 5.78 4.7accumy4b 1.5 2.04 7.17 5.98 6.78 0 0.22 3.02 2.53addery8b 7.7 2.55 5.84 10.57 16.48 1.85 0.8 3.54 2.76addery4b 3.55 1.42 5.84 5.11 8.21 0.3 0.41 2.22 1.59barrely8b 5.4 5.88 5.84 14.16 14.76 2.56 0 13.82 5.11barrely4b 3.1 2 5.84 6.8 5.49 1.27 0 4.62 1.87comparemcy8b 6.5 1.17 5.84 9.66 18.02 1.19 0.87 3.96 1.35comparemy8b 7.4 2.1 5.84 10.81 19.79 1.91 0 9.42 2.22comparemy4b 3.55 1.36 5.84 9.15 6.23 0.69 0 4.16 1.37comparemy2b 2.06 0.81 5.84 3.61 3.23 0 0 2.01 0.9comparey8b 7.11 0.53 5.84 9.95 6.52 0.28 0 3.22 0.67comparey4b 3.3 0.58 5.84 5.76 3.55 0.08 0 1.87 0.71comparey2b 1.8 0.62 5.84 4.41 1.17 0.08 0 1.29 0.75countercy16b 0 1.07 9.86 1.95 0.61 0 0.25 1.23 1.48countercy8b 0 1.07 9.46 3.33 0.45 0 0.25 1.22 1.47countery16b 0 1.16 24.18 4.45 1.75 0.66 0 3.72 1.52countery8b 0 1.21 11.22 4.78 1.63 0.77 0 3.69 1.51countery4b 0 1.12 9.46 3.77 2.2 0 0 2.73 1.42decodery4x16b 2.01 1.06 30.66 10.08 1.84 2 0 11.3 1.2muxy8b 5.4 0.84 5.84 7.82 3.42 0.7 0 3.85 0.97muxy4b 2.81 0.56 5.84 4.38 3.26 0 0 1.87 0.7muxy2b 1.26 0.53 5.84 3.22 1.33 0 0 1.1 0.68parityy9b 4.4 1.31 5.84 7.11 3.51 0.2 0 3.44 1.43shifty16b 3.33 0.52 24.06 5.81 5.07 0.81 0 3.21 4.07shifty8b 0.52 1.84 14.95 4.52 1.02 0.23 0 1.9 2.37shifty4b 0.52 1.18 11.33 3.36 0.72 0.23 0 1.23 1.48

matchb 3.4 6.2 63.8 41.2 39 6.2 0 27.1 17.1txchip 3.7 3.4 49 33.6 15.6 4.8 0 18.8 7.3correl4rb 0.8 1.3 16.7 5 8.4 0.9 0 6.1 3.2correl4b 0.8 1.3 12.5 5.5 8.4 1.5 0 6.1 3.1l5b datapath 4.1 2.1 9.1 14 14.4 1.9 0.3 8.6 4.1l2b pipe datapath 4.2 1.8 20.8 19.4 14 6.2 0.8 7.6 4.2l3b random logic 2.1 0.6 29.3 4.4 8.1 0.4 0 3.75 1.5fir5b 4.81 19.24 26.28 66.86 48.37 10.74 6.24 29.73 21.77rs_mult 6.8 16.2 5.6 49.3 37.3 12.4 0 34 13.4

All DesignsCummulative Ave. 3.18 2.71Std. Dev 2.38 4.04

Large Designs OnlyCummulative Ave. 3.41 5.79 25.90 26.58 21.51 5.00 0.82 15.75 8.41Std. Dev 1.94 7.00 19.26 22.40 15.58 4.34 2.05 11.80 7.24

150

Design Name # Single Lines # Double Lines #Long Lines # Carry Chains # CLB Input Lines # CLB Output Lines # Clock Lines # of F inputs # of G inputs # of C inputs

accumy16b 88 42 12 8 74 41 10 41 33 0accumy8b 43 20 6 4 38 22 6 21 17 0accumy4b 27 24 2 2 20 11 4 11 9 0addery8b 27 54 8 5 20 11 1 11 9 0addery4b 12 26 3 3 12 7 1 7 5 0barrely8b 43 46 15 0 73 25 1 69 4 0barrely4b 20 18 9 0 25 9 1 24 1 0comparemcy8b 29 60 7 5 22 7 1 12 9 1comparemy8b 40 70 10 0 60 17 1 46 13 1comparemy4b 31 23 6 0 24 8 1 18 5 1comparemy2b 8 10 1 0 9 3 1 8 1 0comparey8b 33 22 3 0 18 3 1 8 9 1comparey4b 15 12 2 0 9 2 1 4 5 0comparey2b 9 4 2 0 5 2 1 4 1 0countercy16b 15 7 1 7 17 17 10 8 9 0countercy8b 9 3 1 3 9 9 6 4 5 0countery16b 63 21 5 0 66 21 23 57 5 4countery8b 23 10 6 0 29 10 10 27 1 1countery4b 11 4 1 0 11 5 6 10 1 0decodery4x16b 41 7 7 0 65 17 1 40 25 0muxy8b 24 12 3 0 21 4 1 10 9 2muxy4b 10 11 1 0 9 2 1 4 5 0muxy2b 6 5 1 0 4 2 1 3 1 0parityy9b 18 10 2 0 16 4 1 12 1 3shifty16b 17 16 4 0 17 17 23 0 1 16shifty8b 11 3 3 0 9 9 12 0 1 8shifty4b 6 2 3 0 5 5 7 0 1 4

matchb 284 205 41 0 350 175 110 80 120 150txchip 446 243 88 0 555 151 69 297 237 21correl4rb 207 119 29 15 202 146 77 72 60 70correl4b 222 167 38 15 209 153 76 71 62 76l5b datapath 48 42 9 3 51 21 11 25 9 17l2b pipe datapath 58 48 15 3 64 24 15 32 9 23l3b random logic 62 98 16 0 119 53 32 51 55 13fir5b 200 116 42 32 152 82 28 80 61 11rs_mult 161 118 45 0 197 57 0 174 23 0

All DesignsCummulative Ave.Std. Dev

Large Designs OnlyCummulative Ave. 187.56 128.44 35.89 7.56 211.00 95.78 46.44 98.00 70.67 42.33Std. Dev 127.74 66.80 23.63 11.06 156.96 60.61 37.59 86.11 71.16 48.21

151

Design Name Ave. Net Cap Max Net Cap Min Net Cap Clock Cap Ave Fanout Max Fanout Ave Man Dist Ave Route Dist Ave. Net Activity Ave. Switched Cap.in pF in pF in pF in pF in CLB pitches in CLB pitches in pF

accumy16b 12 17.5 2.5 33.6 1.2 2 4 8.2 0.27 3.45accumy8b 10.85 17.5 2.5 32.4 1.2 1 4.3 7 0.26 3.23accumy4b 12.1 15 2.5 24.6 1.2 2 4.6 5.5 0.24 3.48addery8b 15.15 7.2 2.5 20 1.1 2 7.7 9.7 0.27 4.22addery4b 11.75 6.85 2.5 20 1.2 2 4.6 5.9 0.23 3.28barrely8b 17.28 72.2 4.75 20 2.7 8 4.4 6.5 0.28 4.89barrely4b 15.76 34 4.85 20 2.4 4 5.8 8.2 0.24 4.34comparemcy8b 16.22 26.5 2.5 20 1 1 5.6 7.3 0.21 4.41comparemy8b 19.5 36.6 4.85 20 2 4 5 6.9 0.19 4.76comparemy4b 16.9 19.35 4.85 20 1.8 2 6.1 7.7 0.21 4.49comparemy2b 11.84 4.85 4.85 20 2 2 3.5 5.1 0.21 3.81comparey8b 15.78 7.05 4.85 20 1 1 3.6 5.4 0.22 4.25comparey4b 12.84 4.85 4.85 20 1 1 3.8 5.2 0.22 3.82comparey2b 9.34 5.4 5.4 20 1 1 3.5 4.8 0.21 3.12countercy16b 6.62 12.35 2.5 33 1 1 0.3 1.4 0.06 0.41countercy8b 5.72 10.35 2.5 32.4 1 1 0.3 1.3 0.11 0.78countery16b 21.13 49.8 8.05 82.4 3.2 5 1.8 3.5 0.05 1.26countery8b 18.64 33.7 10.1 38.4 3.1 5 1.6 4.5 0.1 2.84countery4b 11.26 28.4 11.65 32.4 2.5 4 1 2.7 0.19 4.15decodery4x16b 15.37 6.85 4.45 20 12.8 16 6 9.4 0.09 3.65muxy8b 14.7 12.6 6.3 20 1.5 4 4.6 7 0.23 4.12muxy4b 12.45 26.3 4.85 20 1.3 2 4 5.4 0.21 3.84muxy2b 7.35 24.35 4.85 20 1 1 2.7 4.3 0.18 2.46parityy9b 13.27 14.75 4.85 20 1.5 2 2.5 3.6 0.28 4.2shifty16b 10.72 29.4 5.7 82.4 1 1 2.2 4.1 0.23 2.92shifty8b 7.44 12 5.7 51.2 1 1 2.4 2.9 0.22 2.22shifty4b 6.05 11.25 5.7 38.8 1 1 3.2 4 0.21 2.02

matchb 16.9 41 5 198 2 3 2.6 4.1 0.12 2.3txchip 23.6 134 6 153 3.5 33 3.4 6.2 0.07 1.6correl4rb 11.4 76 5 103 1.4 12 2.6 4.7 0.05 0.8correl4b 12.7 55 5 76 1.4 12 2.8 4.7 0.05 0.8l5b datapath 15.4 38 4 32.7 1.5 4 3.4 5.7 0.25 4.3l2b pipe datapath 15.5 36 4 63.4 1.5 4 3.3 4.8 0.23 3.59l3b random logic 15.4 31 4 115 1.9 4 3.3 5.5 0.03 0.83fir5b 14.19 51.6 2.5 90 1.6 15 4 6 0.31 4.82rs_mult 22.3 29 5.4 19.3 2.9 11 6.5 9 0.32 6.43

All DesignsCummulative Ave. 13.76 26.30 1.96 3.64 5.51 0.19 3.13Std. Dev 4.35 1.99 1.66 2.03 0.08 1.43

Large Designs OnlyCummulative Ave. 16.38 54.62 4.54 94.49 1.97 10.89 3.54 5.63 0.16 2.83Std. Dev 4.08 33.15 1.03 56.38 0.74 9.44 1.20 1.44 0.12 2.05

152

Error % % of power relative to total power Input StatisticsDesign Name CLB Power IO Power Clock Power Interconnect Power Ave Inputs Used/ CLB Ave. # F Inputs Ave. # of G Inputs Ave # of C Inputs

accumy16b 30.48% 8.52% 8.56% 8.71% 74.21% 4.11 2.28 1.83 0.00accumy8b 29.46% 8.62% 7.13% 11.97% 72.28% 3.80 2.10 1.70 0.00accumy4b 28.96% 7.66% 5.63% 17.30% 69.41% 2.86 1.57 1.29 0.00addery8b 22.43% 5.28% 15.96% 3.88% 74.88% 2.86 1.57 1.29 0.00addery4b 23.94% 5.74% 14.36% 7.56% 72.33% 2.00 1.17 0.83 0.00barrely8b 17.43% 9.25% 8.49% 2.94% 79.32% 2.92 2.76 0.16 0.00barrely4b 23.94% 7.41% 11.48% 6.93% 74.19% 2.78 2.67 0.11 0.00comparemcy8b 27.37% 2.62% 14.55% 4.19% 78.64% 2.75 1.50 1.13 0.13comparemy8b 22.10% 3.77% 13.29% 3.36% 79.59% 3.75 2.88 0.81 0.06comparemy4b 24.37% 4.80% 12.52% 6.59% 76.09% 3.43 2.57 0.71 0.14comparemy2b 12.06% 5.58% 14.20% 12.89% 67.33% 3.00 2.67 0.33 0.00comparey8b 12.38% 1.75% 23.52% 6.19% 68.54% 6.00 2.67 3.00 0.33comparey4b 12.69% 3.28% 18.67% 10.58% 67.48% 4.50 2.00 2.50 0.00comparey2b 14.14% 5.16% 14.98% 15.56% 64.31% 2.50 2.00 0.50 0.00countercy16b 25.03% 7.71% 0.00% 52.70% 39.58% 1.89 0.89 1.00 0.00countercy8b 17.06% 8.06% 0.00% 41.37% 50.57% 1.80 0.80 1.00 0.00countery16b 4.40% 3.47% 0.00% 60.64% 35.89% 3.47 3.00 0.26 0.21countery8b 7.51% 5.81% 0.00% 34.79% 59.39% 2.90 2.70 0.10 0.10countery4b 9.62% 6.70% 0.00% 32.83% 60.47% 2.20 2.00 0.20 0.00decodery4x16b 18.56% 3.34% 6.33% 5.95% 84.38% 3.82 2.35 1.47 0.00muxy8b 10.93% 3.37% 21.65% 7.50% 67.48% 5.25 2.50 2.25 0.50muxy4b 15.29% 3.62% 18.18% 12.10% 66.11% 4.50 2.00 2.50 0.00muxy2b 14.64% 5.28% 12.56% 18.64% 63.51% 2.00 1.50 0.50 0.00parityy9b 24.20% 5.62% 18.88% 8.02% 67.48% 4.00 3.00 0.25 0.75shifty16b 5.60% 1.30% 8.30% 50.07% 40.33% 1.00 0.00 0.06 0.94shifty8b 0.68% 7.88% 2.23% 47.04% 42.84% 1.00 0.00 0.11 0.89shifty4b 5.41% 7.34% 3.23% 45.77% 43.66% 1.00 0.00 0.20 0.80

matchb 0.39% 3.05% 1.67% 30.76% 64.52% 3.50 0.80 1.20 1.50txchip 13.12% 2.49% 2.71% 32.62% 62.17% 5.55 2.97 2.37 0.21correl4rb 4.93% 3.21% 1.98% 36.54% 58.27% 2.85 1.01 0.85 0.99correl4b 7.25% 3.50% 2.16% 28.84% 65.50% 2.75 0.93 0.82 1.00l5b datapath 12.39% 3.58% 6.98% 15.50% 73.94% 3.92 1.92 0.69 1.31l2b pipe datapath 11.37% 2.78% 6.49% 10.20% 80.53% 3.76 1.88 0.53 1.35l3b random logic 5.71% 1.30% 4.55% 55.19% 38.96% 4.10 1.76 1.90 0.45fir5b 18.46% 8.32% 2.08% 10.25% 79.35% 2.92 1.54 1.17 0.21rs_mult 26.64% 9.52% 4.00% 0.00% 86.49% 3.46 3.05 0.40 0.00

CLB Power IO Power Clock Power Interconnect Power Ave Inputs Used/ CLB Ave. # F Inputs Ave. # of G Inputs Ave # of C InputsCummaltive Ave. 15.58% 5.19% 8.54% 21.00% 65.28% 3.19 1.86 1.00 0.33Std. Dev 8.62% 2.40% 6.88% 17.72% 13.69% 1.19 0.88 0.80 0.46

Cummaltive Ave. 11.14% 4.19% 3.62% 24.44% 67.75% 3.65 1.76 1.10 0.78Std. Dev 7.89% 2.78% 2.01% 16.96% 14.37% 0.86 0.82 0.65 0.57

153

% C ontr ibu tion o f Inte rconnec t R esourc e to In te rconnec t P ower % M akeup o f O v era ll In te rconnect ResourcesD esign Nam e Loca l L ines D oub le L ines Long L ines C LB Inpu t L ines C LB O utpu t L ines Loca l L ines D oub le L ines Long L ines C arry Lines

accum y16b 42.17% 20.19% 3.99% 17.21% 13.71% 58.67% 28.00% 8.00% 5.33%accum y8b 46.17% 17.65% 2.20% 17.44% 14.18% 58.90% 27.40% 8.22% 5.48%accum y4b 32.34% 36.67% 0.00% 16.33% 13.68% 49.09% 43.64% 3.64% 3.64%addery8b 29.26% 45.61% 5.12% 9.80% 7.64% 28.72% 57.45% 8.51% 5.32%addery4b 28.58% 45.92% 1.68% 12.42% 8.89% 27.27% 59.09% 6.82% 6.82%barre ly8b 28.08% 29.27% 5.08% 27.40% 10.13% 41.35% 44.23% 14.42% 0.00%barre ly4b 33.95% 27.41% 6.34% 23.07% 9.34% 42.55% 38.30% 19.15% 0.00%com parem cy 8b 27.50% 51.30% 3.39% 11.27% 3.84% 28.71% 59.41% 6.93% 4.95%com parem y8b 24.39% 44.64% 4.31% 21.25% 5.01% 33.33% 58.33% 8.33% 0.00%com parem y4b 42.40% 28.87% 3.20% 19.28% 6.35% 51.67% 38.33% 10.00% 0.00%com parem y2b 36.95% 33.06% 0.00% 20.57% 9.21% 42.11% 52.63% 5.26% 0.00%com parey8b 48.02% 31.47% 1.35% 15.54% 3.23% 56.90% 37.93% 5.17% 0.00%com parey4b 48.28% 29.76% 0.67% 15.67% 5.95% 51.72% 41.38% 6.90% 0.00%com parey2b 57.05% 15.14% 1.03% 16.69% 9.70% 60.00% 26.67% 13.33% 0.00%countercy16b 35.52% 11.11% 0.00% 22.40% 26.96% 50.00% 23.33% 3.33% 23.33%countercy8b 49.63% 6.71% 0.00% 18.18% 21.91% 56.25% 18.75% 6.25% 18.75%countery16b 37.05% 14.57% 5.50% 30.97% 12.66% 70.79% 23.60% 5.62% 0.00%countery8b 38.67% 13.19% 6.23% 29.85% 12.22% 58.97% 25.64% 15.38% 0.00%countery4b 37.29% 21.76% 0.00% 27.00% 14.05% 68.75% 25.00% 6.25% 0.00%decodery4x 16b 37.61% 6.87% 7.46% 42.16% 4.48% 74.55% 12.73% 12.73% 0.00%m ux y 8b 46.46% 20.32% 4.16% 22.88% 5.76% 61.54% 30.77% 7.69% 0.00%m ux y 4b 42.86% 31.90% 0.00% 18.30% 6.85% 45.45% 50.00% 4.55% 0.00%m ux y 2b 50.55% 20.88% 0.00% 17.27% 10.68% 50.00% 41.67% 8.33% 0.00%parityy 9b 45.20% 22.31% 1.27% 21.87% 9.09% 60.00% 33.33% 6.67% 0.00%sh if ty16b 35.91% 31.33% 5.01% 19.84% 25.15% 45.95% 43.24% 10.81% 0.00%sh if ty8b 45.20% 10.20% 2.30% 19.00% 23.70% 64.71% 17.65% 17.65% 0.00%sh if ty4b 47.86% 10.26% 3.28% 17.52% 21.08% 54.55% 18.18% 27.27% 0.00%

m atchb 31.43% 29.75% 4.73% 20.67% 13.04% 53.58% 38.68% 7.74% 0.00%tx ch ip 39 .62% 18.40% 5.66% 22.17% 8.61% 57.40% 31.27% 11.33% 0.00%corre l4 rb 21 .19% 35.59% 3.81% 25.85% 13.56% 55.95% 32.16% 7.84% 4.05%corre l4b 22.63% 34.57% 6.17% 25.10% 12.76% 50.23% 37.78% 8.60% 3.39%l5b da tapa th 32 .26% 33.18% 4.38% 19.82% 9.45% 47.06% 41.18% 8.82% 2.94%l2b p ipe da tapa th 37 .24% 26.87% 11.90% 14.59% 8.06% 46.77% 38.71% 12.10% 2.42%l3b random log ic 24 .44% 45.00% 2.22% 20.83% 8.33% 35.23% 55.68% 9.09% 0.00%f ir5b 36.44% 26.36% 5.85% 16.20% 11.87% 51.28% 29.74% 10.77% 8.21%rs_m ult 33 .49% 25.34% 8.42% 23.10% 9.10% 49.69% 36.42% 13.89% 0.00%

Loca l L ines D oub le L ines Long L ines C LB Inpu t L ines C LB O utpu t L ines Loca l L ines D oub le L ines Long L ines C arry LinesC um m altiv e A v e . 37 .60% 26.48% 3.52% 20.54% 11.40% 51.10% 36.62% 9.65% 2.63%S td . D ev 8 .75% 11.67% 2.81% 6.06% 5.91% 11.35% 12.60% 4.80% 5.15%

C um m altiv e A v e . 30 .97% 30.56% 5.91% 20.93% 10.53% 49.69% 37.96% 10.02% 2.33%S td . D ev 6 .71% 7.61% 2.83% 3.74% 2.24% 6.53% 7.70% 2.12% 2.75%

154

Design Name CLB Local Wire Double Wire Longline Carry Chain CLB Input CLB Output Clock WireEnergy Energy Energy Energy Energy Line Energy Line Energy Energy

accumy16b 1.71 6.07 3.53 0.81 0.4 3.29 1.99 0.5accumy8b 0.92 3.2 1.59 0.42 0.2 1.69 1.05 0.38accumy4b 0.43 1.7 1.94 0.08 0.1 0.89 0.53 0.32addery8b 0.45 2.03 4.47 0.56 0.25 0.89 0.45 0.13addery4b 0.27 0.84 2.23 0.14 0.15 0.53 0.29 0.13barrely8b 1.21 2.98 3.88 0.7 0 3.53 1.03 0.13barrely4b 0.43 1.3 1.66 0.41 0 1.22 0.37 0.13comparemcy8b 0.27 2.26 4.85 0.41 0.25 0.99 0.29 0.13comparemy8b 0.75 2.77 5.96 0.63 0 2.81 0.72 0.13comparemy4b 0.36 2.15 1.71 0.25 0 1.13 0.33 0.13comparemy2b 0.13 0.52 0.88 0.05 0 0.43 0.12 0.13comparey8b 0.13 2.33 1.77 0.13 0 0.79 0.12 0.13comparey4b 0.08 1.12 0.96 0.08 0 0.39 0.08 0.13comparey2b 0.08 0.74 0.32 0.08 0 0.24 0.08 0.13countercy16b 0.67 1.04 0.6 0.05 0.35 0.73 1 0.5countercy8b 0.35 0.62 0.3 0.05 0.15 0.39 0.52 0.38countery16b 0.89 4.42 1.96 0.29 0 3.17 1.16 1.38countery8b 0.48 1.61 1.02 0.32 0 1.4 0.56 0.5countery4b 0.23 0.75 0.4 0.05 0 0.53 0.28 0.38decodery4x16b 0.71 3.14 0.5 0.6 0 2.95 0.7 0.13muxy8b 0.16 1.7 0.93 0.24 0 0.94 0.16 0.13muxy4b 0.08 0.75 0.89 0.05 0 0.39 0.08 0.13muxy2b 0.08 0.44 0.36 0.05 0 0.18 0.08 0.13parityy9b 0.16 1.33 0.87 0.11 0 0.77 0.16 0.13shifty16b 0.83 1.14 1.4 0.28 0 0.77 1 1.38shifty8b 0.41 0.77 0.28 0.12 0 0.4 0.52 0.75shifty4b 0.23 0.45 0.2 0.12 0 0.22 0.28 0.5

matchb 3.9 19.4 18.1 2.9 0 15.5 10.5 4.3txchip 3.5 31.1 20.6 5.6 0 24.8 7.5 3.1correl4rb 6.3 13.33 10.7 1.11 0.75 9.1 7.7 4.96correl4b 6.63 14.24 14.5 2.11 0.75 9.4 8 4.62l5b datapath 0.5 3.4 3.7 0.5 0.2 2.4 1 0.6l2b pipe datapath 0.6 4.1 4 0.9 0.2 3 1.3 0.5l3b random logic 1.2 4.3 8.6 0.8 0 5.2 3.1 1.7fir5b 3.46 13.52 10.2 2.63 1.6 6.79 4.03 1.62rs_mult 2.81 11.3 10.04 2.81 0 9.4 2.34 0

155

APPENDIX C

Detailed Breakdown of PGA Energy and Capacitances

The following list shows the energy breakdown of the major blocks for the mini-array.Numbers are specified for both the high and low supply indicating the energy that is drawnfrom each one. Along with the energy number, the approximate capacitance that ischarged through the given supply is listed The relationship between the energy andcapacitance is a complicated one because of the variety of swings that are possible so asimple scaling factor does not hold across all blocks.

Block Vdd (2 volts) Vddl (1.5 volts)

Level 1 Track Buffer:

0.87 pJ -> 38 fF

C Input Multiplexer and Drivers:

0.18 pJ -> 45 fF 0.16 pJ -> 95 fF

A/B Input Multiplexer and Drivers:

0.2 pJ -> 50 fF 0.075 pJ -> 45 fF

LUT one path toggling:

0.1 pJ -> 45 fF

LUT all paths toggling:

0.265 pJ -> 120 fF

LUT output buffer path (includes ff input, and output fanout node to programmabledrivers):

0.33 pJ -> 180 fF

Vertical (up/down) path incl. output drivers: 0.115 pJ -> 55 fF

Diagonal (up/down) path incl. output drivers: 0.135 pJ -> 60 fF

156

Direct Horizontal path(left to right) incl. output drivers:

0.14 pJ -> 65 fF

Level 1 Horizontal Track (0 or 3) total: 0.405 pJ -> 200 fF

- Track Buffers (always toggling if track is toggling):

0.18 pJ -> 80 fF

- Programmable Output Inverter 0.126 pJ -> 60 fF

- Track Capacitance 0.1 pJ -> 60 fF

Level 1 Horizontal Track (1 or 2) total: 0.435 pJ -> 220 fF

- Track Buffers (always toggling if track is toggling):

0.18 pJ -> 80 fF

- Programmable Output Inverter 0.126 pJ -> 60 fF

- Track Capacitance 0.129 pJ -> 80 fF

157

APPENDIX D

PGA Encodings

************************

FORMAT OF THE INITDATA FILE

******************************

Cell Pair Instance Name

Cell Instance Name

keyword = lutmemory

bit vector for lut mem [7:0]

keyword = cinput

encoded selected input [0..7]

keyword = binput


keyword = ainput


keyword = ffmux

bit [0,1]

keyword = outputs

bit vector for output enables in this order vu du dd vd r t0 t1 t2 t3

keyword = endcell

keyword = encellpair

keyword = endlogiccells

Smatrix Instance Name

keyword = horiz [0,1,2,3]

keyword = left [tv0,tv1,bv0,bv1]

keyword = right [tv0,tv1,bv0,bv1]

keyword = vert [0,1]

keyword = endsmatrix

keyword = endswitchmatrices

*************************************

Encoding Information for the data file

****************************************

Lut Memory Encoding=

bit vector [7:0] inputs are [abc]

158

abc lutmem

000 bit0

001 bit1

010 bit2

011 bit3

100 bit4

101 bit5

110 bit6

111 bit7

Center Mux Encoding (c input)=

numerical digit [0..7] indicating which input is selected.

input digit

duin 0

vertfanin 1

track1in 2

track0in 3

track2in 4

track3in 5

rightin 6

ddin 7

Bottom Mux Encoding (a input)=

numerical digit [0..3] indicating which input is selected (converted to 2-bits of binary)

input digit

vdin 0

ddin 1

track2in 2

track3in 3

Top Mux Encoding (b input)=

numerical digit [0..3] indicating which input is selected (converted to 2-bits of binary)

input digit

vuin 0

duin 1

track1in 2

track0in 3

Output Encoding=

9 bit vector indicating enabled cell outputs in this order:

vu du dd vd r t0 t1 t2 t3

159

Switch matrix Encoding for 1x1

horiz [lh0rh0 lh1rh1 lh2rh2 lh3rh3]

left [lh1tv0 lh1tv1 lh2bv0 lh2bv1]

right [rh1tv0 rh1tv1 rh2bv0 rh2bv1]

vert [tv0bv0 tv1bv1]

***********************************

usage: initpga <file>

Example PGA Initialization File

* Main test file for final schems with loadcaps* cellpair listingcellpair00cell0* lut memory 7:0lutmemory* binput passes11001100* c input [0..7]cinput6* b input [0..3]binput3* a input [0..3]ainput3* ff muxffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000100endcellcell1lutmemory* passes c input10101010cinput0binput3ainput3ffmux1

160

* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000010endcellendcellpair00cellpair10cell0lutmemory* passes a input11110000cinput0binput3ainput3ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs110000001endcellcell1lutmemory*passes c input10101010cinput2binput3ainput3ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs001100010endcellendcellpair10cellpair01cell0* lut memory 7:0lutmemory* passes c input10101010* c input [0..7]cinput6* b input [0..3]binput

161

3* a input [0..3]ainput3* ff muxffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs001010100endcellcell1lutmemory* passes c input10101010cinput6binput3ainput3ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000010endcellendcellpair01cellpair11cell0lutmemory* passes a input11110000cinput5binput3ainput0ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs110000001endcellcell1lutmemory* passes a input11110000cinput

162

5binput3ainput1ffmux1* enabled outputs* vu du dd vd r t0 t1 t2 t3outputs000000010endcellendcellpair11endlogiccellssmatrix110horiz1011left0000right0100vert00endsmatrixsmatrix111horiz1101left0001right0000vert00endsmatrixendsmatrices

163

APPENDIX E

PGA Configuration Memory Programming PolaritiesB IT LIN E P R O G R A M M IN G P O LA R IT IE S F O R C O N F IG U R A T IO N M E M O R Y(T H E G IV E N B IT L IN E IN D IC A T E S W H IC H B IT L IN E IS H IG HF O R A M E M O R Y C E LL V A LU E O F 1 .)

m em ce ll w l b itline to b ring h igh fo r m em ce ll= 1= = = = = = = = = = = = = = = = = = = = = = =

----------------------------------ce ll0 in a ce llpa ir----------------------------------C E N T E R M U X0m em 0 wl5 b l1#m em 1 wl4 b l1#m em 2 wl4 b l0#m em 3 wl5 b l0#m em 4 wl3 b l0#m em 5 wl2 b l0#m em 6 wl3 b l1#m em 7 wl2 b l1#

T O P M U X0m em 0 wl7 b l0#m em 1 wl6 b l0#

B O T M U X0m em 0 wl0 b l0#m em 1 wl1 b l0#

LU T 0lm em 0 wl3 b l2#lm em 1 wl2 b l2lm em 2 wl6 b l2#lm em 3 wl7 b l2lm em 4 wl2 b l3lm em 5 wl3 b l3#lm em 6 wl7 b l3lm em 7 wl6 b l3#

P R O G O U T 0m 0(f fm em ) w l4 b l4#

poutv u w l7 b l4#poutdu w l6 b l4#poutr w l1 b l2poutdd w l1 b l4#poutv d w l0 b l4#

T O U T C E LL0

poutt0 w l5 b l10poutt1 w l4 b l10poutt2 w l3 b l10poutt3 w l1 b l10

164

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -c e l l1 in a c e l lp a i r- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -C E N T E R M U X 1m e m 0 w l5 b l6 #m e m 1 w l4 b l6 #m e m 2 w l4 b l5 #m e m 3 w l5 b l5 #m e m 4 w l3 b l5 #m e m 5 w l2 b l5 #m e m 6 w l3 b l6 #m e m 7 w l2 b l6 #

T O P M U X 1m e m 0 w l7 b l5 #m e m 1 w l6 b l5 #

B O T M U X 1m e m 0 w l0 b l5 #m e m 1 w l1 b l5 #

L U T 1lm e m 0 w l3 b l7 #lm e m 1 w l2 b l7lm e m 2 w l6 b l7 #lm e m 3 w l7 b l7lm e m 4 w l2 b l8lm e m 5 w l3 b l8 #lm e m 6 w l7 b l8lm e m 7 w l6 b l8 #

P R O G O U Tm 0 ( f f m e m ) w l4 b l9 #

p o u tv u w l7 b l9 #p o u td u w l6 b l9 #p o u t r w l1 b l7p o u td d w l1 b l9 #p o u tv d w l0 b l9 #

T O U T C E L L 1

p o u t t 0 w l7 b l1 0p o u t t 1 w l5 b l1 0p o u t t 2 w l2 b l1 0p o u t t 3 w l0 b l1 0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -s w i t c h m a t r i x ( n o t a b s o lu te b i t l i n e la b e l in t h is c a s e , m u s tlo o k a t s c h e m a t ic t o g e t r e a l o n e . . . . . r e la t i v e d a ta )- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

m e m lh 1 t v 0 w l5 b l1m e m lh 1 tv 1 w l4 b l1m e m r h 1 tv 0 w l6 b l0m e m r h 1 tv 1 w l0 b l1

m e m lh 2 b v 0 w l2 b l0m e m lh 2 b v 1 w l3 b l1m e m r h 2 b v 0 w l0 b l0m e m r h 2 b v 1 w l1 b l1

m e m tv 0 b v 0 w l7 b l0m e m tv 1 b v 1 w l6 b l1

m e m lh 0 r h 0 w l7 b l1m e m lh 1 r h 1 w l5 b l0m e m lh 2 r h 2 w l1 b l0m e m lh 3 r h 3 w l2 b l1

165

Relative Location of Configuration Memory Cells in Layout

Below are a series of diagrams that show the layout of the memory cells within eachblock that contains configuration cells. The orientation matches the picture of the mini-array in Figure 7-1. The memory cells are labeled according to the bit position that theyrepresent in the encoding vectors of Appendix D or the connection path that is enabled inthe switch matrix case.

Bottom Top

Center

Mux Mux

Mux

0

1 0

1

12

3

4

5

6

7

0

Lut Memory

12

3

4

5

6

7

0

lh0rh0

tv1bv1

lh1tv0

lh1tv1

lh2bv1

lh3rh3

rh2bv1

rh1tv1rh2bv0

tv0bv0

rh1tv0

lh1rh1

lh2bv0

lh2rh2

1x to 1xSwitch Matrix

analysis and circuit design for low power programmable ... · figure 5-6: level restoration...

Documents