architectures and algorithms for with foldable logic blocks · 3.2.1 simple and multiple folding...

Architectures and Algorithms for Laser-Prograrnrned Gate Arrays with

Foldable Logic Blocks

Jason Helge Anderson

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Graduate Department of Electrical and Cornputer Engineering

University of Toronto

O Copyright by Jason Helge Anderson 1997

395 Wellington Street 395, rue Wellington Ottawa ON K1A ON4 Ottawa ON K I A ON4 Canada Canada

Your h k Votre relerence

Our file Notre rddrence

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de

reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othewise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.


Jason Helge Anderson Master of Applied Science, 1997

Department of Electrical and Cornputer Engineering University of Toronto

Abstract

Laser-programmed gate arrays (LPGAs) represent a new approach to application specific

integrated circuit implementation. An LPGA consists of an array of programmable logic blocks as

well as a programmable interconnection network.

This thesis proposes two new LPGA logic block architectures: foldable PLA-style logic

blocks and foldable Iook-up-table-based logic blocks. The proposed logic bIocks are sirnilar to

those found in cornmercially available field-programmable devices. The term foldable refers to

the fact that the granularity of the logic blocks can be varied. This is achieved using the LPGA

laser disconnect methodology. Custom CAD tools have been developed to map circuits into the

new architectures.

Experimental studies show that LPGAs with foldable logic blocks are more area-efficient

than those based on normal unfoldable logic blocks. The proposed LPGA architectures possess

more predictable timing than an existing, commercially available LPGA.

1 would like to take this opportunity to thank my supervisor Professor Stephen Brown for

his advice, direction, and encouragement. It has been a privilege working with him.

1 would like to thank my friends and colleagues in the EECG, including: Vincent, Jason,

Warren, Ali, Vaughn, Yaska, Khalid, Qiang, Mazen, Jordan, Jeff, Wai, S teve Wilton, Guy, Dan,

Alireza, and Dawi.

1 also wish to express my appreciation for the OGS from the Government of Ontario and

for financial support from Chip Express Corporation.

Thanks to Jack Kouloheris for supplying the DDMAP technology mapper for PLA-style

logic blocks. Thanks to Amir Farrahi for providing the C code for the LUT-based technology

mapper, Level-Map.

1 would like to thank Mary for her friendship and encouragement and for helping me to

explore and enjoy Toronto. 1 also appreciate the efforts of Mary and Steve in assisting me with the

job of editing this thesis. Finally, 1 would like to thank my mother, father, sister, and grandmother

for their support during my graduate studies.

Chapter 1 Introduction ............... .......................... ................................................................ 1

....................................................... 1.1 Introduction to Laser-Programmed Gate Arrays I

1.2 Motivation for this Research Study ....................................................................... 3

.................................................................................................... 1.3 Research Approach 5

................................................................................................... 1.4 Thesis Organization 6

Chapter 2 Background and Previous Work ............................. .. .................................... 8

................................................................................................................ 2.1 Introduction 8

.......................................................................................... 2.2 Logic Block Architecture 8

2.3 PLA-Style Logic Blocks ............................................................................................ 9

........................................................................................... 2.3.1 Previous Research 10

...................................................................................... 2.3.2 Synthesis Techniques 10

..................................................................... 2.3.3 Commercial1 y Available CPLDs 12

.......................................................................... 2.3.3.1 Altera MAX 9000 12

..................................................................... 2.3.3.2 AMD Mach 4 Family 15

......................................................................... 2.4 Look-Up-Table-Based Logic Blocks 17

........................................................................................... 2.4.1 Previous Research 18

...................................................................................... 2.4.2 Synthesis Techniques 19

................................................ 2.4.2.1 LUT-Based Technology Mappers 19

...................................................................................... 2.4.2.2 Level -Map 20

............................................... 2.4.3 Commercial1 y Available LUT-Based FPGAs 21

........................................................................... 2.4.3.1 AlteraFLEX 10K 21

............................................................................... 2.4.3.2 Xilinx XC4000 23

............................................................................... 2.5 Commercially Available LPGAs 26

............................................................................................... 2.5.1 QYHSOO LPGA 26

................................................................................................ 2.5.2 CX200 1 LPGA 27

..................................................................... Chapter 3 Foldable PLAStyle Logic Blocks 29

3.1 Introduction ................................................................................................................ 29

......................................................... 3.2 Foldable PLA-Style Logic Block Architecture 29

........................................................................... 3.2.1 Simple and Multiple Folding 32

3.2.3 Effect of Bipartite Folding on Combined Folding .......................................... 34

.......................................................... 3.2.4 Summary of Architectural Parameters 34

3.3 Synthesis .................................................................................................................. 35

3.3.1 Overview of CAD Flow ................................................................................ 36

................................................................. 3.3.2 Technology Independent Synthesis 37

3.3.3 hooPLA: Technology Mapping for Foldable PLA-Style Logic Blocks .......... 38

3.3.3.1 hooPLA Phase 1: Performing an Optimal Tree Mapping .............. 39

.............................. 3.3.3.2 hooPLA Phase II: Heuristic Partial Collapshg 43

..................................................... 3.3.3.3 hooPLA Phase III: Bin Packing 45

3.3.3.4 Cornparison with Existing Technology Mappers ........................... 46

3.3.4 PLA Folding .................................................................................................... 48

3.3.4.1 Previous Work ................................................................................ 49

3.3.4.2 Approach Used to Perform Bipartite Folding ................................ 49

.......................................... 3.3.4.3 Integrating PLA Folding into hooPLA 55

3.4 Summary .................................................................................................................... 56

............ ........................ Chapter 4 Foldable Look-Up-Table-Based Logic Blocks .... 58

................................................................................................................ 4.1 Introduction 58

...................................... 4.2 Foldable Look-Up-Table-Based Logic Block Architecture 58

4.3 Synthesis .............................................................................................................. 62

4.3.1 Overview of CAD Flow ................................................................................ 63

4.3.2 LUTPack: Technology Mapping for Foldable Look-Up-Table-Based

Logic Blocks ............................................................................................. 64

................................................................................................................... 4.4 Summary 67

Chapter 5 Experimental Results .......................................................................................... 68

5.1 Introduction and Architectural Questions .................................................................. 68

5.2 Experimental Procedure ............................................................................................ 68

5.2.1 Benchmark Circuits ....................................................................................... 68

5.2.2 Area Models ................................................................................................ 69

5.2.3 Chip Area of FoIdabIe PLA-Style Logic Blocks ............................................. 70

.......................... 5.2.4 Chip Area of Foldable Look-Up-Table-Based Logic Blocks 72

5.2.6 Limitations of Area Model .............................................................................. 74

................................ 5.3 Area-Efficiency Results for Foldable PLA-style Logic Blocks 75

5.3.1 The Benefits of Folding ............................................................................ 75

..................................................................................................... 5.3.2 Area Results 78

............ 5.4 Area-Efficiency Results for Foldable Look-Up-Table-Based Logic Blocks 82

5.4.1 The Benefits of Folding ................................................................................ 82

..................................................................................................... 5 .4.2 Area Results 83

..................... 5.5 Predictability Benefits of the Coarse-Grained Foldable Architectures 87

.................................................................................................................... 5.6 Summary 89

Chapter 6 Conclusions ....................................................................................................... 90

........................................................................................................ 6.1 Thesis Summary 90

6.2 Thesis Contributions ........................................................................................... 90

..................................................................................... 6.3 Suggestions for Future Work 92

References ................... ... ................................................................................................ 94

Appendix A List of Benchmark Circuits ............................................................................ 100

..................... Appendix B Pessimistic Area Results ..... .... ................... ............................ 101

.................... Appendix C PLA Layout ...... ............................................................................ 104

Appendix D Parameterized Benchmark Suite ........................................... ...................... 105

............................................................................................................... D . 1 Introduction 105

................................................................................................. D.2 Current Benchmarks 105

....................................................................................... D.3 Parameterized Benchmarks 106

................................................................................ D.4 Synopsys Behavioral Compiler 107

......................................................................................... D.5 Designware Components 110

D.6 Synthesized Circuit Format ..................................................................................... 111

...................................................................................... D.7 Description of Benchmarks 1 1 1

Appendix E Comparing with the CX2001 LPGA ...................................... . . ................ 121

Table 3 . f : Table 3.1:

Table 3.3:

Table 3.4:

Table 5.1:

Table 5.2:

Table 5.3:

Table 5.4:

Table 5.5:

Table A . 1 :

Table D . 1 :

Table E . 1 :

Table E.2:

Foldable PLA-Style Logic Block Architectural Parameters .............................. 35

.................................................................. Heuristic Partial Collapsing Criteria 44

Effect of Controlled Partial Collapsing ........................................................... 46

Cornparison with Existing Technology Mappers ............................................... 48

Average Wire Length and Average Ratios of Maximum to Average

Channel Density ................................................................................................ 74

Normalized Area Results for PLA-Based Architectures .................................... 82

................ Normalized Area for Foldable Look-Up-Table-Based Architectures 85

Normalized Area for Heterogeneous Foldable Look-Up-Table

...................................................................................................... Architectures 86

Average Number of Logic Levels on Circuits' Critical Paths for Several

...................................................................................................... Architectures 88

List of Benchmark Circuits ............................................................................ 100

Solutions to Problems with the MCNC Benchmarks ........................................ 107

................................................................. Comparing Number of Logic Blocks 121

Comparing Number of Connected Pins ............................................................. 122

vii

Figure 1 . 1 :

Figure 1.2:

Figure 1.3:

Figure 2.1 :

Figure 2.2:

Figure 2.3:

Figure 2.4:

Figure 2.5:

Figure 2.6:

Figure 2.7:

Figure 2.8:

Figure 2.9:

LPGA Laser Cutting [CEC96] ......................................................................... 1

...................................... Packing Additional Logic into Foldable Logic Blocks 4

Abstract View of Utilization/Granularity Trade-Off ......................................... 5

PLA Structure .................................................................................................... 9

........................................................... Altera MAX 9000 Architecture [Alte961 13

................................................ Altera MAX 9000 Logic Array Block [Alte961 14

................ Altera MAX 9000 Macrocell and Local LAB Interconnect [Alte961 15

AMD Mach 4 Architecture [AMD96] ............................................................... 16

................................................ Portion of AMD Mach 4 PAL Block [AMD96] 17

Structure of Look-Up-Table ............................................................................ 18

....................................................... Architecture of Altera FLEX 10K [Alte961 22

..................................... Altera FLEX 10K Logic Array Block (LAB) [Alte961 23

....................................................... Figure 2.10. Altera FLEX 10K Logic Element [Alte961 23

......................................................................... Figure 2.1 1 : Architecture of Xilinx XC4000 24

.......................................... Figure 2.12. Xilinx XC4000 Configurable Logic Block [Xili94] 25

Figure 2.13. Portion of XC4000 Routing Architecture [Xili94] ............................................ 26

.............................. Figure 2.14. Architecture and Logic Site of QYH 500 LPGA [CEC96a] 27

Figure 2.15. Portion of QYH 500 Routing Circuitry [Jana95] ............................................... 27

................................... Figure 2.16. CX200 1 Logic Block and Example Function [CEC96a] 28

Figure 3.1 : Example PLA Personality Matrix ...................................................................... 29

Figure 3.2. PLA Column Folding ......................................................................................... 30

Figure 3.3. PLA Row Folding ....................................................................................... 31

Figure 3.4: Combined Folding to Pack Additional Logic into a Foldable PLA-Style

Logic Block ....................... ,., .......................................................................... 32

Figure 3.5. Foldable PLA-Style Logic Block ....................................................................... 35

Figure 3.6. CAD Flow for Mapping Circuits into Foldable PLA-Style Logic Blocks ........ 37

Figure 3.7. Partitioning a DAC into a Forest of Fanout-Free Trees .................................... 39

Figure 3.8. Computation of Feasible Subtree Cost ............................................................ 41

. . . .

Figure 3.10. Maximum Shared Input Bin Packing Algorithm .............................................. 45

.............................................................. Figure 3.1 1 : Mapping a PLA into a Bipartite Graph 50

......................................................... Figure 3.12. Partitioned Bipartite Graph with Foldings 52

................................................................. Figure 3.13. Pseudo-Code for Folding Algorithm 53

.... Figure 3.14. Division of a Folded PLA into Two Smaller PLAs for Subsequent Folding 54

Figure 3.15. Algorithmic Flow of hooPLA ............................................................................ 57

Figure 4.1 :

Figure 4.2:

Figure 4.3:

Figure 4.4:

Figure 4.5:

Figure 4.6:

Figure 4.7:

Figure 4.8:

Figure 5.1 :

Figure 5.2:

Figure 5.3:

Figure 5.4:

Figure 5.5:

Figure 5.6:

Figure 5.7:

Figure 5.8:

Figure 5.9:

......................................................... LUT Programming in LPGA Technology 59

Utilization of LUT Inputs ............................................................................... 59

................................................................................................. FoldabIe 4-LUT 60

FoldabIe 4-LUT with Additional Flexibility ..................................................... 62

Output Circuitry for Foldable LUT-Based Logic Block with

Parameters K = 4 and L = 3 ................................................................................ 62

CAD Flow for Mapping Circuits into Foldable LUT-Based Logic Blocks ....... 63

Covering the Multiplexer Tree ...................................................................... 65

........................................ Pseudo-Code for First-Fit-Decreasing LUT Packing 66

........................................................... Pessimistic and Optimistic Area Models 70

....................................................................................... PLA Layout Floorplan 71

The Benefits of PLA FoIding - Percentage Reduction in Number of

...................................................................................................... Logic Blocks 76

Area Results for Unfoldable PLA-Style Logic BIock

................................................................................. Architectures (Optimistic) 78

Ratio of Row Foldable to Unfoldable Area for PLA-Style

Logic Block Architectures (Optimistic) ............................................................. 79

Ratio of Column Foldable to Unfoldable Area for PLA-Style

............................................................. Logic Block Architectures (Optimistic) 80

Ratio of Combined Foldable to Unfoldable Area for PLA-Style

Logic Block Architectures (Optimistic) .......................................................... 81

The Benefits of LUT Folding - Percentage Reduction in Nurnber of

...................................................................................................... Logic Blocks 83

Ratio of Foldable to Unfoldable Area for LUT-Based Logic Block

. *

Figure B. 1 : Area Results for Unfoldable PLA-Style Logic Block

................................................................................ Architectures (Pessimistic) 1 0 1

Figure B.2: Ratio of Row Foldable to Unfoldable Area for PLA-Style

........................................................... Logic Block Architectures (Pessimistic) 10 1

Figure B.3: Ratio of Column Foldable to Unfoldable Area for PLA-Style

Logic Block Architectures (Pessimistic) ........................................................... 102

Figure B.4: Ratio of Combined Foldable to Unfoldable Area for PLA-Style

Logic Block Architectures (Pessirnistic) .......................................................... 102

Figure B.5: Ratio of Foldable to Unfoldable Area for LUT-Based

........................................................... Logic Block Architectures (Pessimistic) 1 03

Figure C. 1 : PLA Layout Generated by MPLA [Scot851 ..................................................... 104

-- - .. - -- -

1.1 Introduction to Laser-Programmed Gate Arrays

Laser-programmed gate arrays (LPGAs) represent a new approach to application specific

integrated circuit (ASIC) implementation. An LPGA is a VLSI chip consisting of a two-

dimensional array of logic blocks. Each logic block can be programmed to implement a specific

logic function. A programmable interconnection network allows the LPGA's logic blocks to be

connected together in a general way. Al1 of the mask layers in an LPGA are pre-defined by the

manufacturer and an unprogrammed LPGA has al1 possible metal connections between logic

b1ocks. The device is programmed by using a laser to permanently cut away some of the pre-

defined metal links according to a user's design specifications. This is illustrated in Figure 1.1,

which shows the metalization layers on an LPGA before and after laser cutting. It is possible to

customize metal layers below the topmost metal layer because there are "windows" in the

insulating glass between metal layers, as illustrated in Figure 1.1.

Before Laser Cutting After Laser Cutting

Figure 1.1: LPGA Laser Cutting [CEC96]

Other semi-custom VLSI design options include standard ce11 chips and mask-

programmed gate arrays (MPGAs). In these technologies, some or al1 of the mask layers needed

to produce an ASIC are fully customizeable by the designer, leading to high costs and lengthy

manufacturing times. Only the metal layers are customizeable in MPGAs and they have a

fabrication time of a few weeks EKoul93j. This lengthy fabrication period can be critical in the

development of new products since it is essential that they be available on the market as quickly as

Field-programmable gate arrays (FPGAs) and complex programmable Iogic devices

(CPLDs) are similar to LPGAs in that they consist of an array of uncommitted logic elements and

a programmable routing network that is prefabricated on a VLSI chip. Both FPGAs and CPLDs

belong to a more general class of chips known as field-programmable devices (FPDs) or

programmable logic devices (PLDs). The main difference between FPDs and LPGAs is that FPDs

are programmed electrically instead of using a beam of laser light. Both FPDs and LPGAs have a

short programming time in cornparison with the fabrication time for MPGAs: FPDs can be

configured in a matter of seconds and LPGAs can be configured in several hours [CEC96].

Currently, LPGAs are manufactured by Chip Express Corporation (CEC). Designers send design

specifications to CEC, who use specialized laser-programming equipment to configure their

LPGAS' [CEC96]. These laser-programmers may eventually be available to customers, thus

permitting LPGAs to be labelled as "field-programmable devices". Some of the advantages of

LPGAs over current FPGAs and CPLDs are:

Faster routing connections. In FPGAs and CPLDs, logic blocks are connected

together using user-programmable routing switches. The routing switches, which

consist of pass transistors or anti-fuses [Brow92], introduce signal propagation delays.

In LPGAs, connections are made using only metal, which results in faster speeds.

Higher logic density. Much of the silicon area of an FPGA or CPLD is dedicated to

user-programmable elements, such as SRAM cells or anti-fuses, which are not needed

in LPGAs.

Some of the disadvantages of LPGAs are:

LPGAs are one-time-programmable, meaning that once they are prograrnmed, they

cannot be re-programmed. Some FPGAs and CPLDs [Xili94][Alte95][Luce96] can be

programmed many times, which has led to their use in applications such as dynamically

reconfigurable systems [DeHo96][Atme97].

Currently, LPGAs are more expensive than FPGAs and CPLDs [Ayuk96].

f . A similar laser programming method has been used to configure simple programmable logic devices (SPLDs) [Sti183].

As already mentioned, an LPGA consists of an array of logic blocks and interconnection

circuitry. The issue of which type, or size, of logic block produces the best area-efficiency in an

LPGA is an open question. Architectures with fine-grained logic blocks need greater amounts of

interconnection circuitry than architectures with coarse-grained logic blocks. However, if the

granularity of logic blocks is too large, they become under-utilized, and this results in wasted area.

The logic blocks in MPGAs have traditionally been very fine-gainedl, typically

consisting of a small number of transistors [Veen901 [Ga11961 [Hash92] [Khat92]. Recently,

MPGAs with larger logic blocks have been proposed [Land95]. One aspect of MPGA technology

is that users can sometimes trade-off the amount of logic and routing, since the rnetal layers are

fully-customizeable. A sea-of-gates MPGA [West931 literally consists of a sea of logic blocks

with no space exclusively dedicated to routing. A designer creates space for interconnect by

routing over top of logic blocks, leaving some logic blocks wasted.

State-of-the-art FPGAs have coarse-grained logic blocks in cornparison with traditional

MPGAs. Several commercially available FPGAs [Xili94][Alte96][Luce96] have logic blocks

based on look-up-tables (LUTs), which are small memories that are programmed with the truth

tables of boolean functions. Complex programmable logic devices have PLA-style

(programmable logic array) logic blocks [AMD96] [Phi1971 [Latt96][Cypr97] which are good for

implementing the two-level logic congruent with the sum-of-products form of boolean functions.

The architectural issues associated with LPGAs are similar to those for FPGAs and

CPLDs, since in al1 of these technologies, a fixed amount of interconnect and logic is pre-

fabricated on a chip. However, the capability in LPGA technology to cut metal lines with little

area overhead introduces new architectura1 possibilities. The focus of this thesis is to investigate

the benefits of using coarse-grained logic blocks in LPGAs in a way that leverages the ability to

cut metal lines. In particular, two new logic block architectures are introduced: foldable PLA-

style logic blocks and foldable look-up-table-based logic blocks.

The proposed logic blocks were developed by looking at existing logic blocks in the

context of LPGA technology. In particular, the new logic blocks are variations on the blocks

found in FPGAs and CPLDs. The term foldable refers to the fact that the granularity of the logic

I . The fine-grained logic blocks commonly found in MPGAs are also referred to as logic sites.

methodology. Typically, in cornmercially available FPGAs and CPLDs, the granularity of logic

blocks is fixed and it cannot be modified.

A significant advantage of variable logic block granularity is that it facilitates "packing"

additional logic into each logic block. This reduces the number of logic blocks needed to

implement circuits and rnay increase area-efficiency. As mentioned above, coarse-grained logic

blocks often suffer from under-utilization. For example, when circuits are mapped into traditional

PLA-style or LUT-based logic blocks, a portion of some of the logic blocks is left unused and,

therefore, wasted. In the proposed foldable logic blocks, the unused portion of a logic block may

be separated from the used portion, and logic rnay then be implemented in the unused portion.

Figure 1.2 is an abstract illustration of how additional logic is packed into foldable logic blocks.

The figure shows two implementations of an arbitrary digital circuit. The left side of the figure

depicts the circuit after it has been mapped into normal unfoldable logic blocks. The shaded

portion of each logic block represents the used portion of the logic block, while the unshaded

portion represents wasted area. The right side of the figure shows the sarne circuit after it has been

mapped into foldable logic blocks. In the folded implementation, the logic blocks are better

utilized. Furtherrnore, fewer logic bIocks are needed in the folded implementation. Folding

reduces the amount of silicon area needed to implement the circuit if the reduction in the number

of logic blocks attained by folding more than compensates for the additional area required to

make logic blocks foldable. One of the principle objectives of this thesis is to investigate whether

an LPGA architecture based on foldable logic blocks is more area-efficient than an LPGA based

on normal unfoldable blocks.

Circuit Mapped into Circuit Mapped into Normal Logic Blocks Foldable Logic Blocks

Figure 1.2: Packing Additional Logic into Foldable Logic Blocks

advantage of foldable logic bIocks is that as their granuIarity is increased, their utilization

decreases less quickly than it does for normal logic blocks. This notion is illustrated abstractly in

Figure 1.3. The slower decrease in utilization for foldable logic blocks may make it feasible to

implement coarse-grained architectures that would otherwise be too area-inefficient, if logic

blocks were not foldable.


Normal Logic Blocks

Logic Block Gmnulanty

Figure 1.3: Abstract View of UtilizationIGranularity Trade-Off

Designs implemented using coarse-grained blocks have fewer logic levels on their critical

paths. This is advantageous because logic blocks on the critical path are connected using the

programmable interconnection network, and as feature sizes shrink in VLSI technology,

interconnect delay is becoming a more significant portion of total delay. For this reason, there

may be speed advantages to building architectures with coarse-grained logic blocks. Furthemore,

routing delays must be estimated by synthesis tools when circuits are mapped into an architecture.

Having fewer logic levels means that fewer estimates must be made, and this helps synthesis tools

make better predictions of critical path delay. Currently, inaccurate estimates of routing delay

force designers to iterate the synthesis process, increasing both design time and cost.

Another potential benefit of LPGAs with logic blocks similar to those found in FPGAs

and CPLDs is to ease technology migration. FPD designers wishing to achieve greater speed and

logic density may wish to port their designs to LPGA technology. This can be difficult if the logic

bIocks in the LPGA are essentially different than those in the FPD since it can aIter the relative

delays in a design [Frak92].

1.3 Research Approach

New CAD tools have been developed to study the proposed logic bIock architectures. One

tool, called hooPLA, performs technology mapping for architectures wi th foldable PLA-sty le

with foldable look-up-table logic blocks. The tools have been designed to work and perform well

for a range of architectural parameters.

These new tools are applied in an empirical study in which experiments consist of

mapping benchmark circuits into the proposed architectures. Results are recorded after each

mapping, including the number of logic blocks needed to implement circuits and the number of

levels of logic on each circuit's critical path. These results are used in conjunction with area

models to study the proposed architectures. A wide range of experimental architectures are

considered in the study.

1.4 Thesis Organization

This thesis is organized as follows: Chapter 2 provides background information on Iogic

block architecture and technology mapping. Existing technology mapping methods for FPDs with

look-up-tables and PLA-style blocks are reviewed. A few examples of commercial FPGA and

CPLD architectures are presented, with a focus on logic block architecture.

Chapter 3 introduces the foldable PLA-style logic block architecture and outlines its

architectural parameters. The chapter describes the CAD flow used to rnap circuits into the

proposed architecture and describes a new technology mapper for foldable PLA-style blocks. The

quality of the technology mapping solution produced by the new tool is compared with the results

attained using previously-developed techniques.

The foldable look-up-table logic block architecture is presented in Chapter 4. The chapter

describes a new tool that has been developed to map circuits into foldable LUT-based logic

blocks.

Chapter 5 presents the results of an empirical study in which the synthesis techniques of

Chapters 3 and 4 are applied in a series of experirnents. Parameterized models for the logic block

and interconnection area of the proposed architectures are presented.

Conclusions and suggestions for future work are offered in Chapter 6. A list of references

is provided at the end.

A list of the benchmark circuits used in the experimental study is provided in Appendix A.

Both pessimistic and optimistic area models are considered in the empirical study of

Chapter 5; however, only the results obtained by applying the optimistic mode1 are given in

The logic block area models introduced in Chapter 5 were developed by analyzing actual

VLSI layouts. The layout used to develop the area mode1 for foldable PLA-style logic blocks is

included in Appendix C.

Appendix D describes in detaiI some of the benchmark circuits used to study the proposed

logic block architectures. The circuits were developed by the author through HDL (hardware

description language) synthesis using the Synopsys CAD tools [Syn96].

A preliminary cornparison of the proposed architectures with the commercially available

CX2001 LPGA [CEC96a] is included in Appendix E.

2.1 Introduction

This chapter gives a brief introduction to the notion of logic block architecture. Following

this, a detailed description of PLA-style and LUT-based logic blocks is presented, since they form

the basis for the logic blocks considered in this thesis. Synthesis techniques for PLA-style and

LUT-based logic blocks are summarized, and a description of several commercially available

FPDs is provided. The chapter concludes with a description of the architectures of commercially

available LPGAs.

2.2 Logic Block Architecture

An LPGA consists of an array of logic blocks and a programmable interconnection

network. The type of logic block in an LPGA is referred to as its "logic block architecture". The

type of logic block affects the speed of circuits mapped into the LPGA, as well as the LPGA's

logic density; that is, the amount of logic that can be packed into a given area of the LPGA. The

"routing architecture" of an LPGA refers to the structure of its programmable interconnection

network. The routing architecture defines how the logic blocks in the LPGA may be connected

together. FPGA and CPLD architecture can be defined similarly to LPGA architecture. However,

FPGA and CPLD architecture have an addi tional dimension, called the "programming

technology," which currently consists of either SRAM cells, EPROMIEEPROM transistors, or

anti-fuses [Brow96]. The programming technology is the method through which the FPGA is

configured to implement a specific digital circuit.

In this thesis, the focus is on PLA-style and LUT-based logic blocks. These are referred to

as coarse-grained logic blocks because they can implement a large number of different boolean

functions. Other choices for logic blocks include multiplexer-based logic bIocks, such as those

used in Acte1 FPGAs [Acte96]. Multiplexers are also used as logic blocks in Texas Instruments

MPGAs [Land951 and the Chip Express CX2001 LPGA [CEC96a] (described later). Fine-grained

blocks such as transistor pairs were used in CrossPoint FPGAS' [Marp92] and are the basis of

many commercialIy available MPGAs including those made by Philips [VeengO], Texas

1. CrossPoint FPGAs are no longer manufactured.

2.3 PLA-Style Logic BIocks

The structure of a PLA is congruent with the surn-of-products representation of boolean

functions. PLAs can be characterized by their number of inputs, product terms, and outputs. An

example of a PLA with 5 inputs, 5 product terms, and 2 outputs is given in Figure 2.1. Rows of the

PLA correspond to product terms and columns correspond to inputs and outputs. The left side of

the figure shows an unconfigured PLA with switches that can be programmed to realize product

terms and logical sums of product terms. Product terms are fomed in a PLA's AND-plane; logical

sums of product terms are generated in a PLA's OR-plane. The right side of Figure 2.1 depicts an

abstract view of a programmed PLA implementing the logic functions x = 6 ë + üce + be + Ce

and z = a d . PALs (programmable array logic) are similar to PLAs, except PALs have a fixed

OR-pl ane .

Switch 01 0 2 x z AND-Pl ane OR-Plane AND-Plane OR-Plane

Unprogrammed PLA Programmed PLA

Figure 2.1: PLA Structure

A basic PLA-based LPGA or FPD architecture would consist of an array of logic blocks,

with each logic block containing a PLA having a fixed number of inputs, product terms, and

outputs. In addition, the Iogic blocks would have a register associated with each PLA output, and

circuitry would exist so that the output could bypass the register, thus, being purely

combinational. LastIy, each logic block output would have buffer circuitry to drive signals

through the programmable interconnect to other logic blocks, or to chip output pads.

Kouloheris conducted a study of the speed and logic density of FPGAS' with PLA-style

logic blocks [Kou193]. He mapped benchmark circuits into PLA-based architectures, and placed

and routed the mapped circuits on an experimental FPGA with a routing architecture resembling a

segmented channelled gate array. Kouloheris used layouts of pseudo-NMOS NOR/NOR PLAs

[Mead801 to estimate the area of the logic blocks. His results showed that architectures with PLA-

style logic blocks having 8-10 inputs, 12-13 product terms, and 3-4 outputs are as area-efficient as

LUT-based FPGAS*. His performance study used a delay model that reflected the placement of

logic blocks on the array, the capacitance of metal wires and logic block inputs, and the resistance

and capacitance of the programmable routing switches. Results suggest that the fastest logic block

architecture is the same as the architecture that is most area-efficient, when the programmable

interconnection network contains pass-transistor routing switches.

Research by Kaviani has focused on a hybrid FPGA architecture (HFA) with both PLA-

style and look-up-table-based logic blocks [Kavi97][Kavi96]. Kaviani used an experimental

approach to determine that such heterogeneous FPGAs use significantly less area than

homogeneous LUT-based FPGAs.

Singh studied the speed performance of FPGAs with PLA-style logic blocks using a

simple lumped-delay interconnect model [Sing91]. He considered blocks with between 2 and 32

inputs, and either 3 or 5 product terms. His results indicate that blocks with 5 product terrns and 4-

8 inputs have the best speed performance.

2.3.2 Synthesis Techniques

Technology mapping for PLA-style logic blocks is fundarnentally different than the

library-based rnapping algorithms used for MPGA or standard ce11 design. Library-based mappers

transforrn a circuit into gates that reside in a target library. This type of technology mapping is not

efficient for PLA-based architectures, because of the wide range of functions that may be

implemented in a single PLA-style logic block. For example, consider a PLA with 1 inputs and P

product terms. The number of ways to program the PLA's AND-plane if al1 of the inputs are

1. In [Kou193], FPDs with PLA-style logic blocks are referred to as FPGAs. 2. This result was determined based on an assumption of SRAM programming technology.

12'.

One important synthesis issue for PLA-style logic blocks is fast and effective two-level

logic minimization. A two-level logic minimizer attempts to reduce the number of product terms

needed to express a boolean function in sum-of-products form by finding redundancies in its

representation, and exploiting "don't cares" [Mano91][Bray87]. This is relevant to architectures

with PLA-style logic blocks because each block has only a finite number of product terms. The

Quine-McCluskey algorithm [DiMi941 is an exact two-level minimizer that can represent a

function with an optimally minimal nurnber of product terms. Espresso [DiMi941 is a fast

heuristic algorithm that is commonly used to perform two-level minimization.

The combinational part of a digital circuit can be represented by a directed acyclic graph

(DAG)~. Each node in a circuit's DAG implements a single output logic function that is part of the

circuit. To map a circuit into a PLA-based architecture containing logic blocks with I inputs, P

product terms, and O outputs, Kouloheris first applied a look-up-table technology mapper. This

created a network of 1-bounded nodes3; however, it also produced some nodes with more product

terms than allowable in the target architecture. To deal with this, Kouloheris used logic

decomposition routines inside the logic synthesis tool, SIS /Sent92], to decompose the nodes with

too many product tems into feasible nodes4. Lastly, Kouloheris used a first-fit-decreasing

algorithm to pack nodes into PLA-style blocks with multiple outputs. Kouloheris refers to this

methodology as DDMAP [Koul93].

Another way to map circuits into PLA-style logic blocks is to use a partial collapsing

function within SIS [Sent921 called eliminate, coupled with an efficient node partitioning

algorithm. Partial collapsing refers to the process of collapsing some DAG nodes into their

successors. The goal of the SIS command eliminate is to minimize area by partial collapsing a

network to minimize the nurnber of literals5 present in the network's boolean equation

representation. Applying the partial collapsing function may create infeasible nodes that possess

Not al1 o f these AND-plane configurations are useful. A circuit's DAC can also be referred to as a boolean network [Brow92]. I-bounded nodes are nodes that have less than or equal to I fanins. A feasible node possesses a number of inputs and a number of product terms that allow it to fit into a logic block o f the target architecture. A literal is an instance of a variable in a boolean equation [DeMi94]. For example, z = abc has three literals and x = ab + üc + bc has 6 literals.

architecture. To deal with this, a program developed by Kaviani, called Break-a-Node [Kavi97],

may be used to partition the large infeasible nodes into smailer feasible nodes. After partitioning,

Break-a-Node uses a maximum-input-sharing, first-fit-decreasing approach to pack nodes into

multi-output PLA-style logic blocks. Break-a-Node is used in the CAD flow of the hybrid FPGA

architecture [Kavi96].

2.3.3 CommerciaIly Available CPLDs

This section presents the logic and routing architecture of two commercially available

CPLDs: the Altera MAX 9000 [Alte961 and the AMD Mach 4 [AMD96]. Other commercial

architectures with PLA-style logic blocks include those made by Lattice Semiconductor [Latt96],

Cypress Semiconductor [Cypr97 1, and Philips Semiconductor [Phi197].

2.3.3.1 Altera MAX 9000

The Altera MAX 9000 has a hierarchical routing architecture, as shown in Figure 2.2. The

logic blocks in the MAX 9000 are called macrocells and sets of 16 macrocells are grouped into

logic array blocks (LABs). Local routing circuitry within each LAB allows for fast connections

between macrocells in the same LAB. Macrocells in different LABs can be connected using rows

and columns of FastTrack interconnect, which consists of long wires spanning the entire width

and height of the device. UO pins are accessed through the FastTrack interconnect.

Logic

Bloc Arrai (LAW

1 Interconnect

I 1 I I I

Local LAB au Interconnect

toc ==== IOC IOC ===i roc

Figure 2.2: Altera MAX 9000 Architecture [Alte961

A more detailed view of the MAX 9000 routing architecture is shown in Figure 2.3. Each

LAB has 33 inputs from the row FastTrack interconnect above the LAB. The output of each

macrocell in the LAB is fed back so it may be used by other macrocells in the same LAB. These

input and feedback signals are available to be used in product terms in their true and

complemented forrns. The figure shows that the output of a macrocell may be routed to either the

adjacent row or column FastTrack interconnect. Signals on the column FastTrack interconnect

may be routed ont0 the row FastTrack interconnect; however, the reverse is not possible. The

macrocells in each LAB have access to two global clock signals and a global clear signal that is

fed through high-speed routing to every rnacrocell on the device.

Local Lab lnterconnect

\

Shared Expander Signds

I i / "1 6 <48

Macrocell I 1 Macrocel12 8

I 48

Figure 2.3: Altera MAX 9000 Logic Array Block [Alte961

The architecture of the MAX 9000 macroce'il is shown in Figure 2.4. Each macrocel1 has a

nominal allocation of 5 product terms. One of these product terms may be used as a shared

expander product term and fed back into the local LAB interconnect in inverted form. For larger

logic functions requiring more than 5 product terrns, it is possible to borrow product terms from

adjacent macrocells. These borrowed product terrns are called parallel expander product terms.

The flip-flop in the macrocell may be configured as either D, T, RS, or JK and it rnay be

clocked by one of the global clock signals, or one of the product terms allocated to the macrocell.

It is possible to use one of the 5 product terms to impIement a register preset and one of the

product terms to implement a clear. Each macrocell has two outputs that may be either registered

or combinational. One of the two outputs feeds back into the local LAB interconnect; the other

output feeds the FastTrack interconnect. An additional feature called register packing allows the

flip-flop to be fed with a single product term while the remaining product terms are available to

realize other independent unregistered logic. This effectively allows a user to implement two

separate logic functions per macrocell.

Figure

Circuits can be mapped into the MAX 9000 using Altera's Max+Plus II development

system, which allows a user to enter a design via hardware description language or schematic

capture. The software can be used to perform timing analysis and floorplanning. The

programming technology for the MAX 9000 is EEPROM. Devices in the family are available in

sizes ranging from 6000 - 20000 gates [Alte96].

2.3.3.2 AMD Mach 4 Family

Figure 2.5 shows the architecture of the AMD Mach 4 CPLD. It can be viewed as an array

of PALS interconnected by a central switch matrix. Each of the PAL blocks contains 16

macrocellsl, which may be configured as registered or combinational. One benefit of the Mach 4

architecture is that it has completely predictable timing because a signal's path from one

macrocell to another macrocell always passes through the central switch rnatrix. The figure shows

that four clock signals are fed directly into the central switch matrix. These dock signals are

available for use in any of the macrocells on the device. The Mach 4 is available with 128 or 256

macrocells; each being equivalent to about 2500 or 5000 gates, respectively [Brow96].

1 . In this case, the term 'macrocell' refers to the circuitry driven by one of the OR-gates in Mach 4's PAL blocks [Brow96]. A macrocell contains a bypassable programmable register.

Block (33V16)

Block (33V 16)

Block (33V 16)

Block (33V 16)

4 (CLK) Central Switch Matrix

Figure 2.5: AMD Mach 4 Architecture [AMD96]

A portion of the Mach 4 PAL block is shown in Figure 2.6. The PAL blocks in the Mach 4

have 33 inputs from the central switch matrix, which are available in true and complemented

forms. These 66 different signals are used to form 90 product terms, 80 of which are grouped into

16 clusters of 5. These clusters of product terms implement logic and feed macrocells. Eight of

the remaining 10 product terms are used to create output enable signals for the 8 V 0 cells

connected to each PAL block. The Iast two product terms are available to form preset and reset

signals for the flip-Rops in the PAL block's 16 macrocells.

Each macrocell is allocated a cluster of 5 product terms. Clusters may be redirected from a

macrocell to other adjacent macrocells, allowing up to 20 product terms to feed a single

macrocell. The Mach 4 architecture is designed so that al1 5 product terms in a cluster may be

diverted from a macrocell (leaving the macrocell unused), or optionally, only 4 of the 5 product

terms can be redirected, allowing a single product term function to be implemented in the

macrocell. This redirection of product term clusters is controlled by the logic allocator. In

essence, the functionality of the PAL block is in between a PAL and a PLA since the clusters of

product terms that feed a particular macrocell is not entirely fixed.

Only 8 of the 16 macrocells in a PAL block may drive an 110 pin, as controlled by the

output switch matrix. Each of the 16 macrocell outputs as well as 8 registered input signals and 8

VO pin signals are sent to an input switch matrix which multiplexes 24 of the 32 signals into the

central switch matrix. The programrning technology for the Mach 4 is EEPROM.

24 Input 116 1 +, Switch A - Matrix 116

/

Figure 2.6: Portion of AMD Mach 4 PAL Block [AMD96j

2.4 Look-Up-Table-Based Logic Blocks

Look-up-tables (LUTs) are memories that are characterized by the number of address

lines they possess. A look-up-table with K address lines has 2K storage elements and it can

implement any boolean function of up to K inputs. A LUT possessing K inputs is referred to as a

K-LUT. Figure 2.7 shows the basic structure of a 4-input look-up-table (4-LUT). A LUT consists

of a multiplexer decoding tree and storage elements. The storage elements are programmed with

the tmth table of the logic function being implemented in the LUT. Inputs to the LUT connect to

the multiplexers and select a particular storage element whose contents is passed to the LUT

output.

= Storage Element

lnp"t O lnp& t lnph 2 lnph 3

Figure 2.7: Structure of Look-Up-Table

A basic LUT-based LPGA or FPGA architecture would consist of a homogenous array of

LUT-based logic blocks, with each LUT each having a fixed number of inputs. The logic blocks

would contain a register and circuitry to allow the LUT output to be either registered or

combinational. Drive circuitry would be present for each logic block output.

2.4.1 Previous Research

The earliest research on LUT architecture was conducted by Rose, Francis, Lewis, and

Chow [Rose89 ][Rose90]. This work focused on how the area-efficiency of LUT-based FPGAs

changes as the size of the LUT in the logic blocks changes. An experimental study revealed that

LUT architectures with between 3 and 4 inputs are the most area-efficient, when both logic and

routing area are taken into account. The work also showed that it is beneficial for logic blocks to

contain a flip-flop.

As well as studying PLA-based !ogic blocks, Kouloheris studied LUT-based FPGAs

[Kou193]. His work on area-efficiency confirrned the results in [Rose9O]. In a study of the speed

of LUT-based FPGAs, he found that LUTs with 4-5 inputs should be used when the switches in

the FPGA interconnection network have a small tirne constant, as is the case with anti-fuse

-- - -

pass-transistor switches are used in the interconnection network.

Research by Singh also focused on the speed of FPGAs with LUT-based logic blocks

[Sing91]. Singh suggests that LUTs with 6 inputs provide the best speed performance.

Research by He [He941 focused on heterogeneous FPGA architectures containing LUTs

of two different sizes. He developed synthesis tools and studied a wide range of heterogeneous

architectures and deterrnined that an architecture with a combination of 2-input and 4-input LUTs

was more area-efficient than the best homogenous architecture1.

Other research has centered on increasing the speed of LUT-based FPGAs by hard-wiring

some of the logic blocks together, in an attempt to minimize the number of times that time-critical

signals pass through the slow programmable interconnection network [Chun94].

2.4.2 Synthesis Techniques

Similar to the case of PLA-style blocks, technology mapping for LUTs is fundamentally

different than library-based technology mapping. This is because of the wide range of functions

that may be implemented in a single LUT. For example, a CLUT may irnplement up to 224 =

65536 different functions (research has shown this number may be reduced somewhat [Zili96]).

Library-based technology mapping is not feasible for LUT architectures because the library

would be too large to be repetitively searched exhaustively during technology mapping. Many

technology mapping algorithms for LUTs have been developed and several are described briefly

below. The goal of each of these algorithms is to map a circuit into a network of K-input LUTs.

Special attention is given to one algorithm, called Level-Map [Farr94], because it is used in this

thesis in the CAD flow for foldable LUT-based logic blocks.

2.4.2.1 LUT-Based Technology Mappers

FlowMap is a technology mapping algorithm for LUTs that produces solutions with

optimal depth [Cong94]. The dgorithm translates the problem of finding a minimal depth

implementation for each node in a circuit into the problem of determining the maximum flow in a

network. FlowMap considers minimizing the number of LUTs in the solution as a secondary

1. This conclusion was drawn using an area mode1 based on the total number of SRAM bits in an FPGA's logic biocks and the total number of logic block pins.

depth constraints on non-critical paths.

Chortle-crf [Frangla][Fran92] is a technology mapper that focuses on minimizing the

number of LUTs in the mapping solution. In this algorithm, a circuit's DAG is broken into a forest

of trees and a first-fit-decreasing bin packing algorithrn is applied to each tree to pack as many

nodes as possible into a single LUT. The algorithm rnakes an effort to consider reconvergent paths

and also attempts to eliminate nodes by collapsing multi-fanout nodes into their successors.

Chortle-d [Fran9l b] is a version of Chortle that rninimizes depth rather than area.

Other LUT mappers include mis-pga [Murg95] which attempts to minimize area, RMAP

[Sch194] which focuses on producing routable mapping sofutions, and M.Map [Chen951 which

combines the technology mapping problem with placement on a two-dimensional 'may.

2.4.2.2 Level-Map

Farrahi and Sarrafzadeh proved that the problem of mapping an arbitrary DAG into an

optimally minimal number of LUTs is NP-complete for K 2 5 [Farr94]. The authors present a

heuristic algorithm called Level-Map that produces solutions with fewer LUTs than solutions

produced by Chortle-crf, FlowMap, and ~ l o w ~ a ~ - r ' [Cong94a].

Level-Map works by traversing a network from its primary inputs towards its primary

outputs. During the traversal, LUTs are assigned to some DAG nodes, meaning that the output

signals of these nodes will become output signals of LUTs in the mapping solution. For each

node, v , in the network, two parameters are computed: the node's dependency, d, , and the node's

contribution, 2,. These parameters are defined as follows: if a node, v , has been assigned a LUT

or v is a primary input, 2, is assigned the value 1. Otherwise, the contribution of a node, Z,, is

equal to the sum of the contributions of its immediate fanin nodes. The dependency of a node is

equal to 1 if the node is a primary input. Otherwise, the dependency of a node is equal to the sum

of the contributions of its immediate fanin nodes. Given these definitions, if a LUT is assigned to

a node, v , in the mapping solution, that LUT will have d, inputs (with each of v's immediate

fanins contributing a certain amount to d , ) . When the algorithm traverses the network and

encounters a node, v, for which d, is greater than K, the algorithm proceeds to assign LUTs to

1 . Flowmap-r (like CutMap) is a version of FlowMap that allows a user to relax thc depth constraints on non- critical paths to help minimize the number of LUTs in the mapping solution.

selected to be assigned LUTs on the basis of their contribution and their fanout1. This assignrnent

of LUTs to v's fanin nodes continues until d , is less than or equal to K. A feasible K-LUT

mapping soIution has been found when the dependency value for each node in the network is less

than or equal to K. The final step of the algorithm is to assign LUTs to any primary outputs of the

network that have not already been assigned LUTs.

2.4.3 Commercially Available LUT-Based FPGAs

This section presents the architecture of two commercially available LUT-based FPGAs:

the Altera FLEX 10K and the Xilinx XC4000. Other LUT-based FPGAs include the ORCA

FPGAs by Lucent Technologies [Luce96].

2.4.3.1 AItera FLEX 10K

The architecture of the Altera FLEX 10K FPGA is shown in Figure 2.8. Its hierarchical

routing architecture is similar to that of the MAX 9000 described previously. The logic blocks,

called logic elements, (LES), are Cinput LUTs with programmable registers. LES are grouped into

sets of 8 to form logic array blocks (LABs). Each LAB has local interconnect resources that

connect LES in the same LAB. Connections between LES in different LABs are made using

FastTrack row and column interconnect.

The FLEX f OK contains embedded array blocks (Ems) , which are 2048-bit synchronous

RAMs that can be used to implement memory within a design or may be used as large LUTs to

implement logic functions. The EABs can be used in four different configurations: 2048 x 1, 1024

x 2, 512 x 4, or 256 x 8. In addition, the multiple EABs on a single FLEX IOK device may be

combined to create wider RAMS~.

1. Here, fanout refers to the out-degree of a node (the number of DAG edges emanating from a node). 2. For example, two EABs in 256 x 8 mode may be combined to form one 256 x 16 RAM.

FastTrac Column In terconnec t

Figure 2.8: Architecture of Altera FLEX 10K [Alte961

A FLEX 10K LAI3 is shown in Figure 2.9. Each LE in a LAB is provided with four

control signals of which two may be used as clocks and two as preset and clear for the register in

each LE. The output of each LE in a LAB may drive either row or column FastTrack interconnect;

an LE'S output rnay also drive an input on an LE in the same LAB through the local LAI3

interconnect. Signals enter a LAB from row FastTrack interconnect.

A FLEX 10K logic element is shown in Figure 2.10. The output of the 4-LUT in the LE

may either be registered or combinational. Each LE has carry-in and carry-out signals that travel

to neighbouring LES; the signals can be used to implement fast arithmetic and counter circuitry.

Furtherrnore, each LE has cascade circuitry that allows the output of the 4-LUT to be logical

ORed or ANDed with the output of the LUT in the LE above. The register in the LE rnay be

cleared or preset using either of the control lines LABCTRL1 and LABCTRL2, or using the input

Figure 2.9: Altera FLEX 10K Logic Array Block (LAB) [Alte961

SRAM bits are used to configure the LES and routing in the FLEX 10K. The FLEX 10K is

available sizes ranging from 10000 to 100000 gates [Alte96].

LABClRLI Preset Dcviu-Widc

LABCIRLl

L A B r n U

Figure 2.10: Altera FLEX 1OK Logic Element [Alte961

2.4.3.2 Xilinx XC4000

The architecture of the Xilinx XC4000 FPGA is shown in Figure 2.1 1. It consists of a two-

dimensional array of LUT-based logic blocks caIled configurable logic blocks (CLBs). Each row

or column of CLBs is interleaved with routing channels that form the XC4000 interconnection

network. Unlike the Altera FLEX 1 OK, the XC4000 possesses a flat routing architecture.

O 0

Configurable Logic 4 Block

00 00 no no Figure 2.11: Architecture of Xilinx XC4000

The XC4000 CLB is shown in Figure 2.12 below. The 13 inputs and two levels of LUTs in

a CLB allow it to implement any function of 5 variables, any two functions of four variables, and

some functions of up to nine variables. Each of the four control inputs C 1, C2, C3, and C4 can be

mapped ont0 any of the four intemal signals HI, DIN, S R , and EC. The functions of these

interna1 signals are shown in Figure 2.12. The CLB contains two flip-flops and each can be driven

by any of the signals F', G', H', or DIN. The CLB has one output for each flip-flop and two

additional unregistered outputs.

The CLB has several additional features not shown in Figure 2.12. First, the CLB has

built-in fast carry logic in which the LUTs producing F' and G' are configured as two full adders

with dedicated carry circuitry. This feature can enhance the speed of arithmetic circuits. Another

feature is the option of using the SRAM bits in the F' and G' LUTs as write-able memory

elements. The 32 SRAM bits (there are 16 in each LUT) can be used in a 32 x 1, or a 16 x 2

configuration. In this memory mode, the control bits Cl - C4 act as memory-specific signals like

write-enable and data-in; the F1 - F4 and G1 - G4 inputs serve as memory address lines.

Figure 2.12: Xilinx XC4000 ConfigurabIe Logic Block [XiIi94]

The routing tracks in each routing channel of Figure 2.1 1 consist of wires of varying

length including single length, double length, quad length, and long lines. Single and double

length lines are shown in Figure 2.13. Single length lines pass through switch matrices every time

a horizontal routing channel intersects with a vertical channel; whereas double length lines pass

through switch matrices half as often, thus offering smaller delays for longer routing connections.

The XC4000 also has long lines that run both vertically and horizontally, spanning the entire

height and width of the device. These long lines are useful for implementing signals that require

low skew or for implementing high-fanout nets.

The XC4000 is available in a variety of sizes ranging from 2000 - 130000 gates. Users

targeting Xilinx FPGAs must synthesize their circuits into a library of primitive gates which are

then mapped into LUTs, placed, and routed using the Xilinx XACT toolset [Xili95].

Single Length Lines

h c h point consists of six pass transistors - -

Double Length Lines

Figure 2.13: Portion of XC4000 Routing Architecture [Xili94]

2.5 Commercially Available LPGAs

This section describes two commercially available LPGAs manufactured by Chip

Express: the QYH 500 and the state-of-the-art CX2001 LPGA. Circuits are mapped into these

LPGAs using library-based technology mappers such as the Synopsys Design Compiler [Syn96].

2.5.1 QYHSOO LPGA

The architecture of the Chip Express QYH 500 LPGA is depicted in Figure 2.14. It

consists of rows of logic blocks interleaved by routing channels. I/O cells surround the array of

logic and routing. Its logic blocks are similar to those found in traditional MPGAs [West941 since

each block (logic site) consists of four transistors: two p-type and two n-type. The four transistors

can be linked together in many ways allowing a 2-input NAND, a 2-input NOR, or an inverter to

be implemented in a single site. A D-type flip-flop can be implemented using 7 logic sites. When

latches and flip-flops are implemented in the QYW 500, the clock signals feeding these elements

are routed using the same interconnection circuitry as other signals. This is different than the

architecture of Xilinx or Altera FPGAs [Xili94][Alte96] which have dedicated clock circuitry to

help minimize clock skew. It is possible for users to combine sites on the QYH 500 to form

embedded SRAMs.

Figure 2.15 shows a small portion of a QYH 500 routing channel' and illustrates how

1. An actual QYH 500 routing channel has many more tracks than the 4 shown in Figure 2.15.

used to connect to logic block pins, or to connect together horizontal tracks in neighbouring

routing channels. Initially, each vertical wire is connected to al1 of the horizontal wires. Figure

2.15 shows the laser cut points that are needed to configure the routing circuitry and gives insight

into the laser disconnect concept. Cut points exist on the horizontal routing tracks, allowing thern

to be cut at any location along a routing channel.

VO Cell I I

I I L,,,,,,-,A

A U

Figure 2.14: Architecture and Logic Site of QYH 500 LPGA [CEC96a]

VIA12

Figure 2.15: Portion of QYH 500 Routing Circuitry [Jana95]

2.5.2 CX2001 LPGA

Laser Cut Point

The CX2001 is also a channelled array and has a routing architecture similar to the QYH

500. The CX2001 logic block is shown on the left side of Figure 2.16. The logic block is coarse-

grained in comparison to that in the QYH 500, and it is similar to the logic blocks in Acte1 ACT I

FPGAs [Actego]. Basic logic gates like NOT, AND, and OR, as well as more complex logic

z = ab + ac + bc is implemented by tying some logic block inputs to logic zero or one, as shown

on the right side of the figure. Through the use of feedback, a latch can be implemented in a single

logic block, and therefore, a flip-flop can be implemented using two logic blocks.

l I

Z I

MAI

Figure 2.16: CX2001 Logic Block [CEC96a] and Example Function

Several other features of the CX2001 logic block include the option of bypassing the

second-Ievel multiplexer and passing the output of a first-level multiplexer directly to the logic

block output. Timing of the chip is further enhanced by programmable drive on the output of each

logic block that enable a block to be used in lx, 2X, or 3X drive mode.

The CX2001 has embedded 8-Kbit SRAM blocks that reside along the sides of the array

of logic and routing. The memories are synchronous and each may be used as a FIFO, single or

dual port RAM, or as a ROM to implement logic. Like the Altera EABs, the depth and width of

the memory blocks are programmable.

--

3.1 Introduction

In this chapter, architecture and synthesis techniques for foldable PLA-style logic blocks

are introduced. Section 3.2 defines the proposed logic block architecture and its relevant

parameters. Section 3.3 discusses synthesis algorithms that may be used to map circuits into

foldable PLA-style blocks. These algorithms have been implemented in a set of custom-

devdoped CAD tools.

3.2 Foldable PLA-Style Logic Block Architecture

Chapter 2 introduced the notion of logic block architecture. A PLA is characterized by its

nurnber of inputs, product terms, and outputs. The logic function implemented by a PLA can be

described using a personality matrix [Wong87]. A personality matrix for two combinational

functions is shown in Figure 3.1. The rows of the personality matsix correspond to product terms,

while the columns correspond to inputs and outputs. A ' 1 ' in an input column indicates that an

input is present in its 'true' form in a product term; a 'O' indicates that an input is present in its

complemented form; a '-' represents a "don't care" and indicates that an input is not used in a

product tenn. The ' l ' , 'O', and '-' have similar meanings when used in an output column,

indicating whether or not a product term is present in the sum-of-products form of the function

corresponding to the output. Previous research has shown that on average, about 87% of the

entries in the personality matrices of large nodes in real circuits are "don't cares" [Kavi97]. Inputs

Figure 3.1: Example PLA Personality Matrix

PLA folding was first introduced as a method for reducing the silicon area consumed by

PLAs in custom VLSI. A PLA's area is proportional to the number of columns in its personality

leverages the high percentage of "don't cares" in personality matrices and reduces area by

allowing two columns of a personality matrix to reside on a single physical column (colurnn

folding), or by allowing two rows of a personality matrix to reside on a single physical row (row

folding). Colurnn folding is illustrated in Figure 3.2. A normal unfolded PLA is shown on the left

side of the figure1. A folded PLA in which three column pairs are folded ont0 single physical

columns is shown on the right side of the figure. Notice the "breaks" that occur on the folded

columns. An exarnple of row folding is depicted in Figure 3.3. In the example, four product terms

are folded ont0 two physical rows. The row folded PLA has two OR-planes, one on each side of

the AND-plane. Column folding elirninates columns from a PLA; row folding eliminates product

terrn rows from a PLA. It is also possible to combine row and column folding. Combined folding

can be applied to eliminate both rows and columns from a PLA; a combined folded PLA has

breaks on both its columns and its rows. The amount of folding in a PLA is quantified by a

parameter called the size of the folding [Egan84]. This parameter is equal to the number of

columns or rows eliminated frorn the original PLA. The size of the column folding in Figure 3.2 is

3; the size of the row folding in Figure 3.3 is 2.

a b c d e f g Y Z

Unfolded PLA

Figure 3.2: PLA Column

Column Folded PLA

Folding

1 . In Figure 3.2, a single column is used to represent both the true and complemented versions of each input signal.

a b c d e t t Y Z

Unfolded PLA Row Folded PLA

Figure 3.3: PLA Row Folding

Figures 3.2 and 3.3 show that PLAs can be folded by cutting either physical input columns

or physical product term rows. These structures can be implemented using the metalization layers

in a VLSI chip. Since rnetal lines can be cut in LPGA technology, it is possible to build an LPGA

with foldable PLA-style logic blocks. In such an architecture, each PLA-style logic block in the

array has a fixed number of physical input columns, product term rows, and outputs. Folding is

applied to facilitate packing additional logic into each logic block. As an example, consider the

PLA and logic block shown in the top portion of Figure 3.4. Clearly, the PLA shown in the figure

does not fit into the logic block, because it needs 6 product terms and 7 inputs. However, it is easy

to fit the PLA into the block by using folding. The bottom part of Figure 3.4 shows how two

columns and one row can be folded to accommodate the PLA. The notion of an array of foldable

PLA-style logic blocks in an LPGA represents an entirely new application for PLA folding, since

it has previously been applied only for single custom-fabricated PLAs. The empirical study in

Chapter 5 is concerned with evaluating the area-efficiency of architectures with foldable PLA-

style logic blocks and comparing it to the area-efficiency of architectures with normal unfoldable

logic blocks. The rest of this section elaborates on the architectural details of foldable PLA-style

logic blocks.

- - - - - - -

placed on the number of inputs (or product tems) that rnay share a single physical column (or

row) . One advantage that simple column folding has over multiple column folding is that input

signals connect to the folded PLA frorn either the top, or the bottom of its AND-plane. This is

because there is at most one break in any given column. This simplifies routing signals to the PLA

since signals never need to be connected to the middle of a column. Furthermore, multiple coIumn

folding rnay result in many signals being connected to a single logic block which rnay prove to be

unroutable in a PLA-based LPGA. In addition, if row folding is constrained to be simple, then the

PLA oütputs rnay always be placed along the left and right sides of the PLA'. Despite the fact that

multiple folding rnay result in larger area reductions for PLAs in custom VLSI [Liu94], simple

folding is the most appropriate choice for PLA-style logic blocks in an LPGA.

3.2.2 Bipartite Folding

Bipartite folding2 is a type of constrained folding in which al1 of the breaks occur at the

same level in the PLA [Kuo85]. For example, the column folding of Figure 3.2 is a bipartite

folding because the three breaks occur at the same vertical level. In generaI forrns of PLA folding,

the breaks rnay occur at several different levels within the same PLA.

Bipartite folding has two advantages over general folding. Consider the example of

general column folding in which breaks occur at different vertical levels within a PLA. The

different levels of breaks force specific pairs of input signals to share a column. This is not the

case in bipartite folding. For example, in the column folded PLA of Figure 3.2, input signal e was

paired with input signal a. However, since al1 of the breaks occur at the same vertical level, input

signal e could have been paired with any of the signals a, b, or d. This flexibility in pairing allows

a greater number of logic block pins to be logically equivalent, and this rnay make it easier to

route signals to foldable PLA-style logic blocks in an LPGA.

The second advantage of bipartite folding is that it introduces fewer constraints on

subsequent folding. This notion will be explained in the next section.

1 . Simple row-folded PLAs have an OR-AND-OR structure as shown in Figure 3.3. 2. Bipartite folding is referred to as block folding in [KUOU].

Most of the literature on PLA folding considers only column folding; however, a study by

Egan and Liu [Egan84] showed that in many cases, area reductions from row folding were

achievable when it was performed on bipartite column folded PLAs. Bipartite column folding can

be perceived as a partitioning of the product terms of a PLA into two classes: those above the

breaks, and those below the breaks. Egan and Liu point out that in subsequent row folding, only

product terms belonging to the same class can be considered as folding pairs to share a single

physical product term row. In more general non-bipartite foms of colurnn folding, the product

terms will be partitioned into a greater number of classes since breaks can occur at several levels

in the same PLA. This has the effect of limiting the number of combinations of product terms that

may be paired together during subsequent row folding, and it serves as good motivation for using

bipartite folding instead of more general foms of folding.

3.2.4 Summary of Architectural Parameters

The architecture of a foldable PLA-style logic block is shown in Figure 3.5; its

architectural parameters are summarized in Table 3.1. A foldable PLA-style logic block is

characterized by its number of input columns (0, product term rows (P), and outputs (O), along

with the type of folding that is permitted for the block. The parameters of a PLA-style logic block

are expressed using the tuple, (1, P, O). Row and combined foldable logic blocks have two OR-

planes; hence, in these blocks, the outputs are divided such that there is an equal number in each

OR-plane. Colurnn and combined foldable logic blocks allow signals to enter the PLA from both

the top and bottom of the AND-plane. Note that a column foldable PLA-style logic block with I

input columns actually has 2 x 1 inputs, whereas, an unfoldable PLA-style logic block with I

input columns has I inputs. Figure 3.5 shows the laser cut points that are necessary to make the

logic block row, column, or combined foldable. Although not shown in Figure 3.5, each output of

the block has an associated register, which can either be used or bypassed.

Row laser cut point for Column lmer cut point for ,,, ,, combibed folding column o r combined folding

\ t l * +

i . . . m . .

O-le$ Outputs O-riglir Outputs I Input columns

Logic block has O outputs; O = O-lefr + O-rifihr

Figure 3.5: Foldable PLA-Style Logic Block

Table 3.1: Foldable PLA-Style Logic Block Architectural Parameters

1 Parameter ( Description I I 1 Number of input columns.

1 P 1 Nurnber of product term rows. I 1 O 1 Nurnber of ouiputs. 1

To map circuits into the proposed architecture, a technology mapping CAD tool must be

able to transform an arbitrary digital circuit into a network of foldable PLA-style logic blocks.

The CAD tool must use folding effectively to minimize the number of logic blocks, while at the

same time, it must produce feasible mapping solutions, containing logic blocks that do not violate

the constraints on the number of logic block input columns, product term rows, and outputs.

Lastly, when the CAD tool is used to map circuits into logic blocks that are not foldable, it must

do at least as well at minimizing the number of logic blocks as existing technology mappers for

PLA-style blocks, or else, it will be difficult to assess the gains associated with folding.

-

Folding Type

3.3 Synthesis

This section introduces a CAD flow that may be used to map circuits into architectures

Logic blocks may be unfoldable, row foIdable, column foldable, or combined foldable.

Section 3.3.2 outlines appropriate technology independent synthesis methods. Section 3.3.3

describes a new CAD tool, called hooPLA, that has been designed and implemented to perform

technology mapping for PLA-based architectures. Section 3.3.4 discusses the synthesis

techniques used to perform PLA folding.

3.3.1 Overview of CAD Flow

Figure 3.6 illustrates the CAD ffow used to map circuits into architectures with foldable

PLA-style blocks. This CAD flow is used in the empirical study of foldable architectures in

Chapter 5. As depicted in the figure, circuits may be in any of three different forms: MCNC

circuits [Yang911 in EDIF', HDL circuits written at the behavioural level, or HDL circuits written

in RTL (register transfer-level). Circuits are read into the Synopsys Design Compiler [Syn96]

where they are subjected to technology independent synthesis and mapped into a netlist of gates

from an intermediate target library. The behavioural HDL circuits must be synthesized into an

RTL form using the Synopsys Behavioral Compiler [Knap96][Syn96] before they can be read into

the Design Compiler. Lastly, the intermediate Synopsys generated netlist is read into hooPLA,

where circuits are mapped into foldable PLA-style logic blocks.

1. The MCNC circuits [Yang9 11 are initially in an EDIF (electronic data interchange format) netlist format composed of gates from an MCNC library.

Behavioural HDL Circuit MCNC Circuit (EDIF netlist) 1 RTL HDL Circuit

I RTL HDL Circuit

Synopsys Design Compiler 8-Bounded

Verilog Netlist Translated into BLIF Using verîblif

Netlist of FoIded PLA-style BIocks -

Figure 3.6: CAD Flow for Mapping Circuits into Foldable PLA-Style Logic Blocks

3.3.2 Technology Independent Synthesis

Logic synthesis can typically be divided into two steps: logic optimization (or technology

independeni synthesis) and technology rnapping (or technology dependent synthesis). The first

step rnanipulates the boolean equation representation of a circuit with goals such as minimizing

the number of literals or reducing depth [Brow92][Toua91]. Some frequentIy used methods

include factoring and substitution [Bray871 [DeMi941 [Murg95]. This step is labelled 'technology

independent' because it manipulates a circuit without any concern for the type of logic block

available in the target technology. SIS' [Sent92], the sequential interactive systern, is a logic

synthesis tool commonly used to perform this step. Technology independent synthesis is followed

by technology rnapping in which the optimized circuit is mapped into logic blocks resembling

those in the target technology.

The CAD flow of Figure 3.6 attempts to leverage the technology independent synthesis

interna1 to a commercial CAD tool, Synopsys. The reason for using this methodology is so that

the proposed foldable architectures can be fairly compared with existing commercial LPGA

architectures, which are targeted using Synopsys.

1 . SIS was deveIoped at the University of California, at Berkeley.

that the tool does not allow a user to access a circuit after technology independent synthesis but

before technology dependent synthesis. That is, it is not possible to view a circuit in terms of

boolean equations before it is mapped into the gates of a target library. FPGA companies like

Altera [Alte951 and Xilinx [XiligS] who allow their customers to use Synopsys, have dealt with

this by requiring that customers map circuits into a special target library whose elements are

interpretable to the Altera and Xilinx CAD tools. The choice of which specific logic elements

should be in this intermediate target library is a research issue in itself since it has to do with

which primitive logic elements are best to represent the majority of circuits, given certain

optimization criteria (for example, minimum area or maximum speed). For this research, circuits

are mapped into the elements from Altera's MAX 9000 CPLD library and its FLEX 8000 FPGA

library [Alte95]. The gates of the target library are 8-bounded, requiring that the foldable PLA-

style logic block architectures considered have greater than or equal to 8 inputs.

Synopsys allows a user to set constraints to direct the tool to optimize for speed, area, or

some combination of the two. Although the study in Chapter 5 of this thesis considers both the

area consumed by architectures, as well as the number of levels of logic blocks on circuits' critical

paths, Synopsys was directed to optimize for area.

3.3.3 hooPLA: Technology Mapping for Foldable PLA-Style Logic Blocks

The hooPLA algorithm is a technology mapper for architectures with foldable PLA-style

blocks. Technology mapping for PLA-style logic blocks is considerably different than technology

mapping for look-up-tables because PLA-style blocks have a limited number of product terms. In

hooPLA, the technology mapping problem is broken into three phases: perforrning an optimal tree

mapping, heuristic partial collapsing, and bin packing. The algorithmic flow of hooPLA is

somewhat similar to that of the Chortle technology mapper for LUT-based architectures [Fran92],

and the DAGON algorithm [Keut87] for library-based technology mapping. The hooPLA

algorithm has been implemented in the C language within the SIS [Sent921 framework, allowing

hooPLA to access the I/O routines and two-level logic minimization algorithms within SIS. This

section explains hooPLA in the context of mapping circuits into normal unfoldable logic blocks;

Section 3.3.4 explains how folding is integrated into hooPLA. The first phase of hooPLA uses

contained within a circuit's directed acyclic graph representation.

3.3.3.1 hooPLA Phase 1: Performing an Optimal Tree Mapping

The combinational part of a circuit may be represented using a directed acyclic graph

(DAG). To begin, assume that the goal is to map a circuit into an architecture with normal

unfoldabie PLA-style blocks having the parameters (1, P, O). Technology mapping begins by

partitioning a circuit's DAG into a forest of fanout-free trees'. This is accomplished by identifying

the nodes within the DAG that have an out-degree greater than one, and using these nodes as

'breaking points'. This is illustrated in Figure 3.7 in which a DAG is broken into three trees. The

reason for breaking a circuit's DAG into a forest of trees is to divide the technology mapping

problem into smaller and simpler sub-problems. Technology mapping for fanout-free trees is

simpler because no node in a fanout-free tree has an out-degree greater than one and therefore, it

is not necessary to consider replication of logic.

Gate-Level Circuit DAG Forest of Fanout-Free Trees

Figure 3.7: Partitioning a DAG into a Forest of Fanout-Free Trees

Primary input nodes are added to each of the fanout-free trees by modifying them in the

following way: for each leaf vertex, n, in a fanout-free tree, T = (V, E) , a new primary input

node, p, is added to the vertex set V. An edge, e = (p, n), is created and added to the edge set

E. The primary input node p is a dummy node and implements no combinational logic function.

Before explaining the algorithm further, it is necessary to define several terms:

1. Fanout-free trees are trees in which no node has an out-degree greater than one.

that when simplified, it has less than or equal to I inputs and less than or equal to P product terms.

The two-level logic rninimizer Espresso [DiMi941 is used to simplify combinational nodes.

feasible subtree - a subtree of a fanout-free tree with the special property that it can be collapsed

into a single feasible node. Feasible subtrees are not allowed to possess any of the dummy

primary input nodes.

cone at n - a subtree of a fanout-free tree consisting of a node, n , and al1 of n 's predecessors.

size of node n - the size of a node with i inputs and p product terms is equal to p x i.

After partitioning the DAG, dynamic programming is used to map each fanout-free tree

into a new tree possessing the minimum number of feasible nodes. The trees in the forest of

fanout-free trees can be mapped in any order.

To map a fanout-free tree, T = (V, E) , the algorithm traverses the nodes of the tree in a

bottom-up (leaves to root) manner. As each node, n , is visited in turn, the algorithm proceeds to

find the set, S(n), of al1 feasible subtrees of T rooted at n . Espresso [DiMi941 is used to

determine if a particular subtree rooted at n is a feasible subtree; that is, the subtree can be

collapsed into a feasible node with less than or equal to P product terms and less than or equal to Z

inputs. A cost is computed for each feasible subtree and the feasible subtree of minimum cost is

selected and stored at node n . Cost(n) is an integer that refers to this minimum cost.

Primary input nodes implernent no combinational logic function and are assigned a cost of

zero. Al1 other nodes in V initially have no cost assigned. A set of steps are perforrned repetitively

until al1 of the nodes in V have been assigned a cost.

Step 1: Select a node, n , from V that has not yet been assigned a cost but whose fanin nodes

have been assigned a cost (this implies a bottom-up tree traversai).

Step 2: Determine S(n) - the set of al1 feasible subtrees rooted at n . Step 3: Assign a cost to node n using the formula:

where T' = (V', El) is a feasible subtree rooted at n belonging to the set S ( n ) ; F I ( T 9 ) is the set

of nodes in the fanout-free tree, T = (V, E) , that are not nodes in the feasible subtree T' but that

F I ( T 1 ) = { v l v f V , v e' V', (v, w) E E, w E V') (3.2)

In equation (3.1), Cost(n) is equal to the minimum number of feasible nodes needed to

implement the cone at n . Each subtree, T', in S(n) can be collapsed into a single feasible node

in the mapping solution; this is the reason for the 1 in the first term inside the brackets of equation

(3.1). The summation term tallys the costs of nodes in T that fanout to nodes in the subtree T' . The min function selects the feasible subtree rooted at n that results in the minimum cost

mapping of the cone at n . Figure 3.8 shows a node, n , along with three feasible subtrees rooted at

n. The cost of each of n's predecessors is shown interna1 to each node. Notice that previously

computed costs are used in the computation of Cosr(n) . The last node to be assigned a cost is the

root of the fanout-free tree being mapped.

Cost ofsubtree- 1+ 1 + 1 + 4 + 6 = 13

Cost of subtree = I + I

. , / Cost of subtree = 1 + (best)

Primary Input 1 \ 1 1

Figure 3.8: Computation of Feasible Subtree Cost

After al1 of the nodes have been assigned a cost, the final mapping solution for the tree is

generated using the minimum cost feasible subtree stored at each node. This is done by

considering the root, r , of the original fanout-free tree. The minimum cost feasible subtree stored

at the root is irnplemented as a new feasible node in the mapping solution for the tree. Nodes in T

that fanout to this new feasible node are then added to a node set, M. Mapping proceeds by

removing a node, m , from M, implementing the subtree stored at m as a neW node in the

mapping solution, and lastly, identifying nodes in T that fanout to the newly created node and

which time a network with a minimum number of feasible nodes has been created to implement

the function of the original tree. The mapping solution produced by hooPLA for a tree in a real

circuit is shown in Figure 3.9. The tree was mapped into three feasibIe nodes in a target

architecture with 1 = 8 and P = 8.

Feasible nodes with I 8 inputs and 1 8 product

Prim Inpu

Figure 3.9: Mapping Solution for Ree in MCNC Circuit alu4

A recursive algorithm is used to find the set S ( n ) - the set of al1 feasible subtrees rooted at

a node, n . The constraint on the number of product terms adds substantial complexity to the

enurneration of feasible subtrees. For example, consider the case of finding the set of feasible

subtrees for a node, n , with two fanin nodes, A and B. Assume that the subtree consisting of n

and A is a feasible subtree but that the subtree consisting of n and B is not feasible because it has

more than P product terms. Complexity is introduced because the infeasibility of the n and B

subtree does not imply the infeasibility of the n and A and B subtree. Specifically, the n and A

and B subtree may be feasible because the subtree consisting of n and A may collapse into a

feasible node containing fewer product terms than were originally in node n. Thus, technoIogy

mapping for architectures composed of PLA-style blocks with both product term and input

constraints is significantly different than technology mapping for LUT-based architectures.

The problem of finding an optimal tree mapping for a fanout-free tree possesses the two

elements that make dynamic programming applicable: optimal substructure and overlapping

subproblems [Corm94]. Step 3 finds a feasible subtree, T ' , of minimum cost rooted at each tree

node, n, using the previously computed minimum costs of the predecessors of n (optimal

tirnes in the subsequent cost cornputation of its successors (overlapping subproblems).

After performing technology mapping on al1 of the fanout-free trees within a circuit's

DAG, the mapping solutions for each tree are put back together into a complete circuit. The next

phase of hooPLA attempts to eliminate additional nodes from the circuit by collapsing rnulti-

fanout nodes into their successors; that is, by collapsing nodes across tree boundaries

3.3.3.2 hooPLA Phase II: Heuristic Partial Collapsing

In the circuit created by phase 1, any node that can be collapsed into al1 of its fanouts can

be eliminated, provided that al1 nodes rernain feasible after the collapsing. This introduces another

optimization problem since collapsing sorne nodes into their fanouts may preclude the possibility

of collapsing other nodes into their fanouts. This suggests that when given the choice between

collapsing two nodes into their fanouts, choosing one may be better than choosing the other.

Several criteria were identified and studied empirically using 30 benchmark circuits1 to determine

which nodes should be given preference to collapse into their fanouts. These criteria refer to

nodes to be collapsed into their fanouts, and not the new node(s) that would exist after collapsing

was complete. The criteria considered are:

1. Inputs - prefer to collapse nodes with fewer inputs.

2. Product Terms - prefer to collapse nodes with fewer product terms.

3. Node size2 - prefer to collapse small nodes.

4. Fanout - prefer to collapse nodes with low fanout.

To evaluate the criteria, each was applied individually as the selection criteria for partial

collapsing. The number of circuit nodes before and after collapsing was determined and a

percentage reduction was computed for each benchmark circuit. These percentages were then

averaged; hence, each circuit was treated equally in the cornparison. The results of this

experiment are shown in Table 3.2 for a PLA-based architecture with 10 inputs and 12 product

terms. The data show that the selection criteria of inputs, product terms, and node size perform

1. The benchmark circuits used for this experiment are thosc listed in Appendix A. 2. RecaI1 that sizc of a node with i inputs and p product terms is equal to p X i .

basis of Îanout. Thus, node size was chosen as the primary criteria for selecting nodes to collapse

and fanout is used as a secondary criteria.

Table 3.2: Heuristic Partial Collapsing Criteria

Average 9% Criteria Reduction

Inputs 1 26.8

When performing technology mapping for PLA-style blocks with a single output, each

logic block in the final mapped circuit will implement exactly one feasible node. In this case, the

goal is to minirnize the number of feasible nodes without concern for each node's size (as long as

every node is feasible). However, when the PLA-style logic blocks in the target architecture have

multiple outputs, it may be beneficial to control node size during partial collapsing.

When a node is collapsed into its fanouts, the sum of the sizes of the resultant nodes after

collapsing may be larger than the sum of the sizes of nodes before collapsing. The hooPLA

algorithm allows a user to control this by varying the parameter P in the following relation:

where v is the node to be collapsed into its fanouts; S is the set of v's fanouts before any

collapsing; and T is the set of v 's fanouts after v has been collapsed into them. The algorithm

will not collapse a node into its fanouts if relation (3.3) evaluates false. This allows a user to

ensure that collapsing does not overly increase the sum of the sizes of the nodes in the network.

When the logic blocks in the target architecture have only one output, should be set to a

large number. This places no restrictions on the size of nodes after collapsing. However, for multi-

output logic blocks, it may be advantageous to set to a smaller value. This was investigated

experimentally in the context of the third phase of hooPLA, and the results are shown in the next

section.

3.3.3.3 hooPLA Phase III: Bin Packing

The final phase of hooPLA packs circuit nodes into the multi-output PLA-style logic

blocks available in the target architecture. This is accomplished using a first-fit-decreasing bin

packing algorithm that attempts to maximize the number of shared inputs between nodes that are

packed into the sarne PLA-style logic block. The bin packing algorithm used is shown in Figure

3.10. A bin packing approach was also used to solve this problem in [Kou1931 and [Kavi96].

Several alternative approaches to the algorithm in Figure 3.10 were also investigated

including, first-fit-decreasing without consideration for shared inputs, and maximally disjoint

input packing that minimized the number of common inputs between nodes packed into the same

PLA-style logic block. The algorithm shown below gave slightly better results than the others that

were considered.

nodeset + Set of al1 nodes in network while (nodeset is not empty) {

plaBlock t empty block /* allocate a new PLA-style logic block */ nodeSe1 t largest node in nodeset (node size = number of inputs x number of product terms) Add nodeSe1 to plaBlock Remove nodeSe1 from nodeset while (nodeset is not empty and there are nodes in nodeset that can fit into plaBlock) {

nodeSe1 t node from nodeset that has the largest number of inputs in common with the nodes already in plaBlock; the node must be able to fit into plaBlock; use node size to break ties

Add nodeSe1 to plaBlock Remove nodesel from nodeset 1

1

Figure 3.10: Maximum Shared Input Bin Packing Algorithm

To investigate what value of p in relation (3.3) is appropriate for multi-output blocks,

was varied while benchmark circuits were mapped into logic blocks with 10 inputs, 12 product

terms, and 4 outputsl. The number of blocks needed to implement each circuit was cornpared to

that attained when p was set to a large value (unrestricted collapsing) and a percentage decrease

in the number of logic blocks was computed. These percentages were averaged so that each

circuit was treated equally. The results of this experiment are shown in Table 3.3.

1. The (1 0, 12,4) was determined to the most area-efficient PLA-based architecture in [Kou1931

Table 3.3: Effect of Controlled Partial Collapsing

Average % 1 1 Decrease 1

The results above suggest that it is not always a good idea to pack as much logic as

possible into each feasible node before packing the nodes into multi-output logic blocks. The

results also show that the number of outputs on the PLA-style blocks in the target architecture

should be taken into account when feasible nodes are generated by phases 1 and II of hooPLA.

One direction for future work wouId involve making modifications to phase 1 of hooPLA to take

this notion into account.

3.3.3.4 Cornparison with Existing Technology Mappers

To assess the quality of mapping solutions produced by hooPLA, the tool was compared

with the two technology mapping methods discussed in Chapter 2. In particular, hooPLA was

compared with the tool used by Kouloheris, called DDMAP [Kou193], and also compared with

the method of using the SIS [Sent921 partial collapsing function, eliminate', along with the node

partitioning and packing program called Break-a-Node, developed by ~ a v i a n i ~ [Kavi97]. Note

that the first step of DDMAP is to apply a look-up-table technology mapper; Level-Map [Farr94]

is used to perform this initial rnapping3.

Table 3.4 shows the results when the technology mappers are used to map benchmark

circuits into a target architecture containing unfoldable logic blocks with 10 input columns, 12

1. The partial collapsing routine in SIS is called eliminate. The routine was used in used in four diffcrcnt ways and the best solution was chosen: 'climinate -1 24 5'' 'eliminatc -1 24 2', 'climinatc -1 20 5'' and 'eliminate -1 20 2'. The SIS command 'simplify -1' was called after eliminate.

2. Break-a-Node is used in the CAD flow of the hybrid FPGA architecture [Kavi96]. 3. Kouloheris conducted his original experiments using the LUT mappcr Chortle-crf [Fran92][Kou193].

second lists the number of Iogic blocks needed to implement each circuit when hooPLA is used.

The third and fourth columns give the results for the eliminate method and DDMAP, respectively.

In these columns, a percentage is given in brackets which represents the amount of additional

logic blocks needed to implement each circuit in comparison with hooPLA. On average, when the

circuits are mapped using the eliminate method, they require 21 -5% more logic blocks than when

they are mapped with hooPLA. When circuits are mapped using DDMAP, they require 93.8%

more blocks on average than when hooPLA is used.

Notice that hooPLA perforrns poorly for the benchmark 'ex5p'. For this circuit, DDMAP

produces a solution with nearly 80% fewer blocks than hooPLA. ex5p is a purely combinational

circuit possessing 8 primary inputs, and 63 primary outputs. Since the number of inputs to the

circuit is less than the number of inputs to the logic blocks in the (10, 12,4) architecture, Level-

Map produces a mapping containing 63 nodes: one node for each primary output. Level-Map

produces such a mapping because it is able to deal effectively with reconvergent paths within

circuits. Furtherrnore, for this circuit, most of the nodes in the Level-Map solution happen to be

feasible nodes. Many of the nodes have common inputs, allowing several nodes to be packed into

each 4-output logic block. To verify that the exploitation of reconvergent paths was the reason for

the superior mapping, the circuit was mapped with the LUT-based technology mapper, Chortle-crf

[Fran92], which deals with reconvergence in only a limited way. Chortle-crf produced a mapping

containing 363 nodes which is significantly greater than the 63 nodes in the Level-Map solution.

Since hooPLA breaks up a circuit into fanout-free trees and finds a covering for each tree, it is not

able to exploit reconvergent paths effectively, and hence, produces an inferior solution for this

benchmark.

To verify that the results favouring hooPLA were not a side-effect of the (10, 12, 4)

architecture used for comparison, the circuits were also mapped into a (16, 8, 4) architecture and

compared with the eliminate method. In this case, the eliminate method produced solutions with

an average of 37.1 % more blocks than hooPLA.

Benchmark

a h 4

apex2

apex4

bigkey

C5315

clma

CPS dalu

des

3.3.4 PLA Folding

So far, hooPLA has been

been described as a technology

section reviews previous work on

ex5p ex1010

i 10

misex3

pair

pdc ~38417

described without any mention of PLA folding;

rnapper for normal unfoldable PLA-style logic

hooPLA

155

that is, it has

blocks. This

132

217

1 70

154

104

61 8

603

folding and describes the synthesis techniques used to perform

eliminate method (% more)

201 (29.7)

simple bipartite PLA folding. This is followed by a discussion of how PLA-folding is integrated

into the hooPLA algorithm.

DDMAP (% more)

199 (28.4)

335 (54.4)

196 (15.3)

190 (23.4)

116 (1 1.5)

700 (16.1)

s38584.1

seq spla

339 (54.8)

193 (0.0)

686 (202.2)

236 (1 56.5)

1458 (52.4)

159 (32.5)

102 (59.4)

456 (1 00.0)

219 1 278 (26.9)

193 160 (-17.1)

227 1 337 (48.5)

92 1 28 (39.1 )

957 1221 (27.6) l

27 (-79.5)

207 (4.6)

436 (156.5)

214 (39.0)

164 (57.7)

1221 (97.6)

1208 (100.3)

722 ( 13.2)

274 (19.7)

755 (27.3)

223 (37.7)

252 (1.2)

190 (46.2)

65 (32.7)

7 12 (35.9)

357 (-5.1)

253 (6.8)

151 (0.0)

164 (18.8)

352 (7.0)

Average: 21.5 %

63 8

229

593

120

64

228

977 (53.1)

337 (47.2)

1 160 (95.6)

3 12 (92.6)

1424 (47 1.9)

160 (23.1)

58 (1 8.4)

394 (-24.8)

173 1 (360.4)

9 1 1 (284.4)

301 (99.3)

368 (78.6)

275 (99.3)

536 (62.9)

Average: 93.8%

137 (14.2)

70 (9.4)

320 (40.4)

enc-shift-dec 1 162

fir

fsm8-16-13

fsm8-8-13

go164

mle

pmac

psdes

r4000-32

sort

valu

249

130

49

524

376

237

151

206

138

3 29

3.3.4.1 Previous Work

Egan and Liu showed that the problem of finding an optimal bipartite folding is NP-

complete [Egan84]. This implies that an algorithm with exponential time complexity must be

used to find a folding of optimal (maximum) size for a given PLA. The rnethod of branch and

bound was applied in [Egan84] to find optimal bipartite foldings. Several other heuristic methods

have been proposed.

Simulated annealing is a general algorithmic approach that was applied to the folding

problem by Wong, Leong, and Liu wong87-J. The authors developed a cost function, an

annealing schedule, and a simple method of moving from one solution to the next which consists

of permuting two rows of the PLA personality matrix. In this, and similar work [Sanc95], the

authors show how their algorithms can be adjusted to deal with specific constraints on the folding

problem such as bounded product terrn positions or ordered connection line assignment in which

a partial ordering is imposed on the PLA's input signals.

Another method undertaken in several studies is to translate the folding problem into a

graph partitioning problem and then apply heuristic min-cut partitioning [Lakh90][Liu94]. This is

the approach used in this thesis, and it is discussed in the next section.

Other approaches include mapping the folding problem into the problem of maximal

clique identification' in a graph [Leck89]; a simple greedy algorithm can be used to locate

cliques. Kuo, Chen, and Hu reformulate the folding problem as an integer programming problem

[Ku0851 [Pres95].

Hsu, Lin, Hsieh, and Chao consider the problem of combining Iogic minimization and

folding for PLAs [HsuQl]. The basic premise of the work is to examine how decisions made

during logic synthesis affect the amount of folding that can be achieved. The authors propose a

type of folding-directed logic synthesis that leads to increased folding sizes for some PLAs.

3.3.4.2 Approach Used to Perform Bipartite Folding

In this thesis, bipartite PLA folding is performed using an algorithm similar to the one

developed by Liu and Wei [Liu94]. Specifically, the algorithm in [Liu941 has been adapted to be

able to perform combined folding. The approach used involves mapping a PLA description into a

1. Maximal clique identification is the problem of finding the largest fully connected subgraph in a graph.

ulpai L l L G SiUylI UiiU L I l b l l UpYlJ L A I S L l l l l l - b U L S L U ~ l l Y U l L I L l V l 1 1 1 A 6 L W Y L V U U b b U L W l U 1 1 1 6 . A W L Y U I L I L b

graph is an undirected graph, G = (V, E) , in which V can be divided into two sets, VI and V2,

such that each edge, (u, v ) cz E , indicates that u E V1 and v E V2 [Corm94]. Thus, al1 of the

edges in a bipartite graph connect vertices in the different vertex sets, V I and V2.

A PLA can be transformed into a bipartite graph by letting each vertex in the first vertex

set, VI , correspond to a single product term of the PLA and each vertex in the second vertex set,

VZ , correspond to one of the PLA inputs or outputs. An edge exists between a node u E V I and a

second node v E V2 if one of the following two conditions are true:

1. v is an input, and it is used in the product term represented by u . 2. v is an output, and the product term represented by u is a terrn in the sum-of-products

function that v irnplements.

The transformation of a PLA into a bipartite graph is shown in Figure 3.1 1.

Figure 3.11: Mapping a PLA into a Bipartite Graph

Following the transformation of the PLA into a bipartite graph, the newly created graph,

G , is partitioned into two subgraphs, G, and G 2 . A min-cut algorithm similar to that developed

between nodes in the two different subgraphs. The following parameters can be deterrnined for

any partition, P , of G :

X I = {xlx E V2, x adjacent to vertices in G1 only, x represents an input} (3.4)

X1 = { X ~ X E V2, x adjacent to vertices in G2 only, x represents an input} (3.5)

X 3 = { X lx E VI, x adjacent to vertices in G , only } (3.6) X4 = {X lx E V I , x adjacent to vertices in G2 only ) (3.7)

Once these vertex sets have been identified, the size of the column folding corresponding to P is

Similarly, the size of the row folding corresponding to P is given by:

R = min(lx31, Ix41 Only inputs are included in the sets X I and X2, since outputs are not allowed to be folded'. A

graphical interpretation of the concepts above and a partitioned version of the bipartite graph of

Figure 3.11 is shown in Figure 3.12. In the figure, a single edge crosses between the subgraphs

G, and G 2 . The value of C and R are 2 and 1, respectively. The folded PLAs corresponding to

the partitioning are displayed beneath the partitioned graph.

- . . -. .. - . -

1 . Outputs are not allowed to be folded because of the register and drive circuitry associated with each output.

5 1

Figure 3.12: Partitioned Bipartite Graph with Foldings

The partitioning algorithm works as follows: First, the vertices of the bipartite graph are

randomly partitioned into two subgraphs, G , and G 2 , and al1 vertices as tagged as 'free',

meaning they are free to move between subgraphs. Next, the free vertex with the highestfimess

(defined shortly) is selected and moved to the opposite subgraph. The selected vertex is then

tagged as 'locked' meaning that it may no longer move between subgraphs. After each move,

equations (3.8) or (3.9) above are used to determine if the current partitioning is the best found so

far; (3.8) is used for column folding, (3.9) for row folding. The best partitioning is saved. This

process of selecting, moving, and locking continues until there are no more free vertices,

indicating that the 'pass' is compIete.

A second pass is initiated by freeing al1 vertices and setting the initial partition equal to the

best partition found during the previous pass. These partitioning passes continue until there is a

complete.

When coIumn folding is being performed, the fitness of moving a vertex, v, to the

opposite subgraph is computed using:

Fitness(v) = v is rnoved - Cbefore v iç moved (3.10)

When row fo1ding is being performed, the fitness of a moving v is computed using:

Fitness(v) = i~ is rnoved - Rbefore v is rnoved (3.1 1)

Pseudo-code for the folding aIgorithm is given in Figure 3.13.

Fold (G)( /* G is the bipartite graph representation of a PLA */ BestP t random partition of the vertices in G modified t true while (modified) { /* begin a pass */

free al1 vertices in G set the initial partition equal to BesrP modified t faIse whi1e there are free vertices (

select the vertex with the highest fitness move the selected vertex to the opposite subgraph and Iock it if the size folding corresponding to the current partition >

the size of the folding corresponding to BestP ( BestP t the current partition modified t true 1

1

Figure 3.13: Pseudo-Code for Folding Algorithm

One significant feature of this folding algorithm is that either row or column folding can

be performing using the same bipartite graph and partitioning algorithm. Combined folding is

achieved by first performing either row or column foiding. This primary folding is tantamount to

dividing the original PLA into two smaller PLAs. Subsequent folding can then be applied to these

smaller PLAs by transforming them into bipartite graphs and applying the same folding

algorithm. The division of a folded PLA into two smaller PLAs is shown in Figure 3.14.

a b c 'f Z

Figure 3.14: Division of a Folded PLA into Two Smaller PLAs for Subsequent Folding

One problem that arises in the subsequent folding of the two smaller PLAs is related to the

fact that there may be inputs (or product terms) that are present in both of the smaller PLAs. This

is true for the case of the small PLAs in Figure 3.14 which share the product terrns P4 and P5.

This sharing leads to a situation in which folding one of the smaller PLAs may introduce

constraints on the folding of the second smaller PLA. Column folding can be thought of as a

partial ordering of a PLA's product terms - some product terms above the breaks, and some

product terms below the breaks. Similarly, row folding can be thought of as a partial ordering of a

PLA's inputs and outputs - some of the inputs and outputs to the left of the breaks, some of the

inputs and outputs to the right of the breaks. To understand how the constraints are created,

consider the following example. A large PLA with two outputs is row folded and thus divided into

two smaller PLAs. Assume that the two smaller PLAs have two product terms in common and

that subsequent column folding on the first small PLA results in a partial product term ordering in

which both of the common product terms are located above the breaks. This introduces a

constraint on the column folding of the second small PLA: both of the common product terms are

constrained to be above the breaks in the folded PLA'. The reason for the constraint is that

1 . Equivalently, both of the product tenns may be constrained to be below the breaks in the folded PLA. The constraint is simply that the nodes representing the two shared product terms be in the same partition after the bipartite graph partitioning step.

eventually be re-assembled into a single combined folded PLA. These additional constraints can

be realized within the context of the folding algorithm described above by allowing vertices to be

pre-allocated to one of the subgraphs, G , or G 2 , and by allowing a permanent lock to be placed

on some vertices of the bipartite graph. Perrnanently locked vertices are never allowed to move

between subgraphs.

3.3.4.3 Integrating PLA Folding into hooPLA

In essence, the goal is to use folding to pack additional logic into each logic block. The

folding algorithm described in the previous section was applied in several different ways. The first

attempt involved trying to maximize the sum of the sizes of the nodes that are packed into a single

foldable PLA-style logic block. That is, attempting to maximize BlockSize in the following

equation:

BlockSize = size(u) U E B

where B is the set of nodes that are packed into a multi-output PLA-style logic block. Folding

was integrated into phase III of hooPLA and nodes were selected to be packed into foldable PLA-

style blocks on the basis of their size. The largest node that could fit into the PLA-style block

being considered was selected and packed, even if folding was necessary to make the node fit into

the multi-output block.

The second method investigated was simpler and gave slightly better results. This method

attempts to maximally utilize al1 of the outputs on logic blocks. In this approach, phase III of

hooPLA is perfonned as it would be for an architecture with unfoldable blocks (as described in

Section 3.3.3.3). This inevitably leads to PLA-style blocks with unused outputs. Folding is then

applied to identify additional nodes that can be packed into the multi-output blocks until al1

outputs are utilized. Given a situation where several nodes are identified as candidates to pack into

a logic block, the node with the most inputs in common with nodes already in the logic block is

selected and packed.

Phase II of hooPLA was designed for unfoldable logic blocks and it attempts to eliminate

nodes from a circuit by collapsing them into their successors. As described previously, when

mapping into an architecture with the parameters (1, P, O), a node is not collapsed into its

- - -

P product terms. Phase II was modified and incorporated with column folding to allow more

nodes to be collapsed into their fanouts and eliminated from the network. In the modified version,

a node can be coIlapsed into its successors as long as the nodes that result from collapsing have

less than or equal to P product terms, and less than or equal to F inputs. F may be Iarger than I as

long as a colurnn folding can be found such that F minus the size of the folding is less than or

equal to 1. Clearly, this modification of hooPLA is only appropriate when the target architecture

has either column or combined foldable logic blocks. It gives good results for some circuits, and

therefore, the folding results presented in the empirical study in Chapter 5 reflect the best folding

results achieved both with or without using this modification.

One folding method that was not attempted is to combine folding into phase 1 of hooPLA.

Currently, hooPLA covers fanout-free trees with feasible nodes where each feasible node can fit

into a normal unfolded PLA-style block. Phase 1 of hooPLA could be modified to cover fanout-

free trees with nodes that possess more inputs than that which could fit into a normal unfolded

PLA-style block, but that could be folded to fit into a column or combined foldable PLA-style

logic block. This would increase the size of the search space in the optimal tree mapping and it

would make hooPLA significantly more complex, but it may give superior results.

3.4 Summary

In this chapter the foldable PLA-style logic block architecture was introduced.

Implementing foldable PLA-style logic blocks is feasible in LPGA technology since it is possible

to cut metal Iines. The proposed logic block architecture represents an entirely new application for

PLA folding which has previously only been used in custom VLSI. A brief review of PLA folding

was given along with the rationale for choosing to use simple bipartite folding.

A CAD flow to rnap circuits into the proposed architecture was presented. The CAD flow

includes a new tooI, called hooPLA, that was designed and implernented to perform technology

mapping for architectures with foldable PLA-style blocks. The hooPLA algorithm operates in

three phases. Phase 1 uses dynamic programming to map each fanout-free tree in a circuit's DAG

into a new tree possessing the minimum number of feasible nodes. Phase II attempts to eliminate

nodes by collapsing them into their successors. Phase III is a bin packing step in which the nodes

in a circuit are packed into multi-output logic blocks. Folding was achieved using a min-cut graph

algorithmic flow of hooPLA is summarized in Figure 3.15.

Break circuit's DAG into a forest of fanout-free trees.

Phase 1: Map each tree into a new tree

possessing the minimum number of feasi ble nodes.

Re-assemble circuit from trees and collapse nodes across tree boundaries. Optionally, perform

folding when target is column or combined foldable logic blocks.

Phase III: Pack nodes into PLA-style logic blocks.

Use folding to pack as much logic as possible into each logic block.

Circuit mapped into foldable PLA-style logic blocks.

Figure 3.15: Algorithmic Flow of hooPLA

4.1 Introduction

This chapter introduces the foldable look-up-table logic block architecture. Section 4.2

shows how the proposed architecture is related to LUT-based FPGAs and outlines its relevant

architectural parameters. Synthesis techniques for mapping circuits into foldable LUTs are

presented in Section 4.3. A custom CAD tool has been developed, and it is used in conjuction

with an existing FPGA CAD tool to realize a CAD flow for targeting foldable LUT-based

architectures.

4.2 Foldable Look-Up-Table-Based Logic Block Architecture

Chapter 2 introduced LUT-based logic blocks and reviewed several important research

results and synthesis techniques. A LUT is a multiplexer tree and a set of storage elementsl. LUTs

have an area-efficient implementation in LPGA technology. Instead of using SRAM cells for the

LUT's storage elements, each storage element is implemented as a programmable connection to

either logic 'O' or ' l ' , as shown in Figure 4.1. The LUT's "storage elements" require no

transistors. During laser programming, either the connection to logic '1', or the connection to

logic 'O' is cut away according to the tnith table of the logic function being implemented in the

LUT. This causes the programmed LUT to resemble a small ROM. A similar LUT

implernentation can be found in the Xilinx XC3300 mask-programmed gate array2. The XC3300

has the same architecture as the SRAM-based XC3000 FPGA, but, in the XC3300, SRAM cells

are replaced by 'programmable vias', resulting in a die size 50% smaller than an equivalent

SRAM-based part [Frak92].

1. An example of a 4-LUT is depicted in Figure 2.7. 2. The XC3300 is an MPGA with two fully-customizeable metal layers.

Select Lines

Figure 4.1: LUT Programming in LPGA Technology

K There are 2 storage elements in a K-LUT; the number of 2 to 1 multiplexers in a K-

LUT is 2K - 1 . These structures dorninate the silicon area consurned by a LUT [Rose90], rnaking

the area of a LUT exponentially related to the parameter K. Therefore, to achieve good area-

efficiency, it is critical to utilize LUT inputs effectively. When a 4-LUT is used to implement a

function of only 3 inputs, half of its logic capacity is wasted. Current technology mappers for

LUTs do a reasonable job of utilizing LUT inputs when K is small; however, as K is increased, a

signifiant number of inputs are left unused. This effect can be seen in Figure 4.2 which shows the

average number of LUT inputs that were left unused when the Level-Map technology mapper

[Farr94] was applied to the 30 benchmark circuits listed in Appendix A. Although increasing the

number of inputs to LUTs in the target architecture reduces the number of LUTs needed to

implement circuits, the reduction must be traded-off with the increase in logic block and routing

Number of LUT Inputs (K)

Figure 4.2: Utilization of LUT Inputs

the abiIity to cut metal lines. A 4-input LUT contains two 3-input LUTs within it. Through the

addition of cut points, extra inputs, and outputs, a LUT could optionally be divided in half. Figure

4.3 shows a 4-input LUT with added cut points so that it may be divided into two 3-LUTs. The

multiplexers that implement each of the 3-LUTs are shown with different shading. The term

folding refers to the division of a LUT into smaller LUTs. LUT folding is tantamount to varying

the granularity of the logic blocks in the target architecture. The capability to divide LUTs in this

way increases the amount of 1ogic that may be packed into a single LUT. For example, consider

the case of mapping a circuit into an architecture with 4-LUTs. If the logic blocks are not

foldable, a logic block is needed to implement each 3-input node in the circuit. However, if

folding is permitted, two 3-input nodes can be paired together and irnplemented in a single logic

block. The notion of foldable LUTs is similar to the notion of 'decomposable look-up-tables' in

Input 4 Input 5 Input 6 I

1; t E H

01 Li

2 : i îü u = "Stomge Elernent" ? n = Laser Cut Point w

Input O Input 1 Input 2 Input 3

Figure 4.3: Foldable 4-LUT

This ability to divide LUTs can be extended. For instance, a 4-LUT can implement one 4-

LUT, two 3-LUTs, five 2-LUTs, and some combinations of 3-LUTs and 2-LUTs. The laser cut-

points necessary to achieve this fiexibility are shown in Figure 4.4. The multiplexers that

implement each of the five different 2-LUTs in a 4-LUT are shown with different shading. As

iIlustrated in the figures, there is some overhead involved in being able to fold LUTs.

Implementing two 3-LUTs in a single 4-LUT means that the logic block must have two outputs.

The multiplexer outputs in Figure 4.4 are labelled "potential" outputs because although 7 outputs

are shown, a maximum of 5 different logic functions can be implemented in the logic block;

therefore, five output drivers would be needed. Furthemore, folding increases the number of

inputs to each logic block. Each input must be present in its tnie and complemented forrn for the

multiplexer select lines. Thus, inverters are needed for each of the additional logic block inputs.

Lastly, folding may increase the number of "storage eiements" in a LUT as shown in Figure 4.4,

in which four new storage elements were introduced.

An architecture with foldable LUTs can be characterized by two parameters, K and L.

The parameter, K, is equal to the number of inputs to the LUT in its unfolded forrn. The

parameter, L , is referred to as the foldingflexibility, and it is equal to the nurnber of inputs to the

smallest LUT into which the original LUT may be divided. For exarnple, the foldable LUT shown

in Figure 4.4 has the parameters K = 4 and L = 2, since it can be divided into 2-LUTs.

Normally, in LUT-based FPGAs, each logic block contains a register [Alte96][XiIi94].

Clearly, it is not feasible to have a register associated with each output of a foldable LUT, because

it would greatly increase logic block area. In this study, it is assurned that each logic block has a

single register that can optionally be bypassed to implement combinational logic. The register

bypass can be implemented in LPGA technology in a way that requires no multiplexers. It is

further assumed that any of the potential outputs of the combinational portion of a logic block

may connect to the register input. The output circuitry for a foldable LUT-based logic block with

K = 4 and L = 3 is shown in Figure 4.5

Input 4 Input 5 Input 6

I I I I

Input O Input 1 Input 2 Input 3

Figure 4.4: Foldable 4-LUT with Additional Flexibility

"Stonge Element" Laser Cut Point

Potential Output FromRoot - Multiplexer of Multiplexer Tree

Actual Outputs

I I I - -\ Output Dnvers Potentid Outputs Laser Cut Point

Figure 4.5: Output Circuitry for Foldable LUT-Based Logic Block with Parameters K = 4 and L = 3

As folding flexibility is increased, the number of logic blocks needed to implement

circuits should decrease because more logic can be packed into each LUT. This decrease must be

traded-off with the area penalties connected with the added flexibility. The empirical study in

Chapter 5 is concerned with whether or not there is an optimal amount of folding flexibility for

LUT-based logic blocks.

4.3 Synthesis

This section discusses a CAD flow for foldable LUT-based architectures. A high-level

overview of the flow is given in Section 4.3.1. Following this, Section 4.3.2 introduces a new tool

that has been developed to perfonn technology rnapping for foldable LUT-based logic blocks.

4.3.1 Overview of CAD Flow

The CAD flow for foldable look-up-table-based logic blocks is shown in Figure 4.6. The

front end of the flow is identical to the front end of the CAD flow for foldable PLA-style logic

blocks discussed in Chapter 3. The issues related to technology independent synthesis that were

presented in Chapter 3 apply equally well to synthesis for foldable LUTs, and they will not be

repeated here. Circuits are mapped into the gates of a 4-bounded intermediate target library using

Synopsys tools [Syn96]. The library consists of elements from Altera's FLEX 8000 FPGA library

[Al te951.

Behavioural HDL Circuit State Machine MCNC Circuit (edif netlist) or RTL, HDL Circuit

\

\

\ RTL WDL

\ Circuit

Library of Synopsys Design Compiler 4-Bounded

/ Verilog Netlist

Using ver2blif

Netlist of

Trmsferred to BLIF

Level-Map

Unfolded

Nctlist of Folded Look-Up-Tables

Figure 4.6: CAD Flow for Mapping Circuits into Foldable LUT-Based Logic Blocks

Synthesis proceeds in a manner typical for LUT-based FPGAs. The Level-Map [Fan941

technology mapper is used to map circuits into a network of normal unfolded LUTs. Level-Map

was discussed in Chapter 2.

After technology mapping with Level-Map, some circuit nodes may not use al1 of their K

into a single foldable LUT. A tool, called LUTPack, has been developed and integrated into SIS

[Sent921 to perforrn this packing.

4.3.2 LUTPack: Technology Mapping for FoIdable Look-Up-TabIe-Based Logic Blocks

A LUT contains a binary tree of multiplexers. Large LUTs can be decomposed into

smaller LUTs by cutting the multiplexer tree into smaller trees. Circuit nodes with less than K

inputs are referred to as small nodes. LUTPack uses a first-fit-decreasing (FFD) approach to pack

multiple small nodes into a single foldable LUT-based logic block. In essence, the algorithm must

'cover' the multiplexer trees in the logic blocks with small nodes that exist after technology

mapping. LUTPack attempts to minimize the total number of logic blocks needed to implement a

circuit.

First-fit-decreasing bin packing algorithms are commonly ernployed for problems in

which a number of elements must be 'packed' into bins tbat have a fixed capacity. The reason this

type of algorithm cannot be applied directly to the problem of packing small nodes into LUTs is

that FFD algorithms consider only the size of the elements and the bin capacity. To perform

technology mapping for foldable look-up-tables, it is also necessary to consider the location of

the elements within a bin. That is, it is necessary to consider how the

blocks are covered with small nodes.

To illustrate the algorithm, it is important to consider what

small nodes with L inputs that can be packed into a LUT with K

found using equation 4.1.

multiplexer trees in the logic

is the maximum number of

inputs. This number can be

LLJ

Number of L-LUTs = 2K - jL = i = 1

where L I K

The maximum described in (4.1) can be achieved only if the smaller nodes with L inputs

are packed into the K-LUT in a 'bottom-up' manner; that is, from the leaf multiplexers of the K -

LUT'S multiplexer tree towards the root multiplexer. For example, consider the problem of

packing two 3-input nodes into a 4-LUT. This is illustrated in Figure 4.7 where the multiplexers

are shown as nodes in a binary tree. To keep the figure simple, inputs to the LUT are not shown.

node covers the portion of the multiplexer tree closest to the root multiplexer (the covered portion

of the tree is shaded). This placement precludes the possibility of packing any additional 3-input

nodes into the 4-LUT. Part (b) of the figure shows how it is possible to pack two 3-input nodes

into the 4-LUT, if they are in locations closest to the leaf multiplexers. To achieve the best

utilization of the multiplexers in LUT-based logic blocks, it is best to cover the multiplexer trees

in a bottom-up fashion. Parameter L may be chosen such that there exists circuit nodes with less

than L inputs. During packing (covering), these nodes will consume a portion of the tree equal to

that consurned by a node with exactly L inputs.

Root Multiplexers I

I I I I I I I I I I

(a) Poor Covering LeafMultiGexers (b) G O O ~ Covering

Figure 4.7: Covering the Multiplexer Wee

One additional objective of the algorithm is to attempt to limit the number of distinct

inputs to a single logic block. The reason for this is that the number of connected input pins per

logic block is directly proportional to the average number of routing tracks required to route

circuits [ElGa8 11. Consequently, this secondary objective rnay help improve routability, given that

the number of tracks available is fixed.

Some circuit nodes are registered, and, as stated earlier, it is assumed that there is a

maximum of one register per logic block. Two algorithms were investigated to deal with this. The

first algorithm did not confer any special preference on registered nodes when choosing nodes to

pack into a block. In this algorithm, nodes were selected on the basis of their size' and the number

of inputs shared with nodes already packed into a block. The second algorithm attached special

1. In this case, size is equaI to the number of inputs to a node.

packing. In the case of the second algorithm, node size and minimizing the number of distinct

inputs to a logic block were secondary selection criteria. These two algorithms were compared in

a study in which benchmark circuits1 were mapped into foldable LUTs with the parameters K = 6

and L = 4. The number of foldable blocks needed to irnplement each circuit was determined and

compared with the number of logic blocks needed when the circuit was mapped into an

unfoldable architecture with K = 6. A percentage reduction in number of logic blocks was

computed for each circuit, and these percentages were averaged over al1 circuits. The second

algorithm never performed worse than the first algorithm, and it produced better results for a few

circuits; hence, it was chosen as the packing technique. Figure 4.8 provides pseudo-code for the

algorithm used to cover the multiplexer trees.

LUTPack { IutSet c- Set of aIl nodes in network while (lutset is not empty) {

donepacking e false FoldableBlock e empty block /* allocate a new logic block */ LUT + largest registered node in lutset; if there are no registered nodes in lutset, select

the largest unregistered node Remove LUT from lutset Add LUT to FoIdableBlock in a position as close as possible to the leaf

multiplexers of FoldableBlock while (donepackhg is equal to false and lutset is not ernpty) (

LUT t largest unregistered node in IutSet that can fit into FoldableBlock - use the number of shared inputs to break ties

If LUT exists { Remove LUT from lutset Add LUT to FoldableBlock in a position as close as

possible to the leaf multiplexers of FoldableBlock 1

else ( donePacking c tme 1

1 1

1

Figure 4.8: Pseudo-Code for First-Fit-Decreasing LUT Packing

1 . The benchmark circuits used in the study are those Iisted in Appendix A.

66

This chapter introduced the foIdable look-up-table logic block architecture. The proposed

logic block is characterized by the parameters K and L. K represents the number of inputs to the

logic block in its unfolded form. L is referred to as the folding flexibiIity, and it is equal to the

number of inputs to the smallest LUT into which the larger K-LUT may be divided.

A tool named LUTPack has been developed to cover the binary tree of multiplexers in a

K-LUT with small nodes having less than K inputs. LUTPack uses a first-fit-decreasing approach

and covers multiplexer trees in a bottom-up fashion.

-

5.1 Introduction and Architectural Questions

In this chapter, the synthesis techniques described in Chapters 3 and 4 are applied to

investigate the advantages of foldable PLA-style logic blocks and foldable look-up-table-based

logic blocks. Some of the architectural questions addressed are:

Can the number of logic blocks needed to implement circuits be reduced if logic blocks

are foldable?

What are the advantages of row folding, column folding, and combined folding in

PLA-styie logic blocks? What are the effects of allowing folding in look-up-tables?

Assuming that folding can reduce the number of logic blocks needed to implement

circuits, can it actually reduce silicon area, when both routing and logic area

are taken into account?

Would an LPGA architecture based on the proposed coarse-grained foldable blocks

exhibit superior predictability than the finer-grained state-of-the-art CX200 1 LPGA?

5.2 Experimental Procedure

An empirical approach is used to study the foldable architectures. Experiments consist of

mapping a set of benchmark circuits into the experimental architectures. Architectural parameters

are varied to study the effect they have on the mapping solutions. For the foldable PLA-style

blocks, the number of inputs columns (0, product term rows (P), and outputs (0) are varied, and

the effects of row folding, column folding, and combined folding are investigated. For the

foldable look-up-table logic blocks, the number of inputs to the LUT in its unfolded form (K ) , and

the folding flexibility (L), are varied.

5.2.1 Benchmark Circuits

A set of 30 benchmark circuits from three sources are used in this study. The benchmarks,

their sources, and their sizesl are listed in Appendix A. A total of 19 of the circuits are large

1. The size of each benchmark is given in terms of unfolded 4-LUTs and unfolded (10, 12,4) PLA-style logic blocks.

in Appendix D. The last circuit is a processor benchmark from the PREP synthesis suite

[PREP96].

5.2.2 Area Models

To determine the relative area-efficiencies of the foldable architectures, area rnodels are

used. The models assume that silicon area is consumed by a combination of logic and routing,

with the possibility that some routing may be placed in metalization layers directly on top of

active logic. This is different than the area mode1 that has traditionally been used in FPGA

architecture research [Rose90], where the area consumed by routing is assumed to be separate

from active logic area. Routing on top of active logic is feasible in LPGA technology because the

routing circuitry present in LPGAs is entirely metal, and it contains none of the SRAM bits, pass

transistors, or anti-fuses [Brow92] that are used to create programmable routing connections in

FPGAs.

Since the amount of routing resources that may be placed on top of active logic is limited

by the area of each logic block and the laser cut points needed to configure the logic circuitry, two

area modeIs are used: one pessimistic, the other optimistic. These two models are depicted in

Figure 5.1. The pessimistic model assumes that only vertical routing tracks may be pIaced on top

of active logic. Depending on the number of vertical routing tracks, the logic blocks may either be

abutted, or some space may exist between adjacent blocks. The optimistic model assumes that it is

possible for both horizontal and vertical routing resources to be located on top of logic. These two

area models serve as upper and lower bounds for the area that will be needed for each

experimental architecture. The exact area needed can be determined only through the detailed

VLSI layout of logic blocks and routing resources.

A basic tile is defined to be the area of a single logic block and its adjacent routing

circuitry. The basic tile structure is shown in Figure 5.1. Using the pessimistic area model, the

area of a basic tile is:

TileAreapcss = height x width = (m + W - R,,) x r n a x ( m , W . R,) (6.1)

where W is the number of tracks in a routing channel, R , is the routing pitch, and LA is the area

of a logic block. For simplicity, logic blocks are assumed to be square. The max function reflects

is applied, then the area of a basic tile is:

TileArea,,,, = rnax(LA, (W R p ) 2 )

The total area needed to implement a circuit is equal to the number of logic blocks needed

multiplied by the tile area. The models assume that there are equal arnounts of horizontal and

vertical routing resources. An empirical study by Betz showed that this routing architecture leads

to the smallest possible routing resource area in FPGAs [Betz96].

Basic Tile (Pessimistic Model)

Optimistic Model

Basic Tile (Optimistic Model)

Pessimistic Model

Figure 5.1: Pessimistic and Optimistic Area Models

Both routing and logic area are measured in terms of the technoIogy independent

parameter h [Mead80], which is equal to half the minimum feature size in a given technology. In

a typical LPGA technology, accounting for the overhead needed for laser cut points, R , is

approximately equal to 1 1 A.

5.2.3 Chip Area of Foldable PLA-Style Logic Blocks

A chip area mode1 for foldable PLA-style logic blocks was developed using a layout

autornatically generated by the PLA layout generation program MPLA' [Scot85]. The generated

layout is given in Appendix C; its floorplan is shown in Figure 5.2. The PLA layout is for a

1 . MPLA was developed at the University of California, at Berkeley.

models in this section were produced using the generated MPLA layout, and by estimating how

the layout would need to be modified so that it could be configured using the laser disconnect

methodology.

Clocked Pull-Up Transistors

Figure 5.2: PLA Layout Floorplan

The area of an unfoldable PLA-style logic block is estimated as:

LA = (24 .1+ 1 9 . 0 + 5 8 ) - ( 1 6 . P + 4 4 ) + lOOO.Z+l35OO.O+lOOO h2 (6.3)

where 1, P, and O represent the number of input columns, product term rows, and outputs of the

logic block, respectively. The first term in (6.3), (24 I + 19 O + 58), is the combined width of

the PLA's AND and OR-planes. The next term, (16 . P + 44), is the height of the PLA's AND-

plane. The 1000- I term accounts for the area consumed by the input buffers. The 13500 . O

term includes the area consumed by the latch' (-3500 h2) , flip-Rop (-8000 h2

[Vran97][Rose90]), and output driver (-2000 A') that are present for each logic block output.

The final constant, 1000, represents the area needed to buffer and invert the signal used to clock

the pull-up transistors in the AND and OR-planes. Row foldable PLAs have an OR-AND-01Z2

structure and thus, require pull-up transistors on both sides of the AND-plane. The logic area of a

row foldable logic block is estimated as:

LA =(24~1+19~0+84)~(16~P+44)+1000-I+13500~0+1000 h2 (6.4)

Column folded PLAs have extra inputs, and therefore, require additional input buffers. The logic

area of a column foldable PLA-style logic block is estimated as:

1 . The latch is needed for the PLA-style block to have zero stand-by power [Wong86]. 2. The OR-AND-OR structure of row foldable PLA-style Iogic bIocks is shown in Figure 3.5.

Low power is an important consideration in many LPGA applications. Low-power PLAs

can be built using the circuit techniques described in [Wong86] and [Frak89]. These PLAs

achieve zero stand-by power through the use of input transition detection circuitry, CMOS

dynamic logic, and sense amplifiersl. The PLA that was used to generate the models above is very

similar to the zero stand-by power PLA described in [Wong86], with the main difference being

that it contains no input transition detection circuitry. However, this circuitry is relatively small in

cornparison with the other circuit structures interna1 to the PLA. The PLA design in Wong861 is

used in a commercial CPLD architecture.

5.2.4 Chip Area of Foldable Look-Up-Table-Based Logic Blocks

As mentioned in Chapter 4, when LUTs are implemented in LPGAs, they require none of

the SRAM cells that dominate the area of LUT implementations in FPGAs. The chip area of a

foldable look-up-table-based logic block is estimated as:

where I is the number of inputs to the logic block (inputs are assumed to be buffered), N4, is

the number of 4 to 1 multiplexers contained within the LUT3 multiplexer tree, O is the number

of output drivers needed for the logic block, and 8000 [Vran97][Rose90] is the area consumed by

the flip-flop present in each logic block. Note that foldable LUTs will have larger values of I and

O than unfoldable LUTs with the same K. The area consumed by an output driver is

approxirnately 2000 h2. The area consurned by a 4 to 1 rnultiplexer without input buffers is

approximately 1000 h2. The number of 4 to 1 rnultiplexers in a K-LUT rnay be computed using

equation (4.1) in Chapter 4. To estimate the area of LUTs that do not contain an integral number

of 4 to 1 multiplexers, N4 1o , is not required to be an integer2.

1. Input transition detection circuitry is used dong with CMOS dynarnic logic to ensure that power is only dissipated when input transitions occur. In addition to this, dynamic power dissipation is reduced and speed is increased by using sense amplifiers on product term lines to eliminate the necessity for wide volt- age swings.

2. LUTs do not contain an integral number of 4 to 1 multiplexers when parametcr K is odd.

In this study, several important theoretical results are used to predict routing resource area.

In [EIGa81], El Gamal showed that the average number of used tracks in any channel of a gate

array with equal amounts of horizontal and vertical routing resources is given by:

where hpins is the average number of connected input pins per logic block for circuits

implemented in the gate array, and R is the average Manhattan length of two-point routing

connections (measured in the number of blocks)'. The parameter hpins in the above equation is

known after technology mapping is complete; however, placement and routing must be completed

if R is to be known exactly. In this study, routing resource area is estimated using equation (6.8)

above, and the value that results from performing placement and global routing for each circuit

in each architecture in its unfolded forrn. Placement and global routing is done using the CAD

system, VPR [BetzgBa]. During placement and routing, it is assumed that the clock, set, and reset

signals feeding the flip-flops in each logic block are routed on dedicated tracks, and that the pins

on logic blocks are distributed, with some pins being accessible to horizontal routing resources,

and some being accessible to vertical routing resources.

It is assumed that the maximum number of used tracks in any routing channel, W, is

greater than the average number of used tracks, W,,, . When each circuit is placed and routed in

an unfoldable architecture, the ratio of W to W,,, can be computed. This ratio is then used to

compute the number of tracks, W, needed to route the circuit in a foldable architecture; the ratio

is used by multiplying it by the value of WaV, that is computed using equation (6.8) after the

circuit is mapped into the foldable architecture. Table 5.1 shows the average ratios of W to W,,,

and average values of R for several unfoldable architectures. The numbers in the table were

computed by averaging across al1 30 benchmark circuits.

1 . In addition to (6.8), El Gamal determined that the number of tracks per channel follows a Poisson distribution. This assertion was verified by Brown in [Brow92a].

Table 5.1: Average Wire Length and Average Ratios of Maximum to Average Channel Density

I -

Unfoldable Architecture 1 Average R 1 Average W/Wavg I

The routing mode1 used in this study errs on the side of pessimism since work by Donath

and Feuer suggests that R will likely decrease as blocks are folded [Dona791 [Dona8 11 [Feue82].

In this previous work, it was shown that the average connection length in a region of C nodes

when placed on a square array is: 1

P - 5 R oc c

where p is the so-called 'Rent exponent9,' which, although it depends on the circuit being routed,

is typically about 213 Feue821. (6.9) suggests that will decrease as the number of blocks

needed to implement circuits decreases (given that p does not increase). Circuits implemented in

an architecture with large foldable blocks will need fewer blocks than when they are implemented

in architectures with smaller, or unfoldable blocks. This should lead to a decrease in R . Table 5.1

verifies that decreases as logic block granularity increases.

5.2.6 Limitations of Area Mode1

To more precisely compute the chip area needed for routing resources in the foldable

architectures, a new routing CAD tool would be needed. This is because the number of physically

different pins on each logic block varies when folding is used. This notion does not exist for logic

blocks that are not foldable. For instance, consider a column foldable PLA-style logic block that

1. The Rent Relationship [Dona79]. 1 = A cP . relates the number of external terminals, 1 , frorn a group of gates to the number of gates in the group, C, the average number of terminals per gate, A , and the Rent exponent, p. The value of p was originally taken to be 213 [Feue82].

L I U 3 l l l Y U L y 1 1 1 3 l b U U 1 1 1 6 3 1 6 1 1 U 1 3 U L U U L l l L l l b L V t f U I L U L 1 1 b U W L C V l l l U1 1 L J A A I Y tflU1lU. UUIllU b V l U 1 1 1 1 1 3 1 1 1

the logic biock may be folded, meaning that they contain a break. Two signals are fed to a folded

column, as the break in the column makes the signal at the top of the column physically different

than the signal at the bottom of the column. However, only a single signal is fed to an unfolded

column, and therefore, the signal may enter the logic block from either of two pins, making the

two pins that feed an unfolded column physically equivalent. To make efficient use of the blocks,

a router would need to understand how Iogic block pins can best be used. A new routing CAD tooI

has not been developed for this study; however, it is believed that the area model being used is

sufficiently accurate, and the effort needed to develop a new routing tool is not merited.

5.3 Area-Efficiency Results for Foldable PLA-style Logic Blocks

In this section, the benefits of folding in PLA-style logic blocks are explored. The

advantages of row, column, and combined folding are studied for architectures containing PLA-

style blocks of various sizes, ranging from 8 input columns, 8 product terrn rows, and 3 outputs to

24 input columns, 24 product terrn rows, and 5 outputs. The results are presented in two ways: 1)

as a reduction in the number of blocks needed to implement circuits, and 2) an area is presented

for each experimental architecture; the area is computed using the area models described in the

previous section. Only the results obtained by applying the optimistic area model are included in

this chapter; the results obtained by applying the pessimistic model are included in Appendix B.

When numerical data results are quoted in the text of this chapter, the data result corresponding to

the pessirnistic area model wilI be given in parentheses.

5.3.1 The Benefits of Folding

The left-hand column of plots in Figure 5.3 depicts the results for row folding. The top,

centre, and bottom plots in the colurnn give the results for architectures with 3, 4, and 5 outputs,

respectively. The vertical axis indicates the average percentage reduction in the number of logic

blocks needed to implement a circuit when row folding is used. This was computed by

determining a percentage reduction for each of the benchmark circuits and then averaging these

percentages. Thus, each benchmark (whether small or large) was treated equally.

- 8 Product Term Rows (P = 8) b - - ~ 16 Product Term Rows (P = 16) - 24 Product Term Rows (P = 24)

3 Outputs -------.....------.---------- r - l

Nurnber of lnput Columns (1) Nurnber uf Input Columns (1) Number of Input Columns (1)

'3

u

0.0 8 .O 16.0 24.0 Number of Input Columns (1)

4 Outputs 1 . * . . - * - - - - . - - - - - - - - - - - - - - - - - -

Nurnber of lnput Columns (1) Number of Input Columns (1)

0.0 - 8.0 16.0 24.0 Number of Input Columns (1)

Combined Folding

8.0 16.0 24.0 Nurnber of lnput Columns (1) Number of lnput Columns (1)

Row Folding Coiumn Folding

Figure 5.3: The Benefits of PLA Folding - Percentage Reduction in Number of Logic Blocks

The figure above indicates that row folding holds the most benefit for blocks that have

very few product term rows and large numbers of input columns and outputs. That is, row folding

is most beneficial for architectures in which product terms are scarce in cornparison with the

number of inputs and outputs. Recall that row folding allows two product terms to be placed ont0

the same physical product term row. This sharing of physical product term rows is tantamount to

increasing the number of product terms that may be placed into a logic block, which helps

alleviate the effect of having relatively few product term rows. For architectures with 24 input

columns, 8 product term rows, and 5 outputs, row folding c m reduce the number of blocks needed

to implement circuits by 23.4% on average. There is very little benefit to row folding when the

The centre column of Figure 5.3 illustrates the benefits of column folding. The figure

shows that in comparison to row folding, column folding is superior at reducing the number of

blocks needed to implement circuits. Opposite to the results observed for row folding, colurnn

folding performs best when logic blocks have relatively few input columns in comparison with

product term rows and outputs; that is, column folding is most useful when inputs are scarce. The

maximum reduction of 43.1 % occurs when column folding is permitted in architectures with 8

input columns, 24 product term rows, and 5 outputs. In this architecture, technology mapping

resulted in more than 80% of the logic blocks being folded for most circuits.

One reason why column folding provides a greater reduction in the number of blocks than

row folding is related to how choices are made with regard to which inputs are permitted to share

a single physical input column, and which product terms are permitted to share a single physical

product t e m row. In row folding, physical product term rows can only be shared by product terms

belongingl to different outputs. This is due to the OR-AND-OR structure of row foldable PLA-

style logic blocks as shown in Figure 3.5. In fact, row folding requires that physical product terrn

rows be shared by product terms belonging to outputs in different OR-planes. This severely

restricts the number of pairs of product terms that rnay share a physical product terrn row. On the

other hand, in column folding, no restrictions are placed on which inputs may share a physical

column. An input may share a physical column with any other input; thus, it is even possible for

two inputs to the same function to share a physical input column. Clearly, there are more degrees

of freedom available in column folding than row folding, accounting for the more significant

gains of column folding.

The right-hand column of plots in Figure 5.3 depicts the gains of combined folding. The

architecture with the largest gain is the same as that for the case of column folding. The main

difference between the results for column and combined folding is that combined folding provides

more significant gains for architectures with 16 and 24 inputs. For example, column folding alone

provides only small benefits for architectures with 16 input columns, 8 product term rows, and 3

outputs; however, combined folding allows the number of blocks to be reduced by 16.6%.

1. A product term belongs to an output if the product term is in the sum-of-products boolean function corresponding to the output.

Figure 5.4 shows the normalized area results for architectures with unfoldable PLA-style

logic blocks. The three graphs in the figure show the results for architectures with 3, 4, and 5

outputs, respectively. Each benchmark circuit was treated equally in the area rneasurement. The

area consumed by each circuit in each architecture was norrnalized to the area consumed by the

same circuit in an architecture containing logic blocks with 8 input columns, 8 product term rows,

and 3 outputs. These norrnalized area values for each circuit were then averaged and the results

are shown in the figure. The (8, 8, 3) architecture was determined to be the most area-efficient

unfoldable architecture. Other good architectures include the (16, 8, 3) architecture, the (16, 8,4)

architecture, and the (16, 16, 5) architecture. These architectures are reasonably similar to the

architecture with 10-12 inputs, 12-13 product terms, and 3-4 outputs that was identified as the

most area-efficient in [Kou193]. The results in Figure 5.4 also suggest that the appropriate number

of outputs for a logic block is related to the parameters I and P. For example, for small logic

blocks with I = 8 and P = 8, setting O = 3 gives the best area-efficiency. However, for large logic

blocks with I = 24 and P = 24, setting O = 5 is the best choice.

Logic Blocks with 3 Outputs (O = 3)

0

8.0 16.0 24.0 Number of lnput Columns (1)

Logic Blocks with 4 Outputs (O = 4) 2.25 4 1 J

0.75 ' 1 1

8.0 16.0 24.0 Number of Input Columns (1)


0.75 ' ' 1

8.0 16.0 24.0 Number of lnput Columns (1)

Q--û 8 Product Term Rows (P = 8) o--a 16 Product Term Rows (P = 16) * 24 Product Term Rows (P = 24)

Figure 5.4: Area Results for Unfoidable PLA-Style Logic Block Architectures (Optimistic)

Figure 5.5 illustrates the area benefits of row folding. The vertical axis gives the ratio of

folding reduces the silicon area needed to implement circuits when this ratio is less than one.

According to the results in the figure, row folding is most beneficial for those architectures that

have a large number of input columns and a small nurnber of product term rows. The greatest area

reduction occurs for the architectures with 24 input columns, 8 product term rows, and 5 outputs.

In this case, an architecture based on row foldable logic blocks consumes 79% (or 82%, using the

pessimistic model,) of the area of an architecture with unfoldable blocks with the same

parameters, (24, 8, 5). For most of the architectures with 8 input columns, the area overhead

associated with being able to fold the blocks outweighs any potential area reduction.

Logic Blocks with 3 Outputs (O = 3) Logic Blocks with 5 Outputs (O = 5)

5

1 1 1 ' 1 1 0.60 8 .O 16.0 24.0 8 .O 16.0 24.0

Number of Input Columns (1) Number of Input CoIumns (1)

Logic BIocks with 4 Outputs (O = 4)

M 8 Product Term Rows (P = 8) Q-+ 16 Product Term Rows (P = 16) M 24 Product Term Rows (P = 24)


Figure 5.5: Ratio of Row Foldable to Unfoldable Area for PLA-Style Logic Block Architectures (Optimistic)

Figure 5.6 depicts the area results for column folding. The vertical axis is again the ratio of

the folded to unfolded area. The figure shows that column folding can provide large area

reductions for many of the architectures considered, including the (8, 8, 3) architecture identified

as area-efficient in Figure 5.4. Both Figures 5.5 and 5.6 reveal that the benefits of folding increase

as the number of Iogic block outputs increases. The greatest benefits of column folding occur for

architectures that have a smali number of input columns and many product term rows. For

I ' Y

rows, and 5 outputs consumes about 59% (64%) of the area of the unfoldable architecture with the

same parameters. Notice that column folding provides either no benefit or very little benefit for

logic blocks with 24 input columns. The reason for this is that such blocks already have a large

number of inputs, and, as shown in Figure 5.3, column folding is most beneficial when inputs are

scarce.

Logic Blocks with 3 Outputs (O = 3) O 1.25

1 I 1 0.50 8.0 16.0 24.0

Number of Input Columns (1)



0

1 I t

8 .O 16.0 24.0 Number of Input Columns (1)

8 Product Term Rows (P = 8) M 16 Product Term Rows (P = 16) M 24 Product Term Rows (P = 24)


Figure 5.6: Ratio of Column Foldable to Unfoldable Area for PLA-Style Logic Block Architectures (Optimistic)

The area benefits of combined folding are illustrated in Figure 5.7. Results show that

combined folding can result in an area reduction for al1 of the architectures considered. Note that

the shapes of the curves in the figure appear to be similar to the shapes of the curves in Figure 5.6,

with the difference being that some curves have 'shifted' vertically. This effect is evident in the

curves for architectures with P = 8, which have shifted downward in the results for combined

folding. For some architectures, such as the (8, 8, 4) architecture, the area reductions due to

combined folding are larger than those achievable by either row or column folding alone.

However, for other architectures, such as the (24, 8,4) architecture, the combined folded area is in

between the column foIded and row folded area. Results in Figure 5.7 suggest that combined

folding provides the benefits of both row folding and column folding.

Logic Blocks with 3 Outputs (O = 3) 1.25

1 1



0

1 1 I



5

0.50 8.0 16.0 24.0


M 8 Product Term Rows (P = 8) D-f3 16 Product Term Rows (P = 16) * 24 Product Term Rows (P = 24)

Figure 5.7: Ratio of Combined Foldable to Unfoldable Area for PLA-Style Logic Block Architectures (Optimistic)

Table 5.2 combines the data from the previous figures and shows the best folded area

achievable for each of the architectures, the type of folding used to achieve that area, and the

unfolded area for each of the architectures. The area values in the table were norrnalized to the

area consumed by a combined foldable architecture with the parameters (8, 8, 4), since it was

determined that this architecture was the most area-efficient of al1 the unfoldable and foldable

architectures considered. Each ce11 of the table gives the area that results from applying the

optimistic area model, and the area that results from applying the pessimistic area mode1 (shown

in parentheses). In the table cells describing folded area, RF is used to indicate row folding, CF is

used to indicate column folding, and BF is used to indicate combined folding. Other architectures

with good area-efficiencies include the combined foldable architecture with the parameters (8, 8,

3), and the combined foldable architecture with the parameters (8, 8, 5). It is interesting to note

that no single type of folding is best for al1 architectures; the best type of folding can be any of

row, column, or combined folding depending on the parameters, (1, P, O). The best unfoldable

logic block architecture, with 8 input columns, 8 product term rows, and 3 outputs consumes 27%

(19%) more area than the (8, 8, 4) combined foldable architecture. Thus, the results show that

efficient unfoldable architectures.

Table 5.2: Normalized Area Results for PLA-Based Architectures

~ ~ 8 Ï ~ 2 0 ( 1 . 1 4 ) ~ ~ 1 1.28(1.21) 1 1.12(1.09)RF 1 1.30(1.22) 1 1.14BF(I.II RF) 1 1.39(1.29) 1

1 & p

5.4 Area-Efficiency Results for Foldable Look-Up-Table-Based Logic Blocks

0 = 3

Folded Unfolded

1.40(1.31) BF

1.68 (1.51) CF

1.42 ( 1 32) R F

1.77 BF (1.57 RF)

2.13 (1.83) CF

In this section, the advantages of foldable look-up-tables are considered. First, the

effectiveness of being able to fold look-up-tables is examined from the point of view of reducing

the number of blocks needed to implement circuits. Following this, the area models discussed

previously are applied to determine if there are any area benefits associated with look-up-table

folding. Again, only the results obtained by applying the optimistic area model are given in this

chapter; the area results obtained by applying the pessimistic model are given in Appendix B.

Foldable LUTs with K ranging from 4 to 10 are considered with several values of the

folding flexibility, L.

5.4.1 The Benefits of Folding

0 = 4

Foldcd Unfolded

1.51 (1.40)

1.93 ( 1-68)

1.59 (1.44)

1.78 (1.57)

Figure 5.8 shows how the number of LUT-based logic blocks needed to implement circuits

is reduced when folding is perrnitted. Data points on the graph were computed by recording the

number of logic blocks

unfoldable architectures,

0 = 5

1.23 (1.16) BF

1.44(1.31)CF

1.29 (1.21) RF

1.49 ( 1.33) BF

needed to implement each benchmark circuit in both foldable and

and determining a percentage reduction. The vertical axis shows the

Folded

1.38 (1.26)

1.71 (1.51)

1.59 ( 1.42)

1.56 (1.39)

2.22 (1.90) 1.75 ( 1.53) SF

Unfolded

1.86 (1.60)

horizontal axis shows the parameter, K. The three curves shown on the graph represent the results

for three values of L.

Figure 5.8 shows that there is a significant reduction in the number of bIocks needed to

implement circuits when L = K-1; that is, when it is possible to divide a LUT with K inputs into

two LUTs, each with K-1 inputs. In addition, the graph shows for Iarge K, setting L = K-2 can

provide significant gains over L = K-1. Reducing L further from K-2 to K-3 results in smaller

additional gains.

Generally, the curves in Figure 5.8 increase with K. This is especially evident when K is

increased from 4 to 5 and for architectures with L = K-2 and L = K-3. As logic blocks get larger, a

greater proportion of them may be eliminated through folding. The reason for this is that larger

blocks are less-utilized than smaller blocks. The LUTPack algorithm described in Chapter 4

leverages this under-utilization. For example, when circuits are mapped into unfoldable logic

blocks with K = 4, an average of 3.4 inputs are used on each logic block; about a sixth of an input

is left unused on average. However, when circuits are mapped into logic blocks with K = 10, an

average of 7.73 inputs are used; about 2.27 inputs are left unused on average.

4.0 5.0 6.0 7.0 8.0 9.0 IO.0 Inpuis in Unfolded Form (K)

Figure 5.8: The Benefits of LUT Folding - Percentage Reduction in Number of Logic BIocks

5.4.2 Area Results

Figure 5.9 shows area results for foldable LUT-based logic blocks. The vertical axis is the

ratio of the area of a foldable architecture to the area of an unfoldable architecture with the same

when this ratio is less than one. The results in the figure show that folding can reduce area for al1

of the values of K that were considered. The results also show that if the folding flexibility, L, is

too large, then the area of a foIdable architecture can be greater than the area of an unfoldable

architecture with the same K. Furtherrnore, Figure 5.9 shows that the advantages of folding

increase with K. When K = 4, there is only a srnall benefit to being able to fold LUTs. When K =

5, the foldable architecture with L = 4 consumes about 81% (86%) of the area of an unfoldable

architecture with the same K. The area improvement due to folding jumps as K is increased, and

the foldable architecture with K = 10, L = 8 consumes about 64% (66%) of the area of the

unfoldable architecture with K = 1 O. LastIy, the results show that the arnount of folding flexibility

should be increased as K is increased. For small K, the best value of L is K- 1 ; however, for larger

K, the best value of L shifts downward towards K-2 and K-3.

I I 4.0 6.0 8.0 10.0

Inputs in Unfolded Form (K)

Figure 5.9: Ratio of Foldable to Unfoldable Area for LUT-Based Logic Block Architectures (Optimistic)

Table 5.3 gives normalized area results for al1 of the LUT-based architectures considered.

The areas in the table have been normalized to the area consumed by a foldable architecture with

the parameters, K = 5, L = 4, as this was the architecture determined to be the most area-efficient.

The column of results for unfoldable architectures indicates that the architecture with K = 4 is the

rnost area-efficient unfoldable architecture1; however, it consumes about 10% (8%) more area

than the best foldable architecture. The unfoldable architecture with K = 5 requises about 24%

1. This result, which shows that the architecture with K = 4 is the most ara-efficient unfoldable architecture, is in agreement with the results of Rose [Rose901 and Kouloheris [Kou193].

same critical path logic depth.

Table 5.3: Normalized Area for Foldable Look-Up-TabIe-Based Architectures

One advantage of folding is that it reduces or eliminates the area penalties associated with

logic blocks with K greater than 4. For example, the data in Table 5.3 show that the best foldable

architecture with K = 6 consumes 24% (18%) more area than the architecture with K = 5, L = 4;

however, the unfoldable architecture with K = 6 consumes 56% (41%) more area. When circuits

are mapped into such higher-fanin blocks they have fewer logic levels on their critical paths. This

may be advantageous for predictability reasons as will be discussed in the next section. In

addition to this, studies have shown that FPGA architectures consisting of higher-fanin LUT-

based logic blocks (i.e. 7 or 8 inputs) exhibit superior speed in comparison with architectures with

low-fanin LUT-based logic blocks (Le. 4 inputs) [SingBI][Koul93]. Recall that the algorithm used

to map circuits into foldable LUT-based architectures (described in Chapter 4) does not affect

combinational depth.

Although the results above suggest that folding provides only modest area gains over the

best unfoldable architecture, it should be kept in mind that the routing area mode1 for the foldabIe

architectures is pessimistic since the number of routing tracks is computed using the R values for

the architectures in their unfolded forms. This is pessimistic because will decrease as blocks

are folded, as fewer blocks are needed to implement circuits. This phenornenon can be observed

in the data of Table 5.1 which shows that R is decreased significantly as LUT size increases.

Thus, being able to fold LUTs may provide additional gains that are not reflected in the data of

Table 5.3.

Inputs in Unfolded Fom (K)

unfoldcd L = K-l L = K-2 L = K-3

LUT-based logic blocks is that although al1 of the blocks in each architecture are foldable and

possess the necessary extra inputs and output drivers, only a fraction of the logic blocks are

actually folded after technology mapping. For example, in the foldable architecture with the

parameters K = 5, L = 4, used as the basis for normalization in Table 5.3, an average of 48.8% of

blocks were folded in mapped circuits. This means that for about half of the blocks in the target

architecture, the extra area incurred by allowing blocks to be folded is not needed. Even fewer

logic blocks are folded when circuits are mapped into foldable architectures with K = 4. To

evaluate the potential gains of an architecture wherein only a fraction of the logic blocks are

foldable, consider a hypothetical situation in which the percentage of foldable blocks in the

architecture is exactly the percentage needed for a particular benchmark. This situation would

represent a loose upper bound on the area gains that could be achieved by building a

heterogeneous foldable architecture. The normalized area for such heterogeneous architectures is

shown in Table 5.4, and it was modelled by assuming that al1 blocks in the target architecture have

the same height with the unfoldable bIocks being narrower than the foldable ones.

In the data of TabIe 5.4, an unfoldable architecture with K = 4 now consumes 20% (17%)

more area than the foldable architecture with K = 5 and L = 4. In general, the data trends in Table

5.4 are the same as in the data of Table 5.3; however, the benefits of folding are greater when only

a fraction of the blocks in the target architecture are assumed to be foldable.

Table 5.4: Norrnalized Area for Heterogeneous Foldable Look-Up-Table Architectures

Inputs in Unfolded Unfolded I I L = K - 2 I L = K - 3

One of the problems associated with a heterogeneous architecture is that it introduces new

- - average wire lengths, R , since placement tools may not be able to exploit the locality inherent

within circuits as effectively as possible since certain blocks are forced into certain locations on

the array.

5.5 Predictability Benefits of the Coarse-Grained Foldable Architectures

One problem encountered by ASIC designers who target designs to gate arrays composed

of small logic blocks is that interconnect delay is not known until after placement and routing are

complete. With technology improvements, the minimum feature size in modem gate arrays has

been shrinking. This trend causes the component of delay associated with active Iogic to decrease

relative to interconnect delay. Since interconnect delay is becoming a greater proportion of total

delay, it is becoming a significant source of error in pre-layout timing estimates and timing-

directed synthesis.

The fine granularity of the blocks in typical gate arrays leads to circuit implementations

that have a large number of small logic elements in their critical paths. This has a compounding

negative effect on predictability because an unpredictable and highl y variable interconnect delay

is incurred between each logic block. Pre-layout synthesis tools use wire load models to predict

the delay of these interconnections [Syn96]. An LPGA with coarse-grained logic blocks would

give way to circuit implementations that have relatively few logic levels on each circuit's critical

path. This means that fewer interconnection delay predictions would need to be made by pre-

Iayout synthesis tools, increasing the accuracy of pre-layout timing estimates.

Table 5.5 shows the average number of logic blocks on the critical path of the benchmark

circuits (averaged over ail 30 benchmark circuits) when they are implemented in several

architectures: the CX200I1 (discussed in Chapter 2), architectures with PLA-style blocks, and

architectures with LUTs. Notice that when circuits are irnplemented using the CX2001, the

number of levels on their critical path is much greater than when the circuits are implemented

using the coarser-grained PLA-style or LUT blocks. The o values provided in the table show that

the variation in the nurnber of logic levels on circuits' critical paths is significantly larger in the

1. Circuits were mapped into the CX2001 using the Synopsys tools and CX2001 ce11 library that was obtained from Chip Express [CEC96a].

Table 5.5: Average Number of Logic Levels on Circuits' Critical Paths for Several Architectures

PLA (8, 8,4) 1 13.77 1 8.48 1

PLA (8,24,4) 13.77

LUT (6) 13.23

LUT (8) 10.60

LUT (10) 9.00 6.3 1

PLXM, 24,4)

PLA (24.8.4)

PLA (24, 16,4)

PLA (24.24.4)

LUT (4)

LUT (5)

Table 5.5 shows that there is an 1 1.6% drop in the average number of logic block levels on

a circuit's critical path as K is increased from 4 to 5 in LUT-based architectures (a decreased by

12.17

12.20

1 1.73

1 1.60

16.07

14.20

16.5%). This significant depth reduction also cornes with an area reduction as folding results

7.3 1

8.03

7.6 1

7.45

9.09

7.59

showed that an architecture based on foldable 5-LUTs consumes less area than an architecture

based on 4-LUTs. Table 5.5 shows that there are smaller variations in logic depth among the PLA-

based architectures. For example, circuits implemented in the (24,24,4) architecture need 1 5.8%

fewer levels on average than when implemented in the (8,8,4) architecture. This 15.8% decrease

in logic depth is small in comparison with the 44% decrease in depth that occurs when K is

increased from 4 to 10 in LUT-based architectures.

It should be pointed out that the data in Table 5.5 is for illustration only since it may be

possible to further reduce the number of logic levels for al1 architectures considered by using

depth-based synthesis methods.

In this chapter, an experimental approach to study the area-efficiency of the foldable logic

block architectures was presented. An ernpirical methodology was employed in which benchmark

circuits were mapped into the proposed architectures using the synthesis techniques of Chapters 3

and 4. Pessimistic and optirnistic routing area models were introduced to determine area bounds

for realistic architectures. Actual layouts were used to estimate the silicon area necessary to

implernent the foldable logic blocks. Some of the key experimental results are:

Folding for PLA-style blocks can significantly reduce the number of logic blocks

needed to implement circuits. Colurnn folding works best for architectures with

fewer input columns than product term rows. Row folding works best for architectures

with fewer product term rows than input columns. Combined folding is able to reap the

benefits of both row and column folding.

A combined foldable architecture with the parameters (8, 8,4) was determined to

be the most area-efficient of al1 the unfoldable and foldable PLA-based architectures.

Results show that foldable PLA-style logic block architectures use significantly

less area than the most area-efficient unfoldable architectures.

The benefits of folding in LUT-based logic blocks increase with the parameter, K.

A foldable LUT-based architecture with the parameters K = 5, L = 4 was determined

to be the most area-efficient of al1 the LUT-based architectures considered.

This architecture uses slightly less area than the most area-efficient unfoldable LUT

architecture with K = 4; however, it requires significantly less area than the unfoldable

LUT architecture with K = 5.

Folding reduces or eliminates the area penalties associated with LUT-based logic blocks

with K greater than 4.

Architectures based on the proposed coarse-grained logic blocks exhibit superior

predictability than those based on fine-grained blocks, like the CX2001.

6.1 Thesis Summary

The objective of this thesis has been to study the advantages of implernenting coarse-

grained logic blocks in LPGAs. In particular, two new logic block architectures were introduced:

foldable PLA-style logic blocks and foldable look-up-table-based logic blocks. The new logic

blocks are based on similar logic blocks found in commercially available FPDs with the main

difference being that additional logic may be packed into the proposed logic blocks by leveraging

the ability to cut metal lines in LPGA technology. Custom CAD tools have been developed to map

circuits into the new architectures. The tools were applied in an empirical study in which

benchmark circuits were mapped into experimental architectures. Many different experimental

architectures were considered and they were studied from two points of view: area-efficiency and

logic depth.

6.2 Thesis Contributions

Relevant architectural parameters were identified for the new logic blocks. Foldable PLA-

style logic blocks are characterized by the number of input columns (0, product term rows (P),

and outputs (O) they possess, as well as whether they are unfoldable, row foldable, column

foldable, or combined foldable. A constrained type of folding called simple bipartite folding was

considered in this study. The proposed foldable PLA-style logic blocks represent a new

application for PLA folding, which has previously onIy been used in custom VLSI.

Foldable look-up-table-based logic blocks are characterized by the parameters K and L. K

represents the number of inputs to the LUT in its unfolded form. L is called the folding flexibility

and is equal to the number of inputs to the smallest granularity LUT into which the larger K-LUT

may be divided.

Chapter 3 discussed a new technology mapping CAD tool for foldable PLA-style blocks

called hooPLA. The tool operates in three phases. Phase I breaks up a circuit's directed acyclic

graph into a forest of trees and then uses a dynamic prograrnming approach to map each tree into

a new tree possessing the minimum number of PLA-feasible nodes. Phase II is a collapsing step

successors. Phase III is a packing step that packs circuit nodes into the multi-output logic bfocks

available in the target architecture. Folding was used to pack additional logic into each PLA-style

iogic block. PLA folding solutions were generated using a method similar to that developed by

Liu and Wei [Liu941 which involved transforrning the folding problem into an equivalent min-cut

graph partitioning problem. Folding was integrated into phases II and III of hooPLA.

Chapter 4 presented a technology mapping algorithm for foldable look-up-table logic

blocks called LUTPack. The algorithm packs additional logic into each logic block by taking

advantage of unused LUT inputs, and the fact that LUTs can be divided into smaller LUTs by

using the laser disconnect methodology to cut metal lines. The algorithm uses a first-fit-

decreasing bin packing approach to cover the multiplexer tree in a LUT with the small nodes that

exisi after normal LUT-based technology mapping.

An experimental study was presented in Chapter 5. Models were developed to estimate

logic block area and a theoretical mode1 was used to estimate the number of routing tracks that

would be needed to route circuits. The study of PLA-style logic blocks considered unfoldable,

row foldable, column foldable, and combined foldabIe logic blocks ranging in size from 8 input

columns, 8 product terni rows, and 3 outputs to 24 input columns, 24 product term rows, and 5

outputs. The study of foldable look-up-table-based logic blocks considered logic blocks with K

ranging from 4 to 10, and L ranging from K-3 to K-1. Several conclusions were drawn from the

study :

Folding in PLA-style logic blocks significantly reduces the number of logic blocks

needed to implement circuits. Column folding is best for architectures in which inputs

are scarce. Row folding is best for architectures in which product terrns are scarce.

Combined folding reaps the benefits of both row and column folding.

A combined foldable PLA-style logic block with the parameters (8, 8,4) was found to

be the most area-efficient of a11 the PLA-based architectures considered. The most

area-efficient unfoldable architecture has the parameters (8, 8,3), and it consumes 27%

(19%) more area than the best foldable architecture.

In look-up-tables, the effectiveness of folding increases with the parameter, K.

The foldable LUT architecture with the parameters K = 5, L = 4 was found to be the

most area-efficient LUT architecture. It consumes slightly less area than the most

There rnay be area advantages to a heterogeneous LUT-based architecture in which only

a fraction of the logic blocks are foldable.

Allowing look-up-tables to be folded reduces or eliminates the area penalties associated

with LUT-based logic blocks with K greater than 4. Such coarse-grained logic blocks

have depth advantages over fine-grained LUT-based blocks. For example, a

foldable architecture with K = 6 consumes about 13% (10%) more area than an

unfoldable architecture with K = 4; however, an unfoldable architecture with K = 6

requires about 42% (3 1%) more area than the unfoldable K = 4 architecture.

When circuits are mapped into the either of the proposed architectures, they possess

superior iogic depth and predictability than when they are implemented in the CX2001

LPGA.

6.3 Suggestions for Future Work

During the development of phase II of hooPLA, it was observed that collapsing a node

into its successors may cause an increase in the sum of the sizes of the nodes in the network. It

was beneficial to limit this increase by adjusting the parameter, B , in relation (3.3) when targeting

multi-output logic blocks. Phase 1 of hooPLA maps each tree in a circuit's DAG into a tree with

the minimum number of PLA-feasible nodes without concern for the sizes of the nodes in the

covering. A future enhancement of phase 1 could take the notion of node size into account when

mapping circuits into multi-output logic blocks.

Folding was integrated into phases II and III of hooPLA. For column or combined

foldable PLA-style blocks, it may be beneficial if folding were integrated into phase 1. In this

case, the nodes in the mapping solution for each tree in a circuit's DAG would be allowed to

possess an infeasible number of inputs, as long as the nodes could be column folded to fit into the

target logic blocks. This change would increase the nurnber of feasible subtrees rooted at any

particular node within a tree, thus increasing the problem complexity; however, it may give

superior results.

In this work, only bipartite folding was considered, requiring al1 of the breaks in a folded

PLA to occur at the same Ievel (same vertical level for column folding, same horizontal level for

row folding). As discussed in Chapter 3, bipartite folding is useful for the first step of combined

more general type of folding for the second step of combined folding, or when performing

column or row folding alone. This would involve implementing another folding algorithm;

however, it may allow even greater amounts of logic to be packed into each logic block.

Future work could also include generating a more accurate area mode1 for the foldable

logic blocks through detailed VLSI layout. One potential source of inaccuracy in generating area

models for LPGAs is in estimating how the addition of laser cut-points affects the positioning of

transistors and metal interconnect within a layout. For example, many laser cut points are needed

to configure the AND- and OR-planes in a PLA-style logic block. These laser cut points may limit

the amount of programmable interconnect that may be placed directly on top of a logic block.

This uncertainty is precisely the reason for including both pessimistic and optimistic area models

in the empirical study in Chapter 5.

Another direction for future work is to compare the area-efficiency of the proposed

architectures with the area-efficiency of the commercially available CX2001 LPGA [CEC96a].

[Acte961 ACT I Series FPGAs Data Sheet, Acte1 Corporation, 1996.

[Alte961 The Altera Data Book, Altera Corporation, 1996.

[Al te951 Altera/Synopsys User Guide, Altera Corporation, 1995.

[Atme97] AT6000LV Series Coprocessor Field Programmable Gate A rrays Data Sheet, A tmel Corporation, 1997.

[Ayuk96] M. Ay ukawa, Private Communication, 1 996.

[AMD96] The MACH 5 Family Data Sheet, Advanced Micro Devices, 1996.

[Betz96] V. Betz and J. Rose, "Directional Bias and Non-Uniforrnity in FPGA Global Routing Architectures", ZEEE/ACM International Conference on Computer-Aided Design, 1996, pp. 652-659.

[Betz96a] V. Betz and J. Rose, "On Biased and Non-Uniforrn Global Routing Architectures and CAD Tools for FPGAs" , CSRZ Technical Report #358, Department of Electrical and Cornputer Engineering, University of Toronto, 1996.

[Bray871 R. K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli and A. R. Wang, "MIS: A Multiple- LeveI Logic Optimization System", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 1987, pp. 1062- 1 O8 1.

[Brow92] S. D. Brown, R. J. Francis, J. Rose and Z. Vranesic, Field-Programmable Gate Arrays, Kluwer Academic Publishers, Boston, 1992.

[Brow92a] S. D. Brown, "Routing Algorithms and Architectures for Field-Programmable Gate Arrays", Ph. D. Thesis, Department of Electrical Engineering, University of Toronto, 1992.

[Brow96] S. D. Brown, Field-Programmable Devices - Technology, Applications, Tools, Stan Baker Associates, 1996.

[CEC96] Chip Express Technology Ovewiew, Chip Express Corporation, 1996.

[CEC96a] Chip Express Technology and CALI Tool Workshop Notes, Chip Express Corporation, Santa Clara, California, July 1996.

[Cheng51 C. Chen, Y. Tsay, T. Hwang, A. Wu and Y Lin, "Combining Technology Mapping and

Araea Design OJ mregrarea urcurrs ana aysrems, vol. 14, NO. Y, aepremper 1 Y Y ~ , pp. 1076- 1084.

[Chur1941 K. C. K. Chung, "Architecture and Synthesis of Field-Programmable Gate Arrays with Hard-wired Connections", Ph. D. Thesis, Department of Electrical and Computer Engineering, University of Toronto, 1994.

[Cong94] J. Cong and Y. Ding, "FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs", IEEE Transactions on Computer- Aided Design of Zntegrated Circuits and Systems, Vol. 13, No. 1, January 1994, pp. 1- 1 1.

[Cong94aJ J. Cong and Y. Ding, "On AredDepth Trade-Off in LUT-Based FPGA Technology Mapping", IEEE Transactions on VLSI Systems, Vol. 1 3, 1 994, pp. 1 - 1 2.

[Cong95] J. Cong and Y. Hwang, "Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping", UCLA Department of Computer Science Technical Report, CSD TR-9500001.

[Corm94] T. H. Corrnen, C. E. Leiserson and R. L. Rivest, Introduction to Algorithms, McGraw-Hill Book Company, Toronto, 1994.

[CY pr971 UltraLogic High-Per$ormance CPLD Data Sheet, Cypress Semiconductor, 1997.

[DeHo961 A. DeHon, "Dynamically Programmable Gate Arrays: A Step Toward Increased Computational Density", 4th Canadian Workshop on Field-Programmable Devices, 1996, pp. 47-54.

[DeMi941 Giovanni De Micheli, Synthesis and Optirnization of Digital Circuits, McGraw-Hill Inc., Toronto, 1994.

[Dona791 W. E. Donath, "Placement and Average Interconnection Lengths of Computer Logic", IEEE Transactions of Circuits and Systems, Vol. CAS-26, No. 4 April 1979, pp. 272-277.

[Dona8 11 W. E. Donath, "Wire Length Distribution for Placements of Cornputer Logic", IBM Journal of Research and Development, Vol. 25, No. 3, May 198 1, pp. 1 52- 155.

[Egan 841 J. R. Egan and C. L. Liu, "Bipartite Folding and Partitioning of a PLA", IEEE Transactions on Computer-Aided Design, Vol. CAD-3, No. 3, July 1984, pp. 19 1 - 199.

[ElGa8 11 A. El Gamal, "Two-Dimensional Stochastic Model for Interconnections in Master Slice Integrated Circuits", IEEE Transactions on Circuits and Systerns, Vol. CAS-28, No. 2, February 1 98 1, pp. 127- 138.

[Farr94] A. H. Farrahi and M. Sarrafzadeh, "Complexity of the Lookup-Table Minimization Problern for FPGA Technology Mapping", IEEE Transactions on Cornputer-Aided

133.4.

[Feue821 M. Feuer, "Connectivity of Random Logic", IEEE Transactions on Computers, Vol. C-3 1, No. 1, January 1982, pp. 29-33.

[Fidu821 C. M. Fiduccia and R. M. Mattheyses, "A Linear-Time Heuristic for Improving Network Partitions", 19th Design Automation Conference, 1982, pp. 175- 1 8 1.

[Frak89] S. Frake, M. Knecht, P. Cacharelis, M. Hart, M. Manley, R. Zeman and R. Ramus, "A 9ns Low Standby Power CMOS PLD with a Single-Poly EPROM Cell", 1989 IEEE International Solid-State Circuits Conference, pp. 230-23 1 .

[Frak92] S. O. Frake, S. G. Lawson and J. E. Mahoney, "A Scan-Testable Mask Programmable Gate Array for Conversion of FPGA Designs", IEEE 1992 Custom Integrated Circuits Conference, pp. 27.3.1-27.3.4.

[Fra119 1 a] R.J Francis, J. Rose and Z. Vranesic, "Chortle-crf: Fast Technology Mapping for Lookup Table-Based FPGAs", 28th ACMBEEE Design Automation Conference, June 199 1, pp. 227-233.

[Frang 1 b] R. J. Francis, J. Rose and Z . Vranesic, "Technology Mapping of Lookup Table-Based FPGAs for Performance", 1991 IEEE Conference on Computer-Aided Design, pp. 568- 571.

[Fran92] R. J. Francis, "Technology Mapping for Lookup-Table Based Field-Programmable Gate Arrays", Ph. D. Thesis, Department of Electrical and Computer Engineering, University of Toronto, December 1992.

[Gaj s943 D. D. Gajski and L. Ramachandran, "Introduction to High-Level Synthesis", IEEE Design and Test of Computers, Winter 1994, pp. 44-54.

[Ga1 1961 J. D. Gallia, R. J. Landers, C. Shaw. T. Blake and W. Banzhaf, "A Flexible Gate Array Architecture for High-Speed and High-Density Applications", IEEE Journal of Solid- State Circuits, Vol. 3 1, No. 3, March 1996, pp. 430-435.

[Hash92] M. Hashimoto, S. S. Mahant Shetti and J. D. Gallia, "New Base Ce11 for High Density Gate Array", IEEE 1992 Custom Integrated Circuits Conference, pp. 27.2.1 -27.2.4.

[He941 J. He, "Technology Mapping and Architecture of Heterogeneous Field-Programmable Gate Arrays", M.A. Sc. Thesis, Department of Electrical and Computer Engineering, University of Toronto 1994.

[Hill9 1 j D. Hill and N-S Woo, "The Benefits of Flexibility in Look-up Table FPGAs", Oxford 1991 International Workshop on Field-Programmable Logic and Applications, pp. 127- 136.

Y. nsu, Y Lin, H. mien ana 1. Lnao, - Lomoining ~ o g i c iviinirnizariun ana roiuing iur PLAs", IEEE Transactions on Computers, Vol. 40, No. 6, June 199 1, pp. 706-7 13.

[Jana95] M. Janai, "Re-Engineering ASIC Design with LPGAs", Proceedings of the Eighth Annual International ASIC Conference, 1995, pp. 60-63.

[Kavi96] A. Kaviani and S. Brown, "Hybrid FPGA Architecture", International Symposium on Field-Programmable Gate Arrays, 1996, pp. 1 -7.

[Kavi97] A. Kaviani, Ph. D. Thesis in Progress, Department of Electrical and Computer Engineering, University of Toronto, 1997.

fKern7 O] B. W. Kernighan and S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs", Bell System Technical Journal, February 1970, pp. 29 1-307.

[Keut87] K. Keutzer, "DAGON: Technology Binding and Local Optimization by DAG Matching", 24th ACMflEEE Design Automation Conference, Paper 2 1.1, pp. 34 1-347.

[Khat92] M. Khatakhotan, "Interleaved Channeless Gate Array Architecture", IEEE 1992 Custom Integrated Circuits Conference, pp. 27.1.1-27.1 -4.

[ ~ h a ~ 9 6 1 D. W. Knapp, Behavioral Synthesis - Digital System Design Using the Synopsys Behavioral Compiler; Prentice Hall, New Jersey, 1996.

[Kou1921 J. L. Kouloheris and A. El Gamal, "PLA-based FPGA Area versus Ce11 Granularity", IEEE 1992 Custom Integrated Circuits Conference, pp. 4.3.1-4.3.4.

[Kou1931 J. L. Kouloheris, "Empirical Study of the Effect of Ce11 Granularity on FPGA Density and Performance", Ph. D. Thesis, Department of Electrical Engineering, S tanford University, 1993.

[Ku0851 Y. S. Kuo, C. Chen and T. C. Hu, "A Heuristic Algorithm for PLA Block Folding", 22nd Design Automation Conference, 1985, pp. 744-747.

[LakhgO] G. Lakhani and K. Kannappan, "PLA Folding by Partitioning", 1990 IEEE/ACM Design Automation Conference, pp. 234 1-2344.

[Land951 R. J. Landers, S. S. Mahant-Shetti and C. Lemonds, "A Multiplexer-Based Architecture for High-Density, Low-Power Gate Arrays", IEEE Journal of Solid State Circuits, Vol. 30, No. 4, April 1995, pp. 392-396.

[Latt96] ispLSI and pLSI 6000, 3000 CPLD Datasheet, Lattice Semiconductor, 1996.

[Leck89] J. E. Lecky, O. J. Murphy and R. G. Absher, "Graph Theoretic Algorithms for the PLA Folding Problem", IEEE Transactions on Cornputer-Aided Design, Vol. 8 , No. 9,

- A -

[Li 1.1941 B. Liu and K Wei, "An Efficient Algorithm for Selecting Bipartite Row or Column Folding of Programmable Logic Arrays", IEEE Transactions on Circuits and Systems-1: Fundamental Theory and Applications, Vol. 41, No. 7, July 1994, pp. 494-498.

[Luce961 ORCA OR3C/OR3T Series FPGA Product Brie5 Lucent Technologies, 1996.

[Man09 1 ] M. Morris Mano, Digital Design, Prentice Hall, Englewood Cliffs, New Jersey, 199 1.

[Marp921 D. Marple and L. Cooke, "An MPGA CompatibIe FPGA Architecture", ACMBIGDA Workshop on FPGAs, 1992, pp. 39-44.

[Mead801 C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley Publishing Company, Don Mills, Ontario, 1980.

[Murg95] R. Murgai, R. Brayton and A. Sangiovanni-Vincentelli, Logic Synthesis for Field- Programmable Gate Arrays, Kluwer Academic Publishers, Boston, 1995.

[Phi1971 CoolRunner CPLD Data Sheet, Philips Semiconductors, 1997.

[PFEP96] Programmable Electronics Performance Corporation Test Benches, http://www.prep.org, 1996.

[Pres95] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P Flannery, Numerical Recipes in C - The Art of Scientific Computing, Cam bridge University Press, New York, 1 995.

[Rose891 J.S. Rose, R.J. Francis, P. Chow and D. Lewis, "The Effect of Logic Block Complexity on Area of Programmable Gate Arrays", Proc. IEEE Custom Integrated Circuits Conference, May 1989, pp. 5.3.1 - 5.3.5.

[Rose901 J. Rose, R. J. Francis, D. Lewis and P. Chow, "Architecture of Field-Programmable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency", IEEE Journal of Solid-State Circuits, Vol. 25, No. 5, October 1990, pp. 12 17-1 225.

[S anc951 J. M. Sanchez and J. Ballesteros, "A method for optimizing programmable logic arrays using the simulated annealing algorithm", Microelectronics Journal, Vol. 26, 1995, pp. 43-54.

[Sch194] M. Schlag, J. Kong and P. Chan, "Routability-Driven Technology Mapping for Lookup Table-Based FPGAs", IEEE Transactions on Cornputer-Aided Design of Integrated Circuits and Systems, Vol. 13, No. 1, January 1994, pp. 13-26.

[Scot851 W. S. Scott et al., "1986 VLSI Tools: Still More Works by the Original Artists", Technical Report U C W D 86.72, University of California at Berkeley, 1985.

[Sent921

P. R. Stephan, R. K. ~ r a ~ t & , A. ~ansovanni-~incente~li, "SIS: A System for ~ e ~ u e i t i a l Circuit Synthesis", Technical Report UCBIERL M92/41, Electronics Research Laboratory, Department of Electrical Engineering and Computer Science, University of California, Berkeley, 1992.

[Singg 11 S. Singh, "The Effect of Logic Block Architecture on the Speed of Field-Programmable Gate Arrays", M.A.Sc. Thesis, Department of Electrical Engineering, University of Toronto, 199 1.

[Sti183] D. W. Still, "A 4ns Laser-Customized PLA with Pre-Program Test Capability", 1983 IEEE International Solid-State Circuits Conference, pp. 1 54- 155.

[Syn961 Design Compiler and Behavioral Compiler User's Guide, Synopsys Incorporated, 1996.

[Touag 1 ] H. Touati, W. Savoj and R. Brayton, "Delay Optimization of Combinational Logic Circuits by Clustering and Partial ColIapsing", 1991 IEEE Conference on Cornputer-Aided Design, pp. 188-191.

[Veen903 H. Veendrick, D. van den Elshout, D. Harberts and T. Brand, "An Efficient and Flexible Architecture for High-Density Gate Arrays", 1990 IEEE International Solid-State Circuits Conference, pp. 86-87.

[Vran97] D. Vranesic, Private Communication, 1 997.

[West931 Neil H. E. Weste and Kamran Eshraghian, Principles of CMOS VLSZ Design, Addison- Wesley Pubtishing Company, Don Mills Ontario, 1993.

PNong861 S. Wong, H. So, C. Hung and J. Ou, "Novel Circuit Techniques for Zero-Power 25-ns CMOS Erasable Programmable Logic Devices (EPLD's)", IEEE Journal of Solid-State Circuits, Vol. SC-21, No. 5, October 1986, pp. 766-773.

[Wong87] D. F. Wong, H. W. Leong and C. L. Liu, "PLA Folding by Simulated Annealing", IEEE Journal of Solid-State Circuits, Vol. SC-22, No. 2, April 1987, pp. 208-21 5.

[Xili94] The Programmable Logic Data Book, Xilinx Corporation, 1994.

[Xiii95] Development System User Guide, Xilinx Corporation, 1995.

[Yang9 l j S. Yang, "Logic Synthesis and Optirnization Benchmarks", Technical Report, Microelectronics Center of North Carolina, 199 1.

[Zili96] Z. Zilic and 2. Vranesic, "Using BDDs to Design ULMs for FPGAs", Fourth International Symposium on Field-Programmable Gate Arrays, 1996, pp. 24-30.

Table A.l: List of Benchmark Circuits

- I --

Benchmark Circuit Source Unfolded Unfolded (10,12,4) 1 4-LUTr PLA-Style Logic Blocks I I

1 du4 1 M C N C 1 7 13 1 155

1 apex2 1 M C N C 1 934 1 219 1

1 I I

1 des 1 MCNC 1 1232 1 228

apex4

bigkey

CS315

clma

CPS ddu

1 I I

1 ex5p 1 MCNC 1 584 1 132

M C N C

M C N C

M C N C

MCNC M C N C

M C N C

ex1010

i 10

1 ~38417 1 M C N C 1 2996 1 603 1

806

925

507

4049

555

362

193

227

92

957

120

64

MCNC

M C N C I

693 I 154

507 104

3228 1 618

misex3

pair

pdc

1 spln 1 M C N C 1 3862 1 593 1

M C N C

M C N C

M C N C

~38584.1

seq

HDL

89 1

744

I I I I fsm8-16-13 HDL 552 130

217

170

M C N C

M C N C

1 fsm8-8- 1 3 1 HDL 1 243 1 49 1

-

371 i 1 638

1007 229

t - -

1 1 I

no164 1 HDL 1 1866 1 524

rnle

pmac

psdes

r4000-32

sort

valu

HDL

HDL

HDL

PREP

HDL

HDL

1180

863

616

933

707

1351

376

237

15 1

206

138

329

Logic Blocks with 3 Outputs (O = 3) I 1 2.25

..-------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 2 2.00

0.75 fi

I 1 0.75 8.0 10.0 24.0



0



---------.----------------.------.-----------.---.

-.......-.-------...-..-*-.------.*-..**---------. 0-e 8 Product Term Rows (P = 8) M 16 Product Term Rows (P = 16)

24 Product Term Rows (P = 24)

1 I 1 I 8.0 16.0 24.0


Figure B.1: Area Results for Unfoldable PLA-Style Logic Block Architectures (Pessimistic) Logic Blocks with 3 Outputs (O = 3)




Ia20 0

0.60 ' I I l


M 8 Product Term Rows (P = 8) M 16 Product Term Rows (P = 16) W 24 Product Term Rows (P = 24)


Figure B.2: Ratio of Row Foldable to Unfoldable Area for PLA-Style Logic Block Architectures (Pessimistic)

Logic Blocks with 3 Outputs (O = 3) 1.25 1 1

1 1 1 .25

Nurnber of Input Colurnns (1)


1 1 I


Logic Blocks with 5 Outputs (O = 5) b 1 I

8.0 16.0 24.0 Nurnber of Input Columns (1)

Q--a 8 Product Term Rows (P = 8) - 16 Product Term Rows (P = 16) 24 Product Term Rows (P = 24)

Figure B.3: Ratio of Column Foldable to Unfoldable Area for PLA-Style Logic Block Architectures (Pessimistic)

Logic Blocks with 3 Outputs (O = 3) 1.25

Nurnber of Input Columns (1)

Logic Blocks with 4 Outputs (O = 4) 1.25 1 1 1 1

0.50 I 1 t

8.0 16.0 24.0 Nurnber of Input Columns (1)

Logic Blocks with 5 Outputs (O = 5) 1 I I


M 8 Product Term Rows (P = 8) M 16 Product Term Rows (P = 16) H 24 Product Tem Rows (P = 24)

Figure B.4: Ratio of Combined Foldable to Unfoldable Area for PLA-Style Logic Block Architectures (Pessimistic)

I 4 .O 6.0 8 .O 10.0

Inputs in Unfolded Form (K}

Figure 8.5: Ratio of Foldable to Unfoldable Area for LUT-Based Logic Block Architectures (Pessimistic)

Figure C.l: PLA Layout Generated by MPLA [Scot851

i 04

D.1 Introduction

To date, al1 research on FPD architecture has taken an empirical approach. Researchers

use a set of benchmark circuits which are mapped into proposed FPD architectures

[Rose901 [Singg 11 [Brow92] [Kou1931 [Betz96] [Kavi96]. Analysis of experimental results and

models of physical hardware are used to detemine which architectural features are 'best' in the

context of a particuIar set of benchmark circuits. The type and structure of the benchmark circuits

used in an architectural study partially define which architectural features are deemed desirable.

For the results of these studies to carry any validity, the benchmarks used must be representative

of real industrial circuits. This document describes a new benchmark circuit suite that may be

synthesized using the Synopsys tools, including the Behavioral Compiler and the Design

Compiler.

D.2 Current Benchmarks

Over the past several years, most architectural research has been conducted using a

standard set of benchmark circuits. These are the circuits provided by the Microelectronics Centre

of North Carolina (MCNC) [Yanggl]. The MCNC circuits have been used to investigate al1

aspects of FPGA architecture including studies of logic block type and granularity, as well as

routing architecture.

One probIem with the MCNC circuits is that the precise function of each of the circuits is

not known and undocumented. Some circuits are known to be either control-type or datapath

circuits; however, this broad categorization is not adequate in many cases. For example, it is

unclear how FPD architects could use these circuits to investigate the speed of FPDs

implernenting arithmetic circuitry. Since the function of the circuits is not clearly defined, it is

difficult for architects to select a subset of the circuits that is representative of the universe of real

circuits.

Another problem with the MCNC benchmarks is that they are distributed in a netlist

format. This distribution format is not conducive to studying synthesis styles or determining the

technology independent optimization on the converted circuits using SIS ' [Sent92], the results of

this optimization depend on the initial form of the circuit.

To properly study FPD architectures, it could be argued that benchmark circuits of a size

comparable to actual circuits should be used. This reveals an additional problem with the MCNC

circuits: although there are about 200 circuits in the suite, most of the circuits are very small

( ~ 2 0 0 0 gates). It may not be reasonable to use these circuits to investigate architectures for FPDs

that will need to have capacities in the range of 50000 to 100000 gates.

One additional problem with the MCNC circuits is that because of their distribution

format, the function of each of the benchmarks is fixed, and cannot be readily modified. Thus, it is

difficult to add to the functionality of the circuits, or to combine several of the smaller circuits to

create larger circuits.

The simple solution to these problems would be for commercial FPD users to make their

circuits available to researchers. Unfortunately, commercial designs and industrial benchmarks

are considered proprietary in most cases and are not released to the public.

D.3 Parameterized Benchmarks

A set of parameterized benchmarks has been created that rnay be synthesized using the

Synopsys tools. The benchmarks are specified in either Verilog, VHDL, or Synopsys state table

format. About half of the circuits are written in behavioural HDL in a form acceptable to the

Synopsys Behavioral Compiler (BC) [Syn96]. After the BC performs scheduling and allocation

on a circuit's behavioural code, the resulting register transfer-level (RTL) specification of the

circuit may be read into the Synopsys Design Compiler and synthesized into the gates of a target

library. The second half of the circuits in this suite rnay be read directly into the Design Compiler.

Each of the circuits is parameterized in a particular way. For example, the datapath circuits

are parameterized so that a user may Vary the datapath width. Similarly, control circuits (state

machines) rnay be created with varying numbers of States, inputs, or transitions per state. Because

the circuits are parameterized and written in a text form, they are fundamentally different than the

MCNC benchmarks. First, the function of each circuit is known to the user because the HDL code

1 . SIS is a muIti-level sequential logic synthesis optimization system.

the adjustment of parameters. This means that circuits which are significantly larger than most of

the MCNC circuits can be created (> 10000 gates). Third, because the circuits are written in HDL,

they are easy to modify. Lastly, these circuits allow FPD architectural research to proceed in

different directions as architects can study which architectures are best for different sizes of a

certain class of circuit. Table B.l summarizes how properties of the benchmarks in this suite

mitigate the problems of the MCNC benchmarks.

D.4 Synopsys BehavioraI Compiler

Table B.1: Solutions to Problems with the MCNC Benchmarks

The HDL for several of the benchmarks in this suite is written at the behavioural-level and

therefore, these benchmarks require the high-level synthesis of the Synopsys Behavioral Compiler

(BC). The BC takes behavioural-level HDL as input and transforms this description into an RTL

(register transfer-level) circuit that consists of functional units (adders, multipliers, etc.), a state

machine for control, and memory elements.

Untii very recently, most HDL designs were done at the register transfer-level. At this

level of abstraction, a designer must explicitly specify the cycle-by-cycle behaviour of a circuit in

the HDL. This includes a description of which operations (for example, multiplies, additions, or

shifts) are to occur in which clock cycles (the schedule), and the design of any state machines

used for control. Additionally, RTL designers must consider the number and types of hardware

units to be used (the allocation), and explicitly specify how operations in the HDL code map onto

actual hardware units (the binding). Writing behavioural-Ievel HDL code is significantly different

than writing RTL code. In the behavioural coding style, many of the timing constraints that must

MCNC Benchmark Problem

Unknownhndefined function

Parameterized Benchmark Suite Solution

Function is transparent in HDL code

Netlist distribution format

Mostly small circuits

Fixed functionhot modifiable

Text-based HDL distribution format

Large or small circuits can be created (adjust parameters)

Parameters can be adjusted and HDL modified

resemblance to high-leveI programming languages such as C .

The Synopsys BC automatically performs many of the high-level synthesis features

discussed in pubiished papers and text books. In particular, the tool performs scheduling,

allocation, and binding. The BC also has features that allow it to perform operator chaining (for

example, scheduling two operations in a single clock cycle, where one of the operands of the

second operation is the result of the first operation), multiyciing (allowing a lengthy operation to

span multiple clock cycles), and the automatic pipelining of functional units. For an introduction

to high-level synthesis refer to [Gajs94]. Some of the benefits of behavioural-level HDL and the

BC are:

State machines to control functional units are generated automatically by the BC.

* Shorter code. Since state machines for control are generated automatically and

the complete schedule does not have to be encoded into the HDL code, the amount of

code needed to specify large, complex circuits is significantly reduced.

Automatic hardware unit sharing. The BC maps operations in the HDL ont0

hardware units and can automatically share hardware units between operations in

different clock cycles.

@ Automatic memory element sharing.

* Automated exploration of circuit implementations with different schedules.

As already mentioned, al1 of the circuits in this suite have been created with parameters

that can be used to make them larger or smaller. Furthemore, constraints can be set within the BC

to Vary the amount of parallelism between the operations in each circuit. Greater amounts of

parallelism will result in larger circuits and shorter latencies. This implies that each benchmark

can actually be viewed as a large number of benchmarks, because the parameters and synthesis

options provide many degrees of freedom in which each circuit's function and hardware

architecture may be varied.

In addition, most of the circuits remain sensible, even when their size is increased. This is

different than the PREP benchmark suite [PREP96] in which large circuits are generated through

the concatenation of smaller circuits in such a way that the resulting large circuits perform no

useful function.

As a default, the BC performs ASAP scheduling (shortest latencyllarge area). However,

resource-constrained scheduling (minimal aredlong latency). Furthermore, using the

set-cyc l e s command in the BC, it is possible to generate the schedules in between the fastest

and the smallest schedule.

The BC supports three different UO scheduling modes. Different UO modes a1Iow varying

degrees of freedom for I/O to move with respect to the clock cycle boundaries specified in the

HDL description. The different I/O modes are: c y c l e - f ixed, s u p e r s t a t e - f ixed, and

f ree-f loat . The first two maintain the order of the I/O given in the HDL description while

free-f loa t allows the BC to re-order I/O operations in order to produce more optimal

schedules. In c y c l e - f i x e d , the precise cycle-by-cycle I/O behaviour of the HDL description is

preserved. In s u p e r s t a t e - f i x e d , clock cycles other than those in the HDL may be

introduced during scheduling. Refer to [Syn96] or [Knap96] for a detailed discussion of the I/O

scheduling modes supported by the BC. For the benchmarks in this suite that must be synthesized

using the BC, the s u p e r s t a t e - f ixed I/O scheduling mode should be used.

One limitation of the BC relates to the clocking strategies perrnitted in the synthesis of

sequential circuits. The tool permits sequential designs to operate on positive-edge clocking or

negative-edge clocking. Combinations of the two schemes are not currently supported. For this

reason, the circuits in this suite use exclusively positive-edge clocking. A second limitation of the

Behavioral Compiler is that circuits containing tri-state logic cannot be synthesized. Any tri-state

logic must be resolved using multiplexers.

The following shows the BC script used to synthesize the benchmark 'sort' (described

later in this paper). It is fairly representative of the script that can be used to synthesize any of the

benchmarks that require the Behavioral Compiler. Comments are shown in curly brackets.

{script to compile the circuit 'sort') {set up target, link and syrnbol libraries) analyze -format verilog s0rt.v elaborate -s verilog create-clock clk -period 200 {set clock period to an appropriate value for target technology) s e t ~ b e h a v i ~ r a l ~ a s y n c ~ r e s e t bc-check-design -io superstate bc-tirne-des ign write -hier sort-tirned.db schedule -io superstate -effort low {alternately: schedule -io superstate -effort low -area)

report-schedule -summary -abstract-fsm -operations -var write -hier sort-scheduled-db {set synthesis constraints for area and delay) compile write -f verilog -hier -O sort.vlg

D.5 DesignWare Components

The Synopsys DesignWare library is a technology independent collection of commonly

occurring digital circuit components. The availability and use of these components can

substantially reduce design time. Furtherrnore, this library allows designers to create sophisticated

circuits that include complicated arithmetic or digital logic components (for example, components

such as booth multipliers) without necessitating that designers have an in-depth understanding of

the specifics of designing and optimizing these sub-circuits.

The heavy leveraging of DesignWare components in this benchmark suite greatly

simplified its creation by substantially reducing code length and design time. Cornmonly

occurring arrithmetic components did not need to be designed. This methodology can be

considered akin to the inclusion of standard libraries when programs are written in high-Ievel

programming languages such as C.

The libraries containing DesignWare components are terrned synthetic libraries. To be

able to use the circuits in this benchmark suite, the licenses for the DWOl, DW02, and DW03

synthetic libraries must be present. The synthetic-library variable in Synopsys must be

set to include these libraries before synthesis is atternpted.

DesignWare components can be used in two different ways: inferencing or instantiation.

Instantiation is congruent with a structural HDL coding style wherein different components are

connected together in a netlist fashion. The parameterized benchmark suite circuits rely on

inferencing. In this style, hardware components are inferred from the use of various operators in

the HDL code. For example, when a '+' sign occurs in the HDL code, the Synopsys tools infer a

hardware adder. This hardware adder may have several implementations, including ripple-carry or

carry-look-ahead, Based on the timing and area constraints that a designer provides to Synopsys,

the tools will automatically select the adder implementation that best meets the constraints.

By setting the targe t-1 ibrary and 1 ink-1 ibrary variables in Synopsys, the

circuits in the pararneterized benchmark suite may be synthesized into the gates of any target

library. For example, both Altera and Xilinx provide Iibraries to customers that may be used to

target their technologies and interface with their CAD tools. In essence, these benchmarks may be

mapped into any FPD, MPGA, or LPGA for which there exists a Synopsys library. A tool has

been created to convert mapped circuits into the Berkeley Logic Interchange Format (BLIF) that

is commonly used in research, and is readable by SIS [Sent92].

D.7 Description of Benchmarks

This section describes the benchmarks in the suite. For each circuit, the names of its input

and output ports are given, as well as their widths. Following this, the parameters of the circuit are

presented with a brief description of their meaning. Then, a detailed description of the circuit's

function is provided.

It is impossible to describe the function of these circuits as precisely as is done with

benchmarks such as those in the PREP suite [PREP96]. The reason for this is that the cycle-by-

cycle behaviour of many of these circuits is not known until after scheduling. Recall that there are

many possible schedules for each circuit. This means that providing a single timing diagram for

each circuit would not fully describe each circuit's behaviour. Therefore, in this section, the

function of one possible schedule for each circuit is described. Users of this suite should use the

B C command report-schedule to verify that scheduled circuits meet timing expectations.

Svnthesis 'l'ool: Behavioral Compiler Source: Verilog HDL Code

Input/Output Signals Direction Width

in-data in-datcrdy clk rese t ou t-da ta

INPUT [width- 1 :O] INPUT INPUT INPUT OUTPUT [width- 1 :O] OUTPUT

Parameters Meaning

width Width of each item to be sorted. num-items Number of data items to be sorted.

Description of Circuit Function

This circuit reads in num-items width-bit data items, sorts the data, and outputs the data in

sorted order. The circuit is initialized by placing a ' 1' on the reset input. The first data item to be

sorted should be placed on the in-data input port and it will be read when a '1' is placed on the

handshaking signal, in-data-rdy. The remaining num-items- 1 data items are read on subsequent

clock cycles.

After al1 data items have been read, the sorting routine begins. Several dock cycles later, a

' 1' will be placed on the handshaking signal, out-rdy (the latency of the sort will depend on how

the design was scheduled). The data will appear on the port out-data in sorted order. One data

item will be output to out-data in each clock cycle for num-items clock cycles. After al1 data

items have been output in sorted order, out-rdy will be restored to 'O' and a new set of data items

may be presented to the circuit.

It is possible to pipeline this circuit and overlap the sorting of two sets of input data. This

can be done using the setgipeline-cycles command of the Behavioral Compiler.

This circuit contains comparators, adderhubtractor circuitry, and memory elements.

SVntheSlS 1001: uesign compiler Source: Verilog HDL Code

Input/Output Signals Direction

in-bit, start, clk, rst INPUT crc-out OUTPUT crc-rdy OUTPUT

Width

[crc-Zen- 1 :O]

Parameters Meaning

crc-Zen Length of the CRC word that is produced. num-bits Number of data bits used to produce a single CRC word. crcgoly The CRC polynomial. This is a constant with the sarne number of bits as

crc-Zen. 102 1 i6 should be used for the CCITT 16-bit standard (represented as 16'h 102 1 in Verilog HDL). Use WhO4C 1 1DB7 for the 32-bit AUTODIN-II standard.


This circuit performs a cyclical redundancy check (CRC) on num-bits bits of input data. A

CRC check is a well-accepted method of detecting errors in data transmission in communications

circuits. A CRC word is produced for a set of data, and this word is typicalIy appended to the data

sent over the communications medium. A CRC polynornial is a constant that is used to produce

the CRC word. Several examples of these polynomials are given above in the parameters section.

Refer to [Pres88] for additional information on CRC checks.

This circuit is reset when a ' 1' is placed on the rst port. After reset, when a ' 1 ' is placed on

the handshaking signal inbi t , the first data bit is read on a positive clock edge. Data are read in

bit-serial fashion. The remaining numbits-1 bits are read on successive clock cycles. After al1

input bits have been read, a ' 1' is placed on the output signal, crc-rdy, on the next positive clock

edge. At this time, al1 bits of the crc-Zen-bit CRC word are output on the crc-out port in parallel.

The number of bits in the CRC word (crc-Zen) need not be the same as the number of bits used to

produce the CRC word (numbits), though there are established standards. On the next positive

dock edge, the handshaking signal crc-rdy will be restored to ' O , , and the circuit will be ready to

accept new input data.

Synthesis Tool: Behavioral Compiler Source: Verilog HDL Code


in-data INPUT [width- 1 :O] in-rdy, clk, rst INPUT out-data OUTPUT [(2* width) + order - 1 :O]

Parameters Meaning

width order

Width of each data item read into the filter. Order of the filter.


An FIR filter is a digital filter with no feedback connections. The circuit must first be

initialized by placing a '1' on the input port, rst. Before the filter can be operated, its coefficients

must be read in. The number of filter coefficients is equal to the filter order. Each coefficient, as

well as each data word, is width-bits wide. The coefficients must be supplied to the filter through

the input port, in-data, in successive dock cycles. PIacing a ' 1' on the port i c rdy indicates the

beginning of the coefficient Stream.

After the coefficients have been read, the circuit wilI begin to filter the data supplied to the

port in-data. The results of the filtering will appear on the output port out-data. This circuit can

be scheduled so that new data items are read into the filter in successive clock cycles, implying

that filtered output is also available in each clock cycle.

Depending on the amount of parallelism in the schedule, this circuit can contain several

large multipliers and adder circuits. Smaller versions of the FIR filter can be created by increasing

the latency of the circuit (and reducing parallelism) by setting scheduling constraints. For

example, one could schedule the FIR filter so that input data is not read every clock cycle, but

instead read every two dock cycles.

avnrnesis moi: ~ e n a v ~ o r a i ~ornpi ier Source: Verilog HDL Code

Inpuî/Output Signals Direction Width

i i n , y-in INPUT [width- 1 :O] in-rdy, clk, reset INPUT x-out, y-out OUTPUT [width- 1 :O] done OUTPUT

Parameters Meaning

width Width of each data item read into the MLC circuit. num-code-words Number of words to which each input data item is compared.


This circuit is a hardware implementation of a maximum likelihood classifier (MLC). The

circuit compares input data with a set of code words stored in memory. Each data item has an x

component and a y component similar to a point in two-dimensional Cartesian space. The input

data are compared with the stored data on the basis of Euclidean distance. The code word in

memory that is 'closest' to the input word is output on the ports x-out and y-out.

Before any classification can occur, the circuit must be reset by placing a ' 1 ' on the input

port reset. FolIowing this, the set of code words must be read into the circuit's memory elements.

num-code-words are read in successive dock cycles, with the x components being read on the

x-in port, and the y components being read on the y-in port. Placing a '1' on the handshakirig

signal in-rdy indicates the beginning of the code word sequence.

Once al1 the code words have been read, the first input data can be supplied on the next

positive clock edge. This data will be compared to each of the code words, and the components of

the closest code word will be placed on the x-out and y-out ports; the output signal clone will be

asserted. On the next clock cycle, done will be restored to '0'. Another data item can be presented

to the circuit on the next clock cycle.

This circuit contains a combination of multipliers, adders, subtractors, and less-than

comparators.

Svnthesis Tool: Behavioral Compiler Source: Verilog HDL Code (Adapted Frorn Numerical Recipes in C [Pres88])


in-data, clk, reset INPUT in-da ta-rdy INPUT out-data OUTPUT [width- 1 :O] done OUTPUT

Parameters Meaning

width cl c2

Width of the data item to be encrypted (must be divisible by 4). Array [3:0] of constants, each with width equal to (widthl2) Array [3:0] of constants, each with width equal to (widthl2)


This circuit was originally a C program that has been translated into behavioural HDL.

The circuit encrypts data in a way similar to the data encryption standard (DES). Input data is

divided into segments, and these segments are perrnuted. Following this, several arithmetic

operations are performed on the segments. These operations include the multiplication and

addition of some segments with others, and the exclusive-Oliing of segments with user-supplied

constants in the arrays, c l and c2.

This circuit is initialized by placing a ' 1' on the reset port. Input data is read in bit-serial

fashion, with the first bit being read when the handshaking signal in-data-rdy is asserted. The

next width-1 bits of the input data word are read on successive clock cycles.

After the input data has been read, encryption begins. Refer to [Pres88] for details on the

encryption process. After encryption, the encrypted form of the input word is placed on the output

port out-data, and the handshaking signal done is set to ' 1 ' . After one clock cycle, done will be

restored to 'O7, and new data can be presented to the encryption circuit.

The parameters of this circuit include 8 user-defined constants. Each constant must have a

width equal to the width parameter divided by 2. These constants are exclusive-ORed with various

segments of the input data.

wntnes~s 1001: uesign Lompiier Source: VHDL Code

Input/Output SignaIs Direction Width

in-v scan, clk out-v

INPUT [O:N- 1 ] INPUT OUTPUT [O: N- 1 ]

Parameters Meaning

N Cellular automata will be an N by N array of cells.


This circuit is a hardware implementation of the 'game of life' cellular automata. The

automata's structure consists of a square array of identical cells, each having connections to its

eight nearest neighbours. In a given dock cycle, a ce11 can be in the ' 1' state or the '0' state. A

cell's value in the next state (in the next clock cycle) will depend on its current state and the states

of its neighbours.

The circuit consists of two VHDL modules which must be synthesized separately. The

first describes the basic ceIl of the array and it must be compiled before the second which is a

structural module instantiating the square array of cells. The dimensions of the array are N by AT,

and therefore, the number of cells in the array is N ~ .

The states of each of the cells in the array can be initialized in N ctock cycles through the

input port in-v and the scan input. When scan is set to 'l', the value of each ce11 is simply set to

the value of its neighbour to the north. The in-v port is the array of northern connections to the top

row of cells in the array. Setting scan to 'O' causes each ceIl in the circuit to perform the game of

life algorithm.

Examining the state of al1 cells in the array is done in a way similar to the presetting of

cells. The ouf-v output port is a vector representing the states of the bottom row of cells in the

array. By setting scan equal to ' I ', the value of each ceIl in the array can eventually be passed to

the bottom row of the array, and examined through the out-v port.

Synthesis '1001: Design Cornprier Source: C program

Input/Output SignaIs Direction

i0, il, ... i(nin-1) clk, reset 00, 00, ..., o(n0ut-1)

INPUT INPUT OUTPUT

Parameters Meaning

nin Number of inputs to the finite state machine. nout Number of outputs of the finite state machine. nstate Number of states in the finite state machine. prob-tram Probability of adding another transition from a given state. prob-dc Probability that a given input is a 'don't care' in a transition.


A C program has been written to generate finite state machines randomly with any nurnber

of inputs, states, or outputs. The program produces state tables in Synopsys state table format. The

states in the machine are represented symbolically in the generated state table. This allows a user

to choose the encoding style from within the Design Compiler, for example, binary, one-hot, or

other encoding styles.

The number of transitions from a given state to other states can be adjusted through the

prob-tram parameter. When transitions are being generated from a certain state, this parameter

represents the probability that another state will be added. This implies that the number of

transitions from each state to other states follows a geometric distribution. Furthermore, in each

transition, each input must be either 'Oy, 'l ', or 'X'. The prob-dc parameter can be adjusted to

control the number of 'X's in the state table. Al1 states in generated machines will be reachabie

from the start state'. A sample DC script to synthesize this benchmark is:

{set up target and link libraries) read -format fsm fsm.st set_fsm-encoding-style onehot {or binary) reduce-fsm set-fsm_minimize true compile

1 . Asserting the reset input forces the state machine into the start state.

svntnesis moi: uesign Lompiier Source: VHDL Code

InpuUOutput Signals Direction Width

a, b, c clk z

INPUT [w idth- 1 :O] INPUT OUTPUT [(2* width) :O]

Parameters Meaning

width The datapath width of the circuit. pipe-length The number of pipeline stages in the multiply-accumulate operation (>2).


This circuit implements a pipelined form of the simple arithmetic function z= a b+ c. The

number of pipeline stages and the width of the datapath are represented as the parameters width

and pipe-length, respectively.

Upon examination of the HDL code, it appears as though data is being passed between the

circuit's registers in a shifter-like fashion, with al1 the combinational logic occurring in front of

the first set of pipeline registers. The reason for this is that the HDL for this circuit was written to

take advantage of a special Synopsys synthesis option, called register balancing. Register

balancing can automatically minimize the clock cycle time of a circuit by moving logic between

register boundaries. In essence, the tool will 'balance' the amount of logic in between the sets of

pipeline registers. Using a larger value for the pipe-length parameter will reduce the amount of

logic in each stage of the pipeline.

The benchmark can be compiled using the Design Compiler; however, the user should

issue the command 'set-balance-registers true' before synthesizing the design.

All of the registers in this circuit are positive-edge triggered flip-flops. Input data

presented to the circuit on the a, b, and c input ports will be captured on a rising clock edge, and

the output will become available on the z port several clock cycles later, depending on the value of

the pipe-length parameter.

Svnthesis Tool: Behavioral Compiler Source: VHDL Code

Input/Output Signals Direction WidEh

a, b, 7, INPUT [width- l :O] cmd INPUT [2 :O] clk, reset, in-data-ready INPUT out-data-ready OUTPUT

Parameters Meaning

width Datapath width of the circuit (a power of 2). log-w idth Must be set to log2width. vector-length Number of elements in each of the vectors A and B.


This circuit implements a vector ALU. Operations on the elements in two vectors, A and

B, may be perforrned in parallel depending on how the design is scheduled. The value placed on

the cmd port controls which operation is perforrned by the vector ALU as follows:

crnd 000 O0 1 010 O1 1 100 101 110 1 1 1

Selected O~eration Vector Addition (A + B) Vector Subtract (A - B) Barre1 Shift (A is shifted by the lower log-width bits of B ) Logical AND (A AND B) Logical OR (A OR B) Logical EXOR (A EXOR B) (NOT A) and B Logical NAND (A NAND B)

The circuit can be reset synchronously on the positive clock edge using the reset input.

After reset, the signal in-data-ready should be asserted when the first elements of the two vectors

A and B are placed on the a and b input ports, and the selected operation code is placed on the

crnd port. The first elements of vectors A and B, as well as the operation code will be read on the

rising edge of the clock. The subsequent vector-length-l elements will be read in successive

clock cycles. After the computation is complete, out-data-ready will be asserted for

vector-length clock cycles, with a different elernent of the result vector Z appearing on output

port z in each clock cycle.

Table E.l compares the number of logic blocks needed to implement circuits in the

CX2001 LPGA [CEC96a] with the number of blocks needed in the proposed architectures. The

numbers in the table were computed by deterrnining a ratio for each benchmark circuit and

averaging the ratios across al1 benchmarks. Thus, each circuit (whether small or large) was treated

equally. Only the combinational portion of each circuit was considered in the comparisonl. The

table shows that one combined foldable PLA-style logic block with the parameters (8, 8,4) has a

logic capacity approximately equivalent to ten CX2001 logic blocks. One foldable LUT-based

logic block with K = 5 and L = 4 is approximately equivalent to three CX2001 blocks.

Table E.l: Comparing Number of Logic Blocks

Architecture

Unfoldable PLA-based (8, 8, 3)

Table E.2 compares the total number of connected logic block pins in circuits

implemented in the proposed architectures with the number of connected pins in circuits

implemented in the CX2001. The metric of total number of connected pins has been shown to

correlate well with routing resource area [Hill9 l][Brow92] [He94]. Soth input and output logic

block pins are included in the comparison. The table shows that when circuits are implemented in

a combined foldable (8, 8,4) PLA-based architecture, they possess 5 1 % fewer connected pins, on

average, than when implemented in the CX2001. This suggests that an LPGA with such combined

foldable PLA-style logic blocks would need significantly less routing resource area than the

CX2001. Circuits implemented in a foldable LUT-based architecture with K = 5 and L = 4 have

28% fewer connected pins than when implemented in the CX2001.

Average Ratio (NCX2001/Narch)

5.93

Combined foldable PLA-based (8 ,8 ,4)

Unfoldable LUT-based (K = 4)

FoIdable LUT-based (K = 5, L = 4)

1 . This was necessary because CX2001 logic blocks contain no flip-flops.

9.89

1.80

3.14

Table E.2: Comparing Number of Connected Pins

Architecture 1 Average Ratio ( P i n ~ , , ~ f f i n s ~ ~ ~ )

UnfoIdable PLA-based (8 ,8 ,3 )

Combined foldable PLA-based (8 ,8 ,4 )

UnfoIdable LUT-based (K = 4)

Foldable LUT-based (K = 5, L. = 4)

0.53

0.49

0.8 1

0.72

APPLIED - A IMAGE, Inc - - 1653 East Main Street - -. - Rochester, NY 14609 USA -- -- - - Phone: 7161482-0300 -- -- - - Fax: 7161288-5989

0 1993. Applied Image. Inc., All Righls Resewed

architectures and algorithms for with foldable logic blocks · 3.2.1 simple and multiple folding...

Documents