thesis_v4

7/28/2019 thesis_v4

1/147

i

HIGHSPEEDANDLOWPOWERDESIGNOFCONTENT

ADDRESSABLEMEMORY

Athesissubmittedtothe

DEPARTMENTOFELECTRICALANDELECTRONICENGINEERING

OF

BANGLADESHUNIVERSITYOFENGINEERINGANDTECHNOLOGY

inpartial

fulfillment

of

therequirementsforthe

degreeof

BachelorofScienceinElectricalandElectronicEngineering

By

MD.

NAIMUL

HASAN

(0306017)

MD.TAUHIDURRAHMAN(0306035)

MD.MEHEDIHASAN(0306071)

Supervisor

DR.A.B.M.

HARUN

URRASHID

Professor,DepartmentofEEE,

BUET,Dhaka1000

BANGLADESHUNIVERSITY

OF

ENGINEERING

AND

TECHNOLOGY

7/28/2019 thesis_v4

2/147

ii

Declaration

We hereby declare that the work presented in this thesis entitled

High Speed and Low Power Design of ContentAddressable

Memory istheoutcomeofthe investigationcarriedoutbyusand

neither this thesis nor any part thereof has been submitted or is

being currently submitted anywhere else for the award of any

degreeordiploma.

.. .

Md.NaimulHasan Md.TauhidurRahman Md.MehediHasan

(0306017) (0306035) (0306071)

7/28/2019 thesis_v4

3/147

iii

Our parents.

7/28/2019 thesis_v4

4/147

iv

ACKNOWLEDGEMENTS

Atfirst,wewouldliketoconveyourgratitudetoAlmightyAllahwithout

whosewishnothingispossible.

Wewouldliketoexpressourprofoundgratitudeandappreciationtoour

Thesis supervisor, Professor Dr. A. B. M. HarunUr Rashid for his benign

attitude towards us and whose supervision gave us the opportunity to get

involved

in

this

stateof

the

art

and

greatly

emerging

research

of

low

power

and highspeed design of circuits specially ContentAddressable Memory

(CAM).Hisgeneroushelp,encouragementandconstantguidanceaccelerated

thecompletionofthisthesis.

Wealsowould liketoexpressourspecialthankstoMr.AtaurRahman

Patwary, School of Electrical and Computer Engineering, Oregon State

University,USAforhissuggestionsanddifferentkindsofhelptocompletethe

work.

We would like to thank thealumniofBUET (also members of Yahoo!

GroupBUETian)fortheirhelpbyprovidingalotofnecessaryworks.

We also would like to acknowledge the help of the Department of

ElectricalandElectronicEngineering,BUETforlettingustousetheVLSIserver

computertosimulatelargeHSpicefile.

7/28/2019 thesis_v4

5/147

v

CONTENTS

Page

Declarationii

Dedication.iii

Acknowledgementiv

Contents.v

List of Figures viii

List of Tables.xi

Abstractxii

CHAPTER1:INTRODUCTION......................................................................................................2

1.1INTRODUCTION.................................................................................................................. 2

1.2MOTIVATIONS ................................................................................................................... 3

1.3OBJECTIVES....................................................................................................................... 4

1.4THESIS ORGANIZATION ..................................................................................................... 4

CHAPTER2:DIFFERENT TYPES OF LOGIC ................................................................................7

2.1INTRODUCTION.................................................................................................................. 7

2.2RATIOED LOGIC................................................................................................................. 7

2.2.1 Load Resistance RL ................................................................................................... 7

2.2.2 nMOS depletion mode transistor pull up .................................................................. 8

2.2.3 nMOS enhancement mode pull up ............................................................................ 8

2.2.4 Pseudo-nMOS logic .................................................................................................. 9

2.3COMPLEMENTARY TRANSISTOR PULL-UP (CMOS) ............................................................ 9

2.3.1 Components of Total Power Dissipation in CMOS Circuits .................................. 10

2.4DYNAMIC CIRCUITS......................................................................................................... 11

2.5DOMINO LOGIC ............................................................................................................... 16

CHAPTER3:CONTENT ADDRESSABLE MEMORY (CAM)REVIEW ........................................19

3.1INTRODUCTION................................................................................................................ 19

3.2CORE CELLS AND MATCHLINE STRUCTURE.................................................................... 22

3.2.1 Structure of NOR Cell ............................................................................................. 22

3.2.2 Structure of NAND Cell .......................................................................................... 23

7/28/2019 thesis_v4

6/147

vi

CONTENTS (Continued)

Page

3.2.3 Ternary Cells ........................................................................................................... 23

3.3MATCHLINE SENSING SCHEMES ...................................................................................... 25

3.3.1 Conventional (Precharge-High) Matchline Sensing ................................................ 25

3.3.1.1 Basic Operation ................................................................................................. 26

3.3.1.2 Matchline power .............................................................................................. 27

3.3.1.3 Charge Sharing.................................................................................................. 27

3.3.1.4 Power Consumption .......................................................................................... 28

3.4LOW-SWING SCHEMES ................................................................................................... 293.5CURRENT-RACE SCHEME ................................................................................................ 30

3.6SELECTIVE-PRECHARGE SCHEME .................................................................................... 31

3.7PIPELINING SCHEME ........................................................................................................ 32

3.8CURRENT-SAVING SCHEME ............................................................................................. 34

3.9CONCLUSION ................................................................................................................... 35

CHAPTER4:PROPOSED CHARGING CONTROL SCHEME &SENSE AMPLIFIER.....................37

4.1INTRODUCTION................................................................................................................ 37

4.2PROPOSED CHARGE CONTROLLING SCHEME ................................................................... 37

4.3SIMULATION RESULTS AND ANALYSIS ............................................................................ 39

4.4CORNER SIMULATION OF THE SCHEME............................................................................. 41

4.5CONCLUSION ................................................................................................................... 42

CHAPTER5:PROPOSED SIMPLIFIED DESIGN OF CHARGING CONTROLLER..........................44

5.1INTRODUCTION................................................................................................................ 44

5.2PROPOSED MLCHARGING TECHNIQUE ........................................................................... 46

5.3SIMULATION RESULTS AND ANALYSIS ............................................................................ 48

5.4CONCLUSION ................................................................................................................... 50

CHAPTER6:PROPOSED CAM WITH IMPROVEDNOISE MARGIN ............................................... 52

6.1INTRODUCTION................................................................................................................ 52

6.2OPERATION OF THE SCHEME ............................................................................................ 52

6.2.1 Charging controller .................................................................................................. 53

6.2.2 The sense amplifier ................................................................................................. 54

6.3SIMULATION RESULT....................................................................................................... 55

7/28/2019 thesis_v4

7/147

vii

CONTENTS (Continued)

Page

6.4CONCLUSION ................................................................................................................... 57

CHAPTER7:Conclusion ........................................................................................................... 59

7.1CONCLUSION .................................................................................................................. 59

7.2FUTURE WORK................................................................................................................ 59

Reference ................................................................................................................................. 61

Bibliography ............................................................................................................................ 69

RESEARCH PAPERS FROM THIS THESIS..................................................................................... 70

APPENDIX A ............................................................................................................................. 71

A.1 HSPICE CODE FORCHARGING CONTROL SCHEME........................................................... 71A.2HSPICE CODE FOR SIMPLIFIED DESIGN OF CHARGING CONTROLLER.............................. 122

A.3HSPICE CODE FOR IMPROVED NOISE SCHEME [CHAPTER6] ........................................... 124

APPENDIX B ............................................................................................................................ 128

B.1INTRODUCTION ............................................................................................................. 128

B.2INSTALLATION AND USAGE OF HSPICE 2007 ................................................................ 128

B.3BASIC RULES ORQUICKMANUAL[2] ........................................................................ 131

B.3.1. Input File .............................................................................................................. 131

B.3.2. Element Description ............................................................................................ 132

B.3.3. Analysis ............................................................................................................... 134

B.3.4 References ............................................................................................................. 135

7/28/2019 thesis_v4

8/147

viii

LIST OF FIGURES

PAGE

Fig. 2. 1: Resistor Pull-up......................................................................................................................

7

Fig. 2. 2: nMOS depletion mode transistor pull up................................................................................ 8

Fig. 2. 3: nMOS enhancement mode pull up......................................................................................... 8

Fig. 2. 4: Pseudo-nMOS Logic.............................................................................................................. 9

Fig. 2. 5: Complementary transistor pull-up (CMOS)........................................................................... 9

Fig. 2. 6: CMOS inverter current versus Vin....................................................................................... 10

Fig. 2. 7: Precharge and evaluation of dynamic gates..........................................................................

11

Fig. 2. 8: Footed dynamic inverter....................................................................................................... 12

Fig. 2. 9: Unfooted dynamic gates....................................................................................................... 12

Fig. 2. 10: Generalized footed gates.................................................................................................... 13

Fig. 2. 11: Logical effort of footed and unfooted dynamic gates......................................................... 13

Fig. 2. 12: Monotonicity problem........................................................................................................ 14

Fig. 2. 13: Incorrect connection of dynamic gates............................................................................... 15

Fig. 2. 14: Standard Domino Logic circuit.......................................................................................... 16

Fig. 2. 15: Weak keeper implementation............................................................................................. 17

Fig. 3. 1: Simple schematic model of a 4x3 CAM array showing the core memory cells, differential

search lines, match lines and encoder.................................................................................................. 20

Fig. 3. 2: Profile of CAM capacity (log scale) versus year of publication [33][40]...........................21

Fig. 3. 3: CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The cells areshown using SRAM-based data-storage cells...................................................................................... 22

Fig. 3. 4: Structure of Ternary core cells for (a) NOR-type (b) NAND-type CAM [41], [42]............24

Fig. 3. 5: (a) the schematic with precharge circuitry for matchline sensing using the precharge-high

scheme, and (b) the corresponding timing diagram showing relative signal transitions. [34]..............26

Fig. 3. 6: Matchline power for NAND and NOR architecture [33]...................................................... 27

Fig. 3. 7: Two possible configurations for the NOR cell: (a) the stored bit is connected to the bottom

transistors of the pulldown pair, and (b) the stored bit is connected to the top transistors of thepulldown pair....................................................................................................................................... 28

7/28/2019 thesis_v4

9/147

ix

LIST OF FIGURES

PAGE

Fig. 3. 8: Low-swing matchline sensing scheme of [33]......................................................................

29

Fig. 3. 9: (a) Circuit implementation including precharge circuitry and (b) a timing diagram for a

single search cycle. For current-race matchline sensing [51]............................................................... 31

Fig. 3. 10: Sample implementation of the selective-precharge matchline technique [52]....................32

Fig. 3. 11: Pipelined matchlines reduce power by shutting down after a miss in a stage .....................33

Fig. 3. 12: Simulated wave forms in the pipelined match-line architecture for (a) the full-match case

consisting of a match in every stage, and (b) a miss case where the third stage results in a miss and

turns off the subsequent stages.[58]

.....................................................................................................

33

Fig. 3. 13: Current-saving matchline-sensing scheme......................................................................... 34

Fig. 4. 1: Structure of the CAM of the proposed scheme: (a) basic architecture and (b) NOR-type

TCAM cell used in the scheme............................................................................................................ 38

Fig. 4. 2: Internal circuit of the charging controller............................................................................. 39

Fig. 4. 3: Proposed sense amplifier (SA)............................................................................................. 40

Fig. 4. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched), ML1 (one-

bit miss), ML2 (two-bit miss), MLC and MLP....................................................................................

41

Fig. 4. 5: Corner Simulation results for FHL (when threshold voltage= -10%,VDD=+5%, and

temparature=273K).............................................................................................................................. 42

Fig. 4. 6: Corner Simulation results for SLH (when threshold voltage= +10%,VDD=-5%, and

temparature=343K).............................................................................................................................. 42

Fig. 5. 1: Simplified conventional CAM architecture.......................................................................... 45

Fig. 5. 2: Structure of the CAM array with the proposed scheme: (a) the basic architecture and (b)

internal circuit of the NOR-type TCAM cell used in this scheme. Here the usual SRAM accesstransistor and associated bitlines are omitted for simplicity................................................................ 47

Fig. 5. 3: The ML charging unit proposed in this work and the sensing unit proposed in [84]. ...........47

Fig. 5. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched), ML1 (one-

bit miss), ML2 (two-bit miss), MLC and MLP.................................................................................... 48

Fig. 6. 1: Structure of the CAM array.................................................................................................. 53

Fig. 6. 2: Internal circuitry of improved Charging............................................................................... 53

Fig. 6. 3: Waveforms for CAM............................................................................................................ 54

7/28/2019 thesis_v4

10/147

x

LIST OF FIGURES

PAGE

Fig. 6. 4: Conventional sense amplifier...............................................................................................

55

Fig. 6. 5: Charging in different match-lines......................................................................................... 55

Fig. 6. 6: Controlled signals and output of each match-line................................................................. 56

7/28/2019 thesis_v4

11/147

xi

LIST OF TABLES

PAGE

Table 3. 1: Truth Table for NOR Cell..................................................................................................

24

Table 3. 2: Truth Table for NAND Cell............................................................................................... 24

Table 3. 3: Comparison between the schemes[62]............................................................................... 35

Table 4. 1: Comparison of Different Schemes..................................................................................... 41

Table 5. 1: Comparison of Different Schemes..................................................................................... 49

Table 6. 1: Comparison of Different Schemes with improved noise margin scheme..........................56

Table 6. 2: Comparison of Noise Immunity of different schemes.......................................................

57

7/28/2019 thesis_v4

12/147

xii

ABSTRACT

The growing market demand of the integrated circuits and energy crisis in the wholeworld accelerate the researchers to find out new process technology with smaller transistors,

to design the circuits more efficiently which needs less power and operate in higher speed.

The purpose of this dissertation is the same i.e. to find a way to design digital circuit

specifically Content-Addressable Memory (CAM) which needs low power and operates in

higher speed with maintaining the noise immunity.

Content-addressable memory (CAM) is an attractive component in network routers

for packet forwarding and packet classification and also in other applications that require

high-speed searches. This dissertation presents three techniques to increase speed and reduce

the energy per bit per search. In the first technique, the charging of fully matched matchline isreduced from VDD to VDD/3 and this voltage is sensed by our proposed sense amplifier. This

reduces the power consumption greatly. In the second scheme, the charging controller is

made simple (to get a low cost circuit) with the cost of some energy. In the third scheme with

high noise margin, only probable matched matchlines are charged to almost VDD. Most of the

matchlines are not charged so much. So in most of the matchlines, the power consumption is

very low.

All the schemes (conventional current saving and current race scheme also) are

simulated in TSMC 0.18 m technology with 64 x 72 Ternary CAM. For first charging

control technique, simulation shows that the match-line energy reduction is 57% and 54%

compared to the current-race and current-saving schemes respectively and 55% compared to

the conventional current-race scheme while speed of operation is increased by over 3 times.

For the second simplified scheme, the energy efficiency is little bit lower. The third scheme

provides very good noise margin with maintaining sufficient energy reduction and speed of

operation. The more accurate result can be obtained by doing Layout of the proposed

schemes.

We can say that this dissertation provides some good schemes for the content

addressable memory. We hope that the proper layout of the schemes would provide goodresults also. We also hope that we would get some good alternatives for existing CAM after

fabrication of the chip.

7/28/2019 thesis_v4

13/147

1

CHAPTER1

INTRODUCTION

7/28/2019 thesis_v4

14/147

2

CHAPTER1

INTRODUCTION

1.1INTRODUCTION

A CONTENT-ADDRESSABLE memory (CAM) compares input search data against

a table of stored data, and returns the address of the matching data. CAMs have a single clock

cycle throughput making them faster than other hardware- and software-based search

systems. CAMs can be used in a wide variety of applications requiring high search speeds.

These applications include cache memory, parametric curve extraction, Hough

transformation, Huffman coding/decoding, LempelZiv compression, and image coding [1].

The primary commercial application of CAMs today is to classify and forward Internet

protocol (IP) packets in network routers. In networks like the Internet, a message such an as

e-mail or a Web page is transferred by first breaking up the message into small data packets

of a few hundred bytes, and, then, sending each data packet individually through the network.

These packets are routed from the source, through the intermediate nodes of the network(called routers), and reassembled at the destination to reproduce the original message. The

function of a router is to compare the destination address of a packet to all possible routes, in

order to choose the appropriate one. A CAM is a good choice for implementing this lookup

operation due to its fast search capability. CAM is also used in neural networks. The two

main strategies available for implementing a CAM with a neural network architecture,

feedback networks and two-stage CAMs, and in particular their ability to retrieve patterns

from corrupted input data. The storage capacity of the Hopfield network is very poor

although it can be improved with the use of an iterative algorithm, such as the threshold

algorithm which is described. However, the possibility of generating spurious patterns alwaysremains with feedback networks. Two-stage CAMs are much more efficient, provided that an

appropriate algorithm is used for the input classification stage. Perceptron and least-mean

squares algorithms need to be modified if they are to cope with corrupted input patterns, but

the optimal classifier for the type of problem under consideration is the minimum-distance

classifier (or Hamming network for binary patterns).

Dynamic CMOS logic in general and domino logic in particular has a number of

advantages to design high speed and low power CMOS circuits. However, the main difficulty

with domino logic is that it can implement only non-inverted logic. To implement inverted

polarity it is required to duplicate several circuit parts using inverted polarities of inputs and

7/28/2019 thesis_v4

15/147

3

hence increasing area and power dissipation. With static CMOS logic, on the other hand, it is

simple to realize gate with both inverted and non-inverted logic unlike dynamic CMOS logic

[2].

Domino logic is known as a better logic for implementing high-speed CMOS circuits.However, domino circuits have some inherent problems like charge sharing, clock routing

overhead, clock skew etc. Another difficulty with the domino logic is that it can only

implement non-inverting functions. As domino logic cannot implement inverters driving

other domino gates, all parts of the gate that follows an inverter has to be implemented again

with opposite polarities of inputs, which increases area and power dissipation. The

advantages of domino logic may come into question when there is a large number of inverters

and having trapped in points where substantial duplication of domino gates is unavoidable.

1.2MOTIVATIONS

The speed of a CAM comes at the cost of increased silicon area and power

consumption, two design parameters that designers strive to reduce. As CAM applications

grow, demanding larger CAM sizes, the power problem is further exacerbated. Reducing

power consumption, without sacrificing speed or area, is the main thread of recent research in

large-capacity CAMs.

The literature shows that a major portion of power is consumed in Match lines (ML)

and search lines (SL). A lot of researches have been done to reduce the ML and SL power

consumption. Previous works present some schemes such that low-swing scheme [3], [4], [5],

selective precharge scheme [6], current-race scheme [7], current-saving scheme [8], [9] etc.

The low-swing scheme reduces the ML power by reducing the ML voltage. The selective

precharge scheme reduces match-line power consumption by breaking the search into two

segments and observing that the second segment is rarely activated. The current-race scheme

limits the ML voltage swing by VDD/2 and precharges the MLs to ground instead of VDD. In

our scheme, as SL is not precharged, there is almost a 50% reduction in SL power

consumption [1], compared to the precharge-high scheme [10], [11]. The current-saving

scheme is the improved version of current race scheme which allocates less power to match

decision involving a large number of mismatched bits. It is reported in a survey [1] that thecurrent saving scheme consumes less power than other schemes [10], [6], [7]. So to compare

our proposed schemes, we have selected the current-race and current-saving scheme.

In our proposed schemes, the match-lines are precharged to ground at precharge stage

unlike the conventional precharge high scheme so that power consumption in the matchlines

is low. Another thing is that pre-charging the matchline eliminates the need of searchline

precharge. So in the typical case, about 50% of the search data bits toggle from cycle to

cycle, there is a 50% reduction in searchline power, compared to the precharge-high

matchline-sensing schemes that have an SL precharge phase. Secondly, in our proposed

charging control scheme, the charging of fully matched matchline is reduced from VDD to

VDD/3 and this voltage is sensed by our proposed sense amplifier. This reduces the power

7/28/2019 thesis_v4

16/147

4

consumption greatly. In the scheme with high noise margin, only probable matched

matchlines are charged to almost VDD. Most of the matchline is not charged so much. So in

most of the matchlines, power consumption is very low.

1.3OBJECTIVES

The objective of this investigation and research was to find out a way or scheme to

design high speed and low power dynamic circuits. We specifically tried to design some good

schemes for Content Addressable Memory (CAM). CAM is a special type of memory array

which provides hardware search system where the information or data to be searched enters

into the two-dimensional memory array and provides the search result (the address of

memory where the data is found). To design very high speed CAM we have investigated the

CAM cells, the charging controller and sense amplifiers. As satisfactory works is present onCAM cells, our objective was to design an appropiate scheme of charging controllers and

sense amplifier which consume less power and operate in high speed. As most of the power is

consumed in matchlines and matchlines are charged by the charging controller, the charging

controller is responsible for most of the power consumption. So, one of our objectives was to

design the charging controller intelligently. On the other hand, the higher the matchline

voltage, the higher the power consumption. For that reason, our objective was to charge the

matchline as VDD/3 and design a sense amplifier which can sense this voltage as a high.

1.4THESISORGANIZATION

Continued increase in leakage current of the transistors with the advancement of

process technology, impacts the leakage power and noise sensitivity of dynamic circuits more

than those of static circuits. This thesis proposes methods and techniques to achieve the goal

of power-efficient design of CAM circuits used commonly in high-performance cache

memory while improving or maintaining their area, performance and noise robustness.

Chapter 2 describes different type of logic circuits-study of which is needed tounderstand the significance of dynamic circuits. As CAM is based on dynamic logic

investigation of dynamic logic is very important.

Chapter 3 underscores the significance of previous schemes and compares their

performances and structure in details to reduce power consumption.

Chapter 4 describes one of the proposed schemes. In this chapter, a scheme named

charging control scheme is proposed. This work is also accepted and presented in

International Conference on Solid State Device and Materials (SSDM) 2008, Tsukuba, Japan.

7/28/2019 thesis_v4

17/147

5

Chapter 5 describes another scheme which contains the simplified version of charging

controller of the scheme described in chapter 4. It is also accepted in TENCON 2008,

Hyderabad, India.

Chapter 6 narrates a scheme which have high noise margin. The principle of thistechnique is carry coal to Newcastle. This scheme charges only a negligible number of

matchline to VDD. For that reason, a high amount of power is saved.

Finally, Chapter 7 presents a summary of proposed methods and techniques

mentioned in the thesis for low-power CAM circuits in high-performance memory arrays. It

also mentions suggestions for extending the current research for possible future work.

7/28/2019 thesis_v4

18/147

6

CHAPTER2

DIFFERENTTYPESOFLOGIC

7/28/2019 thesis_v4

19/147

7

In ut

RL

F

Resistive

Load

VDD

VSS

PDN

CHAPTER2

DIFFERENT TYPES OF LOGIC

2.1INTRODUCTION

In this chapter, different types of logic circuits are described. Logic circuits such as

ratioed logic, CMOS logic, dynamic logic and domino logic are studied. This study is needed

to realize the significance of dynamic circuits. As CAM is based on dynamic logicinvestigation of dynamic logic is very important.

2.2RATIOEDLOGIC

Ratioed logic is an attempt to reduce the number of transistors required to implement

a logic function, often at the cost of reduced robustness and extra power dissipation.

2.2.1LoadResistanceRL

The main goal is to reduce the number of transistors at the cost of reduced robustness

and extra power dissipation. This arrangement is not often used because of the large space

requirements of resistors produced in a silicon substrate. If Pull down network is off, Static

Power = 0. If Pull down network is on, then there is some Static Power dissipation.

Fig. 2. 1: Resistor Pull-up

7/28/2019 thesis_v4

20/147

8

2.2.2nMOSdepletionmodetransistorpullup

Power dissipation is high since rail to rail current flows when input = logical 1.

Switching of output from 1 to 0 begins when input voltage exceeds the Vt of the pull down

device.

Fig. 2. 2: nMOS depletion mode transistor pull up

When switching the output from 1 to 0, the pull up device is non-saturated initially and this

presents lower resistance through which to charge capacitive loads.

2.2.3nMOS

enhancement

mode

pull

up

If the gate of the pull-up transistor is connected to VDD then it is called nMOSenhancement mode pull up. Power Dissipation is high since current flows when inputvoltage = logical 1. Output voltage can never reach VDD (logical 1).

Fig. 2. 3: nMOS enhancement mode pull up

.

VSS

VDD

Depletion

Load

F

PDNInput

VSS

F

VDD

Input PDN

7/28/2019 thesis_v4

21/147

9

FVSS

Input

PMOS

Load

VSS

VDD

PDN

2.2.4PseudonMOSlogic

If we replace the depletion mode pull-up transistor of the standard nMOS circuits with

a p-transistor with gate connected to VSS, we have a structure similar to the NMOS

equivalent. This approach of the logic design is illustrated in the Fig. 2.4

Fig. 2. 4: Pseudo-nMOS Logic

The circuit arrangements look and behave much like nMOS circuits and appropriate ratio

rules must be applied.

2.3COMPLEMENTARYTRANSISTORPULLUP(CMOS)

In CMOS we use both the pull-up network and also the pull down network. So, there

is no static power dissipation, because no current flow either for logical 0 or for logical 1

inputs. Full logical 1 and 0 levels are presented at the output. For devices of similar

dimensions the p-channel is slower than the n-channel device.

Fig. 2. 5: Complementary transistor pull-up (CMOS)

Inputs

PDN

PUN

Output

VDD

VSS

7/28/2019 thesis_v4

22/147

10

2.3.1ComponentsofTotalPowerDissipationinCMOSCircuits

One of the major design challenges in high-performance digital integrated circuits is

the minimization of the total power dissipation. Total power consumption in digital CMOS

circuits can be divided into three major components: a) Switching power or dynamic power,

b) Short-circuit power and c) Static or leakage power. Equation 1.1 defines total power andits three components in a simplified form [1.1]

VddIPower

IVddTT

Power

VddCFAFPower

PowerPowerPowerPower

LEAKAGELEAKAGE

PEAK

fallrise

ITSHORTCIRCU

DYNAMIC

LEAKAGECIRCUITSHORTDYNAMICTOTAL

=

+

=

=

++=

)2

(

2

[1.1]

Dynamic power is a result of the power consumed in charging and discharging

various device and wire capacitances in the circuit. As seen from Equation 1.1, this

component of power depends on the switching activity factor (probability that a power

consuming transition occurs) AF, clock frequency F, the capacitances C being charged or

discharged and square of supply voltage, Vdd. In long channel transistors the dynamic power

is the dominant component of the total power. However, this is not the case in advanced

technologies, as leakage power is becoming a significant component of total power.

Fig. 2. 6: CMOS inverter current versus Vin

Current

(betweenrails)

Vin

7/28/2019 thesis_v4

23/147

11

CLK

Precharge Evaluate Precharge

Y

Short-circuit power is dissipated when there is a direct conducting path between

power supply (Vdd) and ground (Vss). Since the input signals to the logic gates have a non-

zero/finite slope or edge rate, there is a direct path current between Vdd and Vss for a period

of time during which both the PMOS and NMOS devices conduct simultaneously. The

magnitude of this current is given by the actual transistor widths and the on-state saturationcurrent (IDSAT). The duration for which the current flows depends on the signal rise and fall

times and increases as the signal slopes degrade. The overall short-circuit power can be

obtained by integrating the total current over the duration of short circuit and then

multiplying with Vdd. In a simplified form it can be calculated as shown in Equation 1.1,

where Trise and Tfall are the rise and fall times respectively and IPEAKis the peak short circuit

current.

Static or leakage power dissipation is due to the leakage current, I LEAKAGE that flows

between power rails in the absence of any switching activity. Three major sources of leakage

current are: a) current flowing through reverse biased P-N diode junctions of the transistorslocated between the source or drain and substrate, b) subthreshold leakage current between

source and drain when gate-source voltage, Vgs, is smaller than the threshold voltage, Vt of

the transistors and c) gate leakage current via the gate tunneling mechanism.

Because of the quadratic dependence of dynamic power on Vdd, reducing this voltage

is the most effective approach to minimize dynamic power dissipation. However, reducing

the supply voltage necessitates the reduction of threshold voltage to avoid serious degradation

of performance. Unfortunately, reducing threshold voltage causes the sub-threshold leakage

current to increase exponentially.

2.4DYNAMICCIRCUITS

Ratioed circuits reduce the input capacitance by replacing the pMOS transistors

connected to the inputs with a single resistive pull-up. The drawbacks of ratioed circuits

includes slow resistive transitions, contention on the falling transitions, static power

dissipation and a non-zero VOL. Dynamic circuits circumvent these drawbacks by using a

clocked pull-up transistor rather than a pMOS that is always ON.

Fig. 2. 7: Precharge and evaluation of dynamic gates.

7/28/2019 thesis_v4

24/147

12

Dynamic circuit operation is divided into two modes, shown in Fig. 2.7. During

precharge, the clock (CLK) is 0, so the clocked pMOS is ON and initializes the output Y

high. During evaluation, the clock is 1 and the clocked pMOS turns OFF. The output may

remain high or may be discharged low through the pull-down network.

Fig. 2. 8: Footed dynamic inverter

Dynamic circuits are the fastest commonly used circuit family because they have

lower input capacitance and no contention during switching. They also have static power

dissipation. However, they require careful clocking, consume significant dynamic power, and

are sensitive to noise during evaluation.

In Fig. 2.9 if the input is 1 during precharge, contention will take place because both

the pMOS and nMOS transistors will be ON.

Fig. 2. 9: Unfooted dynamic gates.

A

CLK

Precharge

Transistor

FOOT

Y

CLK

Inputs

Y

PDN

7/28/2019 thesis_v4

25/147

13

When the input cannot be guaranteed to be 0 during precharge,an extra clocked

evaluation transistor can be added to the bottom of the nMOS stack to avoid contention as

shown in Fig. 2.10. The extra transistor is sometimes called a foot. Fig. 2.10 shows generic

footed gates.

Fig. 2. 10: Generalized footed gates.

Fig. 2.11 estimates the falling logical effort of both footed and unfooted dynamic

gates. As usual, the pull-down transistors widths are chosen to give unit resistance. Precharge

occurs while the gate is idle and often may take place more slowly. Therefore, the prechargetransistor width is chosen for twice unit resistance. This reduces the capacitive load on the

clock and the parasitic capacitance at the expense of greater rising delays.

Fig. 2. 11: Logical effort of footed and unfooted dynamic gates

CLK

Inputs PDN

Y

Y

A

CLKCLK

Y

A

Inverter

gd=1/3

Pd=2/3gd=2/3

Pd=3/3

7/28/2019 thesis_v4

26/147

14

Footed gates have higher logical effort than their unfooted counterparts but are still an

improvement over static logic. In practice, the logical effort of footed gates is better than

predicted because velocity saturation means series nMOS transistors have less resistance

than we estimate. The size of the foot can be increased relative to the other nMOS transistors

to reduce logical effort of the other inputs at the expense of greater clock loading. Likepseudo-nMOS gates, dynamic gates are particularly well suited to wide NOR functions or

multiplexers because the logical effort is independent of the number of inputs.

A fundamental difficulty with the dynamic circuits is the monotonicity requirement.

While a dynamic gate is in evaluation, the inputs must be monotonically rising. That is, the

input can start LOW and remain LOW, start LOW and rise HIGH, start HIGH and remain

HIGH, but not start HIGH and fall LOW. Fig. 2.12 shows waveforms for a footed dynamic

inverter in which the input violates monotonically. During precharge, the output is pulled

HIGH. When the clock rises, the input is HIGH so the output is discharged LOW through the

pull-down network, as happen in an inverter.

Fig. 2. 12: Monotonicity problem

The input later falls LOW, turning off the pull-down network. However, the

precharge transistor is also OFF, so the output floats, staying LOW rather than rising as it

would in a normal inverter. The output will remain low until the next precharge step. In

summary, the inputs must be monotonically rising for the dynamic gate to compute the

correct function.

Evaluate

A

CLK

Y

Violatesmonotonicity

duringevaluation

Precharge

Output

should

rise

but

does

not

Precharge

7/28/2019 thesis_v4

27/147

15

Unfortunately, the output of a dynamic gate begins HIGH and monotonically falls

LOW during evaluation. This monotonically falling output X is not a suitable input to a

secong dynamic gate expecting monotonically rising signals as shown in Fig. 2.13. Dynamic

gates sharing the same clock cannot be directly connected. This problem is often overcome

with domino logic.

Fig. 2. 13: Incorrect connection of dynamic gates.

CLK

A

XY

X

Xmonotonicallyfallsduringevaluation

A=1

CLK

PrechargeEvaluate

Precharge

Y

Yshouldrisebutcannot

7/28/2019 thesis_v4

28/147

16

2.5DOMINOLOGIC

Domino logic circuits find wider applications in high performance microprocessors

due to their superior speed and area characteristics as compared to static CMOS circuits. But

their noise margins are low making them more prone to noise. Various leakage reductiontechniques are applied to domino logic circuits also to reduce the leakage. As the technology

is scaled below 130nm, noise margin becomes a critical issue and hence techniques that

provide high noise immunity become necessary in order to have reliable circuits.

The monotonicity problem can be solved by placing a static CMOS inverter between

dynamic gates as shown in the figure. This converts the monotonically falling output into a

monotonically rising signal suitable for the next gate. The dynamic- static pair together is

called a domino gate. A single clock can be used to precharge and evaluate all the logic gates

within the chain. Therefore, the static inverter is usually a HI-skew gate to favor this rising

output. We may observe that precharge occurs in parallel, but evaluation occurs sequentially.

A standard domino logic circuit with a keeper is as shown in Fig. 2.14. A standard

domino logic circuit consists of an n-type dynamic logic block followed by a static inverter.

During precharge, the output of the dynamic gate is charged to Vdd and the output of the

inverter is set to 0.

Fig. 2. 14: Standard Domino Logic circuit

7/28/2019 thesis_v4

29/147

17

During evaluation, the inverter makes conditional transition from 0 to 1. If the output

of the domino gate is fed to other domino gates, then it must be ensured that all inputs are set

to 0 at the end of the precharge phase and the transitions during evaluation are only 0 to 1.

Hence the dynamic node discharges only when the previous stage evaluates to 1 and a high

fan-out is achieved due to the static inverter present at the output. To counteract the leakageissues and to establish a low impedance path, a bleeder transistor (keeper) is connected in the

feedback path.

Fig. 2. 15: Weak keeper implementation

The function of the keeper is to compensate the charge lost due to the pull-down

leakage paths. But the keeper is fully turned on at the beginning of the evaluation phase.

When the pull down network is ON, then there exists a contention between this and keeper

transistor, which degrades the speed of domino circuits. Traditionally, a minimum sized

keeper is used to minimize delay and power degradation caused by the contention current. Asmall keeper, however, cannot provide necessary noise immunity for reliable operation in an

increasingly noisy and noise-sensitive on-chip environment. Therefore, there is a tradeoff

between the high speed/energy efficient operation and reliability in domino logic. Hence,

keeper sizing is important in deep sub micron circuits.

A

CLK

XY

Width:min

Length:L

7/28/2019 thesis_v4

30/147

18

CHAPTER3

CONTENTADDRESSABLEMEMORY

(CAM)REVIEW

7/28/2019 thesis_v4

31/147

19

Chapter 3

CONTENT

ADDRESSABLE

MEMORY

(CAM)R

EVIEW

3.1INTRODUCTION

Content addressable memories (CAMs) are memories that can search the entire

memory in parallel and output the location of entries that hold a match to the key value.

Today the increased need for faster searches, larger table sizes and wider data widths,makes CAMs a more attractive solution to the less expensive RAM software solution.

CAMs are now much needed where quick searches of a database, a list, or a pattem is in

order. In fact, CAMs can be a determining factor for a wide range of applications such as

local-area networks, file storage management, artificial intelligence, database management

and pattern recognition.

A Content-Addressable Memory (CAM) searches for data by its content and returns

the address of the matching data. This feature is used extensively in applications such as

internet routers to channel incoming packets towards their destination addresses contained in

the packet header. Energy per search and search speed are two important metrics used toevaluate CAM performance[12]Content addressable memory (CAM), a high-performance

lookup engine in many systems, is so power-consuming that any saving becomes very

significant in the whole system. CAM has three major power-sinking sources: evaluation

power, input transition power and clocking power, all of them are discussed in this research.

After that, a new low-power CAM design is proposed here. Its implementation under 0.35-p

m process operates at 83.3 MHz with power performance metric as 45.5fJ/bit/search or

equivalently 372 mJ/bit/search/m for random inputs. Two modified circuit structures for

binary static CAM cells are also proposed. We have proved that under most conditions cell

layout is smaller by this modification.

A Content Addressable Memory (CAM) compares input search data against a table of

stored data, and returns the address of the matching data [13][17]. CAMs have a single

clock cycle throughput making them faster than other hardware and software-based search

systems. CAMs can be used in a wide variety of applications requiring high search speeds.

These applications include parametric curve extraction [18], Hough transformation [19],

Huffman coding/decoding [20], [21], LempelZiv compression [22][25], and image coding

[26]. The primary commercial application of CAMs today is to classify and forward Internet

protocol (IP) packets in network routers [27][32]. In networks like the Internet, a message

such an as e-mail or a Web page is transferred by first breaking up the message into smalldata packets of a few hundred bytes, and, then, sending each data packet individually through

7/28/2019 thesis_v4

32/147

20

the network. These packets are routed from the source, through the intermediate nodes of the

network (called routers), and reassembled at the destination to reproduce the original

message. The function of a router is to compare the destination address of a packet to all

possible routes, in order to choose the appropriate one. A CAM is a good choice for

implementing this lookup operation due to its fast search capability.

However, the speed of a CAM comes at the cost of increased silicon area and power

consumption, two design parameters that designers strive to reduce. As CAM applications

grow, demanding larger CAM sizes, the power problem is further exacerbated. Reducing

power consumption, without sacrificing speed or area, is the main thread of recent research in

large capacity CAMs. In this research, we survey developments in the CAM area at two

levels: circuits and architectures. Before providing an outline of this research at the end of

this section, we first briefly introduce the operation of CAM and also describe the CAM

application of packet forwarding.

Fig. 3. 1: Simple schematic model of a 4x3 CAM array showing the core memory cells,

differential search lines, match lines and encoder

Fig. 3.1 shows a simplified block diagram of a CAM. The input to the system is the

search word that is broadcast onto the searchlines to the table of stored data. The number of

bits in a CAM word is usually large, with existing implementations ranging from 36 to 144

bits. A typical CAM employs a table size ranging between a few hundred entries to 32K

entries, corresponding to an address space ranging from 7 bits to 15 bits. Each stored word

has a matchline that indicates whether the search word and stored word are identical (the

match case) or are different (a mismatch case, or miss). The matchlines are fed to an encoder

that generates a binary match location corresponding to the matchline that is in the match

state. An encoder is used in systems where only a single match is expected. In CAM

applications where more than one word may match, a priority encoder is used instead of a

simple encoder. A priority encoder selects the highest priority matching location to map to

the match result, with words in lower address locations receiving higher priority. In addition,

there is often a hit signal (not shown in the figure) that flags the case in which there is nomatching location in the CAM. The overall function of a CAM is to take a search word and

C C C

C C C

C C C

Input Search Data Drivers/Registers

SL0

Encoder

Hit

ML2

ML1

ML0

C C C

ML3

SL1 SL2SL0 SL1 SL2

7/28/2019 thesis_v4

33/147

21

return the matching memory location. One can think of this operation as a fully

programmable arbitrary mapping of the large space of the input search word to the smaller

space of the output match location.

8M

Memory Size (bit)

10k

1986 2005

Year

Fig. 3. 2: Profile of CAM capacity (log scale) versus year of publication [33][40]

The operation of a CAM is like that of the tag portion of a fully associative cache. The

tag portion of a cache compares its input, which is an address, to all addresses stored in the

tag memory. In the case of match, a single matchline goes high, indicating the location of a

match. Unlike CAMs, caches do not use priority encoders since only a single match occurs;

instead, the matchline directly activates a read of the data portion of the cache associated with

the matching tag. Many circuits are common to both CAMs and caches; however, we focus

on large- capacity CAMs rather than on fully associative caches, which target smaller

capacity and higher speed. Todays largest commercially available single-chip CAMs are 18

Mbit implementations, although the largest CAMs reported in the literature are 9 Mbit in size

[33], [40]. As a rule of thumb, the largest available CAM chip is usually about half the size of

the largest available SRAM chip. This rule of thumb comes from the fact that a typical CAM

cell consists of two SRAM cells, as we will see shortly. Fig. 3.2 plots (on a logarithmic scale)

the capacity of published CAM [33][40] chips versus time from 1985 to 2004, revealing an

exponential growth rate typical of semiconductor memory circuits and the factor-of-two

relationship between SRAM and CAM.

7/28/2019 thesis_v4

34/147

22

3.2CORECELLSANDMATCHLINESTRUCTURE

A CAM cell serves two basic functions: bit storage (as in RAM) and bit comparison

(unique to CAM). Fig. 3.4 shows a NOR-type CAM cell [Fig. 3.3(a)] and the NAND-typeCAM cell [Fig. 3.3(b)]. The bit storage in both cases is an SRAM cell where cross-coupled

inverters implement the bit-storage nodes D and DB. To simplify the schematic, we omit the

nMOS access transistors and bitlines which are used to read and write the SRAM storage bit.

Although some CAM cell implementations use lower area DRAM cells [3.27], [3.31],

typically, CAM cells use SRAM storage. The bit comparison, which is logically equivalent to

an XOR of the stored bit and the search bit is implemented in a somewhat different fashion in

the NOR and the NAND cells.

3.2.1StructureofNORCell

A NOR Cell implements the comparison between the complementary stored bit, D

(and DB), and the complementary search data on the complementary searchline, SL (and

SLB), using four comparison transistors, M1 through M4, which are all typically minimum-

size to maintain high cell density. These transistors implement the pull down path of a

dynamic XNOR logic gate with inputs SL and D. Each pair of transistors, M1/M3 and

M2/M4, forms a pull down path from the matchline, ML, such that a mismatch of SL and D

Fig. 3. 3: CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The

cells are shown using SRAM-based data-storage cells.

activates least one of the pull down paths,connecting ML to ground. A match of SL and D

disables both pull down paths, disconnecting ML from ground. The NOR nature of this cell

becomes clear when multiple cells are connected in parallel to form a CAM word by shorting

the ML of each cell to the ML of adjacent cells. The pull down paths connect in parallel

resembling the pull down path of a CMOS NOR logic gate. There is a match condition on a

given ML only if every individual cell in the word has a match.

7/28/2019 thesis_v4

35/147

23

3.2.2StructureofNANDCell

The NAND cell implements the comparison between the stored bit, D, and

corresponding search data on the corresponding searchlines, (SL, SLB), using the three

comparison transistors M1, MD and MDB, which are all typically minimum-size to maintainhigh cell density. We illustrate the bit-comparison operation of a NAND cell through an

example. Consider the case of a match when SL=1 and D=1. Pass transistor M D is ON and

passes the logic 1 on the SL to node B. Node B is the bit-match node which is logic 1 if

there is a match in the cell. The logic 1 on node B turns ON transistor M1. Note that is also

turned ON in the other match case when SL=0 and D=0 . In this case, the transistor MDB

passes a logic high to raise node B. The remaining cases, where SLD result in a miss

condition, and accordingly node B is logic 0 and the transistor M1 is OFF. Node B is a

pass-transistor implementation of the XNOR SLD function. The NAND nature of this cell

becomes clear when multiple NAND cells are serially connected. In this case, the MLn and

MLn+1 nodes are joined to form a word. A serial nMOS chain of all the Mi transistors

resembles the pull down path of a CMOS NAND logic gate. A match condition for the entire

word occurs only if every cell in a word is in the match condition. An important property of

the NOR cell is that it provides a full rail voltage at the gates of all comparison transistors.

On the other hand, a deficiency of the NAND cell is that it provides only a reduced logic 1

voltage at node B, which can reach only VDD - V tn sswhen the searchlines are driven to VDD

(where VDD is the supply voltage and Vtn is the nMOS threshold voltage).

3.2.3TernaryCells

Usually two types of ternary cell are used. The NOR and NAND cells that have been

presented are binary CAM cells. Such cells store either a logic 0 or a logic 1.Ternary

cells, in addition, store an X value. The X value is a dont care, that represents both 0

and 1, allowing a wildcard operation. Wildcard operation means that an X value stored in

a cell causes a match regardless of the input bit. As discussed earlier, this is a feature used in

packet forwarding in Internet routers. A ternary symbol can be encoded into two bits

according to Table 2.2. We represent these two bits as D and DB. Note that although the D

and DB are not necessarily complementary, we maintain the complementary notation for

consistency with the binary CAM cell. Since two bits can represent 4 possible states, butternary storage requires only three states, we disallow the state where D and DB are both

zero. To store a ternary value in a NOR cell, we add a second SRAM cell, as shown in Fig.

3.5. One bit, D, connects to the left pulldown path and the other bit, DB, connects to the right

pull down path, making the pull down paths independently controlled.We store an X by

setting both D and DB equal to logic 1, which disables both pull down paths and forces the

cell to match regardless in the inputs. We store a logic 1 by setting D=1 and DB=0 and

store a logic 0 by setting D=0 and DB=1. In addition to storing an X, the cell allows

searching for an X by setting both SL and SLB to logic 0. This is an external dont care

that forces a match of a bit regardless of the stored bit.

7/28/2019 thesis_v4

36/147

24

Table 3. 1: Truth Table for NOR Cell

Stored Value Stored

D D

Search

D D

0 0 1 0 1

1 1 0 1 0

X 1 1 0 0

Table 3. 2: Truth Table for NAND Cell

Stored Value Stored Bit

D M

Search Bit

SL SL

0 0 0 0 1

1 1 0 1 0

x 0 1 1 1

x 1 1 1 1

Fig. 3. 4: Structure of Ternary core cells for (a) NOR-type (b) NAND-type CAM [41], [42].

7/28/2019 thesis_v4

37/147

25

Although storing an X is possible only in ternary CAMs, an external X symbol

possible in both binary and ternary CAMs. In cases where ternary operation is needed but

only binary CAMs are available, it is possible to emulate ternary operation using two binary

cells per ternary symbol.

As a modification to the ternary NOR cell of Fig.3.4(a), propose implementing the

pull down transistors M1-M4 using pMOS devices and complementing the logic levels of the

searchlines and matchlines accordingly. Using pMOS transistors (instead of nMOS

transistors) for the comparison circuitry allows for a more compact layout, due to reducing

the number of spacings of p-diffusions to n-diffusions in the cell. In addition to increased

density, the smaller area of the cell reduces wiring capacitance and therefore reduces power

consumption. The tradeoff that results from using minimum-size pMOS transistors, rather

than minimum-size nMOS transistors, is that the pulldown path will have a higher equivalent

resistance, slowing down the search operation.

A NAND cell can be modified for ternary storage by adding storage for a mask bit at

node M, as depicted in Fig. 3.4(b) [41], [42]. When storing an X, we set this mask bit to

1. This forces transistor Mmask ON, regardless of the value of D, ensuring that the cell

always matches. In addition to storing an X, the cell allows searching for an X by setting

both SL and SLB to logic 1. Table 2.2 lists the stored encoding and search-bit encoding for

the ternary NAND cell. Further minor modifications to CAM cells include mixing parts of the

NAND and NOR cells, using dynamic-threshold techniques in silicon-on-insulator (SOI)

processes, and alternating the logic level of the pull down path to ground in the NOR cell

[44][46].

Currently, the NOR cell and the NAND cell are the prevalent core cells for providing

storage and comparison circuitry in CMOS CAMs. For a comprehensive survey of the

precursors of CMOS CAM cells refer to [47].

3.3MATCHLINESENSINGSCHEMES

This section reviews matchline sensing schemes that generate the match result. First,

we review the conventional precharge high scheme, then introduce several variations that

save power.

3.3.1Conventional(PrechargeHigh)MatchlineSensing

We review the basic operation of the conventional precharge-high scheme and look at

sensing speed, charge sharing, timing control and power consumption.

7/28/2019 thesis_v4

38/147

26

3.3.1.1BasicOperation

The basic scheme for sensing the stateof the NOR matchline is first to precharge high

the matchline and then evaluate by allowing the NOR cells to pull down the match-lines in

the case of amiss, or leave the matchline high in the case of a match. Fig. 3.5(a) shows, in

schematic form, an implementation of this matchline-sensing scheme. Fig. 3.5(b) shows thesignal timing which is divided into three phases: SL precharge, ML precharge, and ML

evaluation. The operation begins by asserting slpre to precharge the searchlines low,

disconnecting all the pull down paths in the NOR cells.With the pull down paths

disconnected, the operation continues by asserting mlpreb to precharge the matchline high.

Once the matchline is high, both slpre and mlpreb are de-asserted. The ML evaluate phase

begins by placing the search word on the searchlines. If there is at least one single-bit miss on

the matchline, a path (or multiple paths) to ground will discharge the matchline, ML,

indicating amiss for the entire word, which is output on the MLSA sense-output node, called

Fig. 3. 5: (a) the schematic with precharge circuitry for matchline sensing using the

precharge-high scheme, and (b) the corresponding timing diagram showing relative signal

transitions. [34]

7/28/2019 thesis_v4

39/147

27

MLso. If all bits on the matchline match, thematchline will remain high indicating a match

for the entire word. Using this sketch of the precharge high scheme, we will investigate the

performance of matchline in terms of speed, robustness, and power consumption. The

matchline power dissipation is one of the major sources of power consumption in CAM.

3.3.1.2Matchline

power

In a typical system, the number of misses is expected to be much greater than the

number of matches; thus, using a dynamic NAND structure results in a significant reduction

in power. Fig. 3.6 demonstrates the power advantages of a NAND architecture versus a NOR.

1A NAND match-line architecture is considerably slower than it has NOR counterpart,

especially for wide words.

Fig. 3. 6: Matchline power for NAND and NOR architecture [33]

3.3.1.3ChargeSharing

There is a potential charge-sharing problem depending on whether the CAM storage

bits D and DB are connected to the top transistor or the bottom transistor in the pulldown

path.

7/28/2019 thesis_v4

40/147

28

Fig. 3. 7: Two possible configurations for the NOR cell: (a) the stored bit is connected to the

bottom transistors of the pulldown pair, and (b) the stored bit is connected to the top

transistors of the pulldown pair.

Fig. 3.7 shows these two possible configurations of the NOR cell. In the configuration

of Fig. 3.7(a), there is a charge-sharing problem between the matchline, ML, and nodes X1

and X2. Charge sharing occurs during matchline evaluation, which occurs immediately after

the matchline precharge-high phase. During matchline precharge, SL and SLB are both at

ground. Once the precharge completes, one of the searchlines is activated, depending on the

search data, causing either M1 or M2 to turn ON. This shares the charge at node X1 or node

X2 with that of ML, causing the ML voltage, VML, to drop, even in the case of match, which

may lead to a sensing error. To avoid this problem, designers use the configuration shown in

Fig.3.7 (b), where the stored bit is connected to the top transistors. Since the stored bit isconstant during a search operation, charge sharing is eliminated.

3.3.1.4PowerConsumption

The dynamic power consumed by a single matchline that misses is due to the rising

edge during precharge and the falling edge during evaluation, and is given by the equation

Pmiss = fCMLVDD2

Where f is the frequency of search operations. In the case of a match, the power

consumption associated with a single matchline depends on the previous state of the

matchline; however, since typically there are only a small number of matches we can neglect

this power consumption. Accordingly, the overall matchline power consumption of a CAM

block with w matchlines is

PML=wPmiss

7/28/2019 thesis_v4

41/147

7/28/2019 thesis_v4

42/147

30

3.5CURRENTRACESCHEME

Current-Race saving is an important scheme for the CAM architecture. Fig. 3.9(a)

shows a simplified schematic of the current-race scheme [51]. This scheme precharges the

matchline low and evaluates the matchline state by charging the matchline with a current IML

supplied by a current source. The signal timing is shown in Fig. 3.9(b). The precharge signal,mlpre, starts the search cycle by precharging thematchline low. Since the matchline is

precharged low, the scheme concurrently charges the searchlines to their search data values,

eliminating the need for a separate SL precharge phase required by the precharge-high

scheme of Fig. 3.5(b). Instead, there is a single SL/ML precharge phase, as indicated in Fig.

3.9(b). After the SL/ML precharge phase completes, the enable signal, enb, connects the

current source to the matchline. A matchline in the match state charges linearly to a high

voltage, while a matchline in the miss state charges to a voltage of only IML*RML /m, where

m denotes the number of misses in cells connected to the matchline. By setting the maximum

voltage of a miss to be small, a simple matchline sense amplifi

er easily differentiates betweena match state and a miss state and generates the signal MLso. As shown in Fig. 3.9, the

amplifier is the nMOS transistor, Msense, whose output is stored by a half-latch. The nMOS

sense transistor trips the latch with a threshold of Vtn. After some delay, matchlines in the

match state will charge to slightly above tripping their latch, whereas matchlines in the miss

state will remain at a much smaller voltage, leaving their latch in the initial state. A simple

replica matchline (not shown) controls the shutoff of the current source and the latching of

the match signal. We derive the power consumption of this scheme by first noting that the

same amount of current is discharged into every matchline, regardless of the state of the

matchline. Looking at the match case for convenience, the power consumed to charge a

matchline to slightly above Vtn is

Pmatch=fCMLVDDVtn

Since the power consumption of a match and a miss are identical, the overall power

consumption for all w matchlines is

PML=wPmatch

This equation is identical to the low-swing scheme (previous equation) with

VMLswing=Vtn. The benefits of this scheme over the precharge-high schemes are the simplicity

of the threshold circuitry and the extra savings in searchline power due to the elimination of

the SL precharge phase which is discussed

7/28/2019 thesis_v4

43/147

31

Fig. 3. 9: (a) Circuit implementation including precharge circuitry and (b) a timing diagram

for a single search cycle. For current-race matchline sensing [51].

Changing CAM cell configuration is an important feature of current racing scheme.

The current-race scheme also allows changing the CAM cell configuration due to the fact that

the matchline is precharged low. With precharge low, there is no charge-sharing problem for

either CAM cell configuration of Fig.3.6, since the ML precharge level is the same as the

level of the intermediate nodes X1 and X2. Rather than to avoid charge sharing, the criterionthat determines which cell to use in this case is matching parasitic capacitances between

MLs. In the configuration of Fig. 3.7(b), the parasitic load on a matchline depends on the

ON/OFF state of M1 and of M2. Since different cells will have different stored data, there

will be variations in the capacitance CML among the MLs. However, in the configuration of

Fig.3.7(a), the variation of parasitic capacitance on the matchline depends only on the states

of SL and SLB which are the same for all cells in the same column. Thus, the configuration

of Fig. 3.7(a) maintains good matching between MLs and prevents possible sensing errors

due to parasitic capacitance variations.

3.6SELECTIVEPRECHARGESCHEME

Selective precharge scheme came considering the non uniform ML power

consumption. The matchline-sensing techniques we have seen so far, expend approximately

the same amount of energy on every matchline, regardless of the specific data pattern, and

whether there is a match or a miss. We now examine three schemes that allocate power to

matchlines nonuniformly. Thefi

rst technique, called selective precharge, performs a matchoperation on the first few bits of a word before activating the search of the remaining bits

7/28/2019 thesis_v4

44/147

32

[52]. For example, in a 144-bit word, selective precharge initially searches only the first 3 bits

and then searches the remaining 141 bits only for words that matched in the first 3 bits.

Assuming a uniform random data distribution, the initial 3-bit search should allow only 3

words to survive to the second stage saving about 88% of the matchline power. In practice,

there are two sources of overhead that limit the power saving. First, to maintain speed, theinitial match implementation may draw a higher power per bit than the search operation on

the remaining bits. Second, an application may have a data distribution that is not uniform,

and, in the worst-case scenario, the initial match bits are identical among all words in the

CAM, eliminating any power saving.

Fig. 3.10 is a simplified schematic of an example of selective precharge similar to that

presented in the original paper [52]. The example uses the first bit for the initial search and

the remaining n-1 bits for the remaining search. To maintain speed, the implementation

modifies the precharge part of the precharge-high scheme [of Fig. 3.7(a) and (b)]. The ML is

precharged through the transistor M1, which is controlled by the NAND CAM cell andturned on only if there is a match in the first CAM bit. The remaining cells are NOR cells.

Fig. 3. 10: Sample implementation of the selective-precharge matchline technique [52].

Note that the ML of the NOR cells must be pre-discharged (circuitry not shown) to

ground to maintain correct operation in the case that the previous search left thematchline

high due to a match. Thus, one implementation of selective precharge is to use this mixed

NAND/NOR matchline structure. Selective precharge is perhaps themost commonmethod

used to save power on matchlines [34], [53][57] since it is both simple to implement and can

reduce power by a large amount in many CAM applications.

3.7

PIPELINING

SCHEME

More generally, an implementation may divide the matchline into any number of

segments, where a match in a given segment results in a search operation in the next segment

but a miss terminates the match operation for that word. A design that uses multiple

matchline segments in a pipelined fashion is the pipelined matchlines scheme [58], [59]. Fig.

3.11(a) shows a simplified schematic of a conventional NOR matchline structure where all

cells are connected in parallel. Fig. 3.11(b) shows the same set of cells as in Fig. 3.11(a), but

with the matchline broken into four matchline segments that are serially evaluated. If any

stage misses, the subsequent stages are shut off, resulting in power saving. The drawbacks of

this scheme are the increased latency and the area overhead due to the pipeline stages.

7/28/2019 thesis_v4

45/147

33

Fig. 3. 11: Pipelined matchlines reduce power by shutting down after a miss in a stage

By itself, a pipelined matchline scheme is not as compelling as basic selective precharge;

however, pipelining enables the use of hierarchical searchlines, thus saving power. Another

approach is to segment the matchline so that each individual bit forms a segment. Thus,

selective precharge operates on a bit-by-bit basis. In this design, the CAM cell is modified so

that thematch evaluation ripples through each CAM cell. If at any cell there is a miss, the

subsequent cells do not activate, as there is no need for a comparison operation. The

drawback of this scheme is the extra circuitry required at each cell to gate the comparison

with the result from the previous cell.

Fig. 3. 12: Simulated wave forms in the pipelined match-line architecture for (a) the full-

match case consisting of a match in every stage, and (b) a miss case where the third stageresults in a miss and turns off the subsequent stages.[58]

7/28/2019 thesis_v4

46/147

34

Fig 3.12 shows the simulated waveforms of the ML segments in the pipelined ML

scheme. Fig 3.12 shows a full match as indicated by the rising ML in every segment along

with the corresponding full-rail output of the MLSA. Fig 3.12(b) shows an example of a word

that misses in the third stage as indicated by the lack of an MLSA output pulse. In this

example, the ML sensing circuitry of the fourth and fifth segments is not activated, hencesaving power.

3.8CURRENTSAVINGSCHEME

The current-saving scheme [60], [61] is another data-dependent matchline-sensing

scheme which is a modified form of the current-race sensing scheme. Recall that the current-

race scheme uses the same current on each matchline, regardless of whether it has a match or

a miss. The key improvement of the current-saving scheme is to allocate a different amount

of current for a match than for a miss. In the current-saving scheme matches are allocated alarger current and misses are allocated a lower current. Since almost every matchline has a

miss, overall the scheme saves power.

Fig. 3. 13: Current-saving matchline-sensing scheme

Fig.3.13 shows a simplified schematic of the current-saving scheme. The main

difference from the current-race scheme as depicted in Fig. 3.9 is the addition of the current-

control block. This block is the mechanism by which a different amount of current is

allocated, based on a match or a miss. The input to this current-control block is the matchline

voltage, VML, and the output is a control voltage that determines the current, IML, which

charges the matchline. The current-control block provides positive feedback since higher VML

results in higher IML, which, in turn, results in higher VML.

7/28/2019 thesis_v4

47/147

35

3.9CONCLUSION

From the above discussion of previous Content addressable Memory (CAM) it is

clearly shown that the main concern was power and speed. The researchers gave a good

importance to these two factors and improved the power and speed significantly. Scheme

simplicity and noise immunity are other two factors. Researchers gave importance to thesetwo factors too. The previous result from the simulation that the researchers performed is

shown below in table format.

Table 3. 3: Comparison between the schemes[62]

Scheme ML energy

fj/bit/search

Cycle time(ns) Noise Scheme

simplicity

Conventional 9.5 3.9 + ++

Low swing 4.2 3.1 - -

Current race 5.5 3.7 - +

Selective

precharge

5.6 3.5 + +

Pipelining 5.8 3.8 + -

Current saving 4.3 3.7 - --

7/28/2019 thesis_v4

48/147

36

CHAPTER4

PROPOSEDCHARGINGCONTROL

SCHEME&SENSEAMPLIFIER

7/28/2019 thesis_v4

49/147

37

CHAPTER4

PROPOSED CHARGING CONTROL SCHEME &SENSE AMPLIFIER

4.1INTRODUCTION

Content-addressable memory (CAM) is a storage device that searches for the

matching data by content and returns the address of the matching data. Due to the risingdemand of CAM for high speed search capability in various applications, the size as well as

power consumption of CAM arrays continue to rise.

A lot of researches have been done to reduce the power consumption and to increasethe speed. Previous works present some schemes such as selective-precharge scheme [63],current-race scheme [64], current-saving scheme [65], [66], etc. It is reported in a survey [67]that the current-saving scheme consumes less power than other schemes [63], [64], [68]. Inthis work, we proposed a simplified match-line charging control scheme. Comparison withthe current-race and current-saving schemes shows match-line energy reduction of over 50%and speed improvement of over 3 times.

4.2PROPOSEDCHARGECONTROLLINGSCHEME

Fig. 4.1(a) shows the basic architecture of the proposed charge controlling scheme.Here, the array of CAM cells stores the data entries; search data register stores the searchword; each charge controller block controls the charging and discharging process of therespective match-line (ML); and the sense amplifier (SA) senses the ML voltage and givesthe final match/miss decision. Fig. 4.1(b) shows the internal circuit of conventional NOR-type Ternary CAM (TCAM) cell which is less susceptible to failure due to the processvariation as compared to NAND-type cell [69].

At the beginning of each search cycle, all MLs are pre-discharged to ground by thecharging controller. During the evaluation stage, the search data register broadcasts the searchdata to the search-lines (SLs). If SL resembles stored bit D or the stored bit is X (dont care),the ML has no discharging path to ground and the path remains in the high-impedance state.If SL does not resemble D, the ML has a discharging path (through either transistors T 1-T2

path or T3-T4 path) to ground. So, in a ML, the number of discharging paths is equal to thenumber of mismatches.

Fig. 4.2 shows the proposed charging controller. Here, at the beginning of searchcycle, ML is pre-discharged to ground by a high MLP. The transistor M2 is turned ON by alow MLP and M5 is OFF by a low MLC; this causes M3 to be turned OFF. Therefore,charging of ML through M3 remains prohibited during the pre-discharging of ML.

During the evaluation stage, MLP is switched to low so that both M1 and M2 turnOFF. At the same time, MLC is switched to high so that charging of ML through M3 begins.

7/28/2019 thesis_v4

50/147

38

If the CAM cell data of this ML is fully matched with the search-line data, then, there is nodischarging path through the CAM cell. So, charging to the fully matched match-line(denoted by ML0) is faster than the partially matched match-line. MLC is kept high until thevoltage of ML0 is enough to turn M4 ON; this causes the gate of the transistor M3 to be lowand charging to ML0 continues via M3. As the voltage of match-line, ML0 reaches aboutVDD/3 and MLC is low, M6 becomes fully ON and M7 becomes partially ON. The end result

is no more charging of ML through M3. The achieved voltage (~VDD/3) of ML0 is highenough to be sensed as a high level for the proposed sense amplifier (SA). Now, the match-line which has one miss is denoted by ML1 and this ML is hardest to detect as a miss. Ascharging to ML1 is slower than ML0, the voltage of ML1 is not enough to be sensed as ahigh level by the SA.

Fig. 4. 1: Structure of the CAM of the proposed scheme: (a) basic architecture and (b) NOR-

type TCAM cell used in the scheme.

7/28/2019 thesis_v4

51/147

39

Fig. 4. 2: Internal circuit of the charging controller.

Fig. 4.3 shows the proposed sense amplifier (SA). During precharge stage, the node

SN is precharged to high through MS1. During the evaluation stage, when the voltage of ML

is slightly above VDD/3, the transistor MS2 turns ON. So, the node SN begins to discharge

and MLS begins to rise. As MLS reaches high level, the transistor MS3 turns ON resulting in

faster discharge of node SN. Transistor MS4 is used to initialize the output MLS to ground at

the beginning of each search cycle.

4.3SIMULATIONRESULTSANDANALYSIS

The proposed scheme along with current-race and current-saving schemes are

simulated using HSpice in the same 64 72 TCAM for TSMC 0.18m CMOS process with

the supply voltage of 1.8V.

Fig. 4.4 shows the simulation result for our design. At the beginning of a search cycle,

MLP is high for 0.39 ns so that previously charged MLs can discharge to ground. Then, MLP

is switched to low and MLC is switched to high for 0.33ns. As the voltage of ML0 rises up to

0.65V and MLC is low, the charging is stopped and the voltage of ML0 remains unchanged.

The proposed sense amplifier senses this voltage as a match and output MLS0 turns to a full

high level. On the other hand, the voltage of the ML1 rises up to 0.46V and then, degrades

through the discharging path. The SA senses this voltage as a miss and results a low MLS1.

As the difference between the maximum ML0 and ML1 (worst-case match-line) is 190 mV,

even for a large process variation, our scheme will produce the correct search results.

Table 4.1 shows the comparative search energy and performance among the schemes.

It shows that ML energy reduction is 57% and 54% compared to the current-race and current-saving schemes respectively and the speed is 3.13 times that of the both schemes.

7/28/2019 thesis_v4

52/147

40

The major difference between previous schemes [64]-[66] and our scheme is that we

have used only one PMOS for charging the ML while previous schemes use two PMOS in

series. So, the equivalent resistance is half of the previous schemes. Equivalent capacitance is

also reduced. As a result, speed of the circuit increases. The basic difference between

previous SA design [64]-[66] and the proposed SA is the switching threshold of SA. Theproposed SA can sense a stable 0.6 V as a high level, whereas the previous SAs use ~1V as a

switching voltage. So, the proposed scheme need not further charging after 0.6V (equals to

VDD/3) at ML0 is achieved. So, this scheme reduces a large amount of dynamic and leakage

power, while a larger voltage swing of about VDD/2 in [64]-[66] causes a higher power

consumption.

Fig. 4. 3: Proposed sense amplifier (SA)

ML

MS1

MLS

MLP

SN

MLPMS2

MS3 MS4

7/28/2019 thesis_v4

53/147

41

Fig. 4. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched),

ML1 (one-bit miss), ML2 (two-bit miss), MLC and MLP.

Table 4. 1: Comparison of Different Schemes

Schemes

MLEnergy

(fJ/bit/search)

SLEnergy

(fJ/bit/search)

Minimum Cycletime, T(ns)

Speed, 1/T

(MHz)

Current-race[64]

3.56 0.65 3.60 278

Current-saving[65],[66]

3.32 0.65 3.60 278

This work 1.53 0.62 1.15 870

4.4CORNERSIMULATIONOFTHESCHEME

To test the reliability of the proposed scheme, we have simulated the proposed

scheme in two extreme corners which we defined as follows: the fast corner with low

threshold voltage, high VDD and low temperature (FHL) and the slow process corner with

high threshold voltage, low VDD and high temperature (SLH). Simulation results reveal that

the proposed scheme works satisfactorily in the fast process (FHL) corner with -10%

threshold votlage, +5% VDD and at a temperature of 273 K as well as in the slow process

(SLH) corner with +10% threshold votlage, -5% VDD and at a temperature of 343 K. The

simulat

thesis_v4

Documents