thesis_v4
TRANSCRIPT
-
7/28/2019 thesis_v4
1/147
i
HIGHSPEEDANDLOWPOWERDESIGNOFCONTENT
ADDRESSABLEMEMORY
Athesissubmittedtothe
DEPARTMENTOFELECTRICALANDELECTRONICENGINEERING
OF
BANGLADESHUNIVERSITYOFENGINEERINGANDTECHNOLOGY
inpartial
fulfillment
of
therequirementsforthe
degreeof
BachelorofScienceinElectricalandElectronicEngineering
By
MD.
NAIMUL
HASAN
(0306017)
MD.TAUHIDURRAHMAN(0306035)
MD.MEHEDIHASAN(0306071)
Supervisor
DR.A.B.M.
HARUN
URRASHID
Professor,DepartmentofEEE,
BUET,Dhaka1000
BANGLADESHUNIVERSITY
OF
ENGINEERING
AND
TECHNOLOGY
-
7/28/2019 thesis_v4
2/147
ii
Declaration
We hereby declare that the work presented in this thesis entitled
High Speed and Low Power Design of ContentAddressable
Memory istheoutcomeofthe investigationcarriedoutbyusand
neither this thesis nor any part thereof has been submitted or is
being currently submitted anywhere else for the award of any
degreeordiploma.
.. .
Md.NaimulHasan Md.TauhidurRahman Md.MehediHasan
(0306017) (0306035) (0306071)
-
7/28/2019 thesis_v4
3/147
iii
Our parents.
-
7/28/2019 thesis_v4
4/147
iv
ACKNOWLEDGEMENTS
Atfirst,wewouldliketoconveyourgratitudetoAlmightyAllahwithout
whosewishnothingispossible.
Wewouldliketoexpressourprofoundgratitudeandappreciationtoour
Thesis supervisor, Professor Dr. A. B. M. HarunUr Rashid for his benign
attitude towards us and whose supervision gave us the opportunity to get
involved
in
this
stateof
the
art
and
greatly
emerging
research
of
low
power
and highspeed design of circuits specially ContentAddressable Memory
(CAM).Hisgeneroushelp,encouragementandconstantguidanceaccelerated
thecompletionofthisthesis.
Wealsowould liketoexpressourspecialthankstoMr.AtaurRahman
Patwary, School of Electrical and Computer Engineering, Oregon State
University,USAforhissuggestionsanddifferentkindsofhelptocompletethe
work.
We would like to thank thealumniofBUET (also members of Yahoo!
GroupBUETian)fortheirhelpbyprovidingalotofnecessaryworks.
We also would like to acknowledge the help of the Department of
ElectricalandElectronicEngineering,BUETforlettingustousetheVLSIserver
computertosimulatelargeHSpicefile.
-
7/28/2019 thesis_v4
5/147
v
CONTENTS
Page
Declarationii
Dedication.iii
Acknowledgementiv
Contents.v
List of Figures viii
List of Tables.xi
Abstractxii
CHAPTER1:INTRODUCTION......................................................................................................2
1.1INTRODUCTION.................................................................................................................. 2
1.2MOTIVATIONS ................................................................................................................... 3
1.3OBJECTIVES....................................................................................................................... 4
1.4THESIS ORGANIZATION ..................................................................................................... 4
CHAPTER2:DIFFERENT TYPES OF LOGIC ................................................................................7
2.1INTRODUCTION.................................................................................................................. 7
2.2RATIOED LOGIC................................................................................................................. 7
2.2.1 Load Resistance RL ................................................................................................... 7
2.2.2 nMOS depletion mode transistor pull up .................................................................. 8
2.2.3 nMOS enhancement mode pull up ............................................................................ 8
2.2.4 Pseudo-nMOS logic .................................................................................................. 9
2.3COMPLEMENTARY TRANSISTOR PULL-UP (CMOS) ............................................................ 9
2.3.1 Components of Total Power Dissipation in CMOS Circuits .................................. 10
2.4DYNAMIC CIRCUITS......................................................................................................... 11
2.5DOMINO LOGIC ............................................................................................................... 16
CHAPTER3:CONTENT ADDRESSABLE MEMORY (CAM)REVIEW ........................................19
3.1INTRODUCTION................................................................................................................ 19
3.2CORE CELLS AND MATCHLINE STRUCTURE.................................................................... 22
3.2.1 Structure of NOR Cell ............................................................................................. 22
3.2.2 Structure of NAND Cell .......................................................................................... 23
-
7/28/2019 thesis_v4
6/147
vi
CONTENTS (Continued)
Page
3.2.3 Ternary Cells ........................................................................................................... 23
3.3MATCHLINE SENSING SCHEMES ...................................................................................... 25
3.3.1 Conventional (Precharge-High) Matchline Sensing ................................................ 25
3.3.1.1 Basic Operation ................................................................................................. 26
3.3.1.2 Matchline power .............................................................................................. 27
3.3.1.3 Charge Sharing.................................................................................................. 27
3.3.1.4 Power Consumption .......................................................................................... 28
3.4LOW-SWING SCHEMES ................................................................................................... 293.5CURRENT-RACE SCHEME ................................................................................................ 30
3.6SELECTIVE-PRECHARGE SCHEME .................................................................................... 31
3.7PIPELINING SCHEME ........................................................................................................ 32
3.8CURRENT-SAVING SCHEME ............................................................................................. 34
3.9CONCLUSION ................................................................................................................... 35
CHAPTER4:PROPOSED CHARGING CONTROL SCHEME &SENSE AMPLIFIER.....................37
4.1INTRODUCTION................................................................................................................ 37
4.2PROPOSED CHARGE CONTROLLING SCHEME ................................................................... 37
4.3SIMULATION RESULTS AND ANALYSIS ............................................................................ 39
4.4CORNER SIMULATION OF THE SCHEME............................................................................. 41
4.5CONCLUSION ................................................................................................................... 42
CHAPTER5:PROPOSED SIMPLIFIED DESIGN OF CHARGING CONTROLLER..........................44
5.1INTRODUCTION................................................................................................................ 44
5.2PROPOSED MLCHARGING TECHNIQUE ........................................................................... 46
5.3SIMULATION RESULTS AND ANALYSIS ............................................................................ 48
5.4CONCLUSION ................................................................................................................... 50
CHAPTER6:PROPOSED CAM WITH IMPROVEDNOISE MARGIN ............................................... 52
6.1INTRODUCTION................................................................................................................ 52
6.2OPERATION OF THE SCHEME ............................................................................................ 52
6.2.1 Charging controller .................................................................................................. 53
6.2.2 The sense amplifier ................................................................................................. 54
6.3SIMULATION RESULT....................................................................................................... 55
-
7/28/2019 thesis_v4
7/147
vii
CONTENTS (Continued)
Page
6.4CONCLUSION ................................................................................................................... 57
CHAPTER7:Conclusion ........................................................................................................... 59
7.1CONCLUSION .................................................................................................................. 59
7.2FUTURE WORK................................................................................................................ 59
Reference ................................................................................................................................. 61
Bibliography ............................................................................................................................ 69
RESEARCH PAPERS FROM THIS THESIS..................................................................................... 70
APPENDIX A ............................................................................................................................. 71
A.1 HSPICE CODE FORCHARGING CONTROL SCHEME........................................................... 71A.2HSPICE CODE FOR SIMPLIFIED DESIGN OF CHARGING CONTROLLER.............................. 122
A.3HSPICE CODE FOR IMPROVED NOISE SCHEME [CHAPTER6] ........................................... 124
APPENDIX B ............................................................................................................................ 128
B.1INTRODUCTION ............................................................................................................. 128
B.2INSTALLATION AND USAGE OF HSPICE 2007 ................................................................ 128
B.3BASIC RULES ORQUICKMANUAL[2] ........................................................................ 131
B.3.1. Input File .............................................................................................................. 131
B.3.2. Element Description ............................................................................................ 132
B.3.3. Analysis ............................................................................................................... 134
B.3.4 References ............................................................................................................. 135
-
7/28/2019 thesis_v4
8/147
viii
LIST OF FIGURES
PAGE
Fig. 2. 1: Resistor Pull-up......................................................................................................................
7
Fig. 2. 2: nMOS depletion mode transistor pull up................................................................................ 8
Fig. 2. 3: nMOS enhancement mode pull up......................................................................................... 8
Fig. 2. 4: Pseudo-nMOS Logic.............................................................................................................. 9
Fig. 2. 5: Complementary transistor pull-up (CMOS)........................................................................... 9
Fig. 2. 6: CMOS inverter current versus Vin....................................................................................... 10
Fig. 2. 7: Precharge and evaluation of dynamic gates..........................................................................
11
Fig. 2. 8: Footed dynamic inverter....................................................................................................... 12
Fig. 2. 9: Unfooted dynamic gates....................................................................................................... 12
Fig. 2. 10: Generalized footed gates.................................................................................................... 13
Fig. 2. 11: Logical effort of footed and unfooted dynamic gates......................................................... 13
Fig. 2. 12: Monotonicity problem........................................................................................................ 14
Fig. 2. 13: Incorrect connection of dynamic gates............................................................................... 15
Fig. 2. 14: Standard Domino Logic circuit.......................................................................................... 16
Fig. 2. 15: Weak keeper implementation............................................................................................. 17
Fig. 3. 1: Simple schematic model of a 4x3 CAM array showing the core memory cells, differential
search lines, match lines and encoder.................................................................................................. 20
Fig. 3. 2: Profile of CAM capacity (log scale) versus year of publication [33][40]...........................21
Fig. 3. 3: CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The cells areshown using SRAM-based data-storage cells...................................................................................... 22
Fig. 3. 4: Structure of Ternary core cells for (a) NOR-type (b) NAND-type CAM [41], [42]............24
Fig. 3. 5: (a) the schematic with precharge circuitry for matchline sensing using the precharge-high
scheme, and (b) the corresponding timing diagram showing relative signal transitions. [34]..............26
Fig. 3. 6: Matchline power for NAND and NOR architecture [33]...................................................... 27
Fig. 3. 7: Two possible configurations for the NOR cell: (a) the stored bit is connected to the bottom
transistors of the pulldown pair, and (b) the stored bit is connected to the top transistors of thepulldown pair....................................................................................................................................... 28
-
7/28/2019 thesis_v4
9/147
ix
LIST OF FIGURES
PAGE
Fig. 3. 8: Low-swing matchline sensing scheme of [33]......................................................................
29
Fig. 3. 9: (a) Circuit implementation including precharge circuitry and (b) a timing diagram for a
single search cycle. For current-race matchline sensing [51]............................................................... 31
Fig. 3. 10: Sample implementation of the selective-precharge matchline technique [52]....................32
Fig. 3. 11: Pipelined matchlines reduce power by shutting down after a miss in a stage .....................33
Fig. 3. 12: Simulated wave forms in the pipelined match-line architecture for (a) the full-match case
consisting of a match in every stage, and (b) a miss case where the third stage results in a miss and
turns off the subsequent stages.[58]
.....................................................................................................
33
Fig. 3. 13: Current-saving matchline-sensing scheme......................................................................... 34
Fig. 4. 1: Structure of the CAM of the proposed scheme: (a) basic architecture and (b) NOR-type
TCAM cell used in the scheme............................................................................................................ 38
Fig. 4. 2: Internal circuit of the charging controller............................................................................. 39
Fig. 4. 3: Proposed sense amplifier (SA)............................................................................................. 40
Fig. 4. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched), ML1 (one-
bit miss), ML2 (two-bit miss), MLC and MLP....................................................................................
41
Fig. 4. 5: Corner Simulation results for FHL (when threshold voltage= -10%,VDD=+5%, and
temparature=273K).............................................................................................................................. 42
Fig. 4. 6: Corner Simulation results for SLH (when threshold voltage= +10%,VDD=-5%, and
temparature=343K).............................................................................................................................. 42
Fig. 5. 1: Simplified conventional CAM architecture.......................................................................... 45
Fig. 5. 2: Structure of the CAM array with the proposed scheme: (a) the basic architecture and (b)
internal circuit of the NOR-type TCAM cell used in this scheme. Here the usual SRAM accesstransistor and associated bitlines are omitted for simplicity................................................................ 47
Fig. 5. 3: The ML charging unit proposed in this work and the sensing unit proposed in [84]. ...........47
Fig. 5. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched), ML1 (one-
bit miss), ML2 (two-bit miss), MLC and MLP.................................................................................... 48
Fig. 6. 1: Structure of the CAM array.................................................................................................. 53
Fig. 6. 2: Internal circuitry of improved Charging............................................................................... 53
Fig. 6. 3: Waveforms for CAM............................................................................................................ 54
-
7/28/2019 thesis_v4
10/147
x
LIST OF FIGURES
PAGE
Fig. 6. 4: Conventional sense amplifier...............................................................................................
55
Fig. 6. 5: Charging in different match-lines......................................................................................... 55
Fig. 6. 6: Controlled signals and output of each match-line................................................................. 56
-
7/28/2019 thesis_v4
11/147
xi
LIST OF TABLES
PAGE
Table 3. 1: Truth Table for NOR Cell..................................................................................................
24
Table 3. 2: Truth Table for NAND Cell............................................................................................... 24
Table 3. 3: Comparison between the schemes[62]............................................................................... 35
Table 4. 1: Comparison of Different Schemes..................................................................................... 41
Table 5. 1: Comparison of Different Schemes..................................................................................... 49
Table 6. 1: Comparison of Different Schemes with improved noise margin scheme..........................56
Table 6. 2: Comparison of Noise Immunity of different schemes.......................................................
57
-
7/28/2019 thesis_v4
12/147
xii
ABSTRACT
The growing market demand of the integrated circuits and energy crisis in the wholeworld accelerate the researchers to find out new process technology with smaller transistors,
to design the circuits more efficiently which needs less power and operate in higher speed.
The purpose of this dissertation is the same i.e. to find a way to design digital circuit
specifically Content-Addressable Memory (CAM) which needs low power and operates in
higher speed with maintaining the noise immunity.
Content-addressable memory (CAM) is an attractive component in network routers
for packet forwarding and packet classification and also in other applications that require
high-speed searches. This dissertation presents three techniques to increase speed and reduce
the energy per bit per search. In the first technique, the charging of fully matched matchline isreduced from VDD to VDD/3 and this voltage is sensed by our proposed sense amplifier. This
reduces the power consumption greatly. In the second scheme, the charging controller is
made simple (to get a low cost circuit) with the cost of some energy. In the third scheme with
high noise margin, only probable matched matchlines are charged to almost VDD. Most of the
matchlines are not charged so much. So in most of the matchlines, the power consumption is
very low.
All the schemes (conventional current saving and current race scheme also) are
simulated in TSMC 0.18 m technology with 64 x 72 Ternary CAM. For first charging
control technique, simulation shows that the match-line energy reduction is 57% and 54%
compared to the current-race and current-saving schemes respectively and 55% compared to
the conventional current-race scheme while speed of operation is increased by over 3 times.
For the second simplified scheme, the energy efficiency is little bit lower. The third scheme
provides very good noise margin with maintaining sufficient energy reduction and speed of
operation. The more accurate result can be obtained by doing Layout of the proposed
schemes.
We can say that this dissertation provides some good schemes for the content
addressable memory. We hope that the proper layout of the schemes would provide goodresults also. We also hope that we would get some good alternatives for existing CAM after
fabrication of the chip.
-
7/28/2019 thesis_v4
13/147
1
CHAPTER1
INTRODUCTION
-
7/28/2019 thesis_v4
14/147
2
CHAPTER1
INTRODUCTION
1.1INTRODUCTION
A CONTENT-ADDRESSABLE memory (CAM) compares input search data against
a table of stored data, and returns the address of the matching data. CAMs have a single clock
cycle throughput making them faster than other hardware- and software-based search
systems. CAMs can be used in a wide variety of applications requiring high search speeds.
These applications include cache memory, parametric curve extraction, Hough
transformation, Huffman coding/decoding, LempelZiv compression, and image coding [1].
The primary commercial application of CAMs today is to classify and forward Internet
protocol (IP) packets in network routers. In networks like the Internet, a message such an as
e-mail or a Web page is transferred by first breaking up the message into small data packets
of a few hundred bytes, and, then, sending each data packet individually through the network.
These packets are routed from the source, through the intermediate nodes of the network(called routers), and reassembled at the destination to reproduce the original message. The
function of a router is to compare the destination address of a packet to all possible routes, in
order to choose the appropriate one. A CAM is a good choice for implementing this lookup
operation due to its fast search capability. CAM is also used in neural networks. The two
main strategies available for implementing a CAM with a neural network architecture,
feedback networks and two-stage CAMs, and in particular their ability to retrieve patterns
from corrupted input data. The storage capacity of the Hopfield network is very poor
although it can be improved with the use of an iterative algorithm, such as the threshold
algorithm which is described. However, the possibility of generating spurious patterns alwaysremains with feedback networks. Two-stage CAMs are much more efficient, provided that an
appropriate algorithm is used for the input classification stage. Perceptron and least-mean
squares algorithms need to be modified if they are to cope with corrupted input patterns, but
the optimal classifier for the type of problem under consideration is the minimum-distance
classifier (or Hamming network for binary patterns).
Dynamic CMOS logic in general and domino logic in particular has a number of
advantages to design high speed and low power CMOS circuits. However, the main difficulty
with domino logic is that it can implement only non-inverted logic. To implement inverted
polarity it is required to duplicate several circuit parts using inverted polarities of inputs and
-
7/28/2019 thesis_v4
15/147
3
hence increasing area and power dissipation. With static CMOS logic, on the other hand, it is
simple to realize gate with both inverted and non-inverted logic unlike dynamic CMOS logic
[2].
Domino logic is known as a better logic for implementing high-speed CMOS circuits.However, domino circuits have some inherent problems like charge sharing, clock routing
overhead, clock skew etc. Another difficulty with the domino logic is that it can only
implement non-inverting functions. As domino logic cannot implement inverters driving
other domino gates, all parts of the gate that follows an inverter has to be implemented again
with opposite polarities of inputs, which increases area and power dissipation. The
advantages of domino logic may come into question when there is a large number of inverters
and having trapped in points where substantial duplication of domino gates is unavoidable.
1.2MOTIVATIONS
The speed of a CAM comes at the cost of increased silicon area and power
consumption, two design parameters that designers strive to reduce. As CAM applications
grow, demanding larger CAM sizes, the power problem is further exacerbated. Reducing
power consumption, without sacrificing speed or area, is the main thread of recent research in
large-capacity CAMs.
The literature shows that a major portion of power is consumed in Match lines (ML)
and search lines (SL). A lot of researches have been done to reduce the ML and SL power
consumption. Previous works present some schemes such that low-swing scheme [3], [4], [5],
selective precharge scheme [6], current-race scheme [7], current-saving scheme [8], [9] etc.
The low-swing scheme reduces the ML power by reducing the ML voltage. The selective
precharge scheme reduces match-line power consumption by breaking the search into two
segments and observing that the second segment is rarely activated. The current-race scheme
limits the ML voltage swing by VDD/2 and precharges the MLs to ground instead of VDD. In
our scheme, as SL is not precharged, there is almost a 50% reduction in SL power
consumption [1], compared to the precharge-high scheme [10], [11]. The current-saving
scheme is the improved version of current race scheme which allocates less power to match
decision involving a large number of mismatched bits. It is reported in a survey [1] that thecurrent saving scheme consumes less power than other schemes [10], [6], [7]. So to compare
our proposed schemes, we have selected the current-race and current-saving scheme.
In our proposed schemes, the match-lines are precharged to ground at precharge stage
unlike the conventional precharge high scheme so that power consumption in the matchlines
is low. Another thing is that pre-charging the matchline eliminates the need of searchline
precharge. So in the typical case, about 50% of the search data bits toggle from cycle to
cycle, there is a 50% reduction in searchline power, compared to the precharge-high
matchline-sensing schemes that have an SL precharge phase. Secondly, in our proposed
charging control scheme, the charging of fully matched matchline is reduced from VDD to
VDD/3 and this voltage is sensed by our proposed sense amplifier. This reduces the power
-
7/28/2019 thesis_v4
16/147
4
consumption greatly. In the scheme with high noise margin, only probable matched
matchlines are charged to almost VDD. Most of the matchline is not charged so much. So in
most of the matchlines, power consumption is very low.
1.3OBJECTIVES
The objective of this investigation and research was to find out a way or scheme to
design high speed and low power dynamic circuits. We specifically tried to design some good
schemes for Content Addressable Memory (CAM). CAM is a special type of memory array
which provides hardware search system where the information or data to be searched enters
into the two-dimensional memory array and provides the search result (the address of
memory where the data is found). To design very high speed CAM we have investigated the
CAM cells, the charging controller and sense amplifiers. As satisfactory works is present onCAM cells, our objective was to design an appropiate scheme of charging controllers and
sense amplifier which consume less power and operate in high speed. As most of the power is
consumed in matchlines and matchlines are charged by the charging controller, the charging
controller is responsible for most of the power consumption. So, one of our objectives was to
design the charging controller intelligently. On the other hand, the higher the matchline
voltage, the higher the power consumption. For that reason, our objective was to charge the
matchline as VDD/3 and design a sense amplifier which can sense this voltage as a high.
1.4THESISORGANIZATION
Continued increase in leakage current of the transistors with the advancement of
process technology, impacts the leakage power and noise sensitivity of dynamic circuits more
than those of static circuits. This thesis proposes methods and techniques to achieve the goal
of power-efficient design of CAM circuits used commonly in high-performance cache
memory while improving or maintaining their area, performance and noise robustness.
Chapter 2 describes different type of logic circuits-study of which is needed tounderstand the significance of dynamic circuits. As CAM is based on dynamic logic
investigation of dynamic logic is very important.
Chapter 3 underscores the significance of previous schemes and compares their
performances and structure in details to reduce power consumption.
Chapter 4 describes one of the proposed schemes. In this chapter, a scheme named
charging control scheme is proposed. This work is also accepted and presented in
International Conference on Solid State Device and Materials (SSDM) 2008, Tsukuba, Japan.
-
7/28/2019 thesis_v4
17/147
5
Chapter 5 describes another scheme which contains the simplified version of charging
controller of the scheme described in chapter 4. It is also accepted in TENCON 2008,
Hyderabad, India.
Chapter 6 narrates a scheme which have high noise margin. The principle of thistechnique is carry coal to Newcastle. This scheme charges only a negligible number of
matchline to VDD. For that reason, a high amount of power is saved.
Finally, Chapter 7 presents a summary of proposed methods and techniques
mentioned in the thesis for low-power CAM circuits in high-performance memory arrays. It
also mentions suggestions for extending the current research for possible future work.
-
7/28/2019 thesis_v4
18/147
6
CHAPTER2
DIFFERENTTYPESOFLOGIC
-
7/28/2019 thesis_v4
19/147
7
In ut
RL
F
Resistive
Load
VDD
VSS
PDN
CHAPTER2
DIFFERENT TYPES OF LOGIC
2.1INTRODUCTION
In this chapter, different types of logic circuits are described. Logic circuits such as
ratioed logic, CMOS logic, dynamic logic and domino logic are studied. This study is needed
to realize the significance of dynamic circuits. As CAM is based on dynamic logicinvestigation of dynamic logic is very important.
2.2RATIOEDLOGIC
Ratioed logic is an attempt to reduce the number of transistors required to implement
a logic function, often at the cost of reduced robustness and extra power dissipation.
2.2.1LoadResistanceRL
The main goal is to reduce the number of transistors at the cost of reduced robustness
and extra power dissipation. This arrangement is not often used because of the large space
requirements of resistors produced in a silicon substrate. If Pull down network is off, Static
Power = 0. If Pull down network is on, then there is some Static Power dissipation.
Fig. 2. 1: Resistor Pull-up
-
7/28/2019 thesis_v4
20/147
8
2.2.2nMOSdepletionmodetransistorpullup
Power dissipation is high since rail to rail current flows when input = logical 1.
Switching of output from 1 to 0 begins when input voltage exceeds the Vt of the pull down
device.
Fig. 2. 2: nMOS depletion mode transistor pull up
When switching the output from 1 to 0, the pull up device is non-saturated initially and this
presents lower resistance through which to charge capacitive loads.
2.2.3nMOS
enhancement
mode
pull
up
If the gate of the pull-up transistor is connected to VDD then it is called nMOSenhancement mode pull up. Power Dissipation is high since current flows when inputvoltage = logical 1. Output voltage can never reach VDD (logical 1).
Fig. 2. 3: nMOS enhancement mode pull up
.
VSS
VDD
Depletion
Load
F
PDNInput
VSS
F
VDD
Input PDN
-
7/28/2019 thesis_v4
21/147
9
FVSS
Input
PMOS
Load
VSS
VDD
PDN
2.2.4PseudonMOSlogic
If we replace the depletion mode pull-up transistor of the standard nMOS circuits with
a p-transistor with gate connected to VSS, we have a structure similar to the NMOS
equivalent. This approach of the logic design is illustrated in the Fig. 2.4
Fig. 2. 4: Pseudo-nMOS Logic
The circuit arrangements look and behave much like nMOS circuits and appropriate ratio
rules must be applied.
2.3COMPLEMENTARYTRANSISTORPULLUP(CMOS)
In CMOS we use both the pull-up network and also the pull down network. So, there
is no static power dissipation, because no current flow either for logical 0 or for logical 1
inputs. Full logical 1 and 0 levels are presented at the output. For devices of similar
dimensions the p-channel is slower than the n-channel device.
Fig. 2. 5: Complementary transistor pull-up (CMOS)
Inputs
PDN
PUN
Output
VDD
VSS
-
7/28/2019 thesis_v4
22/147
10
2.3.1ComponentsofTotalPowerDissipationinCMOSCircuits
One of the major design challenges in high-performance digital integrated circuits is
the minimization of the total power dissipation. Total power consumption in digital CMOS
circuits can be divided into three major components: a) Switching power or dynamic power,
b) Short-circuit power and c) Static or leakage power. Equation 1.1 defines total power andits three components in a simplified form [1.1]
VddIPower
IVddTT
Power
VddCFAFPower
PowerPowerPowerPower
LEAKAGELEAKAGE
PEAK
fallrise
ITSHORTCIRCU
DYNAMIC
LEAKAGECIRCUITSHORTDYNAMICTOTAL
=
+
=
=
++=
)2
(
2
[1.1]
Dynamic power is a result of the power consumed in charging and discharging
various device and wire capacitances in the circuit. As seen from Equation 1.1, this
component of power depends on the switching activity factor (probability that a power
consuming transition occurs) AF, clock frequency F, the capacitances C being charged or
discharged and square of supply voltage, Vdd. In long channel transistors the dynamic power
is the dominant component of the total power. However, this is not the case in advanced
technologies, as leakage power is becoming a significant component of total power.
Fig. 2. 6: CMOS inverter current versus Vin
Current
(betweenrails)
Vin
-
7/28/2019 thesis_v4
23/147
11
CLK
Precharge Evaluate Precharge
Y
Short-circuit power is dissipated when there is a direct conducting path between
power supply (Vdd) and ground (Vss). Since the input signals to the logic gates have a non-
zero/finite slope or edge rate, there is a direct path current between Vdd and Vss for a period
of time during which both the PMOS and NMOS devices conduct simultaneously. The
magnitude of this current is given by the actual transistor widths and the on-state saturationcurrent (IDSAT). The duration for which the current flows depends on the signal rise and fall
times and increases as the signal slopes degrade. The overall short-circuit power can be
obtained by integrating the total current over the duration of short circuit and then
multiplying with Vdd. In a simplified form it can be calculated as shown in Equation 1.1,
where Trise and Tfall are the rise and fall times respectively and IPEAKis the peak short circuit
current.
Static or leakage power dissipation is due to the leakage current, I LEAKAGE that flows
between power rails in the absence of any switching activity. Three major sources of leakage
current are: a) current flowing through reverse biased P-N diode junctions of the transistorslocated between the source or drain and substrate, b) subthreshold leakage current between
source and drain when gate-source voltage, Vgs, is smaller than the threshold voltage, Vt of
the transistors and c) gate leakage current via the gate tunneling mechanism.
Because of the quadratic dependence of dynamic power on Vdd, reducing this voltage
is the most effective approach to minimize dynamic power dissipation. However, reducing
the supply voltage necessitates the reduction of threshold voltage to avoid serious degradation
of performance. Unfortunately, reducing threshold voltage causes the sub-threshold leakage
current to increase exponentially.
2.4DYNAMICCIRCUITS
Ratioed circuits reduce the input capacitance by replacing the pMOS transistors
connected to the inputs with a single resistive pull-up. The drawbacks of ratioed circuits
includes slow resistive transitions, contention on the falling transitions, static power
dissipation and a non-zero VOL. Dynamic circuits circumvent these drawbacks by using a
clocked pull-up transistor rather than a pMOS that is always ON.
Fig. 2. 7: Precharge and evaluation of dynamic gates.
-
7/28/2019 thesis_v4
24/147
12
Dynamic circuit operation is divided into two modes, shown in Fig. 2.7. During
precharge, the clock (CLK) is 0, so the clocked pMOS is ON and initializes the output Y
high. During evaluation, the clock is 1 and the clocked pMOS turns OFF. The output may
remain high or may be discharged low through the pull-down network.
Fig. 2. 8: Footed dynamic inverter
Dynamic circuits are the fastest commonly used circuit family because they have
lower input capacitance and no contention during switching. They also have static power
dissipation. However, they require careful clocking, consume significant dynamic power, and
are sensitive to noise during evaluation.
In Fig. 2.9 if the input is 1 during precharge, contention will take place because both
the pMOS and nMOS transistors will be ON.
Fig. 2. 9: Unfooted dynamic gates.
A
CLK
Precharge
Transistor
FOOT
Y
CLK
Inputs
Y
PDN
-
7/28/2019 thesis_v4
25/147
13
When the input cannot be guaranteed to be 0 during precharge,an extra clocked
evaluation transistor can be added to the bottom of the nMOS stack to avoid contention as
shown in Fig. 2.10. The extra transistor is sometimes called a foot. Fig. 2.10 shows generic
footed gates.
Fig. 2. 10: Generalized footed gates.
Fig. 2.11 estimates the falling logical effort of both footed and unfooted dynamic
gates. As usual, the pull-down transistors widths are chosen to give unit resistance. Precharge
occurs while the gate is idle and often may take place more slowly. Therefore, the prechargetransistor width is chosen for twice unit resistance. This reduces the capacitive load on the
clock and the parasitic capacitance at the expense of greater rising delays.
Fig. 2. 11: Logical effort of footed and unfooted dynamic gates
CLK
Inputs PDN
Y
Y
A
CLKCLK
Y
A
Inverter
gd=1/3
Pd=2/3gd=2/3
Pd=3/3
-
7/28/2019 thesis_v4
26/147
14
Footed gates have higher logical effort than their unfooted counterparts but are still an
improvement over static logic. In practice, the logical effort of footed gates is better than
predicted because velocity saturation means series nMOS transistors have less resistance
than we estimate. The size of the foot can be increased relative to the other nMOS transistors
to reduce logical effort of the other inputs at the expense of greater clock loading. Likepseudo-nMOS gates, dynamic gates are particularly well suited to wide NOR functions or
multiplexers because the logical effort is independent of the number of inputs.
A fundamental difficulty with the dynamic circuits is the monotonicity requirement.
While a dynamic gate is in evaluation, the inputs must be monotonically rising. That is, the
input can start LOW and remain LOW, start LOW and rise HIGH, start HIGH and remain
HIGH, but not start HIGH and fall LOW. Fig. 2.12 shows waveforms for a footed dynamic
inverter in which the input violates monotonically. During precharge, the output is pulled
HIGH. When the clock rises, the input is HIGH so the output is discharged LOW through the
pull-down network, as happen in an inverter.
Fig. 2. 12: Monotonicity problem
The input later falls LOW, turning off the pull-down network. However, the
precharge transistor is also OFF, so the output floats, staying LOW rather than rising as it
would in a normal inverter. The output will remain low until the next precharge step. In
summary, the inputs must be monotonically rising for the dynamic gate to compute the
correct function.
Evaluate
A
CLK
Y
Violatesmonotonicity
duringevaluation
Precharge
Output
should
rise
but
does
not
Precharge
-
7/28/2019 thesis_v4
27/147
15
Unfortunately, the output of a dynamic gate begins HIGH and monotonically falls
LOW during evaluation. This monotonically falling output X is not a suitable input to a
secong dynamic gate expecting monotonically rising signals as shown in Fig. 2.13. Dynamic
gates sharing the same clock cannot be directly connected. This problem is often overcome
with domino logic.
Fig. 2. 13: Incorrect connection of dynamic gates.
CLK
A
XY
X
Xmonotonicallyfallsduringevaluation
A=1
CLK
PrechargeEvaluate
Precharge
Y
Yshouldrisebutcannot
-
7/28/2019 thesis_v4
28/147
16
2.5DOMINOLOGIC
Domino logic circuits find wider applications in high performance microprocessors
due to their superior speed and area characteristics as compared to static CMOS circuits. But
their noise margins are low making them more prone to noise. Various leakage reductiontechniques are applied to domino logic circuits also to reduce the leakage. As the technology
is scaled below 130nm, noise margin becomes a critical issue and hence techniques that
provide high noise immunity become necessary in order to have reliable circuits.
The monotonicity problem can be solved by placing a static CMOS inverter between
dynamic gates as shown in the figure. This converts the monotonically falling output into a
monotonically rising signal suitable for the next gate. The dynamic- static pair together is
called a domino gate. A single clock can be used to precharge and evaluate all the logic gates
within the chain. Therefore, the static inverter is usually a HI-skew gate to favor this rising
output. We may observe that precharge occurs in parallel, but evaluation occurs sequentially.
A standard domino logic circuit with a keeper is as shown in Fig. 2.14. A standard
domino logic circuit consists of an n-type dynamic logic block followed by a static inverter.
During precharge, the output of the dynamic gate is charged to Vdd and the output of the
inverter is set to 0.
Fig. 2. 14: Standard Domino Logic circuit
-
7/28/2019 thesis_v4
29/147
17
During evaluation, the inverter makes conditional transition from 0 to 1. If the output
of the domino gate is fed to other domino gates, then it must be ensured that all inputs are set
to 0 at the end of the precharge phase and the transitions during evaluation are only 0 to 1.
Hence the dynamic node discharges only when the previous stage evaluates to 1 and a high
fan-out is achieved due to the static inverter present at the output. To counteract the leakageissues and to establish a low impedance path, a bleeder transistor (keeper) is connected in the
feedback path.
Fig. 2. 15: Weak keeper implementation
The function of the keeper is to compensate the charge lost due to the pull-down
leakage paths. But the keeper is fully turned on at the beginning of the evaluation phase.
When the pull down network is ON, then there exists a contention between this and keeper
transistor, which degrades the speed of domino circuits. Traditionally, a minimum sized
keeper is used to minimize delay and power degradation caused by the contention current. Asmall keeper, however, cannot provide necessary noise immunity for reliable operation in an
increasingly noisy and noise-sensitive on-chip environment. Therefore, there is a tradeoff
between the high speed/energy efficient operation and reliability in domino logic. Hence,
keeper sizing is important in deep sub micron circuits.
A
CLK
XY
Width:min
Length:L
-
7/28/2019 thesis_v4
30/147
18
CHAPTER3
CONTENTADDRESSABLEMEMORY
(CAM)REVIEW
-
7/28/2019 thesis_v4
31/147
19
Chapter 3
CONTENT
ADDRESSABLE
MEMORY
(CAM)R
EVIEW
3.1INTRODUCTION
Content addressable memories (CAMs) are memories that can search the entire
memory in parallel and output the location of entries that hold a match to the key value.
Today the increased need for faster searches, larger table sizes and wider data widths,makes CAMs a more attractive solution to the less expensive RAM software solution.
CAMs are now much needed where quick searches of a database, a list, or a pattem is in
order. In fact, CAMs can be a determining factor for a wide range of applications such as
local-area networks, file storage management, artificial intelligence, database management
and pattern recognition.
A Content-Addressable Memory (CAM) searches for data by its content and returns
the address of the matching data. This feature is used extensively in applications such as
internet routers to channel incoming packets towards their destination addresses contained in
the packet header. Energy per search and search speed are two important metrics used toevaluate CAM performance[12]Content addressable memory (CAM), a high-performance
lookup engine in many systems, is so power-consuming that any saving becomes very
significant in the whole system. CAM has three major power-sinking sources: evaluation
power, input transition power and clocking power, all of them are discussed in this research.
After that, a new low-power CAM design is proposed here. Its implementation under 0.35-p
m process operates at 83.3 MHz with power performance metric as 45.5fJ/bit/search or
equivalently 372 mJ/bit/search/m for random inputs. Two modified circuit structures for
binary static CAM cells are also proposed. We have proved that under most conditions cell
layout is smaller by this modification.
A Content Addressable Memory (CAM) compares input search data against a table of
stored data, and returns the address of the matching data [13][17]. CAMs have a single
clock cycle throughput making them faster than other hardware and software-based search
systems. CAMs can be used in a wide variety of applications requiring high search speeds.
These applications include parametric curve extraction [18], Hough transformation [19],
Huffman coding/decoding [20], [21], LempelZiv compression [22][25], and image coding
[26]. The primary commercial application of CAMs today is to classify and forward Internet
protocol (IP) packets in network routers [27][32]. In networks like the Internet, a message
such an as e-mail or a Web page is transferred by first breaking up the message into smalldata packets of a few hundred bytes, and, then, sending each data packet individually through
-
7/28/2019 thesis_v4
32/147
20
the network. These packets are routed from the source, through the intermediate nodes of the
network (called routers), and reassembled at the destination to reproduce the original
message. The function of a router is to compare the destination address of a packet to all
possible routes, in order to choose the appropriate one. A CAM is a good choice for
implementing this lookup operation due to its fast search capability.
However, the speed of a CAM comes at the cost of increased silicon area and power
consumption, two design parameters that designers strive to reduce. As CAM applications
grow, demanding larger CAM sizes, the power problem is further exacerbated. Reducing
power consumption, without sacrificing speed or area, is the main thread of recent research in
large capacity CAMs. In this research, we survey developments in the CAM area at two
levels: circuits and architectures. Before providing an outline of this research at the end of
this section, we first briefly introduce the operation of CAM and also describe the CAM
application of packet forwarding.
Fig. 3. 1: Simple schematic model of a 4x3 CAM array showing the core memory cells,
differential search lines, match lines and encoder
Fig. 3.1 shows a simplified block diagram of a CAM. The input to the system is the
search word that is broadcast onto the searchlines to the table of stored data. The number of
bits in a CAM word is usually large, with existing implementations ranging from 36 to 144
bits. A typical CAM employs a table size ranging between a few hundred entries to 32K
entries, corresponding to an address space ranging from 7 bits to 15 bits. Each stored word
has a matchline that indicates whether the search word and stored word are identical (the
match case) or are different (a mismatch case, or miss). The matchlines are fed to an encoder
that generates a binary match location corresponding to the matchline that is in the match
state. An encoder is used in systems where only a single match is expected. In CAM
applications where more than one word may match, a priority encoder is used instead of a
simple encoder. A priority encoder selects the highest priority matching location to map to
the match result, with words in lower address locations receiving higher priority. In addition,
there is often a hit signal (not shown in the figure) that flags the case in which there is nomatching location in the CAM. The overall function of a CAM is to take a search word and
C C C
C C C
C C C
Input Search Data Drivers/Registers
SL0
Encoder
Hit
ML2
ML1
ML0
C C C
ML3
SL1 SL2SL0 SL1 SL2
-
7/28/2019 thesis_v4
33/147
21
return the matching memory location. One can think of this operation as a fully
programmable arbitrary mapping of the large space of the input search word to the smaller
space of the output match location.
8M
Memory Size (bit)
10k
1986 2005
Year
Fig. 3. 2: Profile of CAM capacity (log scale) versus year of publication [33][40]
The operation of a CAM is like that of the tag portion of a fully associative cache. The
tag portion of a cache compares its input, which is an address, to all addresses stored in the
tag memory. In the case of match, a single matchline goes high, indicating the location of a
match. Unlike CAMs, caches do not use priority encoders since only a single match occurs;
instead, the matchline directly activates a read of the data portion of the cache associated with
the matching tag. Many circuits are common to both CAMs and caches; however, we focus
on large- capacity CAMs rather than on fully associative caches, which target smaller
capacity and higher speed. Todays largest commercially available single-chip CAMs are 18
Mbit implementations, although the largest CAMs reported in the literature are 9 Mbit in size
[33], [40]. As a rule of thumb, the largest available CAM chip is usually about half the size of
the largest available SRAM chip. This rule of thumb comes from the fact that a typical CAM
cell consists of two SRAM cells, as we will see shortly. Fig. 3.2 plots (on a logarithmic scale)
the capacity of published CAM [33][40] chips versus time from 1985 to 2004, revealing an
exponential growth rate typical of semiconductor memory circuits and the factor-of-two
relationship between SRAM and CAM.
-
7/28/2019 thesis_v4
34/147
22
3.2CORECELLSANDMATCHLINESTRUCTURE
A CAM cell serves two basic functions: bit storage (as in RAM) and bit comparison
(unique to CAM). Fig. 3.4 shows a NOR-type CAM cell [Fig. 3.3(a)] and the NAND-typeCAM cell [Fig. 3.3(b)]. The bit storage in both cases is an SRAM cell where cross-coupled
inverters implement the bit-storage nodes D and DB. To simplify the schematic, we omit the
nMOS access transistors and bitlines which are used to read and write the SRAM storage bit.
Although some CAM cell implementations use lower area DRAM cells [3.27], [3.31],
typically, CAM cells use SRAM storage. The bit comparison, which is logically equivalent to
an XOR of the stored bit and the search bit is implemented in a somewhat different fashion in
the NOR and the NAND cells.
3.2.1StructureofNORCell
A NOR Cell implements the comparison between the complementary stored bit, D
(and DB), and the complementary search data on the complementary searchline, SL (and
SLB), using four comparison transistors, M1 through M4, which are all typically minimum-
size to maintain high cell density. These transistors implement the pull down path of a
dynamic XNOR logic gate with inputs SL and D. Each pair of transistors, M1/M3 and
M2/M4, forms a pull down path from the matchline, ML, such that a mismatch of SL and D
Fig. 3. 3: CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The
cells are shown using SRAM-based data-storage cells.
activates least one of the pull down paths,connecting ML to ground. A match of SL and D
disables both pull down paths, disconnecting ML from ground. The NOR nature of this cell
becomes clear when multiple cells are connected in parallel to form a CAM word by shorting
the ML of each cell to the ML of adjacent cells. The pull down paths connect in parallel
resembling the pull down path of a CMOS NOR logic gate. There is a match condition on a
given ML only if every individual cell in the word has a match.
-
7/28/2019 thesis_v4
35/147
23
3.2.2StructureofNANDCell
The NAND cell implements the comparison between the stored bit, D, and
corresponding search data on the corresponding searchlines, (SL, SLB), using the three
comparison transistors M1, MD and MDB, which are all typically minimum-size to maintainhigh cell density. We illustrate the bit-comparison operation of a NAND cell through an
example. Consider the case of a match when SL=1 and D=1. Pass transistor M D is ON and
passes the logic 1 on the SL to node B. Node B is the bit-match node which is logic 1 if
there is a match in the cell. The logic 1 on node B turns ON transistor M1. Note that is also
turned ON in the other match case when SL=0 and D=0 . In this case, the transistor MDB
passes a logic high to raise node B. The remaining cases, where SLD result in a miss
condition, and accordingly node B is logic 0 and the transistor M1 is OFF. Node B is a
pass-transistor implementation of the XNOR SLD function. The NAND nature of this cell
becomes clear when multiple NAND cells are serially connected. In this case, the MLn and
MLn+1 nodes are joined to form a word. A serial nMOS chain of all the Mi transistors
resembles the pull down path of a CMOS NAND logic gate. A match condition for the entire
word occurs only if every cell in a word is in the match condition. An important property of
the NOR cell is that it provides a full rail voltage at the gates of all comparison transistors.
On the other hand, a deficiency of the NAND cell is that it provides only a reduced logic 1
voltage at node B, which can reach only VDD - V tn sswhen the searchlines are driven to VDD
(where VDD is the supply voltage and Vtn is the nMOS threshold voltage).
3.2.3TernaryCells
Usually two types of ternary cell are used. The NOR and NAND cells that have been
presented are binary CAM cells. Such cells store either a logic 0 or a logic 1.Ternary
cells, in addition, store an X value. The X value is a dont care, that represents both 0
and 1, allowing a wildcard operation. Wildcard operation means that an X value stored in
a cell causes a match regardless of the input bit. As discussed earlier, this is a feature used in
packet forwarding in Internet routers. A ternary symbol can be encoded into two bits
according to Table 2.2. We represent these two bits as D and DB. Note that although the D
and DB are not necessarily complementary, we maintain the complementary notation for
consistency with the binary CAM cell. Since two bits can represent 4 possible states, butternary storage requires only three states, we disallow the state where D and DB are both
zero. To store a ternary value in a NOR cell, we add a second SRAM cell, as shown in Fig.
3.5. One bit, D, connects to the left pulldown path and the other bit, DB, connects to the right
pull down path, making the pull down paths independently controlled.We store an X by
setting both D and DB equal to logic 1, which disables both pull down paths and forces the
cell to match regardless in the inputs. We store a logic 1 by setting D=1 and DB=0 and
store a logic 0 by setting D=0 and DB=1. In addition to storing an X, the cell allows
searching for an X by setting both SL and SLB to logic 0. This is an external dont care
that forces a match of a bit regardless of the stored bit.
-
7/28/2019 thesis_v4
36/147
24
Table 3. 1: Truth Table for NOR Cell
Stored Value Stored
D D
Search
D D
0 0 1 0 1
1 1 0 1 0
X 1 1 0 0
Table 3. 2: Truth Table for NAND Cell
Stored Value Stored Bit
D M
Search Bit
SL SL
0 0 0 0 1
1 1 0 1 0
x 0 1 1 1
x 1 1 1 1
Fig. 3. 4: Structure of Ternary core cells for (a) NOR-type (b) NAND-type CAM [41], [42].
-
7/28/2019 thesis_v4
37/147
25
Although storing an X is possible only in ternary CAMs, an external X symbol
possible in both binary and ternary CAMs. In cases where ternary operation is needed but
only binary CAMs are available, it is possible to emulate ternary operation using two binary
cells per ternary symbol.
As a modification to the ternary NOR cell of Fig.3.4(a), propose implementing the
pull down transistors M1-M4 using pMOS devices and complementing the logic levels of the
searchlines and matchlines accordingly. Using pMOS transistors (instead of nMOS
transistors) for the comparison circuitry allows for a more compact layout, due to reducing
the number of spacings of p-diffusions to n-diffusions in the cell. In addition to increased
density, the smaller area of the cell reduces wiring capacitance and therefore reduces power
consumption. The tradeoff that results from using minimum-size pMOS transistors, rather
than minimum-size nMOS transistors, is that the pulldown path will have a higher equivalent
resistance, slowing down the search operation.
A NAND cell can be modified for ternary storage by adding storage for a mask bit at
node M, as depicted in Fig. 3.4(b) [41], [42]. When storing an X, we set this mask bit to
1. This forces transistor Mmask ON, regardless of the value of D, ensuring that the cell
always matches. In addition to storing an X, the cell allows searching for an X by setting
both SL and SLB to logic 1. Table 2.2 lists the stored encoding and search-bit encoding for
the ternary NAND cell. Further minor modifications to CAM cells include mixing parts of the
NAND and NOR cells, using dynamic-threshold techniques in silicon-on-insulator (SOI)
processes, and alternating the logic level of the pull down path to ground in the NOR cell
[44][46].
Currently, the NOR cell and the NAND cell are the prevalent core cells for providing
storage and comparison circuitry in CMOS CAMs. For a comprehensive survey of the
precursors of CMOS CAM cells refer to [47].
3.3MATCHLINESENSINGSCHEMES
This section reviews matchline sensing schemes that generate the match result. First,
we review the conventional precharge high scheme, then introduce several variations that
save power.
3.3.1Conventional(PrechargeHigh)MatchlineSensing
We review the basic operation of the conventional precharge-high scheme and look at
sensing speed, charge sharing, timing control and power consumption.
-
7/28/2019 thesis_v4
38/147
26
3.3.1.1BasicOperation
The basic scheme for sensing the stateof the NOR matchline is first to precharge high
the matchline and then evaluate by allowing the NOR cells to pull down the match-lines in
the case of amiss, or leave the matchline high in the case of a match. Fig. 3.5(a) shows, in
schematic form, an implementation of this matchline-sensing scheme. Fig. 3.5(b) shows thesignal timing which is divided into three phases: SL precharge, ML precharge, and ML
evaluation. The operation begins by asserting slpre to precharge the searchlines low,
disconnecting all the pull down paths in the NOR cells.With the pull down paths
disconnected, the operation continues by asserting mlpreb to precharge the matchline high.
Once the matchline is high, both slpre and mlpreb are de-asserted. The ML evaluate phase
begins by placing the search word on the searchlines. If there is at least one single-bit miss on
the matchline, a path (or multiple paths) to ground will discharge the matchline, ML,
indicating amiss for the entire word, which is output on the MLSA sense-output node, called
Fig. 3. 5: (a) the schematic with precharge circuitry for matchline sensing using the
precharge-high scheme, and (b) the corresponding timing diagram showing relative signal
transitions. [34]
-
7/28/2019 thesis_v4
39/147
27
MLso. If all bits on the matchline match, thematchline will remain high indicating a match
for the entire word. Using this sketch of the precharge high scheme, we will investigate the
performance of matchline in terms of speed, robustness, and power consumption. The
matchline power dissipation is one of the major sources of power consumption in CAM.
3.3.1.2Matchline
power
In a typical system, the number of misses is expected to be much greater than the
number of matches; thus, using a dynamic NAND structure results in a significant reduction
in power. Fig. 3.6 demonstrates the power advantages of a NAND architecture versus a NOR.
1A NAND match-line architecture is considerably slower than it has NOR counterpart,
especially for wide words.
Fig. 3. 6: Matchline power for NAND and NOR architecture [33]
3.3.1.3ChargeSharing
There is a potential charge-sharing problem depending on whether the CAM storage
bits D and DB are connected to the top transistor or the bottom transistor in the pulldown
path.
-
7/28/2019 thesis_v4
40/147
28
Fig. 3. 7: Two possible configurations for the NOR cell: (a) the stored bit is connected to the
bottom transistors of the pulldown pair, and (b) the stored bit is connected to the top
transistors of the pulldown pair.
Fig. 3.7 shows these two possible configurations of the NOR cell. In the configuration
of Fig. 3.7(a), there is a charge-sharing problem between the matchline, ML, and nodes X1
and X2. Charge sharing occurs during matchline evaluation, which occurs immediately after
the matchline precharge-high phase. During matchline precharge, SL and SLB are both at
ground. Once the precharge completes, one of the searchlines is activated, depending on the
search data, causing either M1 or M2 to turn ON. This shares the charge at node X1 or node
X2 with that of ML, causing the ML voltage, VML, to drop, even in the case of match, which
may lead to a sensing error. To avoid this problem, designers use the configuration shown in
Fig.3.7 (b), where the stored bit is connected to the top transistors. Since the stored bit isconstant during a search operation, charge sharing is eliminated.
3.3.1.4PowerConsumption
The dynamic power consumed by a single matchline that misses is due to the rising
edge during precharge and the falling edge during evaluation, and is given by the equation
Pmiss = fCMLVDD2
Where f is the frequency of search operations. In the case of a match, the power
consumption associated with a single matchline depends on the previous state of the
matchline; however, since typically there are only a small number of matches we can neglect
this power consumption. Accordingly, the overall matchline power consumption of a CAM
block with w matchlines is
PML=wPmiss
-
7/28/2019 thesis_v4
41/147
-
7/28/2019 thesis_v4
42/147
30
3.5CURRENTRACESCHEME
Current-Race saving is an important scheme for the CAM architecture. Fig. 3.9(a)
shows a simplified schematic of the current-race scheme [51]. This scheme precharges the
matchline low and evaluates the matchline state by charging the matchline with a current IML
supplied by a current source. The signal timing is shown in Fig. 3.9(b). The precharge signal,mlpre, starts the search cycle by precharging thematchline low. Since the matchline is
precharged low, the scheme concurrently charges the searchlines to their search data values,
eliminating the need for a separate SL precharge phase required by the precharge-high
scheme of Fig. 3.5(b). Instead, there is a single SL/ML precharge phase, as indicated in Fig.
3.9(b). After the SL/ML precharge phase completes, the enable signal, enb, connects the
current source to the matchline. A matchline in the match state charges linearly to a high
voltage, while a matchline in the miss state charges to a voltage of only IML*RML /m, where
m denotes the number of misses in cells connected to the matchline. By setting the maximum
voltage of a miss to be small, a simple matchline sense amplifi
er easily differentiates betweena match state and a miss state and generates the signal MLso. As shown in Fig. 3.9, the
amplifier is the nMOS transistor, Msense, whose output is stored by a half-latch. The nMOS
sense transistor trips the latch with a threshold of Vtn. After some delay, matchlines in the
match state will charge to slightly above tripping their latch, whereas matchlines in the miss
state will remain at a much smaller voltage, leaving their latch in the initial state. A simple
replica matchline (not shown) controls the shutoff of the current source and the latching of
the match signal. We derive the power consumption of this scheme by first noting that the
same amount of current is discharged into every matchline, regardless of the state of the
matchline. Looking at the match case for convenience, the power consumed to charge a
matchline to slightly above Vtn is
Pmatch=fCMLVDDVtn
Since the power consumption of a match and a miss are identical, the overall power
consumption for all w matchlines is
PML=wPmatch
This equation is identical to the low-swing scheme (previous equation) with
VMLswing=Vtn. The benefits of this scheme over the precharge-high schemes are the simplicity
of the threshold circuitry and the extra savings in searchline power due to the elimination of
the SL precharge phase which is discussed
-
7/28/2019 thesis_v4
43/147
31
Fig. 3. 9: (a) Circuit implementation including precharge circuitry and (b) a timing diagram
for a single search cycle. For current-race matchline sensing [51].
Changing CAM cell configuration is an important feature of current racing scheme.
The current-race scheme also allows changing the CAM cell configuration due to the fact that
the matchline is precharged low. With precharge low, there is no charge-sharing problem for
either CAM cell configuration of Fig.3.6, since the ML precharge level is the same as the
level of the intermediate nodes X1 and X2. Rather than to avoid charge sharing, the criterionthat determines which cell to use in this case is matching parasitic capacitances between
MLs. In the configuration of Fig. 3.7(b), the parasitic load on a matchline depends on the
ON/OFF state of M1 and of M2. Since different cells will have different stored data, there
will be variations in the capacitance CML among the MLs. However, in the configuration of
Fig.3.7(a), the variation of parasitic capacitance on the matchline depends only on the states
of SL and SLB which are the same for all cells in the same column. Thus, the configuration
of Fig. 3.7(a) maintains good matching between MLs and prevents possible sensing errors
due to parasitic capacitance variations.
3.6SELECTIVEPRECHARGESCHEME
Selective precharge scheme came considering the non uniform ML power
consumption. The matchline-sensing techniques we have seen so far, expend approximately
the same amount of energy on every matchline, regardless of the specific data pattern, and
whether there is a match or a miss. We now examine three schemes that allocate power to
matchlines nonuniformly. Thefi
rst technique, called selective precharge, performs a matchoperation on the first few bits of a word before activating the search of the remaining bits
-
7/28/2019 thesis_v4
44/147
32
[52]. For example, in a 144-bit word, selective precharge initially searches only the first 3 bits
and then searches the remaining 141 bits only for words that matched in the first 3 bits.
Assuming a uniform random data distribution, the initial 3-bit search should allow only 3
words to survive to the second stage saving about 88% of the matchline power. In practice,
there are two sources of overhead that limit the power saving. First, to maintain speed, theinitial match implementation may draw a higher power per bit than the search operation on
the remaining bits. Second, an application may have a data distribution that is not uniform,
and, in the worst-case scenario, the initial match bits are identical among all words in the
CAM, eliminating any power saving.
Fig. 3.10 is a simplified schematic of an example of selective precharge similar to that
presented in the original paper [52]. The example uses the first bit for the initial search and
the remaining n-1 bits for the remaining search. To maintain speed, the implementation
modifies the precharge part of the precharge-high scheme [of Fig. 3.7(a) and (b)]. The ML is
precharged through the transistor M1, which is controlled by the NAND CAM cell andturned on only if there is a match in the first CAM bit. The remaining cells are NOR cells.
Fig. 3. 10: Sample implementation of the selective-precharge matchline technique [52].
Note that the ML of the NOR cells must be pre-discharged (circuitry not shown) to
ground to maintain correct operation in the case that the previous search left thematchline
high due to a match. Thus, one implementation of selective precharge is to use this mixed
NAND/NOR matchline structure. Selective precharge is perhaps themost commonmethod
used to save power on matchlines [34], [53][57] since it is both simple to implement and can
reduce power by a large amount in many CAM applications.
3.7
PIPELINING
SCHEME
More generally, an implementation may divide the matchline into any number of
segments, where a match in a given segment results in a search operation in the next segment
but a miss terminates the match operation for that word. A design that uses multiple
matchline segments in a pipelined fashion is the pipelined matchlines scheme [58], [59]. Fig.
3.11(a) shows a simplified schematic of a conventional NOR matchline structure where all
cells are connected in parallel. Fig. 3.11(b) shows the same set of cells as in Fig. 3.11(a), but
with the matchline broken into four matchline segments that are serially evaluated. If any
stage misses, the subsequent stages are shut off, resulting in power saving. The drawbacks of
this scheme are the increased latency and the area overhead due to the pipeline stages.
-
7/28/2019 thesis_v4
45/147
33
Fig. 3. 11: Pipelined matchlines reduce power by shutting down after a miss in a stage
By itself, a pipelined matchline scheme is not as compelling as basic selective precharge;
however, pipelining enables the use of hierarchical searchlines, thus saving power. Another
approach is to segment the matchline so that each individual bit forms a segment. Thus,
selective precharge operates on a bit-by-bit basis. In this design, the CAM cell is modified so
that thematch evaluation ripples through each CAM cell. If at any cell there is a miss, the
subsequent cells do not activate, as there is no need for a comparison operation. The
drawback of this scheme is the extra circuitry required at each cell to gate the comparison
with the result from the previous cell.
Fig. 3. 12: Simulated wave forms in the pipelined match-line architecture for (a) the full-
match case consisting of a match in every stage, and (b) a miss case where the third stageresults in a miss and turns off the subsequent stages.[58]
-
7/28/2019 thesis_v4
46/147
34
Fig 3.12 shows the simulated waveforms of the ML segments in the pipelined ML
scheme. Fig 3.12 shows a full match as indicated by the rising ML in every segment along
with the corresponding full-rail output of the MLSA. Fig 3.12(b) shows an example of a word
that misses in the third stage as indicated by the lack of an MLSA output pulse. In this
example, the ML sensing circuitry of the fourth and fifth segments is not activated, hencesaving power.
3.8CURRENTSAVINGSCHEME
The current-saving scheme [60], [61] is another data-dependent matchline-sensing
scheme which is a modified form of the current-race sensing scheme. Recall that the current-
race scheme uses the same current on each matchline, regardless of whether it has a match or
a miss. The key improvement of the current-saving scheme is to allocate a different amount
of current for a match than for a miss. In the current-saving scheme matches are allocated alarger current and misses are allocated a lower current. Since almost every matchline has a
miss, overall the scheme saves power.
Fig. 3. 13: Current-saving matchline-sensing scheme
Fig.3.13 shows a simplified schematic of the current-saving scheme. The main
difference from the current-race scheme as depicted in Fig. 3.9 is the addition of the current-
control block. This block is the mechanism by which a different amount of current is
allocated, based on a match or a miss. The input to this current-control block is the matchline
voltage, VML, and the output is a control voltage that determines the current, IML, which
charges the matchline. The current-control block provides positive feedback since higher VML
results in higher IML, which, in turn, results in higher VML.
-
7/28/2019 thesis_v4
47/147
35
3.9CONCLUSION
From the above discussion of previous Content addressable Memory (CAM) it is
clearly shown that the main concern was power and speed. The researchers gave a good
importance to these two factors and improved the power and speed significantly. Scheme
simplicity and noise immunity are other two factors. Researchers gave importance to thesetwo factors too. The previous result from the simulation that the researchers performed is
shown below in table format.
Table 3. 3: Comparison between the schemes[62]
Scheme ML energy
fj/bit/search
Cycle time(ns) Noise Scheme
simplicity
Conventional 9.5 3.9 + ++
Low swing 4.2 3.1 - -
Current race 5.5 3.7 - +
Selective
precharge
5.6 3.5 + +
Pipelining 5.8 3.8 + -
Current saving 4.3 3.7 - --
-
7/28/2019 thesis_v4
48/147
36
CHAPTER4
PROPOSEDCHARGINGCONTROL
SCHEME&SENSEAMPLIFIER
-
7/28/2019 thesis_v4
49/147
37
CHAPTER4
PROPOSED CHARGING CONTROL SCHEME &SENSE AMPLIFIER
4.1INTRODUCTION
Content-addressable memory (CAM) is a storage device that searches for the
matching data by content and returns the address of the matching data. Due to the risingdemand of CAM for high speed search capability in various applications, the size as well as
power consumption of CAM arrays continue to rise.
A lot of researches have been done to reduce the power consumption and to increasethe speed. Previous works present some schemes such as selective-precharge scheme [63],current-race scheme [64], current-saving scheme [65], [66], etc. It is reported in a survey [67]that the current-saving scheme consumes less power than other schemes [63], [64], [68]. Inthis work, we proposed a simplified match-line charging control scheme. Comparison withthe current-race and current-saving schemes shows match-line energy reduction of over 50%and speed improvement of over 3 times.
4.2PROPOSEDCHARGECONTROLLINGSCHEME
Fig. 4.1(a) shows the basic architecture of the proposed charge controlling scheme.Here, the array of CAM cells stores the data entries; search data register stores the searchword; each charge controller block controls the charging and discharging process of therespective match-line (ML); and the sense amplifier (SA) senses the ML voltage and givesthe final match/miss decision. Fig. 4.1(b) shows the internal circuit of conventional NOR-type Ternary CAM (TCAM) cell which is less susceptible to failure due to the processvariation as compared to NAND-type cell [69].
At the beginning of each search cycle, all MLs are pre-discharged to ground by thecharging controller. During the evaluation stage, the search data register broadcasts the searchdata to the search-lines (SLs). If SL resembles stored bit D or the stored bit is X (dont care),the ML has no discharging path to ground and the path remains in the high-impedance state.If SL does not resemble D, the ML has a discharging path (through either transistors T 1-T2
path or T3-T4 path) to ground. So, in a ML, the number of discharging paths is equal to thenumber of mismatches.
Fig. 4.2 shows the proposed charging controller. Here, at the beginning of searchcycle, ML is pre-discharged to ground by a high MLP. The transistor M2 is turned ON by alow MLP and M5 is OFF by a low MLC; this causes M3 to be turned OFF. Therefore,charging of ML through M3 remains prohibited during the pre-discharging of ML.
During the evaluation stage, MLP is switched to low so that both M1 and M2 turnOFF. At the same time, MLC is switched to high so that charging of ML through M3 begins.
-
7/28/2019 thesis_v4
50/147
38
If the CAM cell data of this ML is fully matched with the search-line data, then, there is nodischarging path through the CAM cell. So, charging to the fully matched match-line(denoted by ML0) is faster than the partially matched match-line. MLC is kept high until thevoltage of ML0 is enough to turn M4 ON; this causes the gate of the transistor M3 to be lowand charging to ML0 continues via M3. As the voltage of match-line, ML0 reaches aboutVDD/3 and MLC is low, M6 becomes fully ON and M7 becomes partially ON. The end result
is no more charging of ML through M3. The achieved voltage (~VDD/3) of ML0 is highenough to be sensed as a high level for the proposed sense amplifier (SA). Now, the match-line which has one miss is denoted by ML1 and this ML is hardest to detect as a miss. Ascharging to ML1 is slower than ML0, the voltage of ML1 is not enough to be sensed as ahigh level by the SA.
Fig. 4. 1: Structure of the CAM of the proposed scheme: (a) basic architecture and (b) NOR-
type TCAM cell used in the scheme.
-
7/28/2019 thesis_v4
51/147
39
Fig. 4. 2: Internal circuit of the charging controller.
Fig. 4.3 shows the proposed sense amplifier (SA). During precharge stage, the node
SN is precharged to high through MS1. During the evaluation stage, when the voltage of ML
is slightly above VDD/3, the transistor MS2 turns ON. So, the node SN begins to discharge
and MLS begins to rise. As MLS reaches high level, the transistor MS3 turns ON resulting in
faster discharge of node SN. Transistor MS4 is used to initialize the output MLS to ground at
the beginning of each search cycle.
4.3SIMULATIONRESULTSANDANALYSIS
The proposed scheme along with current-race and current-saving schemes are
simulated using HSpice in the same 64 72 TCAM for TSMC 0.18m CMOS process with
the supply voltage of 1.8V.
Fig. 4.4 shows the simulation result for our design. At the beginning of a search cycle,
MLP is high for 0.39 ns so that previously charged MLs can discharge to ground. Then, MLP
is switched to low and MLC is switched to high for 0.33ns. As the voltage of ML0 rises up to
0.65V and MLC is low, the charging is stopped and the voltage of ML0 remains unchanged.
The proposed sense amplifier senses this voltage as a match and output MLS0 turns to a full
high level. On the other hand, the voltage of the ML1 rises up to 0.46V and then, degrades
through the discharging path. The SA senses this voltage as a miss and results a low MLS1.
As the difference between the maximum ML0 and ML1 (worst-case match-line) is 190 mV,
even for a large process variation, our scheme will produce the correct search results.
Table 4.1 shows the comparative search energy and performance among the schemes.
It shows that ML energy reduction is 57% and 54% compared to the current-race and current-saving schemes respectively and the speed is 3.13 times that of the both schemes.
-
7/28/2019 thesis_v4
52/147
40
The major difference between previous schemes [64]-[66] and our scheme is that we
have used only one PMOS for charging the ML while previous schemes use two PMOS in
series. So, the equivalent resistance is half of the previous schemes. Equivalent capacitance is
also reduced. As a result, speed of the circuit increases. The basic difference between
previous SA design [64]-[66] and the proposed SA is the switching threshold of SA. Theproposed SA can sense a stable 0.6 V as a high level, whereas the previous SAs use ~1V as a
switching voltage. So, the proposed scheme need not further charging after 0.6V (equals to
VDD/3) at ML0 is achieved. So, this scheme reduces a large amount of dynamic and leakage
power, while a larger voltage swing of about VDD/2 in [64]-[66] causes a higher power
consumption.
Fig. 4. 3: Proposed sense amplifier (SA)
ML
MS1
MLS
MLP
SN
MLPMS2
MS3 MS4
-
7/28/2019 thesis_v4
53/147
41
Fig. 4. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched),
ML1 (one-bit miss), ML2 (two-bit miss), MLC and MLP.
Table 4. 1: Comparison of Different Schemes
Schemes
MLEnergy
(fJ/bit/search)
SLEnergy
(fJ/bit/search)
Minimum Cycletime, T(ns)
Speed, 1/T
(MHz)
Current-race[64]
3.56 0.65 3.60 278
Current-saving[65],[66]
3.32 0.65 3.60 278
This work 1.53 0.62 1.15 870
4.4CORNERSIMULATIONOFTHESCHEME
To test the reliability of the proposed scheme, we have simulated the proposed
scheme in two extreme corners which we defined as follows: the fast corner with low
threshold voltage, high VDD and low temperature (FHL) and the slow process corner with
high threshold voltage, low VDD and high temperature (SLH). Simulation results reveal that
the proposed scheme works satisfactorily in the fast process (FHL) corner with -10%
threshold votlage, +5% VDD and at a temperature of 273 K as well as in the slow process
(SLH) corner with +10% threshold votlage, -5% VDD and at a temperature of 343 K. The
simulat