thesis_v4

Upload: naga-karthik

Post on 03-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 thesis_v4

    1/147

    i

    HIGHSPEEDANDLOWPOWERDESIGNOFCONTENT

    ADDRESSABLEMEMORY

    Athesissubmittedtothe

    DEPARTMENTOFELECTRICALANDELECTRONICENGINEERING

    OF

    BANGLADESHUNIVERSITYOFENGINEERINGANDTECHNOLOGY

    inpartial

    fulfillment

    of

    therequirementsforthe

    degreeof

    BachelorofScienceinElectricalandElectronicEngineering

    By

    MD.

    NAIMUL

    HASAN

    (0306017)

    MD.TAUHIDURRAHMAN(0306035)

    MD.MEHEDIHASAN(0306071)

    Supervisor

    DR.A.B.M.

    HARUN

    URRASHID

    Professor,DepartmentofEEE,

    BUET,Dhaka1000

    BANGLADESHUNIVERSITY

    OF

    ENGINEERING

    AND

    TECHNOLOGY

  • 7/28/2019 thesis_v4

    2/147

    ii

    Declaration

    We hereby declare that the work presented in this thesis entitled

    High Speed and Low Power Design of ContentAddressable

    Memory istheoutcomeofthe investigationcarriedoutbyusand

    neither this thesis nor any part thereof has been submitted or is

    being currently submitted anywhere else for the award of any

    degreeordiploma.

    .. .

    Md.NaimulHasan Md.TauhidurRahman Md.MehediHasan

    (0306017) (0306035) (0306071)

  • 7/28/2019 thesis_v4

    3/147

    iii

    Our parents.

  • 7/28/2019 thesis_v4

    4/147

    iv

    ACKNOWLEDGEMENTS

    Atfirst,wewouldliketoconveyourgratitudetoAlmightyAllahwithout

    whosewishnothingispossible.

    Wewouldliketoexpressourprofoundgratitudeandappreciationtoour

    Thesis supervisor, Professor Dr. A. B. M. HarunUr Rashid for his benign

    attitude towards us and whose supervision gave us the opportunity to get

    involved

    in

    this

    stateof

    the

    art

    and

    greatly

    emerging

    research

    of

    low

    power

    and highspeed design of circuits specially ContentAddressable Memory

    (CAM).Hisgeneroushelp,encouragementandconstantguidanceaccelerated

    thecompletionofthisthesis.

    Wealsowould liketoexpressourspecialthankstoMr.AtaurRahman

    Patwary, School of Electrical and Computer Engineering, Oregon State

    University,USAforhissuggestionsanddifferentkindsofhelptocompletethe

    work.

    We would like to thank thealumniofBUET (also members of Yahoo!

    GroupBUETian)fortheirhelpbyprovidingalotofnecessaryworks.

    We also would like to acknowledge the help of the Department of

    ElectricalandElectronicEngineering,BUETforlettingustousetheVLSIserver

    computertosimulatelargeHSpicefile.

  • 7/28/2019 thesis_v4

    5/147

    v

    CONTENTS

    Page

    Declarationii

    Dedication.iii

    Acknowledgementiv

    Contents.v

    List of Figures viii

    List of Tables.xi

    Abstractxii

    CHAPTER1:INTRODUCTION......................................................................................................2

    1.1INTRODUCTION.................................................................................................................. 2

    1.2MOTIVATIONS ................................................................................................................... 3

    1.3OBJECTIVES....................................................................................................................... 4

    1.4THESIS ORGANIZATION ..................................................................................................... 4

    CHAPTER2:DIFFERENT TYPES OF LOGIC ................................................................................7

    2.1INTRODUCTION.................................................................................................................. 7

    2.2RATIOED LOGIC................................................................................................................. 7

    2.2.1 Load Resistance RL ................................................................................................... 7

    2.2.2 nMOS depletion mode transistor pull up .................................................................. 8

    2.2.3 nMOS enhancement mode pull up ............................................................................ 8

    2.2.4 Pseudo-nMOS logic .................................................................................................. 9

    2.3COMPLEMENTARY TRANSISTOR PULL-UP (CMOS) ............................................................ 9

    2.3.1 Components of Total Power Dissipation in CMOS Circuits .................................. 10

    2.4DYNAMIC CIRCUITS......................................................................................................... 11

    2.5DOMINO LOGIC ............................................................................................................... 16

    CHAPTER3:CONTENT ADDRESSABLE MEMORY (CAM)REVIEW ........................................19

    3.1INTRODUCTION................................................................................................................ 19

    3.2CORE CELLS AND MATCHLINE STRUCTURE.................................................................... 22

    3.2.1 Structure of NOR Cell ............................................................................................. 22

    3.2.2 Structure of NAND Cell .......................................................................................... 23

  • 7/28/2019 thesis_v4

    6/147

    vi

    CONTENTS (Continued)

    Page

    3.2.3 Ternary Cells ........................................................................................................... 23

    3.3MATCHLINE SENSING SCHEMES ...................................................................................... 25

    3.3.1 Conventional (Precharge-High) Matchline Sensing ................................................ 25

    3.3.1.1 Basic Operation ................................................................................................. 26

    3.3.1.2 Matchline power .............................................................................................. 27

    3.3.1.3 Charge Sharing.................................................................................................. 27

    3.3.1.4 Power Consumption .......................................................................................... 28

    3.4LOW-SWING SCHEMES ................................................................................................... 293.5CURRENT-RACE SCHEME ................................................................................................ 30

    3.6SELECTIVE-PRECHARGE SCHEME .................................................................................... 31

    3.7PIPELINING SCHEME ........................................................................................................ 32

    3.8CURRENT-SAVING SCHEME ............................................................................................. 34

    3.9CONCLUSION ................................................................................................................... 35

    CHAPTER4:PROPOSED CHARGING CONTROL SCHEME &SENSE AMPLIFIER.....................37

    4.1INTRODUCTION................................................................................................................ 37

    4.2PROPOSED CHARGE CONTROLLING SCHEME ................................................................... 37

    4.3SIMULATION RESULTS AND ANALYSIS ............................................................................ 39

    4.4CORNER SIMULATION OF THE SCHEME............................................................................. 41

    4.5CONCLUSION ................................................................................................................... 42

    CHAPTER5:PROPOSED SIMPLIFIED DESIGN OF CHARGING CONTROLLER..........................44

    5.1INTRODUCTION................................................................................................................ 44

    5.2PROPOSED MLCHARGING TECHNIQUE ........................................................................... 46

    5.3SIMULATION RESULTS AND ANALYSIS ............................................................................ 48

    5.4CONCLUSION ................................................................................................................... 50

    CHAPTER6:PROPOSED CAM WITH IMPROVEDNOISE MARGIN ............................................... 52

    6.1INTRODUCTION................................................................................................................ 52

    6.2OPERATION OF THE SCHEME ............................................................................................ 52

    6.2.1 Charging controller .................................................................................................. 53

    6.2.2 The sense amplifier ................................................................................................. 54

    6.3SIMULATION RESULT....................................................................................................... 55

  • 7/28/2019 thesis_v4

    7/147

    vii

    CONTENTS (Continued)

    Page

    6.4CONCLUSION ................................................................................................................... 57

    CHAPTER7:Conclusion ........................................................................................................... 59

    7.1CONCLUSION .................................................................................................................. 59

    7.2FUTURE WORK................................................................................................................ 59

    Reference ................................................................................................................................. 61

    Bibliography ............................................................................................................................ 69

    RESEARCH PAPERS FROM THIS THESIS..................................................................................... 70

    APPENDIX A ............................................................................................................................. 71

    A.1 HSPICE CODE FORCHARGING CONTROL SCHEME........................................................... 71A.2HSPICE CODE FOR SIMPLIFIED DESIGN OF CHARGING CONTROLLER.............................. 122

    A.3HSPICE CODE FOR IMPROVED NOISE SCHEME [CHAPTER6] ........................................... 124

    APPENDIX B ............................................................................................................................ 128

    B.1INTRODUCTION ............................................................................................................. 128

    B.2INSTALLATION AND USAGE OF HSPICE 2007 ................................................................ 128

    B.3BASIC RULES ORQUICKMANUAL[2] ........................................................................ 131

    B.3.1. Input File .............................................................................................................. 131

    B.3.2. Element Description ............................................................................................ 132

    B.3.3. Analysis ............................................................................................................... 134

    B.3.4 References ............................................................................................................. 135

  • 7/28/2019 thesis_v4

    8/147

    viii

    LIST OF FIGURES

    PAGE

    Fig. 2. 1: Resistor Pull-up......................................................................................................................

    7

    Fig. 2. 2: nMOS depletion mode transistor pull up................................................................................ 8

    Fig. 2. 3: nMOS enhancement mode pull up......................................................................................... 8

    Fig. 2. 4: Pseudo-nMOS Logic.............................................................................................................. 9

    Fig. 2. 5: Complementary transistor pull-up (CMOS)........................................................................... 9

    Fig. 2. 6: CMOS inverter current versus Vin....................................................................................... 10

    Fig. 2. 7: Precharge and evaluation of dynamic gates..........................................................................

    11

    Fig. 2. 8: Footed dynamic inverter....................................................................................................... 12

    Fig. 2. 9: Unfooted dynamic gates....................................................................................................... 12

    Fig. 2. 10: Generalized footed gates.................................................................................................... 13

    Fig. 2. 11: Logical effort of footed and unfooted dynamic gates......................................................... 13

    Fig. 2. 12: Monotonicity problem........................................................................................................ 14

    Fig. 2. 13: Incorrect connection of dynamic gates............................................................................... 15

    Fig. 2. 14: Standard Domino Logic circuit.......................................................................................... 16

    Fig. 2. 15: Weak keeper implementation............................................................................................. 17

    Fig. 3. 1: Simple schematic model of a 4x3 CAM array showing the core memory cells, differential

    search lines, match lines and encoder.................................................................................................. 20

    Fig. 3. 2: Profile of CAM capacity (log scale) versus year of publication [33][40]...........................21

    Fig. 3. 3: CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The cells areshown using SRAM-based data-storage cells...................................................................................... 22

    Fig. 3. 4: Structure of Ternary core cells for (a) NOR-type (b) NAND-type CAM [41], [42]............24

    Fig. 3. 5: (a) the schematic with precharge circuitry for matchline sensing using the precharge-high

    scheme, and (b) the corresponding timing diagram showing relative signal transitions. [34]..............26

    Fig. 3. 6: Matchline power for NAND and NOR architecture [33]...................................................... 27

    Fig. 3. 7: Two possible configurations for the NOR cell: (a) the stored bit is connected to the bottom

    transistors of the pulldown pair, and (b) the stored bit is connected to the top transistors of thepulldown pair....................................................................................................................................... 28

  • 7/28/2019 thesis_v4

    9/147

    ix

    LIST OF FIGURES

    PAGE

    Fig. 3. 8: Low-swing matchline sensing scheme of [33]......................................................................

    29

    Fig. 3. 9: (a) Circuit implementation including precharge circuitry and (b) a timing diagram for a

    single search cycle. For current-race matchline sensing [51]............................................................... 31

    Fig. 3. 10: Sample implementation of the selective-precharge matchline technique [52]....................32

    Fig. 3. 11: Pipelined matchlines reduce power by shutting down after a miss in a stage .....................33

    Fig. 3. 12: Simulated wave forms in the pipelined match-line architecture for (a) the full-match case

    consisting of a match in every stage, and (b) a miss case where the third stage results in a miss and

    turns off the subsequent stages.[58]

    .....................................................................................................

    33

    Fig. 3. 13: Current-saving matchline-sensing scheme......................................................................... 34

    Fig. 4. 1: Structure of the CAM of the proposed scheme: (a) basic architecture and (b) NOR-type

    TCAM cell used in the scheme............................................................................................................ 38

    Fig. 4. 2: Internal circuit of the charging controller............................................................................. 39

    Fig. 4. 3: Proposed sense amplifier (SA)............................................................................................. 40

    Fig. 4. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched), ML1 (one-

    bit miss), ML2 (two-bit miss), MLC and MLP....................................................................................

    41

    Fig. 4. 5: Corner Simulation results for FHL (when threshold voltage= -10%,VDD=+5%, and

    temparature=273K).............................................................................................................................. 42

    Fig. 4. 6: Corner Simulation results for SLH (when threshold voltage= +10%,VDD=-5%, and

    temparature=343K).............................................................................................................................. 42

    Fig. 5. 1: Simplified conventional CAM architecture.......................................................................... 45

    Fig. 5. 2: Structure of the CAM array with the proposed scheme: (a) the basic architecture and (b)

    internal circuit of the NOR-type TCAM cell used in this scheme. Here the usual SRAM accesstransistor and associated bitlines are omitted for simplicity................................................................ 47

    Fig. 5. 3: The ML charging unit proposed in this work and the sensing unit proposed in [84]. ...........47

    Fig. 5. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched), ML1 (one-

    bit miss), ML2 (two-bit miss), MLC and MLP.................................................................................... 48

    Fig. 6. 1: Structure of the CAM array.................................................................................................. 53

    Fig. 6. 2: Internal circuitry of improved Charging............................................................................... 53

    Fig. 6. 3: Waveforms for CAM............................................................................................................ 54

  • 7/28/2019 thesis_v4

    10/147

    x

    LIST OF FIGURES

    PAGE

    Fig. 6. 4: Conventional sense amplifier...............................................................................................

    55

    Fig. 6. 5: Charging in different match-lines......................................................................................... 55

    Fig. 6. 6: Controlled signals and output of each match-line................................................................. 56

  • 7/28/2019 thesis_v4

    11/147

    xi

    LIST OF TABLES

    PAGE

    Table 3. 1: Truth Table for NOR Cell..................................................................................................

    24

    Table 3. 2: Truth Table for NAND Cell............................................................................................... 24

    Table 3. 3: Comparison between the schemes[62]............................................................................... 35

    Table 4. 1: Comparison of Different Schemes..................................................................................... 41

    Table 5. 1: Comparison of Different Schemes..................................................................................... 49

    Table 6. 1: Comparison of Different Schemes with improved noise margin scheme..........................56

    Table 6. 2: Comparison of Noise Immunity of different schemes.......................................................

    57

  • 7/28/2019 thesis_v4

    12/147

    xii

    ABSTRACT

    The growing market demand of the integrated circuits and energy crisis in the wholeworld accelerate the researchers to find out new process technology with smaller transistors,

    to design the circuits more efficiently which needs less power and operate in higher speed.

    The purpose of this dissertation is the same i.e. to find a way to design digital circuit

    specifically Content-Addressable Memory (CAM) which needs low power and operates in

    higher speed with maintaining the noise immunity.

    Content-addressable memory (CAM) is an attractive component in network routers

    for packet forwarding and packet classification and also in other applications that require

    high-speed searches. This dissertation presents three techniques to increase speed and reduce

    the energy per bit per search. In the first technique, the charging of fully matched matchline isreduced from VDD to VDD/3 and this voltage is sensed by our proposed sense amplifier. This

    reduces the power consumption greatly. In the second scheme, the charging controller is

    made simple (to get a low cost circuit) with the cost of some energy. In the third scheme with

    high noise margin, only probable matched matchlines are charged to almost VDD. Most of the

    matchlines are not charged so much. So in most of the matchlines, the power consumption is

    very low.

    All the schemes (conventional current saving and current race scheme also) are

    simulated in TSMC 0.18 m technology with 64 x 72 Ternary CAM. For first charging

    control technique, simulation shows that the match-line energy reduction is 57% and 54%

    compared to the current-race and current-saving schemes respectively and 55% compared to

    the conventional current-race scheme while speed of operation is increased by over 3 times.

    For the second simplified scheme, the energy efficiency is little bit lower. The third scheme

    provides very good noise margin with maintaining sufficient energy reduction and speed of

    operation. The more accurate result can be obtained by doing Layout of the proposed

    schemes.

    We can say that this dissertation provides some good schemes for the content

    addressable memory. We hope that the proper layout of the schemes would provide goodresults also. We also hope that we would get some good alternatives for existing CAM after

    fabrication of the chip.

  • 7/28/2019 thesis_v4

    13/147

    1

    CHAPTER1

    INTRODUCTION

  • 7/28/2019 thesis_v4

    14/147

    2

    CHAPTER1

    INTRODUCTION

    1.1INTRODUCTION

    A CONTENT-ADDRESSABLE memory (CAM) compares input search data against

    a table of stored data, and returns the address of the matching data. CAMs have a single clock

    cycle throughput making them faster than other hardware- and software-based search

    systems. CAMs can be used in a wide variety of applications requiring high search speeds.

    These applications include cache memory, parametric curve extraction, Hough

    transformation, Huffman coding/decoding, LempelZiv compression, and image coding [1].

    The primary commercial application of CAMs today is to classify and forward Internet

    protocol (IP) packets in network routers. In networks like the Internet, a message such an as

    e-mail or a Web page is transferred by first breaking up the message into small data packets

    of a few hundred bytes, and, then, sending each data packet individually through the network.

    These packets are routed from the source, through the intermediate nodes of the network(called routers), and reassembled at the destination to reproduce the original message. The

    function of a router is to compare the destination address of a packet to all possible routes, in

    order to choose the appropriate one. A CAM is a good choice for implementing this lookup

    operation due to its fast search capability. CAM is also used in neural networks. The two

    main strategies available for implementing a CAM with a neural network architecture,

    feedback networks and two-stage CAMs, and in particular their ability to retrieve patterns

    from corrupted input data. The storage capacity of the Hopfield network is very poor

    although it can be improved with the use of an iterative algorithm, such as the threshold

    algorithm which is described. However, the possibility of generating spurious patterns alwaysremains with feedback networks. Two-stage CAMs are much more efficient, provided that an

    appropriate algorithm is used for the input classification stage. Perceptron and least-mean

    squares algorithms need to be modified if they are to cope with corrupted input patterns, but

    the optimal classifier for the type of problem under consideration is the minimum-distance

    classifier (or Hamming network for binary patterns).

    Dynamic CMOS logic in general and domino logic in particular has a number of

    advantages to design high speed and low power CMOS circuits. However, the main difficulty

    with domino logic is that it can implement only non-inverted logic. To implement inverted

    polarity it is required to duplicate several circuit parts using inverted polarities of inputs and

  • 7/28/2019 thesis_v4

    15/147

    3

    hence increasing area and power dissipation. With static CMOS logic, on the other hand, it is

    simple to realize gate with both inverted and non-inverted logic unlike dynamic CMOS logic

    [2].

    Domino logic is known as a better logic for implementing high-speed CMOS circuits.However, domino circuits have some inherent problems like charge sharing, clock routing

    overhead, clock skew etc. Another difficulty with the domino logic is that it can only

    implement non-inverting functions. As domino logic cannot implement inverters driving

    other domino gates, all parts of the gate that follows an inverter has to be implemented again

    with opposite polarities of inputs, which increases area and power dissipation. The

    advantages of domino logic may come into question when there is a large number of inverters

    and having trapped in points where substantial duplication of domino gates is unavoidable.

    1.2MOTIVATIONS

    The speed of a CAM comes at the cost of increased silicon area and power

    consumption, two design parameters that designers strive to reduce. As CAM applications

    grow, demanding larger CAM sizes, the power problem is further exacerbated. Reducing

    power consumption, without sacrificing speed or area, is the main thread of recent research in

    large-capacity CAMs.

    The literature shows that a major portion of power is consumed in Match lines (ML)

    and search lines (SL). A lot of researches have been done to reduce the ML and SL power

    consumption. Previous works present some schemes such that low-swing scheme [3], [4], [5],

    selective precharge scheme [6], current-race scheme [7], current-saving scheme [8], [9] etc.

    The low-swing scheme reduces the ML power by reducing the ML voltage. The selective

    precharge scheme reduces match-line power consumption by breaking the search into two

    segments and observing that the second segment is rarely activated. The current-race scheme

    limits the ML voltage swing by VDD/2 and precharges the MLs to ground instead of VDD. In

    our scheme, as SL is not precharged, there is almost a 50% reduction in SL power

    consumption [1], compared to the precharge-high scheme [10], [11]. The current-saving

    scheme is the improved version of current race scheme which allocates less power to match

    decision involving a large number of mismatched bits. It is reported in a survey [1] that thecurrent saving scheme consumes less power than other schemes [10], [6], [7]. So to compare

    our proposed schemes, we have selected the current-race and current-saving scheme.

    In our proposed schemes, the match-lines are precharged to ground at precharge stage

    unlike the conventional precharge high scheme so that power consumption in the matchlines

    is low. Another thing is that pre-charging the matchline eliminates the need of searchline

    precharge. So in the typical case, about 50% of the search data bits toggle from cycle to

    cycle, there is a 50% reduction in searchline power, compared to the precharge-high

    matchline-sensing schemes that have an SL precharge phase. Secondly, in our proposed

    charging control scheme, the charging of fully matched matchline is reduced from VDD to

    VDD/3 and this voltage is sensed by our proposed sense amplifier. This reduces the power

  • 7/28/2019 thesis_v4

    16/147

    4

    consumption greatly. In the scheme with high noise margin, only probable matched

    matchlines are charged to almost VDD. Most of the matchline is not charged so much. So in

    most of the matchlines, power consumption is very low.

    1.3OBJECTIVES

    The objective of this investigation and research was to find out a way or scheme to

    design high speed and low power dynamic circuits. We specifically tried to design some good

    schemes for Content Addressable Memory (CAM). CAM is a special type of memory array

    which provides hardware search system where the information or data to be searched enters

    into the two-dimensional memory array and provides the search result (the address of

    memory where the data is found). To design very high speed CAM we have investigated the

    CAM cells, the charging controller and sense amplifiers. As satisfactory works is present onCAM cells, our objective was to design an appropiate scheme of charging controllers and

    sense amplifier which consume less power and operate in high speed. As most of the power is

    consumed in matchlines and matchlines are charged by the charging controller, the charging

    controller is responsible for most of the power consumption. So, one of our objectives was to

    design the charging controller intelligently. On the other hand, the higher the matchline

    voltage, the higher the power consumption. For that reason, our objective was to charge the

    matchline as VDD/3 and design a sense amplifier which can sense this voltage as a high.

    1.4THESISORGANIZATION

    Continued increase in leakage current of the transistors with the advancement of

    process technology, impacts the leakage power and noise sensitivity of dynamic circuits more

    than those of static circuits. This thesis proposes methods and techniques to achieve the goal

    of power-efficient design of CAM circuits used commonly in high-performance cache

    memory while improving or maintaining their area, performance and noise robustness.

    Chapter 2 describes different type of logic circuits-study of which is needed tounderstand the significance of dynamic circuits. As CAM is based on dynamic logic

    investigation of dynamic logic is very important.

    Chapter 3 underscores the significance of previous schemes and compares their

    performances and structure in details to reduce power consumption.

    Chapter 4 describes one of the proposed schemes. In this chapter, a scheme named

    charging control scheme is proposed. This work is also accepted and presented in

    International Conference on Solid State Device and Materials (SSDM) 2008, Tsukuba, Japan.

  • 7/28/2019 thesis_v4

    17/147

    5

    Chapter 5 describes another scheme which contains the simplified version of charging

    controller of the scheme described in chapter 4. It is also accepted in TENCON 2008,

    Hyderabad, India.

    Chapter 6 narrates a scheme which have high noise margin. The principle of thistechnique is carry coal to Newcastle. This scheme charges only a negligible number of

    matchline to VDD. For that reason, a high amount of power is saved.

    Finally, Chapter 7 presents a summary of proposed methods and techniques

    mentioned in the thesis for low-power CAM circuits in high-performance memory arrays. It

    also mentions suggestions for extending the current research for possible future work.

  • 7/28/2019 thesis_v4

    18/147

    6

    CHAPTER2

    DIFFERENTTYPESOFLOGIC

  • 7/28/2019 thesis_v4

    19/147

    7

    In ut

    RL

    F

    Resistive

    Load

    VDD

    VSS

    PDN

    CHAPTER2

    DIFFERENT TYPES OF LOGIC

    2.1INTRODUCTION

    In this chapter, different types of logic circuits are described. Logic circuits such as

    ratioed logic, CMOS logic, dynamic logic and domino logic are studied. This study is needed

    to realize the significance of dynamic circuits. As CAM is based on dynamic logicinvestigation of dynamic logic is very important.

    2.2RATIOEDLOGIC

    Ratioed logic is an attempt to reduce the number of transistors required to implement

    a logic function, often at the cost of reduced robustness and extra power dissipation.

    2.2.1LoadResistanceRL

    The main goal is to reduce the number of transistors at the cost of reduced robustness

    and extra power dissipation. This arrangement is not often used because of the large space

    requirements of resistors produced in a silicon substrate. If Pull down network is off, Static

    Power = 0. If Pull down network is on, then there is some Static Power dissipation.

    Fig. 2. 1: Resistor Pull-up

  • 7/28/2019 thesis_v4

    20/147

    8

    2.2.2nMOSdepletionmodetransistorpullup

    Power dissipation is high since rail to rail current flows when input = logical 1.

    Switching of output from 1 to 0 begins when input voltage exceeds the Vt of the pull down

    device.

    Fig. 2. 2: nMOS depletion mode transistor pull up

    When switching the output from 1 to 0, the pull up device is non-saturated initially and this

    presents lower resistance through which to charge capacitive loads.

    2.2.3nMOS

    enhancement

    mode

    pull

    up

    If the gate of the pull-up transistor is connected to VDD then it is called nMOSenhancement mode pull up. Power Dissipation is high since current flows when inputvoltage = logical 1. Output voltage can never reach VDD (logical 1).

    Fig. 2. 3: nMOS enhancement mode pull up

    .

    VSS

    VDD

    Depletion

    Load

    F

    PDNInput

    VSS

    F

    VDD

    Input PDN

  • 7/28/2019 thesis_v4

    21/147

    9

    FVSS

    Input

    PMOS

    Load

    VSS

    VDD

    PDN

    2.2.4PseudonMOSlogic

    If we replace the depletion mode pull-up transistor of the standard nMOS circuits with

    a p-transistor with gate connected to VSS, we have a structure similar to the NMOS

    equivalent. This approach of the logic design is illustrated in the Fig. 2.4

    Fig. 2. 4: Pseudo-nMOS Logic

    The circuit arrangements look and behave much like nMOS circuits and appropriate ratio

    rules must be applied.

    2.3COMPLEMENTARYTRANSISTORPULLUP(CMOS)

    In CMOS we use both the pull-up network and also the pull down network. So, there

    is no static power dissipation, because no current flow either for logical 0 or for logical 1

    inputs. Full logical 1 and 0 levels are presented at the output. For devices of similar

    dimensions the p-channel is slower than the n-channel device.

    Fig. 2. 5: Complementary transistor pull-up (CMOS)

    Inputs

    PDN

    PUN

    Output

    VDD

    VSS

  • 7/28/2019 thesis_v4

    22/147

    10

    2.3.1ComponentsofTotalPowerDissipationinCMOSCircuits

    One of the major design challenges in high-performance digital integrated circuits is

    the minimization of the total power dissipation. Total power consumption in digital CMOS

    circuits can be divided into three major components: a) Switching power or dynamic power,

    b) Short-circuit power and c) Static or leakage power. Equation 1.1 defines total power andits three components in a simplified form [1.1]

    VddIPower

    IVddTT

    Power

    VddCFAFPower

    PowerPowerPowerPower

    LEAKAGELEAKAGE

    PEAK

    fallrise

    ITSHORTCIRCU

    DYNAMIC

    LEAKAGECIRCUITSHORTDYNAMICTOTAL

    =

    +

    =

    =

    ++=

    )2

    (

    2

    [1.1]

    Dynamic power is a result of the power consumed in charging and discharging

    various device and wire capacitances in the circuit. As seen from Equation 1.1, this

    component of power depends on the switching activity factor (probability that a power

    consuming transition occurs) AF, clock frequency F, the capacitances C being charged or

    discharged and square of supply voltage, Vdd. In long channel transistors the dynamic power

    is the dominant component of the total power. However, this is not the case in advanced

    technologies, as leakage power is becoming a significant component of total power.

    Fig. 2. 6: CMOS inverter current versus Vin

    Current

    (betweenrails)

    Vin

  • 7/28/2019 thesis_v4

    23/147

    11

    CLK

    Precharge Evaluate Precharge

    Y

    Short-circuit power is dissipated when there is a direct conducting path between

    power supply (Vdd) and ground (Vss). Since the input signals to the logic gates have a non-

    zero/finite slope or edge rate, there is a direct path current between Vdd and Vss for a period

    of time during which both the PMOS and NMOS devices conduct simultaneously. The

    magnitude of this current is given by the actual transistor widths and the on-state saturationcurrent (IDSAT). The duration for which the current flows depends on the signal rise and fall

    times and increases as the signal slopes degrade. The overall short-circuit power can be

    obtained by integrating the total current over the duration of short circuit and then

    multiplying with Vdd. In a simplified form it can be calculated as shown in Equation 1.1,

    where Trise and Tfall are the rise and fall times respectively and IPEAKis the peak short circuit

    current.

    Static or leakage power dissipation is due to the leakage current, I LEAKAGE that flows

    between power rails in the absence of any switching activity. Three major sources of leakage

    current are: a) current flowing through reverse biased P-N diode junctions of the transistorslocated between the source or drain and substrate, b) subthreshold leakage current between

    source and drain when gate-source voltage, Vgs, is smaller than the threshold voltage, Vt of

    the transistors and c) gate leakage current via the gate tunneling mechanism.

    Because of the quadratic dependence of dynamic power on Vdd, reducing this voltage

    is the most effective approach to minimize dynamic power dissipation. However, reducing

    the supply voltage necessitates the reduction of threshold voltage to avoid serious degradation

    of performance. Unfortunately, reducing threshold voltage causes the sub-threshold leakage

    current to increase exponentially.

    2.4DYNAMICCIRCUITS

    Ratioed circuits reduce the input capacitance by replacing the pMOS transistors

    connected to the inputs with a single resistive pull-up. The drawbacks of ratioed circuits

    includes slow resistive transitions, contention on the falling transitions, static power

    dissipation and a non-zero VOL. Dynamic circuits circumvent these drawbacks by using a

    clocked pull-up transistor rather than a pMOS that is always ON.

    Fig. 2. 7: Precharge and evaluation of dynamic gates.

  • 7/28/2019 thesis_v4

    24/147

    12

    Dynamic circuit operation is divided into two modes, shown in Fig. 2.7. During

    precharge, the clock (CLK) is 0, so the clocked pMOS is ON and initializes the output Y

    high. During evaluation, the clock is 1 and the clocked pMOS turns OFF. The output may

    remain high or may be discharged low through the pull-down network.

    Fig. 2. 8: Footed dynamic inverter

    Dynamic circuits are the fastest commonly used circuit family because they have

    lower input capacitance and no contention during switching. They also have static power

    dissipation. However, they require careful clocking, consume significant dynamic power, and

    are sensitive to noise during evaluation.

    In Fig. 2.9 if the input is 1 during precharge, contention will take place because both

    the pMOS and nMOS transistors will be ON.

    Fig. 2. 9: Unfooted dynamic gates.

    A

    CLK

    Precharge

    Transistor

    FOOT

    Y

    CLK

    Inputs

    Y

    PDN

  • 7/28/2019 thesis_v4

    25/147

    13

    When the input cannot be guaranteed to be 0 during precharge,an extra clocked

    evaluation transistor can be added to the bottom of the nMOS stack to avoid contention as

    shown in Fig. 2.10. The extra transistor is sometimes called a foot. Fig. 2.10 shows generic

    footed gates.

    Fig. 2. 10: Generalized footed gates.

    Fig. 2.11 estimates the falling logical effort of both footed and unfooted dynamic

    gates. As usual, the pull-down transistors widths are chosen to give unit resistance. Precharge

    occurs while the gate is idle and often may take place more slowly. Therefore, the prechargetransistor width is chosen for twice unit resistance. This reduces the capacitive load on the

    clock and the parasitic capacitance at the expense of greater rising delays.

    Fig. 2. 11: Logical effort of footed and unfooted dynamic gates

    CLK

    Inputs PDN

    Y

    Y

    A

    CLKCLK

    Y

    A

    Inverter

    gd=1/3

    Pd=2/3gd=2/3

    Pd=3/3

  • 7/28/2019 thesis_v4

    26/147

    14

    Footed gates have higher logical effort than their unfooted counterparts but are still an

    improvement over static logic. In practice, the logical effort of footed gates is better than

    predicted because velocity saturation means series nMOS transistors have less resistance

    than we estimate. The size of the foot can be increased relative to the other nMOS transistors

    to reduce logical effort of the other inputs at the expense of greater clock loading. Likepseudo-nMOS gates, dynamic gates are particularly well suited to wide NOR functions or

    multiplexers because the logical effort is independent of the number of inputs.

    A fundamental difficulty with the dynamic circuits is the monotonicity requirement.

    While a dynamic gate is in evaluation, the inputs must be monotonically rising. That is, the

    input can start LOW and remain LOW, start LOW and rise HIGH, start HIGH and remain

    HIGH, but not start HIGH and fall LOW. Fig. 2.12 shows waveforms for a footed dynamic

    inverter in which the input violates monotonically. During precharge, the output is pulled

    HIGH. When the clock rises, the input is HIGH so the output is discharged LOW through the

    pull-down network, as happen in an inverter.

    Fig. 2. 12: Monotonicity problem

    The input later falls LOW, turning off the pull-down network. However, the

    precharge transistor is also OFF, so the output floats, staying LOW rather than rising as it

    would in a normal inverter. The output will remain low until the next precharge step. In

    summary, the inputs must be monotonically rising for the dynamic gate to compute the

    correct function.

    Evaluate

    A

    CLK

    Y

    Violatesmonotonicity

    duringevaluation

    Precharge

    Output

    should

    rise

    but

    does

    not

    Precharge

  • 7/28/2019 thesis_v4

    27/147

    15

    Unfortunately, the output of a dynamic gate begins HIGH and monotonically falls

    LOW during evaluation. This monotonically falling output X is not a suitable input to a

    secong dynamic gate expecting monotonically rising signals as shown in Fig. 2.13. Dynamic

    gates sharing the same clock cannot be directly connected. This problem is often overcome

    with domino logic.

    Fig. 2. 13: Incorrect connection of dynamic gates.

    CLK

    A

    XY

    X

    Xmonotonicallyfallsduringevaluation

    A=1

    CLK

    PrechargeEvaluate

    Precharge

    Y

    Yshouldrisebutcannot

  • 7/28/2019 thesis_v4

    28/147

    16

    2.5DOMINOLOGIC

    Domino logic circuits find wider applications in high performance microprocessors

    due to their superior speed and area characteristics as compared to static CMOS circuits. But

    their noise margins are low making them more prone to noise. Various leakage reductiontechniques are applied to domino logic circuits also to reduce the leakage. As the technology

    is scaled below 130nm, noise margin becomes a critical issue and hence techniques that

    provide high noise immunity become necessary in order to have reliable circuits.

    The monotonicity problem can be solved by placing a static CMOS inverter between

    dynamic gates as shown in the figure. This converts the monotonically falling output into a

    monotonically rising signal suitable for the next gate. The dynamic- static pair together is

    called a domino gate. A single clock can be used to precharge and evaluate all the logic gates

    within the chain. Therefore, the static inverter is usually a HI-skew gate to favor this rising

    output. We may observe that precharge occurs in parallel, but evaluation occurs sequentially.

    A standard domino logic circuit with a keeper is as shown in Fig. 2.14. A standard

    domino logic circuit consists of an n-type dynamic logic block followed by a static inverter.

    During precharge, the output of the dynamic gate is charged to Vdd and the output of the

    inverter is set to 0.

    Fig. 2. 14: Standard Domino Logic circuit

  • 7/28/2019 thesis_v4

    29/147

    17

    During evaluation, the inverter makes conditional transition from 0 to 1. If the output

    of the domino gate is fed to other domino gates, then it must be ensured that all inputs are set

    to 0 at the end of the precharge phase and the transitions during evaluation are only 0 to 1.

    Hence the dynamic node discharges only when the previous stage evaluates to 1 and a high

    fan-out is achieved due to the static inverter present at the output. To counteract the leakageissues and to establish a low impedance path, a bleeder transistor (keeper) is connected in the

    feedback path.

    Fig. 2. 15: Weak keeper implementation

    The function of the keeper is to compensate the charge lost due to the pull-down

    leakage paths. But the keeper is fully turned on at the beginning of the evaluation phase.

    When the pull down network is ON, then there exists a contention between this and keeper

    transistor, which degrades the speed of domino circuits. Traditionally, a minimum sized

    keeper is used to minimize delay and power degradation caused by the contention current. Asmall keeper, however, cannot provide necessary noise immunity for reliable operation in an

    increasingly noisy and noise-sensitive on-chip environment. Therefore, there is a tradeoff

    between the high speed/energy efficient operation and reliability in domino logic. Hence,

    keeper sizing is important in deep sub micron circuits.

    A

    CLK

    XY

    Width:min

    Length:L

  • 7/28/2019 thesis_v4

    30/147

    18

    CHAPTER3

    CONTENTADDRESSABLEMEMORY

    (CAM)REVIEW

  • 7/28/2019 thesis_v4

    31/147

    19

    Chapter 3

    CONTENT

    ADDRESSABLE

    MEMORY

    (CAM)R

    EVIEW

    3.1INTRODUCTION

    Content addressable memories (CAMs) are memories that can search the entire

    memory in parallel and output the location of entries that hold a match to the key value.

    Today the increased need for faster searches, larger table sizes and wider data widths,makes CAMs a more attractive solution to the less expensive RAM software solution.

    CAMs are now much needed where quick searches of a database, a list, or a pattem is in

    order. In fact, CAMs can be a determining factor for a wide range of applications such as

    local-area networks, file storage management, artificial intelligence, database management

    and pattern recognition.

    A Content-Addressable Memory (CAM) searches for data by its content and returns

    the address of the matching data. This feature is used extensively in applications such as

    internet routers to channel incoming packets towards their destination addresses contained in

    the packet header. Energy per search and search speed are two important metrics used toevaluate CAM performance[12]Content addressable memory (CAM), a high-performance

    lookup engine in many systems, is so power-consuming that any saving becomes very

    significant in the whole system. CAM has three major power-sinking sources: evaluation

    power, input transition power and clocking power, all of them are discussed in this research.

    After that, a new low-power CAM design is proposed here. Its implementation under 0.35-p

    m process operates at 83.3 MHz with power performance metric as 45.5fJ/bit/search or

    equivalently 372 mJ/bit/search/m for random inputs. Two modified circuit structures for

    binary static CAM cells are also proposed. We have proved that under most conditions cell

    layout is smaller by this modification.

    A Content Addressable Memory (CAM) compares input search data against a table of

    stored data, and returns the address of the matching data [13][17]. CAMs have a single

    clock cycle throughput making them faster than other hardware and software-based search

    systems. CAMs can be used in a wide variety of applications requiring high search speeds.

    These applications include parametric curve extraction [18], Hough transformation [19],

    Huffman coding/decoding [20], [21], LempelZiv compression [22][25], and image coding

    [26]. The primary commercial application of CAMs today is to classify and forward Internet

    protocol (IP) packets in network routers [27][32]. In networks like the Internet, a message

    such an as e-mail or a Web page is transferred by first breaking up the message into smalldata packets of a few hundred bytes, and, then, sending each data packet individually through

  • 7/28/2019 thesis_v4

    32/147

    20

    the network. These packets are routed from the source, through the intermediate nodes of the

    network (called routers), and reassembled at the destination to reproduce the original

    message. The function of a router is to compare the destination address of a packet to all

    possible routes, in order to choose the appropriate one. A CAM is a good choice for

    implementing this lookup operation due to its fast search capability.

    However, the speed of a CAM comes at the cost of increased silicon area and power

    consumption, two design parameters that designers strive to reduce. As CAM applications

    grow, demanding larger CAM sizes, the power problem is further exacerbated. Reducing

    power consumption, without sacrificing speed or area, is the main thread of recent research in

    large capacity CAMs. In this research, we survey developments in the CAM area at two

    levels: circuits and architectures. Before providing an outline of this research at the end of

    this section, we first briefly introduce the operation of CAM and also describe the CAM

    application of packet forwarding.

    Fig. 3. 1: Simple schematic model of a 4x3 CAM array showing the core memory cells,

    differential search lines, match lines and encoder

    Fig. 3.1 shows a simplified block diagram of a CAM. The input to the system is the

    search word that is broadcast onto the searchlines to the table of stored data. The number of

    bits in a CAM word is usually large, with existing implementations ranging from 36 to 144

    bits. A typical CAM employs a table size ranging between a few hundred entries to 32K

    entries, corresponding to an address space ranging from 7 bits to 15 bits. Each stored word

    has a matchline that indicates whether the search word and stored word are identical (the

    match case) or are different (a mismatch case, or miss). The matchlines are fed to an encoder

    that generates a binary match location corresponding to the matchline that is in the match

    state. An encoder is used in systems where only a single match is expected. In CAM

    applications where more than one word may match, a priority encoder is used instead of a

    simple encoder. A priority encoder selects the highest priority matching location to map to

    the match result, with words in lower address locations receiving higher priority. In addition,

    there is often a hit signal (not shown in the figure) that flags the case in which there is nomatching location in the CAM. The overall function of a CAM is to take a search word and

    C C C

    C C C

    C C C

    Input Search Data Drivers/Registers

    SL0

    Encoder

    Hit

    ML2

    ML1

    ML0

    C C C

    ML3

    SL1 SL2SL0 SL1 SL2

  • 7/28/2019 thesis_v4

    33/147

    21

    return the matching memory location. One can think of this operation as a fully

    programmable arbitrary mapping of the large space of the input search word to the smaller

    space of the output match location.

    8M

    Memory Size (bit)

    10k

    1986 2005

    Year

    Fig. 3. 2: Profile of CAM capacity (log scale) versus year of publication [33][40]

    The operation of a CAM is like that of the tag portion of a fully associative cache. The

    tag portion of a cache compares its input, which is an address, to all addresses stored in the

    tag memory. In the case of match, a single matchline goes high, indicating the location of a

    match. Unlike CAMs, caches do not use priority encoders since only a single match occurs;

    instead, the matchline directly activates a read of the data portion of the cache associated with

    the matching tag. Many circuits are common to both CAMs and caches; however, we focus

    on large- capacity CAMs rather than on fully associative caches, which target smaller

    capacity and higher speed. Todays largest commercially available single-chip CAMs are 18

    Mbit implementations, although the largest CAMs reported in the literature are 9 Mbit in size

    [33], [40]. As a rule of thumb, the largest available CAM chip is usually about half the size of

    the largest available SRAM chip. This rule of thumb comes from the fact that a typical CAM

    cell consists of two SRAM cells, as we will see shortly. Fig. 3.2 plots (on a logarithmic scale)

    the capacity of published CAM [33][40] chips versus time from 1985 to 2004, revealing an

    exponential growth rate typical of semiconductor memory circuits and the factor-of-two

    relationship between SRAM and CAM.

  • 7/28/2019 thesis_v4

    34/147

    22

    3.2CORECELLSANDMATCHLINESTRUCTURE

    A CAM cell serves two basic functions: bit storage (as in RAM) and bit comparison

    (unique to CAM). Fig. 3.4 shows a NOR-type CAM cell [Fig. 3.3(a)] and the NAND-typeCAM cell [Fig. 3.3(b)]. The bit storage in both cases is an SRAM cell where cross-coupled

    inverters implement the bit-storage nodes D and DB. To simplify the schematic, we omit the

    nMOS access transistors and bitlines which are used to read and write the SRAM storage bit.

    Although some CAM cell implementations use lower area DRAM cells [3.27], [3.31],

    typically, CAM cells use SRAM storage. The bit comparison, which is logically equivalent to

    an XOR of the stored bit and the search bit is implemented in a somewhat different fashion in

    the NOR and the NAND cells.

    3.2.1StructureofNORCell

    A NOR Cell implements the comparison between the complementary stored bit, D

    (and DB), and the complementary search data on the complementary searchline, SL (and

    SLB), using four comparison transistors, M1 through M4, which are all typically minimum-

    size to maintain high cell density. These transistors implement the pull down path of a

    dynamic XNOR logic gate with inputs SL and D. Each pair of transistors, M1/M3 and

    M2/M4, forms a pull down path from the matchline, ML, such that a mismatch of SL and D

    Fig. 3. 3: CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The

    cells are shown using SRAM-based data-storage cells.

    activates least one of the pull down paths,connecting ML to ground. A match of SL and D

    disables both pull down paths, disconnecting ML from ground. The NOR nature of this cell

    becomes clear when multiple cells are connected in parallel to form a CAM word by shorting

    the ML of each cell to the ML of adjacent cells. The pull down paths connect in parallel

    resembling the pull down path of a CMOS NOR logic gate. There is a match condition on a

    given ML only if every individual cell in the word has a match.

  • 7/28/2019 thesis_v4

    35/147

    23

    3.2.2StructureofNANDCell

    The NAND cell implements the comparison between the stored bit, D, and

    corresponding search data on the corresponding searchlines, (SL, SLB), using the three

    comparison transistors M1, MD and MDB, which are all typically minimum-size to maintainhigh cell density. We illustrate the bit-comparison operation of a NAND cell through an

    example. Consider the case of a match when SL=1 and D=1. Pass transistor M D is ON and

    passes the logic 1 on the SL to node B. Node B is the bit-match node which is logic 1 if

    there is a match in the cell. The logic 1 on node B turns ON transistor M1. Note that is also

    turned ON in the other match case when SL=0 and D=0 . In this case, the transistor MDB

    passes a logic high to raise node B. The remaining cases, where SLD result in a miss

    condition, and accordingly node B is logic 0 and the transistor M1 is OFF. Node B is a

    pass-transistor implementation of the XNOR SLD function. The NAND nature of this cell

    becomes clear when multiple NAND cells are serially connected. In this case, the MLn and

    MLn+1 nodes are joined to form a word. A serial nMOS chain of all the Mi transistors

    resembles the pull down path of a CMOS NAND logic gate. A match condition for the entire

    word occurs only if every cell in a word is in the match condition. An important property of

    the NOR cell is that it provides a full rail voltage at the gates of all comparison transistors.

    On the other hand, a deficiency of the NAND cell is that it provides only a reduced logic 1

    voltage at node B, which can reach only VDD - V tn sswhen the searchlines are driven to VDD

    (where VDD is the supply voltage and Vtn is the nMOS threshold voltage).

    3.2.3TernaryCells

    Usually two types of ternary cell are used. The NOR and NAND cells that have been

    presented are binary CAM cells. Such cells store either a logic 0 or a logic 1.Ternary

    cells, in addition, store an X value. The X value is a dont care, that represents both 0

    and 1, allowing a wildcard operation. Wildcard operation means that an X value stored in

    a cell causes a match regardless of the input bit. As discussed earlier, this is a feature used in

    packet forwarding in Internet routers. A ternary symbol can be encoded into two bits

    according to Table 2.2. We represent these two bits as D and DB. Note that although the D

    and DB are not necessarily complementary, we maintain the complementary notation for

    consistency with the binary CAM cell. Since two bits can represent 4 possible states, butternary storage requires only three states, we disallow the state where D and DB are both

    zero. To store a ternary value in a NOR cell, we add a second SRAM cell, as shown in Fig.

    3.5. One bit, D, connects to the left pulldown path and the other bit, DB, connects to the right

    pull down path, making the pull down paths independently controlled.We store an X by

    setting both D and DB equal to logic 1, which disables both pull down paths and forces the

    cell to match regardless in the inputs. We store a logic 1 by setting D=1 and DB=0 and

    store a logic 0 by setting D=0 and DB=1. In addition to storing an X, the cell allows

    searching for an X by setting both SL and SLB to logic 0. This is an external dont care

    that forces a match of a bit regardless of the stored bit.

  • 7/28/2019 thesis_v4

    36/147

    24

    Table 3. 1: Truth Table for NOR Cell

    Stored Value Stored

    D D

    Search

    D D

    0 0 1 0 1

    1 1 0 1 0

    X 1 1 0 0

    Table 3. 2: Truth Table for NAND Cell

    Stored Value Stored Bit

    D M

    Search Bit

    SL SL

    0 0 0 0 1

    1 1 0 1 0

    x 0 1 1 1

    x 1 1 1 1

    Fig. 3. 4: Structure of Ternary core cells for (a) NOR-type (b) NAND-type CAM [41], [42].

  • 7/28/2019 thesis_v4

    37/147

    25

    Although storing an X is possible only in ternary CAMs, an external X symbol

    possible in both binary and ternary CAMs. In cases where ternary operation is needed but

    only binary CAMs are available, it is possible to emulate ternary operation using two binary

    cells per ternary symbol.

    As a modification to the ternary NOR cell of Fig.3.4(a), propose implementing the

    pull down transistors M1-M4 using pMOS devices and complementing the logic levels of the

    searchlines and matchlines accordingly. Using pMOS transistors (instead of nMOS

    transistors) for the comparison circuitry allows for a more compact layout, due to reducing

    the number of spacings of p-diffusions to n-diffusions in the cell. In addition to increased

    density, the smaller area of the cell reduces wiring capacitance and therefore reduces power

    consumption. The tradeoff that results from using minimum-size pMOS transistors, rather

    than minimum-size nMOS transistors, is that the pulldown path will have a higher equivalent

    resistance, slowing down the search operation.

    A NAND cell can be modified for ternary storage by adding storage for a mask bit at

    node M, as depicted in Fig. 3.4(b) [41], [42]. When storing an X, we set this mask bit to

    1. This forces transistor Mmask ON, regardless of the value of D, ensuring that the cell

    always matches. In addition to storing an X, the cell allows searching for an X by setting

    both SL and SLB to logic 1. Table 2.2 lists the stored encoding and search-bit encoding for

    the ternary NAND cell. Further minor modifications to CAM cells include mixing parts of the

    NAND and NOR cells, using dynamic-threshold techniques in silicon-on-insulator (SOI)

    processes, and alternating the logic level of the pull down path to ground in the NOR cell

    [44][46].

    Currently, the NOR cell and the NAND cell are the prevalent core cells for providing

    storage and comparison circuitry in CMOS CAMs. For a comprehensive survey of the

    precursors of CMOS CAM cells refer to [47].

    3.3MATCHLINESENSINGSCHEMES

    This section reviews matchline sensing schemes that generate the match result. First,

    we review the conventional precharge high scheme, then introduce several variations that

    save power.

    3.3.1Conventional(PrechargeHigh)MatchlineSensing

    We review the basic operation of the conventional precharge-high scheme and look at

    sensing speed, charge sharing, timing control and power consumption.

  • 7/28/2019 thesis_v4

    38/147

    26

    3.3.1.1BasicOperation

    The basic scheme for sensing the stateof the NOR matchline is first to precharge high

    the matchline and then evaluate by allowing the NOR cells to pull down the match-lines in

    the case of amiss, or leave the matchline high in the case of a match. Fig. 3.5(a) shows, in

    schematic form, an implementation of this matchline-sensing scheme. Fig. 3.5(b) shows thesignal timing which is divided into three phases: SL precharge, ML precharge, and ML

    evaluation. The operation begins by asserting slpre to precharge the searchlines low,

    disconnecting all the pull down paths in the NOR cells.With the pull down paths

    disconnected, the operation continues by asserting mlpreb to precharge the matchline high.

    Once the matchline is high, both slpre and mlpreb are de-asserted. The ML evaluate phase

    begins by placing the search word on the searchlines. If there is at least one single-bit miss on

    the matchline, a path (or multiple paths) to ground will discharge the matchline, ML,

    indicating amiss for the entire word, which is output on the MLSA sense-output node, called

    Fig. 3. 5: (a) the schematic with precharge circuitry for matchline sensing using the

    precharge-high scheme, and (b) the corresponding timing diagram showing relative signal

    transitions. [34]

  • 7/28/2019 thesis_v4

    39/147

    27

    MLso. If all bits on the matchline match, thematchline will remain high indicating a match

    for the entire word. Using this sketch of the precharge high scheme, we will investigate the

    performance of matchline in terms of speed, robustness, and power consumption. The

    matchline power dissipation is one of the major sources of power consumption in CAM.

    3.3.1.2Matchline

    power

    In a typical system, the number of misses is expected to be much greater than the

    number of matches; thus, using a dynamic NAND structure results in a significant reduction

    in power. Fig. 3.6 demonstrates the power advantages of a NAND architecture versus a NOR.

    1A NAND match-line architecture is considerably slower than it has NOR counterpart,

    especially for wide words.

    Fig. 3. 6: Matchline power for NAND and NOR architecture [33]

    3.3.1.3ChargeSharing

    There is a potential charge-sharing problem depending on whether the CAM storage

    bits D and DB are connected to the top transistor or the bottom transistor in the pulldown

    path.

  • 7/28/2019 thesis_v4

    40/147

    28

    Fig. 3. 7: Two possible configurations for the NOR cell: (a) the stored bit is connected to the

    bottom transistors of the pulldown pair, and (b) the stored bit is connected to the top

    transistors of the pulldown pair.

    Fig. 3.7 shows these two possible configurations of the NOR cell. In the configuration

    of Fig. 3.7(a), there is a charge-sharing problem between the matchline, ML, and nodes X1

    and X2. Charge sharing occurs during matchline evaluation, which occurs immediately after

    the matchline precharge-high phase. During matchline precharge, SL and SLB are both at

    ground. Once the precharge completes, one of the searchlines is activated, depending on the

    search data, causing either M1 or M2 to turn ON. This shares the charge at node X1 or node

    X2 with that of ML, causing the ML voltage, VML, to drop, even in the case of match, which

    may lead to a sensing error. To avoid this problem, designers use the configuration shown in

    Fig.3.7 (b), where the stored bit is connected to the top transistors. Since the stored bit isconstant during a search operation, charge sharing is eliminated.

    3.3.1.4PowerConsumption

    The dynamic power consumed by a single matchline that misses is due to the rising

    edge during precharge and the falling edge during evaluation, and is given by the equation

    Pmiss = fCMLVDD2

    Where f is the frequency of search operations. In the case of a match, the power

    consumption associated with a single matchline depends on the previous state of the

    matchline; however, since typically there are only a small number of matches we can neglect

    this power consumption. Accordingly, the overall matchline power consumption of a CAM

    block with w matchlines is

    PML=wPmiss

  • 7/28/2019 thesis_v4

    41/147

  • 7/28/2019 thesis_v4

    42/147

    30

    3.5CURRENTRACESCHEME

    Current-Race saving is an important scheme for the CAM architecture. Fig. 3.9(a)

    shows a simplified schematic of the current-race scheme [51]. This scheme precharges the

    matchline low and evaluates the matchline state by charging the matchline with a current IML

    supplied by a current source. The signal timing is shown in Fig. 3.9(b). The precharge signal,mlpre, starts the search cycle by precharging thematchline low. Since the matchline is

    precharged low, the scheme concurrently charges the searchlines to their search data values,

    eliminating the need for a separate SL precharge phase required by the precharge-high

    scheme of Fig. 3.5(b). Instead, there is a single SL/ML precharge phase, as indicated in Fig.

    3.9(b). After the SL/ML precharge phase completes, the enable signal, enb, connects the

    current source to the matchline. A matchline in the match state charges linearly to a high

    voltage, while a matchline in the miss state charges to a voltage of only IML*RML /m, where

    m denotes the number of misses in cells connected to the matchline. By setting the maximum

    voltage of a miss to be small, a simple matchline sense amplifi

    er easily differentiates betweena match state and a miss state and generates the signal MLso. As shown in Fig. 3.9, the

    amplifier is the nMOS transistor, Msense, whose output is stored by a half-latch. The nMOS

    sense transistor trips the latch with a threshold of Vtn. After some delay, matchlines in the

    match state will charge to slightly above tripping their latch, whereas matchlines in the miss

    state will remain at a much smaller voltage, leaving their latch in the initial state. A simple

    replica matchline (not shown) controls the shutoff of the current source and the latching of

    the match signal. We derive the power consumption of this scheme by first noting that the

    same amount of current is discharged into every matchline, regardless of the state of the

    matchline. Looking at the match case for convenience, the power consumed to charge a

    matchline to slightly above Vtn is

    Pmatch=fCMLVDDVtn

    Since the power consumption of a match and a miss are identical, the overall power

    consumption for all w matchlines is

    PML=wPmatch

    This equation is identical to the low-swing scheme (previous equation) with

    VMLswing=Vtn. The benefits of this scheme over the precharge-high schemes are the simplicity

    of the threshold circuitry and the extra savings in searchline power due to the elimination of

    the SL precharge phase which is discussed

  • 7/28/2019 thesis_v4

    43/147

    31

    Fig. 3. 9: (a) Circuit implementation including precharge circuitry and (b) a timing diagram

    for a single search cycle. For current-race matchline sensing [51].

    Changing CAM cell configuration is an important feature of current racing scheme.

    The current-race scheme also allows changing the CAM cell configuration due to the fact that

    the matchline is precharged low. With precharge low, there is no charge-sharing problem for

    either CAM cell configuration of Fig.3.6, since the ML precharge level is the same as the

    level of the intermediate nodes X1 and X2. Rather than to avoid charge sharing, the criterionthat determines which cell to use in this case is matching parasitic capacitances between

    MLs. In the configuration of Fig. 3.7(b), the parasitic load on a matchline depends on the

    ON/OFF state of M1 and of M2. Since different cells will have different stored data, there

    will be variations in the capacitance CML among the MLs. However, in the configuration of

    Fig.3.7(a), the variation of parasitic capacitance on the matchline depends only on the states

    of SL and SLB which are the same for all cells in the same column. Thus, the configuration

    of Fig. 3.7(a) maintains good matching between MLs and prevents possible sensing errors

    due to parasitic capacitance variations.

    3.6SELECTIVEPRECHARGESCHEME

    Selective precharge scheme came considering the non uniform ML power

    consumption. The matchline-sensing techniques we have seen so far, expend approximately

    the same amount of energy on every matchline, regardless of the specific data pattern, and

    whether there is a match or a miss. We now examine three schemes that allocate power to

    matchlines nonuniformly. Thefi

    rst technique, called selective precharge, performs a matchoperation on the first few bits of a word before activating the search of the remaining bits

  • 7/28/2019 thesis_v4

    44/147

    32

    [52]. For example, in a 144-bit word, selective precharge initially searches only the first 3 bits

    and then searches the remaining 141 bits only for words that matched in the first 3 bits.

    Assuming a uniform random data distribution, the initial 3-bit search should allow only 3

    words to survive to the second stage saving about 88% of the matchline power. In practice,

    there are two sources of overhead that limit the power saving. First, to maintain speed, theinitial match implementation may draw a higher power per bit than the search operation on

    the remaining bits. Second, an application may have a data distribution that is not uniform,

    and, in the worst-case scenario, the initial match bits are identical among all words in the

    CAM, eliminating any power saving.

    Fig. 3.10 is a simplified schematic of an example of selective precharge similar to that

    presented in the original paper [52]. The example uses the first bit for the initial search and

    the remaining n-1 bits for the remaining search. To maintain speed, the implementation

    modifies the precharge part of the precharge-high scheme [of Fig. 3.7(a) and (b)]. The ML is

    precharged through the transistor M1, which is controlled by the NAND CAM cell andturned on only if there is a match in the first CAM bit. The remaining cells are NOR cells.

    Fig. 3. 10: Sample implementation of the selective-precharge matchline technique [52].

    Note that the ML of the NOR cells must be pre-discharged (circuitry not shown) to

    ground to maintain correct operation in the case that the previous search left thematchline

    high due to a match. Thus, one implementation of selective precharge is to use this mixed

    NAND/NOR matchline structure. Selective precharge is perhaps themost commonmethod

    used to save power on matchlines [34], [53][57] since it is both simple to implement and can

    reduce power by a large amount in many CAM applications.

    3.7

    PIPELINING

    SCHEME

    More generally, an implementation may divide the matchline into any number of

    segments, where a match in a given segment results in a search operation in the next segment

    but a miss terminates the match operation for that word. A design that uses multiple

    matchline segments in a pipelined fashion is the pipelined matchlines scheme [58], [59]. Fig.

    3.11(a) shows a simplified schematic of a conventional NOR matchline structure where all

    cells are connected in parallel. Fig. 3.11(b) shows the same set of cells as in Fig. 3.11(a), but

    with the matchline broken into four matchline segments that are serially evaluated. If any

    stage misses, the subsequent stages are shut off, resulting in power saving. The drawbacks of

    this scheme are the increased latency and the area overhead due to the pipeline stages.

  • 7/28/2019 thesis_v4

    45/147

    33

    Fig. 3. 11: Pipelined matchlines reduce power by shutting down after a miss in a stage

    By itself, a pipelined matchline scheme is not as compelling as basic selective precharge;

    however, pipelining enables the use of hierarchical searchlines, thus saving power. Another

    approach is to segment the matchline so that each individual bit forms a segment. Thus,

    selective precharge operates on a bit-by-bit basis. In this design, the CAM cell is modified so

    that thematch evaluation ripples through each CAM cell. If at any cell there is a miss, the

    subsequent cells do not activate, as there is no need for a comparison operation. The

    drawback of this scheme is the extra circuitry required at each cell to gate the comparison

    with the result from the previous cell.

    Fig. 3. 12: Simulated wave forms in the pipelined match-line architecture for (a) the full-

    match case consisting of a match in every stage, and (b) a miss case where the third stageresults in a miss and turns off the subsequent stages.[58]

  • 7/28/2019 thesis_v4

    46/147

    34

    Fig 3.12 shows the simulated waveforms of the ML segments in the pipelined ML

    scheme. Fig 3.12 shows a full match as indicated by the rising ML in every segment along

    with the corresponding full-rail output of the MLSA. Fig 3.12(b) shows an example of a word

    that misses in the third stage as indicated by the lack of an MLSA output pulse. In this

    example, the ML sensing circuitry of the fourth and fifth segments is not activated, hencesaving power.

    3.8CURRENTSAVINGSCHEME

    The current-saving scheme [60], [61] is another data-dependent matchline-sensing

    scheme which is a modified form of the current-race sensing scheme. Recall that the current-

    race scheme uses the same current on each matchline, regardless of whether it has a match or

    a miss. The key improvement of the current-saving scheme is to allocate a different amount

    of current for a match than for a miss. In the current-saving scheme matches are allocated alarger current and misses are allocated a lower current. Since almost every matchline has a

    miss, overall the scheme saves power.

    Fig. 3. 13: Current-saving matchline-sensing scheme

    Fig.3.13 shows a simplified schematic of the current-saving scheme. The main

    difference from the current-race scheme as depicted in Fig. 3.9 is the addition of the current-

    control block. This block is the mechanism by which a different amount of current is

    allocated, based on a match or a miss. The input to this current-control block is the matchline

    voltage, VML, and the output is a control voltage that determines the current, IML, which

    charges the matchline. The current-control block provides positive feedback since higher VML

    results in higher IML, which, in turn, results in higher VML.

  • 7/28/2019 thesis_v4

    47/147

    35

    3.9CONCLUSION

    From the above discussion of previous Content addressable Memory (CAM) it is

    clearly shown that the main concern was power and speed. The researchers gave a good

    importance to these two factors and improved the power and speed significantly. Scheme

    simplicity and noise immunity are other two factors. Researchers gave importance to thesetwo factors too. The previous result from the simulation that the researchers performed is

    shown below in table format.

    Table 3. 3: Comparison between the schemes[62]

    Scheme ML energy

    fj/bit/search

    Cycle time(ns) Noise Scheme

    simplicity

    Conventional 9.5 3.9 + ++

    Low swing 4.2 3.1 - -

    Current race 5.5 3.7 - +

    Selective

    precharge

    5.6 3.5 + +

    Pipelining 5.8 3.8 + -

    Current saving 4.3 3.7 - --

  • 7/28/2019 thesis_v4

    48/147

    36

    CHAPTER4

    PROPOSEDCHARGINGCONTROL

    SCHEME&SENSEAMPLIFIER

  • 7/28/2019 thesis_v4

    49/147

    37

    CHAPTER4

    PROPOSED CHARGING CONTROL SCHEME &SENSE AMPLIFIER

    4.1INTRODUCTION

    Content-addressable memory (CAM) is a storage device that searches for the

    matching data by content and returns the address of the matching data. Due to the risingdemand of CAM for high speed search capability in various applications, the size as well as

    power consumption of CAM arrays continue to rise.

    A lot of researches have been done to reduce the power consumption and to increasethe speed. Previous works present some schemes such as selective-precharge scheme [63],current-race scheme [64], current-saving scheme [65], [66], etc. It is reported in a survey [67]that the current-saving scheme consumes less power than other schemes [63], [64], [68]. Inthis work, we proposed a simplified match-line charging control scheme. Comparison withthe current-race and current-saving schemes shows match-line energy reduction of over 50%and speed improvement of over 3 times.

    4.2PROPOSEDCHARGECONTROLLINGSCHEME

    Fig. 4.1(a) shows the basic architecture of the proposed charge controlling scheme.Here, the array of CAM cells stores the data entries; search data register stores the searchword; each charge controller block controls the charging and discharging process of therespective match-line (ML); and the sense amplifier (SA) senses the ML voltage and givesthe final match/miss decision. Fig. 4.1(b) shows the internal circuit of conventional NOR-type Ternary CAM (TCAM) cell which is less susceptible to failure due to the processvariation as compared to NAND-type cell [69].

    At the beginning of each search cycle, all MLs are pre-discharged to ground by thecharging controller. During the evaluation stage, the search data register broadcasts the searchdata to the search-lines (SLs). If SL resembles stored bit D or the stored bit is X (dont care),the ML has no discharging path to ground and the path remains in the high-impedance state.If SL does not resemble D, the ML has a discharging path (through either transistors T 1-T2

    path or T3-T4 path) to ground. So, in a ML, the number of discharging paths is equal to thenumber of mismatches.

    Fig. 4.2 shows the proposed charging controller. Here, at the beginning of searchcycle, ML is pre-discharged to ground by a high MLP. The transistor M2 is turned ON by alow MLP and M5 is OFF by a low MLC; this causes M3 to be turned OFF. Therefore,charging of ML through M3 remains prohibited during the pre-discharging of ML.

    During the evaluation stage, MLP is switched to low so that both M1 and M2 turnOFF. At the same time, MLC is switched to high so that charging of ML through M3 begins.

  • 7/28/2019 thesis_v4

    50/147

    38

    If the CAM cell data of this ML is fully matched with the search-line data, then, there is nodischarging path through the CAM cell. So, charging to the fully matched match-line(denoted by ML0) is faster than the partially matched match-line. MLC is kept high until thevoltage of ML0 is enough to turn M4 ON; this causes the gate of the transistor M3 to be lowand charging to ML0 continues via M3. As the voltage of match-line, ML0 reaches aboutVDD/3 and MLC is low, M6 becomes fully ON and M7 becomes partially ON. The end result

    is no more charging of ML through M3. The achieved voltage (~VDD/3) of ML0 is highenough to be sensed as a high level for the proposed sense amplifier (SA). Now, the match-line which has one miss is denoted by ML1 and this ML is hardest to detect as a miss. Ascharging to ML1 is slower than ML0, the voltage of ML1 is not enough to be sensed as ahigh level by the SA.

    Fig. 4. 1: Structure of the CAM of the proposed scheme: (a) basic architecture and (b) NOR-

    type TCAM cell used in the scheme.

  • 7/28/2019 thesis_v4

    51/147

    39

    Fig. 4. 2: Internal circuit of the charging controller.

    Fig. 4.3 shows the proposed sense amplifier (SA). During precharge stage, the node

    SN is precharged to high through MS1. During the evaluation stage, when the voltage of ML

    is slightly above VDD/3, the transistor MS2 turns ON. So, the node SN begins to discharge

    and MLS begins to rise. As MLS reaches high level, the transistor MS3 turns ON resulting in

    faster discharge of node SN. Transistor MS4 is used to initialize the output MLS to ground at

    the beginning of each search cycle.

    4.3SIMULATIONRESULTSANDANALYSIS

    The proposed scheme along with current-race and current-saving schemes are

    simulated using HSpice in the same 64 72 TCAM for TSMC 0.18m CMOS process with

    the supply voltage of 1.8V.

    Fig. 4.4 shows the simulation result for our design. At the beginning of a search cycle,

    MLP is high for 0.39 ns so that previously charged MLs can discharge to ground. Then, MLP

    is switched to low and MLC is switched to high for 0.33ns. As the voltage of ML0 rises up to

    0.65V and MLC is low, the charging is stopped and the voltage of ML0 remains unchanged.

    The proposed sense amplifier senses this voltage as a match and output MLS0 turns to a full

    high level. On the other hand, the voltage of the ML1 rises up to 0.46V and then, degrades

    through the discharging path. The SA senses this voltage as a miss and results a low MLS1.

    As the difference between the maximum ML0 and ML1 (worst-case match-line) is 190 mV,

    even for a large process variation, our scheme will produce the correct search results.

    Table 4.1 shows the comparative search energy and performance among the schemes.

    It shows that ML energy reduction is 57% and 54% compared to the current-race and current-saving schemes respectively and the speed is 3.13 times that of the both schemes.

  • 7/28/2019 thesis_v4

    52/147

    40

    The major difference between previous schemes [64]-[66] and our scheme is that we

    have used only one PMOS for charging the ML while previous schemes use two PMOS in

    series. So, the equivalent resistance is half of the previous schemes. Equivalent capacitance is

    also reduced. As a result, speed of the circuit increases. The basic difference between

    previous SA design [64]-[66] and the proposed SA is the switching threshold of SA. Theproposed SA can sense a stable 0.6 V as a high level, whereas the previous SAs use ~1V as a

    switching voltage. So, the proposed scheme need not further charging after 0.6V (equals to

    VDD/3) at ML0 is achieved. So, this scheme reduces a large amount of dynamic and leakage

    power, while a larger voltage swing of about VDD/2 in [64]-[66] causes a higher power

    consumption.

    Fig. 4. 3: Proposed sense amplifier (SA)

    ML

    MS1

    MLS

    MLP

    SN

    MLPMS2

    MS3 MS4

  • 7/28/2019 thesis_v4

    53/147

    41

    Fig. 4. 4: Simulation results of the proposed CAM showing voltages ML0 (fully matched),

    ML1 (one-bit miss), ML2 (two-bit miss), MLC and MLP.

    Table 4. 1: Comparison of Different Schemes

    Schemes

    MLEnergy

    (fJ/bit/search)

    SLEnergy

    (fJ/bit/search)

    Minimum Cycletime, T(ns)

    Speed, 1/T

    (MHz)

    Current-race[64]

    3.56 0.65 3.60 278

    Current-saving[65],[66]

    3.32 0.65 3.60 278

    This work 1.53 0.62 1.15 870

    4.4CORNERSIMULATIONOFTHESCHEME

    To test the reliability of the proposed scheme, we have simulated the proposed

    scheme in two extreme corners which we defined as follows: the fast corner with low

    threshold voltage, high VDD and low temperature (FHL) and the slow process corner with

    high threshold voltage, low VDD and high temperature (SLH). Simulation results reveal that

    the proposed scheme works satisfactorily in the fast process (FHL) corner with -10%

    threshold votlage, +5% VDD and at a temperature of 273 K as well as in the slow process

    (SLH) corner with +10% threshold votlage, -5% VDD and at a temperature of 343 K. The

    simulat