a high-performance and energy-efficient tcam design for ip address lookup

Upload: naga-karthik

Post on 02-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup

    1/5

    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009 479

    A High-Performance and Energy-Efficient TCAMDesign for IP-Address Lookup

    Yen-Jen Chang,Member, IEEE

    AbstractIn this brief, we propose a two-level dont-caregating (DCG) scheme that aims to reduce the ternary content-addressable memory (TCAM) power dissipated in the search-line switching activity. By exploiting the vertically continuousdont-care feature, the two-level DCG scheme can largely reducethe average search-line power consumption during a switch pat-tern. In addition, we also use the search enable technique to elim-inate the unnecessary search-line switching activity in the quietpattern. By reducing both the search-line switching activity andaverage switching power, the proposed design can minimize theTCAM search-line power consumption. For a 128 32 TCAM,the best configuration we examined shows that when the first-level

    and second-level gating granularities are 16 and 8, respectively,with a 9% search performance improvement, the two-level DCGscheme can achieve 70% search-line energy reduction.

    Index TermsForwarding table, low-power design, router,search line (SL), ternary content addressable memory (TCAM).

    I. INTRODUCTION

    BECAUSE ternary content-addressable memory (TCAM)

    has an additional dont-care (or X) state to perform

    the wild match, it is widely used in the forwarding table of

    the network router. However, the power consumption of TCAM

    is usually considerable due to the parallel comparison feature,

    in which a large amount of transistors and wires are active oneach lookup. As revealed in [1], there are three major power

    consumers in TCAM, including the clock and control, match

    lines (MLs), and search lines (SLs). The power dissipated by

    the former two components has effectively been reduced [2],

    [3], but the SLs still contribute 54%82% to the total power

    consumption [1]. The SL power consumption can be reduced by

    using the segmented [1] or hierarchical SL scheme [4][6] and

    minimizing the switching activity [7], [8]. However, they suffer

    from either performance penalty or complex control circuitry.

    In contrast, this brief presents a low-power TCAM design that

    consists of the two-level dont-care gating (DCG) scheme and

    the search enable (SE) technique. Without performance penalty

    and complex control circuitry, our design can largely reduce the

    TCAM power dissipated in the SL switching activity.

    The most pronounced features of the proposed low-power

    TCAM design are summarized as follows: 1) The SE technique

    can completely eliminate the unnecessary SL switching activity

    in the quiet pattern, but it will result in a performance penalty.

    Manuscript received November 4, 2008; revised January 23, 2009. Currentversion published June 17, 2009. This paper was recommended by AssociateEditor T. Zhang.

    The author is with the Department of Computer Science and Engineering,National Chung Hsing University, Taichung 402, Taiwan (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TCSII.2009.2020935

    2) Based on the vertically continuous X feature, the two-level

    DCG scheme uses the additional gating nodes to conditionally

    prevent the search data from being broadcast over the entire SL.

    By decreasing the SL effective capacitance, the two-level DCG

    scheme can largely reduce the average power dissipated in the

    SL switches. 3) Instead of the simple transmission gates (TGs),

    the two-level DCG scheme uses the inverter chain circuitry

    to accelerate the transmission of search data. Because such

    speedup will compensate for the performance loss incurred by

    the SE technique, our design can achieve comparable or even

    better performance than the conventional TCAM design.The proposed TCAM design with a size of 128 32 was

    implemented with the TSMC 0.18-m technology, and all

    the results were measured from the HSPICE simulation. By

    examining all possible configurations, the results show that

    our design can achieve the best energy efficiency when the

    first-level (L1) and second-level (L2) gating granularities are

    16 and 8, respectively. Compared to the conventional NO R-

    type TCAM, our design not only reduces the average SL

    energy consumption by 63%70% but also improves the search

    performance by 9%.

    The rest of this brief is organized as follows. Section II

    reviews the conventional NO R-type TCAM design and the

    previous work on TCAM power reduction. Section III describesthe circuitry developed for the two-level DCG scheme in detail.

    In addition to the discussions on the importance issues, the

    comparison between our design and the related work is also

    provided. The measurement results and analysis are given in

    Sections IV, and V offers some brief conclusions.

    II. TCAM

    As shown in Fig. 1, a typical TCAM cell consists of three

    major components: 1) an eight-transistor XO R-type content-

    addressable memory (CAM) cell that not only stores the prefix

    data but also compares the prefix data with the search data;

    2) a six-transistor static random access memory (SRAM) cell

    that stores the mask bit to indicate whether this TCAM cell is

    X or not; and 3) a control logic that can conditionally pull

    down the ML. As shown in Fig. 1, this can simply be imple-

    mented with two n-type metaloxidesemiconductor transistors

    placed in serial, which are controlled by the mask bit and the

    XO Rresult of the CAM cell, respectively.

    A. Search Operation

    The search operation is initiated by discharging all SLs and

    then precharging all MLs to VDD. In Fig. 1, the discharge of

    both S andSis to ensure that no short-circuit path exists during

    1549-7747/$25.00 2009 IEEE

  • 8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup

    2/5

    480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009

    Fig. 1. Typical TCAM cell and the corresponding state table.

    the ML precharge. After precharging the ML, the search data

    and its complement are then applied to S and S to perform thesearch operation. As shown in Fig. 1, M = 1 means that thisTCAM cell is in the X state, where it is always a match

    regardless of the comparison result of the CAM cell. Such

    comparison is referred to as a wild match. In contrast, if themask bit is 0, this TCAM cell is in either the 0 or 1 state. In

    this case, if the search data are equal to the stored data, it is also

    a match that we refer to asnormal matchto distinguish it from

    thewild match. In Fig. 1, the ML is discharged to 0 only in case

    of mismatch, in whichM = 0and the X OR result of the CAMcell is 1. Note that the search operations account for almost

    all the TCAM operations, and the SLs are highly capacitive.

    Thus, the TCAM power dissipated in the high-frequency SL

    switching is substantial [1].

    B. Related Work

    Including our previous work [9][11], a large number of

    designs have been proposed to reduce TCAM power consump-

    tion, particularly ML power consumption [2], [3]. Because this

    brief aims at reducing the TCAM SL power, we only focus on

    the work related to the SL power reduction. In [4], Pagiamtzis

    and Sheikholeslami introduced a hierarchical SL design in

    which the SLs are broken into the global SLs (GSLs) and local

    SLs (LSLs). Instead of directly connecting to the CAM cells,

    the GSLs with low-swing signals are fed into the LSLs, and

    then, the local receivers would conditionally amplify the signals

    to drive a subset of CAM cells. Because only a few LSLs

    with reduced capacitance would be activated, the CAM powerdissipated in the SLs can effectively be reduced.

    Based on [4], an alternative dont-care-based hierarchical

    (DCBH) SL design [6] was proposed. The DCBH scheme uses

    the X bit stored in the bottom word of each block to control

    whether the search data on GSL should be broadcast to the

    corresponding LSL. The GSLs are active every cycle, but the

    LSLs are active only when the bottom word is not X.

    The only difference between [4] and [6] is the LSL control

    strategy. In [1], the segmented SL design was introduced,

    which uses the segmentation cell (SC) to segment the SL. The

    SC consists of a dummy cell and a path-control switch. In

    particular, the dummy cell is an extra SRAM cell to probe the

    continuous X information. When the TCAM cell above theSC stores the X bit, the SC dummy cell will keep 0 to turn

    Fig. 2. (a) Search data example of five consecutive 0s. Case A/B is the TCAMdesign without/with SE scheme. (b) Traditional SE scheme.

    off the path-control switch to block the search data from being

    further broadcast. Because the effective capacitance of SL is

    decreased, the SL power consumption can be reduced.

    III. TWO-L EVELDCG SCHEME

    A. SE Technique

    If we only consider two consecutive search data on a single

    bit, there are four search patterns, i.e., 0 0, 1 1, 0 1,and1 0. Due to no having data transition,0 0and 1 1patterns are classified asquiet patterns. In contrast,0 1and1 0 are classified as switch patterns. In the conventionalTCAM design, because the ML has to be charged to high during

    the precharge phase, both S and S must be discharged to 0 to

    avoid a possible short circuit. However, such discharge willincrease the unnecessary SL switching activity. For example,

    Fig. 2(a) shows a search data pattern of five consecutive 0s.

    In case A, i.e., the conventional TCAM design, the gray block

    contains the values of S and S during the precharge phase.Clearly, the number of energy-consuming transitions (N01)on SL is four, but they are all unnecessary switching activities.

    As illustrated in Fig. 2(b), a straightforward solution to this

    weakness is the introduction of an additional transistor, i.e., N3,

    that is used to disconnect the pull-down path during the ML

    precharge and then enable the search operation. It is referred to

    as the SE technique, whose effect can be observed in case B

    shown in Fig. 2(a), where the unnecessary SL switches are all

    eliminated. Compared to the traditional TCAM without SE,

    whose pull-down path is only N1 and N2, the use of SE will

    result in an increase of one transistor in the length of the pull-

    down path. Based on our simulation, the optimal N3 width

    is three times the N1 (or N2) width, where the performance

    penalty is about 4.2%. Fortunately, this performance penalty

    can be compensated by the two-level DCG scheme proposed

    in the following section.

    B. L1 DCG

    Fig. 3 shows the general configuration of a TCAM array

    withNprefixes. The function of the routing table lookup is tofind the longest one in all the prefixes that match the incoming

  • 8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup

    3/5

    CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP 481

    Fig. 3. General configuration of a prefix array.

    Fig. 4. L1 gating node(GNL1)implementation.

    packets destination address. This is commonly known as the

    longest prefix match (LPM) problem [12]. To simply resolve

    the LPM problem, the stored prefixes are sorted by decreasing

    length. Clearly, if the ith cell is X, then all the cells below

    the ith cell must be X. It is referred to as the vertically

    continuous dont-care feature. For example, consider the gray

    column shown in Fig. 3. Because the sixth cell is X, the

    sixthNth cells are all X. This scenario implies that the

    comparisons between the search data and the prefix data storedin the sixthNth cells are redundant. If we can gate the search

    data at the sixth cell, only the capacitances before the sixth

    cell are needed to be charged/discharged. Such capacitance that

    needs charge in reality is referred to aseffective capacitance. In

    this example, due to the decrease of effective capacitance, the

    SL power consumption can be reduced.

    To gate the search data from being broadcast over the entire

    SL, our design inserts the gating nodes to break the entire SL

    into several segments. As shown in Fig. 4, the L1 gating node

    (GNL1) is implemented as an inverter that is controlled bythe corresponding mask bit. For example, a GNL1 is located

    in the ith cell. When the ith cell is X, Mi= 1 will cut off

    both the power and ground sources to disable the inverter fromtransmitting the search data. Otherwise, the inverter will take

    Fig. 5. L2 DCG example, where the granularity(GL2)is 4, and GNL2 is theL2 gating node.

    effect in the case of Mi= 0. Note that the search data areopposite after the gating node.

    The granularity of L1 DCG (GL1)is defined as the numberof cells between two L1 gating nodes. It is critical to both

    search performance and power saving. IfGL1 is too large, theprobability that the first cell of segment X is low, such that

    the amount of effective capacitance reduction is insignificant.

    In contrast, a smallGL1can largely reduce the effective capaci-tance but suffers from the serious propagation delay. Therefore,

    an adequate GL1 that benefits both power saving and searchperformance is very critical to our design. The detailed analysis

    will be provided in the experimental results.

    C. L2 DCG

    The L1 DCG is beneficial to reduce the SL power only when

    the first TCAM cell is X in a segment. If the first cell is

    not X, this segment would be driven to perform the search

    operation even though the remainder cells are all X. Because

    decreasingGL1will worsen the search performance, to improvethe power efficiency of the L1 DCG, we propose an L2 gating

    scheme to further exploit the vertically continuous X property

    within an L1 segment.

    Fig. 5 shows an L2 gating example, in which the L2 granu-

    larity (GL2) is 4. For example, ifGL1 andGL2 are 16 and 4,

    respectively, then each L1 segment is divided into four L2segments each containing four TCAM cells. Similar to GNL1,

    the function of the L2 gating node (GNL2)is to either gate ortransmit the search data from the L1 segment, which depends

    on whether the first cell is X or not. This requirement can

    easily be realized by a TG, but it has a vital defect when the

    input search data is gated, in which the Q and Q nodes arefloating, such that the N2 transistor (shown in Fig. 1) is likely

    to be turned on to cause an unexpected pull-down path between

    the ML and the ground, i.e., false mismatch.

    Instead of the TG, GNL2is implemented as in Fig. 5, which

    is controlled by the mask value of the first cell (M0) in thecorresponding L2 segment: 1) If the first cell is not X, then

    M0= 0 will turn on the P1 transistor to activate GNL2 as aninverter. Assume that the search data from the L1 segment are S.

  • 8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup

    4/5

    482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009

    GNL2 will drive the Q node to be S to perform the searchoperation. 2) In the other case, where the first cell is X,

    M0= 1will cut off theVDDpath and turn on the N2 transistorto force the Q node to be 0 regardless of the input search data.

    Thus, all side effects incurred by the floating Q and Q nodescan completely be eliminated.

    D. Compared to Related Work

    Similar to the related work [1], [6], our design also exploits

    the vertically continuous X feature to reduce the TCAM SL

    power, but the strategy and implementation are totally different.

    The major overheads in [6] come from the low-swing receivers

    that are responsible for translating a low-swing GSL signal to

    a full-swing LSL signal. In addition to the complex circuitry,

    the overheads also include the leakage power penalty and the

    increased propagation delay. Because these overheads increase

    with the number of low-swing receivers, the large block size is

    preferable, such that the effect of power saving is reduced. Incontrast, the gating nodes used in the two-level DCG are very

    simple, so the above drawbacks are absent in our design.

    Compared to the segmented SL design [1], instead of using

    an extra SRAM cell, our design uses the stored X bit to

    directly control the broadcast of search data. Note that the path-

    control switch used in SC [1] is implemented with the TG that

    has no power source. Thus, the propagation delay of search

    data is increased. In contrast, because the gating node used

    in the two-level DCG is implemented with the controllable

    inverter that has a power source, our design does not incur

    any performance loss. On the contrary, these gating nodes

    will realize an inverter chain that can facilitate the search data

    transmission, i.e., the search performance can be improved.

    IV. EXPERIMENTALR ESULTS

    In this brief, we use TSMC 0.18-m technology with 1.8-V

    supply voltage to implement two IPv4 routing tables. Both of

    them are with a size of 128 32, i.e., 128 entries by 32 bits.

    One is implemented with the proposed SE and two-level DCG

    schemes, and the other is a conventional NOR-type TCAM used

    for comparison.

    A. Search Performance

    The metric used to evaluate the TCAM search performance

    is the match delay (MD). Because our design involves the

    modifications of both the SL architecture and the pull-down

    path of ML, the MD is defined as the elapsed time from the

    search data applied to the SL to the ML discharged to 0 in case

    of a mismatch. Throughout this brief, the MD of our design is

    always the worst-case delay measured from the mismatch of a

    TCAM word located in the last L1 segment.

    Fig. 6 shows the MD for both the conventional and our

    TCAM designs, where XYmeans that the L1 and L2 gran-

    ularities are X and Y, respectively. In our simulation, GL1 isconstrained within the range from 8 to 32, and a total of nine

    reasonable two-level DCG configurations are evaluated. Theyare 8-2, 8-4, 16-2, 16-4, 16-8, 32-2, 32-4, 32-8, and 32-16. Due

    Fig. 6. Match delay for both the conventional and two-level DCG TCAMdesigns, where XYmeans thatGL1 = XandGL2 = Y.

    Fig. 7. Column power dissipated in the switch pattern for both the conven-tional and two-level DCG TCAM designs.

    to having no segmentation, the MD of the conventional TCAM(MDConv)is fixed at 1.608 ns.

    In Fig. 6, there are two cases in which MDDCG is always

    worse than MDConv. The first case is that GL1 is less than orequal to 8; the second case is that GL2 is less than or equalto 2. In particular, MDDCG is even shorter than MDConv in

    the cases of 16-4, 16-8, 32-4, 32-8, and 32-16. This is because

    our design decouples most TCAM cells from the SL, such that

    the SL capacitance can largely be reduced. According to the

    RCdelay model, the SL propagation delay will decrease with

    the SL capacitance. In addition, the inverter chain introduced

    by the L1 DCG scheme can also reduce the SL propagation

    delay. Therefore, the performance gains from the two-level

    DCG scheme will cover the performance penalty incurred by

    the SE technique. Based on this result, only five configurations,

    i.e., 16-4, 16-8, 32-4, 32-8, and 32-16, are evaluated in the

    following discussion of energy reduction.

    B. Average SL Energy Reduction

    Because the power consumption of a quiet pattern is hardly

    noticeable in our design, Fig. 7 only shows the column power

    dissipated in a switch pattern, in which the continuous X

    number is varied from 0 to 128, e.g., X = 32 means that thecontinuous X number is 32. For a clear demonstration, only

    the 16-8, 32-4, and 32-8 configurations are displayed. Becausethe two-level DCG breaks the entire SL into several segments,

  • 8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup

    5/5

    CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP 483

    TABLE ICOLUMNPOWERCONSUMPTION(IN WATTS) F OR BOTH THE

    CONVENTIONAL ANDTWO-LEVELDCG TCAM DESIGNS.CONFIGURATION 16-8 IS THEBEST TOREDUCE

    TH ESWITCHPOWER CONSUMPTION

    its column power shows a step wave. The power data are

    summarized in Table I, where the worst and best values are for

    theX = 0and X = 128 cases. In particular, the average valueis obtained by averaging the results of all 129 cases, i.e., X =

    0 128, if every case has the same occurrence probability.In Table I, the key observations are as follows: 1) There are

    two features about the conventional TCAM. First, due to the

    need for presetting both S and S to 0, the power consumptionof the quiet pattern is almost equal to that of the switch pattern.

    Second, its column power is independent of the continuous X

    number. As shown in Table I, they are always 3.753 E-05 and

    3.776 E-05 W for quiet and switch patterns, respectively. 2) Due

    to having no SL switch, in the quiet pattern, our design almost

    consumes no power compared to the conventional TCAM, and

    the difference between three cases is hardly noticeable. 3) In

    the worst case of the switch pattern, because no X cell can

    facilitate our design to reduce the SL power, the additionalL1 and L2 gating nodes will result in absolute power penalty.

    Consequently, for all configurations, the worst switch power

    must be larger than that of the conventional TCAM. 4) Clearly,

    for the switch pattern the best configuration is 16-8, in which

    our design incurs the least power penalty, i.e., 13%, in the

    worst case, while achieving the largest power reduction, i.e.,

    35%, in the average case.

    For a fair comparison, the evaluation metric is the energy,

    which is the product of the MD and the search power. Thus,

    the column energy consumption is summarized in Table II, in

    which only the results of the average case are presented. In

    Table II, the best configuration is still 16-8, which can achieve

    70% average SL energy reduction compared to the conventional

    NO R-type TCAM design.

    C. Area Overhead and Leakage Power Penalty

    Our design can effectively reduce the SL energy consump-

    tion, but the additional SE control and L1 and L2 gating nodes

    will increase the transistor count. Because both the area and the

    leakage power consumption are proportional to the transistor

    count, our design will result in a 5.8% area overhead and 1.6%

    leakage power penalty. In addition, the two-level DCG design

    also increases the mask node capacitance of the first cell of

    L1 and L2 segments, and the read/write performance loss isabout 0.2%.

    TABLE IICOLUMNENERGYCONSUMPTION FORBOTH THECONVENTIONAL

    AN DTWO-LEVELDCG TCAM DESIGNS

    V. CONCLUSION

    In this brief, we have proposed a low-power TCAM designthat not only reduces the SL power consumption but also

    improves the search performance. We have first used the SEtechnique to eliminate all unnecessary SL switches in the quietpattern and then used the proposed two-level DCG scheme to

    reduce the SL switch power in the switch pattern. By using

    the vertically continuous dont care feature, our design canachieve at least 63% reduction in average SL energy consump-

    tion with a 5.8% area overhead.

    REFERENCES

    [1] J. S. Wang, C. C. Wang, and C. Yeh, TCAM for IP-address lookupusing tree-style AND-type match lines and segmented search lines, in

    Int. Solid-S tate Circuits Con f., 2006, pp. 577586.

    [2] N. Mohan and M. Sachdev, Low-capacitance and charge-shared match-lines for low-energy high-performance TCAMs, IEEE J. Solid-StateCircuits, vol. 42, no. 9, pp. 20542060, Sep. 2007.

    [3] N. Mohan, W. Fung, D. Wright, and M. Sachdev, Match line senseamplifiers with positive feedback for low-power content addressablememories, inIEEE Custom Integr. Circuits Conf., 2006, pp. 297300.

    [4] K. Pagiamtzis and A. Sheikholeslami, A low power content-addressablememory (CAM) using pipelined hierarchical search scheme, IEEE J.Solid-State Circuits, vol. 39, no. 9, pp. 15121519, Sep. 2004.

    [5] H. Noda, K. Inoue, M. Kuroiwa, F. Igaue, K. Yamamoto, H. J. Mattausch,T. Koide, A. Amo, A. Hachisuka, S. Soeda, I. Hayashi, F. Morishita,K. Dosaka, K. Arimoto, K. Fujishima, K. Anami, and T. Yoshihara, Acost-efficient high-performance dynamic TCAM with pipelined hierar-chical searching and shift redundancy architecture, IEEE J. Solid-StateCircuits, vol. 40, no. 1, pp. 245253, Jan. 2005.

    [6] P. T. Huang, S. W. Chang, W. Y. Liu, and W. Hwang, A 256x128 energy-efficient TCAM with novel low powerschemes, in Proc. Int. Symp. VLSI-

    DAT, 2007, pp. 14.[7] I. Arsovski, T. Chandler, and A. Sheikholeslami, A ternary content ad-

    dressable memory (TCAM) based on 4T static storage and including acurrent-race sensing scheme,IEEE J. Solid-State Circuits, vol. 38, no. 1,pp. 155158, Jan. 2003.

    [8] I. Arsovski and R. Nadkarni, Low-noise embedded CAM with reducedslow-rate match-llines and asynchronous search-lines, in IEEE Custom

    Integr. Circuits C onf., 2005, pp. 447450.[9] Y.-J. Chang, Two-layer hierarchical matching method for energy-

    efficient CAM design, Electron. Lett., vol. 43, no. 2, pp. 8082,Jan. 2007.

    [10] Y.-J. Chang, Y.-H. Liao, and S.-J. Ruan, Improve CAM power efficiencyusing decoupled match line scheme, in IEEE/ACM DATE, Apr. 1620,2007, pp. 16.

    [11] Y.-J. Chang and Y.-H. Liao, Hybrid-type CAM design for both powerand performance efficiency,IEEE Trans. Very Large Scale Integr. (VLSI)

    Syst., vol. 16, no. 8, pp. 965974, Aug. 2008.[12] D. Shah and P. Gupta, Fast updating algorithms for TCAMs, IEEEMicro, vol. 21, no. 1, pp. 3647, Jan./Feb. 2001.