a high-performance and energy-efficient tcam design for ip address lookup

8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup

1/5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009 479

A High-Performance and Energy-Efficient TCAMDesign for IP-Address Lookup

Yen-Jen Chang,Member, IEEE

AbstractIn this brief, we propose a two-level dont-caregating (DCG) scheme that aims to reduce the ternary content-addressable memory (TCAM) power dissipated in the search-line switching activity. By exploiting the vertically continuousdont-care feature, the two-level DCG scheme can largely reducethe average search-line power consumption during a switch pat-tern. In addition, we also use the search enable technique to elim-inate the unnecessary search-line switching activity in the quietpattern. By reducing both the search-line switching activity andaverage switching power, the proposed design can minimize theTCAM search-line power consumption. For a 128 32 TCAM,the best configuration we examined shows that when the first-level

and second-level gating granularities are 16 and 8, respectively,with a 9% search performance improvement, the two-level DCGscheme can achieve 70% search-line energy reduction.

Index TermsForwarding table, low-power design, router,search line (SL), ternary content addressable memory (TCAM).

I. INTRODUCTION

BECAUSE ternary content-addressable memory (TCAM)

has an additional dont-care (or X) state to perform

the wild match, it is widely used in the forwarding table of

the network router. However, the power consumption of TCAM

is usually considerable due to the parallel comparison feature,

in which a large amount of transistors and wires are active oneach lookup. As revealed in [1], there are three major power

consumers in TCAM, including the clock and control, match

lines (MLs), and search lines (SLs). The power dissipated by

the former two components has effectively been reduced [2],

[3], but the SLs still contribute 54%82% to the total power

consumption [1]. The SL power consumption can be reduced by

using the segmented [1] or hierarchical SL scheme [4][6] and

minimizing the switching activity [7], [8]. However, they suffer

from either performance penalty or complex control circuitry.

In contrast, this brief presents a low-power TCAM design that

consists of the two-level dont-care gating (DCG) scheme and

the search enable (SE) technique. Without performance penalty

and complex control circuitry, our design can largely reduce the

TCAM power dissipated in the SL switching activity.

The most pronounced features of the proposed low-power

TCAM design are summarized as follows: 1) The SE technique

can completely eliminate the unnecessary SL switching activity

in the quiet pattern, but it will result in a performance penalty.

Manuscript received November 4, 2008; revised January 23, 2009. Currentversion published June 17, 2009. This paper was recommended by AssociateEditor T. Zhang.

The author is with the Department of Computer Science and Engineering,National Chung Hsing University, Taichung 402, Taiwan (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSII.2009.2020935

2) Based on the vertically continuous X feature, the two-level

DCG scheme uses the additional gating nodes to conditionally

prevent the search data from being broadcast over the entire SL.

By decreasing the SL effective capacitance, the two-level DCG

scheme can largely reduce the average power dissipated in the

SL switches. 3) Instead of the simple transmission gates (TGs),

the two-level DCG scheme uses the inverter chain circuitry

to accelerate the transmission of search data. Because such

speedup will compensate for the performance loss incurred by

the SE technique, our design can achieve comparable or even

better performance than the conventional TCAM design.The proposed TCAM design with a size of 128 32 was

implemented with the TSMC 0.18-m technology, and all

the results were measured from the HSPICE simulation. By

examining all possible configurations, the results show that

our design can achieve the best energy efficiency when the

first-level (L1) and second-level (L2) gating granularities are

16 and 8, respectively. Compared to the conventional NO R-

type TCAM, our design not only reduces the average SL

energy consumption by 63%70% but also improves the search

performance by 9%.

The rest of this brief is organized as follows. Section II

reviews the conventional NO R-type TCAM design and the

previous work on TCAM power reduction. Section III describesthe circuitry developed for the two-level DCG scheme in detail.

In addition to the discussions on the importance issues, the

comparison between our design and the related work is also

provided. The measurement results and analysis are given in

Sections IV, and V offers some brief conclusions.

II. TCAM

As shown in Fig. 1, a typical TCAM cell consists of three

major components: 1) an eight-transistor XO R-type content-

addressable memory (CAM) cell that not only stores the prefix

data but also compares the prefix data with the search data;

2) a six-transistor static random access memory (SRAM) cell

that stores the mask bit to indicate whether this TCAM cell is

X or not; and 3) a control logic that can conditionally pull

down the ML. As shown in Fig. 1, this can simply be imple-

mented with two n-type metaloxidesemiconductor transistors

placed in serial, which are controlled by the mask bit and the

XO Rresult of the CAM cell, respectively.

A. Search Operation

The search operation is initiated by discharging all SLs and

then precharging all MLs to VDD. In Fig. 1, the discharge of

both S andSis to ensure that no short-circuit path exists during

1549-7747/$25.00 2009 IEEE


2/5

480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009

Fig. 1. Typical TCAM cell and the corresponding state table.

the ML precharge. After precharging the ML, the search data

and its complement are then applied to S and S to perform thesearch operation. As shown in Fig. 1, M = 1 means that thisTCAM cell is in the X state, where it is always a match

regardless of the comparison result of the CAM cell. Such

comparison is referred to as a wild match. In contrast, if themask bit is 0, this TCAM cell is in either the 0 or 1 state. In

this case, if the search data are equal to the stored data, it is also

a match that we refer to asnormal matchto distinguish it from

thewild match. In Fig. 1, the ML is discharged to 0 only in case

of mismatch, in whichM = 0and the X OR result of the CAMcell is 1. Note that the search operations account for almost

all the TCAM operations, and the SLs are highly capacitive.

Thus, the TCAM power dissipated in the high-frequency SL

switching is substantial [1].

B. Related Work

Including our previous work [9][11], a large number of

designs have been proposed to reduce TCAM power consump-

tion, particularly ML power consumption [2], [3]. Because this

brief aims at reducing the TCAM SL power, we only focus on

the work related to the SL power reduction. In [4], Pagiamtzis

and Sheikholeslami introduced a hierarchical SL design in

which the SLs are broken into the global SLs (GSLs) and local

SLs (LSLs). Instead of directly connecting to the CAM cells,

the GSLs with low-swing signals are fed into the LSLs, and

then, the local receivers would conditionally amplify the signals

to drive a subset of CAM cells. Because only a few LSLs

with reduced capacitance would be activated, the CAM powerdissipated in the SLs can effectively be reduced.

Based on [4], an alternative dont-care-based hierarchical

(DCBH) SL design [6] was proposed. The DCBH scheme uses

the X bit stored in the bottom word of each block to control

whether the search data on GSL should be broadcast to the

corresponding LSL. The GSLs are active every cycle, but the

LSLs are active only when the bottom word is not X.

The only difference between [4] and [6] is the LSL control

strategy. In [1], the segmented SL design was introduced,

which uses the segmentation cell (SC) to segment the SL. The

SC consists of a dummy cell and a path-control switch. In

particular, the dummy cell is an extra SRAM cell to probe the

continuous X information. When the TCAM cell above theSC stores the X bit, the SC dummy cell will keep 0 to turn

Fig. 2. (a) Search data example of five consecutive 0s. Case A/B is the TCAMdesign without/with SE scheme. (b) Traditional SE scheme.

off the path-control switch to block the search data from being

further broadcast. Because the effective capacitance of SL is

decreased, the SL power consumption can be reduced.

III. TWO-L EVELDCG SCHEME

A. SE Technique

If we only consider two consecutive search data on a single

bit, there are four search patterns, i.e., 0 0, 1 1, 0 1,and1 0. Due to no having data transition,0 0and 1 1patterns are classified asquiet patterns. In contrast,0 1and1 0 are classified as switch patterns. In the conventionalTCAM design, because the ML has to be charged to high during

the precharge phase, both S and S must be discharged to 0 to

avoid a possible short circuit. However, such discharge willincrease the unnecessary SL switching activity. For example,

Fig. 2(a) shows a search data pattern of five consecutive 0s.

In case A, i.e., the conventional TCAM design, the gray block

contains the values of S and S during the precharge phase.Clearly, the number of energy-consuming transitions (N01)on SL is four, but they are all unnecessary switching activities.

As illustrated in Fig. 2(b), a straightforward solution to this

weakness is the introduction of an additional transistor, i.e., N3,

that is used to disconnect the pull-down path during the ML

precharge and then enable the search operation. It is referred to

as the SE technique, whose effect can be observed in case B

shown in Fig. 2(a), where the unnecessary SL switches are all

eliminated. Compared to the traditional TCAM without SE,

whose pull-down path is only N1 and N2, the use of SE will

result in an increase of one transistor in the length of the pull-

down path. Based on our simulation, the optimal N3 width

is three times the N1 (or N2) width, where the performance

penalty is about 4.2%. Fortunately, this performance penalty

can be compensated by the two-level DCG scheme proposed

in the following section.

B. L1 DCG

Fig. 3 shows the general configuration of a TCAM array

withNprefixes. The function of the routing table lookup is tofind the longest one in all the prefixes that match the incoming


3/5

CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP 481

Fig. 3. General configuration of a prefix array.

Fig. 4. L1 gating node(GNL1)implementation.

packets destination address. This is commonly known as the

longest prefix match (LPM) problem [12]. To simply resolve

the LPM problem, the stored prefixes are sorted by decreasing

length. Clearly, if the ith cell is X, then all the cells below

the ith cell must be X. It is referred to as the vertically

continuous dont-care feature. For example, consider the gray

column shown in Fig. 3. Because the sixth cell is X, the

sixthNth cells are all X. This scenario implies that the

comparisons between the search data and the prefix data storedin the sixthNth cells are redundant. If we can gate the search

data at the sixth cell, only the capacitances before the sixth

cell are needed to be charged/discharged. Such capacitance that

needs charge in reality is referred to aseffective capacitance. In

this example, due to the decrease of effective capacitance, the

SL power consumption can be reduced.

To gate the search data from being broadcast over the entire

SL, our design inserts the gating nodes to break the entire SL

into several segments. As shown in Fig. 4, the L1 gating node

(GNL1) is implemented as an inverter that is controlled bythe corresponding mask bit. For example, a GNL1 is located

in the ith cell. When the ith cell is X, Mi= 1 will cut off

both the power and ground sources to disable the inverter fromtransmitting the search data. Otherwise, the inverter will take

Fig. 5. L2 DCG example, where the granularity(GL2)is 4, and GNL2 is theL2 gating node.

effect in the case of Mi= 0. Note that the search data areopposite after the gating node.

The granularity of L1 DCG (GL1)is defined as the numberof cells between two L1 gating nodes. It is critical to both

search performance and power saving. IfGL1 is too large, theprobability that the first cell of segment X is low, such that

the amount of effective capacitance reduction is insignificant.

In contrast, a smallGL1can largely reduce the effective capaci-tance but suffers from the serious propagation delay. Therefore,

an adequate GL1 that benefits both power saving and searchperformance is very critical to our design. The detailed analysis

will be provided in the experimental results.

C. L2 DCG

The L1 DCG is beneficial to reduce the SL power only when

the first TCAM cell is X in a segment. If the first cell is

not X, this segment would be driven to perform the search

operation even though the remainder cells are all X. Because

decreasingGL1will worsen the search performance, to improvethe power efficiency of the L1 DCG, we propose an L2 gating

scheme to further exploit the vertically continuous X property

within an L1 segment.

Fig. 5 shows an L2 gating example, in which the L2 granu-

larity (GL2) is 4. For example, ifGL1 andGL2 are 16 and 4,

respectively, then each L1 segment is divided into four L2segments each containing four TCAM cells. Similar to GNL1,

the function of the L2 gating node (GNL2)is to either gate ortransmit the search data from the L1 segment, which depends

on whether the first cell is X or not. This requirement can

easily be realized by a TG, but it has a vital defect when the

input search data is gated, in which the Q and Q nodes arefloating, such that the N2 transistor (shown in Fig. 1) is likely

to be turned on to cause an unexpected pull-down path between

the ML and the ground, i.e., false mismatch.

Instead of the TG, GNL2is implemented as in Fig. 5, which

is controlled by the mask value of the first cell (M0) in thecorresponding L2 segment: 1) If the first cell is not X, then

M0= 0 will turn on the P1 transistor to activate GNL2 as aninverter. Assume that the search data from the L1 segment are S.


4/5

482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009

GNL2 will drive the Q node to be S to perform the searchoperation. 2) In the other case, where the first cell is X,

M0= 1will cut off theVDDpath and turn on the N2 transistorto force the Q node to be 0 regardless of the input search data.

Thus, all side effects incurred by the floating Q and Q nodescan completely be eliminated.

D. Compared to Related Work

Similar to the related work [1], [6], our design also exploits

the vertically continuous X feature to reduce the TCAM SL

power, but the strategy and implementation are totally different.

The major overheads in [6] come from the low-swing receivers

that are responsible for translating a low-swing GSL signal to

a full-swing LSL signal. In addition to the complex circuitry,

the overheads also include the leakage power penalty and the

increased propagation delay. Because these overheads increase

with the number of low-swing receivers, the large block size is

preferable, such that the effect of power saving is reduced. Incontrast, the gating nodes used in the two-level DCG are very

simple, so the above drawbacks are absent in our design.

Compared to the segmented SL design [1], instead of using

an extra SRAM cell, our design uses the stored X bit to

directly control the broadcast of search data. Note that the path-

control switch used in SC [1] is implemented with the TG that

has no power source. Thus, the propagation delay of search

data is increased. In contrast, because the gating node used

in the two-level DCG is implemented with the controllable

inverter that has a power source, our design does not incur

any performance loss. On the contrary, these gating nodes

will realize an inverter chain that can facilitate the search data

transmission, i.e., the search performance can be improved.

IV. EXPERIMENTALR ESULTS

In this brief, we use TSMC 0.18-m technology with 1.8-V

supply voltage to implement two IPv4 routing tables. Both of

them are with a size of 128 32, i.e., 128 entries by 32 bits.

One is implemented with the proposed SE and two-level DCG

schemes, and the other is a conventional NOR-type TCAM used

for comparison.

A. Search Performance

The metric used to evaluate the TCAM search performance

is the match delay (MD). Because our design involves the

modifications of both the SL architecture and the pull-down

path of ML, the MD is defined as the elapsed time from the

search data applied to the SL to the ML discharged to 0 in case

of a mismatch. Throughout this brief, the MD of our design is

always the worst-case delay measured from the mismatch of a

TCAM word located in the last L1 segment.

Fig. 6 shows the MD for both the conventional and our

TCAM designs, where XYmeans that the L1 and L2 gran-

ularities are X and Y, respectively. In our simulation, GL1 isconstrained within the range from 8 to 32, and a total of nine

reasonable two-level DCG configurations are evaluated. Theyare 8-2, 8-4, 16-2, 16-4, 16-8, 32-2, 32-4, 32-8, and 32-16. Due

Fig. 6. Match delay for both the conventional and two-level DCG TCAMdesigns, where XYmeans thatGL1 = XandGL2 = Y.

Fig. 7. Column power dissipated in the switch pattern for both the conven-tional and two-level DCG TCAM designs.

to having no segmentation, the MD of the conventional TCAM(MDConv)is fixed at 1.608 ns.

In Fig. 6, there are two cases in which MDDCG is always

worse than MDConv. The first case is that GL1 is less than orequal to 8; the second case is that GL2 is less than or equalto 2. In particular, MDDCG is even shorter than MDConv in

the cases of 16-4, 16-8, 32-4, 32-8, and 32-16. This is because

our design decouples most TCAM cells from the SL, such that

the SL capacitance can largely be reduced. According to the

RCdelay model, the SL propagation delay will decrease with

the SL capacitance. In addition, the inverter chain introduced

by the L1 DCG scheme can also reduce the SL propagation

delay. Therefore, the performance gains from the two-level

DCG scheme will cover the performance penalty incurred by

the SE technique. Based on this result, only five configurations,

i.e., 16-4, 16-8, 32-4, 32-8, and 32-16, are evaluated in the

following discussion of energy reduction.

B. Average SL Energy Reduction

Because the power consumption of a quiet pattern is hardly

noticeable in our design, Fig. 7 only shows the column power

dissipated in a switch pattern, in which the continuous X

number is varied from 0 to 128, e.g., X = 32 means that thecontinuous X number is 32. For a clear demonstration, only

the 16-8, 32-4, and 32-8 configurations are displayed. Becausethe two-level DCG breaks the entire SL into several segments,


5/5

CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP 483

TABLE ICOLUMNPOWERCONSUMPTION(IN WATTS) F OR BOTH THE

CONVENTIONAL ANDTWO-LEVELDCG TCAM DESIGNS.CONFIGURATION 16-8 IS THEBEST TOREDUCE

TH ESWITCHPOWER CONSUMPTION

its column power shows a step wave. The power data are

summarized in Table I, where the worst and best values are for

theX = 0and X = 128 cases. In particular, the average valueis obtained by averaging the results of all 129 cases, i.e., X =

0 128, if every case has the same occurrence probability.In Table I, the key observations are as follows: 1) There are

two features about the conventional TCAM. First, due to the

need for presetting both S and S to 0, the power consumptionof the quiet pattern is almost equal to that of the switch pattern.

Second, its column power is independent of the continuous X

number. As shown in Table I, they are always 3.753 E-05 and

3.776 E-05 W for quiet and switch patterns, respectively. 2) Due

to having no SL switch, in the quiet pattern, our design almost

consumes no power compared to the conventional TCAM, and

the difference between three cases is hardly noticeable. 3) In

the worst case of the switch pattern, because no X cell can

facilitate our design to reduce the SL power, the additionalL1 and L2 gating nodes will result in absolute power penalty.

Consequently, for all configurations, the worst switch power

must be larger than that of the conventional TCAM. 4) Clearly,

for the switch pattern the best configuration is 16-8, in which

our design incurs the least power penalty, i.e., 13%, in the

worst case, while achieving the largest power reduction, i.e.,

35%, in the average case.

For a fair comparison, the evaluation metric is the energy,

which is the product of the MD and the search power. Thus,

the column energy consumption is summarized in Table II, in

which only the results of the average case are presented. In

Table II, the best configuration is still 16-8, which can achieve

70% average SL energy reduction compared to the conventional

NO R-type TCAM design.

C. Area Overhead and Leakage Power Penalty

Our design can effectively reduce the SL energy consump-

tion, but the additional SE control and L1 and L2 gating nodes

will increase the transistor count. Because both the area and the

leakage power consumption are proportional to the transistor

count, our design will result in a 5.8% area overhead and 1.6%

leakage power penalty. In addition, the two-level DCG design

also increases the mask node capacitance of the first cell of

L1 and L2 segments, and the read/write performance loss isabout 0.2%.

TABLE IICOLUMNENERGYCONSUMPTION FORBOTH THECONVENTIONAL

AN DTWO-LEVELDCG TCAM DESIGNS

V. CONCLUSION

In this brief, we have proposed a low-power TCAM designthat not only reduces the SL power consumption but also

improves the search performance. We have first used the SEtechnique to eliminate all unnecessary SL switches in the quietpattern and then used the proposed two-level DCG scheme to

reduce the SL switch power in the switch pattern. By using

the vertically continuous dont care feature, our design canachieve at least 63% reduction in average SL energy consump-

tion with a 5.8% area overhead.

REFERENCES

[1] J. S. Wang, C. C. Wang, and C. Yeh, TCAM for IP-address lookupusing tree-style AND-type match lines and segmented search lines, in

Int. Solid-S tate Circuits Con f., 2006, pp. 577586.

[2] N. Mohan and M. Sachdev, Low-capacitance and charge-shared match-lines for low-energy high-performance TCAMs, IEEE J. Solid-StateCircuits, vol. 42, no. 9, pp. 20542060, Sep. 2007.

[3] N. Mohan, W. Fung, D. Wright, and M. Sachdev, Match line senseamplifiers with positive feedback for low-power content addressablememories, inIEEE Custom Integr. Circuits Conf., 2006, pp. 297300.

[4] K. Pagiamtzis and A. Sheikholeslami, A low power content-addressablememory (CAM) using pipelined hierarchical search scheme, IEEE J.Solid-State Circuits, vol. 39, no. 9, pp. 15121519, Sep. 2004.

[5] H. Noda, K. Inoue, M. Kuroiwa, F. Igaue, K. Yamamoto, H. J. Mattausch,T. Koide, A. Amo, A. Hachisuka, S. Soeda, I. Hayashi, F. Morishita,K. Dosaka, K. Arimoto, K. Fujishima, K. Anami, and T. Yoshihara, Acost-efficient high-performance dynamic TCAM with pipelined hierar-chical searching and shift redundancy architecture, IEEE J. Solid-StateCircuits, vol. 40, no. 1, pp. 245253, Jan. 2005.

[6] P. T. Huang, S. W. Chang, W. Y. Liu, and W. Hwang, A 256x128 energy-efficient TCAM with novel low powerschemes, in Proc. Int. Symp. VLSI-

DAT, 2007, pp. 14.[7] I. Arsovski, T. Chandler, and A. Sheikholeslami, A ternary content ad-

dressable memory (TCAM) based on 4T static storage and including acurrent-race sensing scheme,IEEE J. Solid-State Circuits, vol. 38, no. 1,pp. 155158, Jan. 2003.

[8] I. Arsovski and R. Nadkarni, Low-noise embedded CAM with reducedslow-rate match-llines and asynchronous search-lines, in IEEE Custom

Integr. Circuits C onf., 2005, pp. 447450.[9] Y.-J. Chang, Two-layer hierarchical matching method for energy-

efficient CAM design, Electron. Lett., vol. 43, no. 2, pp. 8082,Jan. 2007.

[10] Y.-J. Chang, Y.-H. Liao, and S.-J. Ruan, Improve CAM power efficiencyusing decoupled match line scheme, in IEEE/ACM DATE, Apr. 1620,2007, pp. 16.

[11] Y.-J. Chang and Y.-H. Liao, Hybrid-type CAM design for both powerand performance efficiency,IEEE Trans. Very Large Scale Integr. (VLSI)

Syst., vol. 16, no. 8, pp. 965974, Aug. 2008.[12] D. Shah and P. Gupta, Fast updating algorithms for TCAMs, IEEEMicro, vol. 21, no. 1, pp. 3647, Jan./Feb. 2001.

a high-performance and energy-efficient tcam design for ip address lookup

Documents