a high-performance and energy-efficient tcam design for ip address lookup
TRANSCRIPT
-
8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup
1/5
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009 479
A High-Performance and Energy-Efficient TCAMDesign for IP-Address Lookup
Yen-Jen Chang,Member, IEEE
AbstractIn this brief, we propose a two-level dont-caregating (DCG) scheme that aims to reduce the ternary content-addressable memory (TCAM) power dissipated in the search-line switching activity. By exploiting the vertically continuousdont-care feature, the two-level DCG scheme can largely reducethe average search-line power consumption during a switch pat-tern. In addition, we also use the search enable technique to elim-inate the unnecessary search-line switching activity in the quietpattern. By reducing both the search-line switching activity andaverage switching power, the proposed design can minimize theTCAM search-line power consumption. For a 128 32 TCAM,the best configuration we examined shows that when the first-level
and second-level gating granularities are 16 and 8, respectively,with a 9% search performance improvement, the two-level DCGscheme can achieve 70% search-line energy reduction.
Index TermsForwarding table, low-power design, router,search line (SL), ternary content addressable memory (TCAM).
I. INTRODUCTION
BECAUSE ternary content-addressable memory (TCAM)
has an additional dont-care (or X) state to perform
the wild match, it is widely used in the forwarding table of
the network router. However, the power consumption of TCAM
is usually considerable due to the parallel comparison feature,
in which a large amount of transistors and wires are active oneach lookup. As revealed in [1], there are three major power
consumers in TCAM, including the clock and control, match
lines (MLs), and search lines (SLs). The power dissipated by
the former two components has effectively been reduced [2],
[3], but the SLs still contribute 54%82% to the total power
consumption [1]. The SL power consumption can be reduced by
using the segmented [1] or hierarchical SL scheme [4][6] and
minimizing the switching activity [7], [8]. However, they suffer
from either performance penalty or complex control circuitry.
In contrast, this brief presents a low-power TCAM design that
consists of the two-level dont-care gating (DCG) scheme and
the search enable (SE) technique. Without performance penalty
and complex control circuitry, our design can largely reduce the
TCAM power dissipated in the SL switching activity.
The most pronounced features of the proposed low-power
TCAM design are summarized as follows: 1) The SE technique
can completely eliminate the unnecessary SL switching activity
in the quiet pattern, but it will result in a performance penalty.
Manuscript received November 4, 2008; revised January 23, 2009. Currentversion published June 17, 2009. This paper was recommended by AssociateEditor T. Zhang.
The author is with the Department of Computer Science and Engineering,National Chung Hsing University, Taichung 402, Taiwan (e-mail: [email protected]).
Digital Object Identifier 10.1109/TCSII.2009.2020935
2) Based on the vertically continuous X feature, the two-level
DCG scheme uses the additional gating nodes to conditionally
prevent the search data from being broadcast over the entire SL.
By decreasing the SL effective capacitance, the two-level DCG
scheme can largely reduce the average power dissipated in the
SL switches. 3) Instead of the simple transmission gates (TGs),
the two-level DCG scheme uses the inverter chain circuitry
to accelerate the transmission of search data. Because such
speedup will compensate for the performance loss incurred by
the SE technique, our design can achieve comparable or even
better performance than the conventional TCAM design.The proposed TCAM design with a size of 128 32 was
implemented with the TSMC 0.18-m technology, and all
the results were measured from the HSPICE simulation. By
examining all possible configurations, the results show that
our design can achieve the best energy efficiency when the
first-level (L1) and second-level (L2) gating granularities are
16 and 8, respectively. Compared to the conventional NO R-
type TCAM, our design not only reduces the average SL
energy consumption by 63%70% but also improves the search
performance by 9%.
The rest of this brief is organized as follows. Section II
reviews the conventional NO R-type TCAM design and the
previous work on TCAM power reduction. Section III describesthe circuitry developed for the two-level DCG scheme in detail.
In addition to the discussions on the importance issues, the
comparison between our design and the related work is also
provided. The measurement results and analysis are given in
Sections IV, and V offers some brief conclusions.
II. TCAM
As shown in Fig. 1, a typical TCAM cell consists of three
major components: 1) an eight-transistor XO R-type content-
addressable memory (CAM) cell that not only stores the prefix
data but also compares the prefix data with the search data;
2) a six-transistor static random access memory (SRAM) cell
that stores the mask bit to indicate whether this TCAM cell is
X or not; and 3) a control logic that can conditionally pull
down the ML. As shown in Fig. 1, this can simply be imple-
mented with two n-type metaloxidesemiconductor transistors
placed in serial, which are controlled by the mask bit and the
XO Rresult of the CAM cell, respectively.
A. Search Operation
The search operation is initiated by discharging all SLs and
then precharging all MLs to VDD. In Fig. 1, the discharge of
both S andSis to ensure that no short-circuit path exists during
1549-7747/$25.00 2009 IEEE
-
8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup
2/5
480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009
Fig. 1. Typical TCAM cell and the corresponding state table.
the ML precharge. After precharging the ML, the search data
and its complement are then applied to S and S to perform thesearch operation. As shown in Fig. 1, M = 1 means that thisTCAM cell is in the X state, where it is always a match
regardless of the comparison result of the CAM cell. Such
comparison is referred to as a wild match. In contrast, if themask bit is 0, this TCAM cell is in either the 0 or 1 state. In
this case, if the search data are equal to the stored data, it is also
a match that we refer to asnormal matchto distinguish it from
thewild match. In Fig. 1, the ML is discharged to 0 only in case
of mismatch, in whichM = 0and the X OR result of the CAMcell is 1. Note that the search operations account for almost
all the TCAM operations, and the SLs are highly capacitive.
Thus, the TCAM power dissipated in the high-frequency SL
switching is substantial [1].
B. Related Work
Including our previous work [9][11], a large number of
designs have been proposed to reduce TCAM power consump-
tion, particularly ML power consumption [2], [3]. Because this
brief aims at reducing the TCAM SL power, we only focus on
the work related to the SL power reduction. In [4], Pagiamtzis
and Sheikholeslami introduced a hierarchical SL design in
which the SLs are broken into the global SLs (GSLs) and local
SLs (LSLs). Instead of directly connecting to the CAM cells,
the GSLs with low-swing signals are fed into the LSLs, and
then, the local receivers would conditionally amplify the signals
to drive a subset of CAM cells. Because only a few LSLs
with reduced capacitance would be activated, the CAM powerdissipated in the SLs can effectively be reduced.
Based on [4], an alternative dont-care-based hierarchical
(DCBH) SL design [6] was proposed. The DCBH scheme uses
the X bit stored in the bottom word of each block to control
whether the search data on GSL should be broadcast to the
corresponding LSL. The GSLs are active every cycle, but the
LSLs are active only when the bottom word is not X.
The only difference between [4] and [6] is the LSL control
strategy. In [1], the segmented SL design was introduced,
which uses the segmentation cell (SC) to segment the SL. The
SC consists of a dummy cell and a path-control switch. In
particular, the dummy cell is an extra SRAM cell to probe the
continuous X information. When the TCAM cell above theSC stores the X bit, the SC dummy cell will keep 0 to turn
Fig. 2. (a) Search data example of five consecutive 0s. Case A/B is the TCAMdesign without/with SE scheme. (b) Traditional SE scheme.
off the path-control switch to block the search data from being
further broadcast. Because the effective capacitance of SL is
decreased, the SL power consumption can be reduced.
III. TWO-L EVELDCG SCHEME
A. SE Technique
If we only consider two consecutive search data on a single
bit, there are four search patterns, i.e., 0 0, 1 1, 0 1,and1 0. Due to no having data transition,0 0and 1 1patterns are classified asquiet patterns. In contrast,0 1and1 0 are classified as switch patterns. In the conventionalTCAM design, because the ML has to be charged to high during
the precharge phase, both S and S must be discharged to 0 to
avoid a possible short circuit. However, such discharge willincrease the unnecessary SL switching activity. For example,
Fig. 2(a) shows a search data pattern of five consecutive 0s.
In case A, i.e., the conventional TCAM design, the gray block
contains the values of S and S during the precharge phase.Clearly, the number of energy-consuming transitions (N01)on SL is four, but they are all unnecessary switching activities.
As illustrated in Fig. 2(b), a straightforward solution to this
weakness is the introduction of an additional transistor, i.e., N3,
that is used to disconnect the pull-down path during the ML
precharge and then enable the search operation. It is referred to
as the SE technique, whose effect can be observed in case B
shown in Fig. 2(a), where the unnecessary SL switches are all
eliminated. Compared to the traditional TCAM without SE,
whose pull-down path is only N1 and N2, the use of SE will
result in an increase of one transistor in the length of the pull-
down path. Based on our simulation, the optimal N3 width
is three times the N1 (or N2) width, where the performance
penalty is about 4.2%. Fortunately, this performance penalty
can be compensated by the two-level DCG scheme proposed
in the following section.
B. L1 DCG
Fig. 3 shows the general configuration of a TCAM array
withNprefixes. The function of the routing table lookup is tofind the longest one in all the prefixes that match the incoming
-
8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup
3/5
CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP 481
Fig. 3. General configuration of a prefix array.
Fig. 4. L1 gating node(GNL1)implementation.
packets destination address. This is commonly known as the
longest prefix match (LPM) problem [12]. To simply resolve
the LPM problem, the stored prefixes are sorted by decreasing
length. Clearly, if the ith cell is X, then all the cells below
the ith cell must be X. It is referred to as the vertically
continuous dont-care feature. For example, consider the gray
column shown in Fig. 3. Because the sixth cell is X, the
sixthNth cells are all X. This scenario implies that the
comparisons between the search data and the prefix data storedin the sixthNth cells are redundant. If we can gate the search
data at the sixth cell, only the capacitances before the sixth
cell are needed to be charged/discharged. Such capacitance that
needs charge in reality is referred to aseffective capacitance. In
this example, due to the decrease of effective capacitance, the
SL power consumption can be reduced.
To gate the search data from being broadcast over the entire
SL, our design inserts the gating nodes to break the entire SL
into several segments. As shown in Fig. 4, the L1 gating node
(GNL1) is implemented as an inverter that is controlled bythe corresponding mask bit. For example, a GNL1 is located
in the ith cell. When the ith cell is X, Mi= 1 will cut off
both the power and ground sources to disable the inverter fromtransmitting the search data. Otherwise, the inverter will take
Fig. 5. L2 DCG example, where the granularity(GL2)is 4, and GNL2 is theL2 gating node.
effect in the case of Mi= 0. Note that the search data areopposite after the gating node.
The granularity of L1 DCG (GL1)is defined as the numberof cells between two L1 gating nodes. It is critical to both
search performance and power saving. IfGL1 is too large, theprobability that the first cell of segment X is low, such that
the amount of effective capacitance reduction is insignificant.
In contrast, a smallGL1can largely reduce the effective capaci-tance but suffers from the serious propagation delay. Therefore,
an adequate GL1 that benefits both power saving and searchperformance is very critical to our design. The detailed analysis
will be provided in the experimental results.
C. L2 DCG
The L1 DCG is beneficial to reduce the SL power only when
the first TCAM cell is X in a segment. If the first cell is
not X, this segment would be driven to perform the search
operation even though the remainder cells are all X. Because
decreasingGL1will worsen the search performance, to improvethe power efficiency of the L1 DCG, we propose an L2 gating
scheme to further exploit the vertically continuous X property
within an L1 segment.
Fig. 5 shows an L2 gating example, in which the L2 granu-
larity (GL2) is 4. For example, ifGL1 andGL2 are 16 and 4,
respectively, then each L1 segment is divided into four L2segments each containing four TCAM cells. Similar to GNL1,
the function of the L2 gating node (GNL2)is to either gate ortransmit the search data from the L1 segment, which depends
on whether the first cell is X or not. This requirement can
easily be realized by a TG, but it has a vital defect when the
input search data is gated, in which the Q and Q nodes arefloating, such that the N2 transistor (shown in Fig. 1) is likely
to be turned on to cause an unexpected pull-down path between
the ML and the ground, i.e., false mismatch.
Instead of the TG, GNL2is implemented as in Fig. 5, which
is controlled by the mask value of the first cell (M0) in thecorresponding L2 segment: 1) If the first cell is not X, then
M0= 0 will turn on the P1 transistor to activate GNL2 as aninverter. Assume that the search data from the L1 segment are S.
-
8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup
4/5
482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 56, NO. 6, JUNE 2009
GNL2 will drive the Q node to be S to perform the searchoperation. 2) In the other case, where the first cell is X,
M0= 1will cut off theVDDpath and turn on the N2 transistorto force the Q node to be 0 regardless of the input search data.
Thus, all side effects incurred by the floating Q and Q nodescan completely be eliminated.
D. Compared to Related Work
Similar to the related work [1], [6], our design also exploits
the vertically continuous X feature to reduce the TCAM SL
power, but the strategy and implementation are totally different.
The major overheads in [6] come from the low-swing receivers
that are responsible for translating a low-swing GSL signal to
a full-swing LSL signal. In addition to the complex circuitry,
the overheads also include the leakage power penalty and the
increased propagation delay. Because these overheads increase
with the number of low-swing receivers, the large block size is
preferable, such that the effect of power saving is reduced. Incontrast, the gating nodes used in the two-level DCG are very
simple, so the above drawbacks are absent in our design.
Compared to the segmented SL design [1], instead of using
an extra SRAM cell, our design uses the stored X bit to
directly control the broadcast of search data. Note that the path-
control switch used in SC [1] is implemented with the TG that
has no power source. Thus, the propagation delay of search
data is increased. In contrast, because the gating node used
in the two-level DCG is implemented with the controllable
inverter that has a power source, our design does not incur
any performance loss. On the contrary, these gating nodes
will realize an inverter chain that can facilitate the search data
transmission, i.e., the search performance can be improved.
IV. EXPERIMENTALR ESULTS
In this brief, we use TSMC 0.18-m technology with 1.8-V
supply voltage to implement two IPv4 routing tables. Both of
them are with a size of 128 32, i.e., 128 entries by 32 bits.
One is implemented with the proposed SE and two-level DCG
schemes, and the other is a conventional NOR-type TCAM used
for comparison.
A. Search Performance
The metric used to evaluate the TCAM search performance
is the match delay (MD). Because our design involves the
modifications of both the SL architecture and the pull-down
path of ML, the MD is defined as the elapsed time from the
search data applied to the SL to the ML discharged to 0 in case
of a mismatch. Throughout this brief, the MD of our design is
always the worst-case delay measured from the mismatch of a
TCAM word located in the last L1 segment.
Fig. 6 shows the MD for both the conventional and our
TCAM designs, where XYmeans that the L1 and L2 gran-
ularities are X and Y, respectively. In our simulation, GL1 isconstrained within the range from 8 to 32, and a total of nine
reasonable two-level DCG configurations are evaluated. Theyare 8-2, 8-4, 16-2, 16-4, 16-8, 32-2, 32-4, 32-8, and 32-16. Due
Fig. 6. Match delay for both the conventional and two-level DCG TCAMdesigns, where XYmeans thatGL1 = XandGL2 = Y.
Fig. 7. Column power dissipated in the switch pattern for both the conven-tional and two-level DCG TCAM designs.
to having no segmentation, the MD of the conventional TCAM(MDConv)is fixed at 1.608 ns.
In Fig. 6, there are two cases in which MDDCG is always
worse than MDConv. The first case is that GL1 is less than orequal to 8; the second case is that GL2 is less than or equalto 2. In particular, MDDCG is even shorter than MDConv in
the cases of 16-4, 16-8, 32-4, 32-8, and 32-16. This is because
our design decouples most TCAM cells from the SL, such that
the SL capacitance can largely be reduced. According to the
RCdelay model, the SL propagation delay will decrease with
the SL capacitance. In addition, the inverter chain introduced
by the L1 DCG scheme can also reduce the SL propagation
delay. Therefore, the performance gains from the two-level
DCG scheme will cover the performance penalty incurred by
the SE technique. Based on this result, only five configurations,
i.e., 16-4, 16-8, 32-4, 32-8, and 32-16, are evaluated in the
following discussion of energy reduction.
B. Average SL Energy Reduction
Because the power consumption of a quiet pattern is hardly
noticeable in our design, Fig. 7 only shows the column power
dissipated in a switch pattern, in which the continuous X
number is varied from 0 to 128, e.g., X = 32 means that thecontinuous X number is 32. For a clear demonstration, only
the 16-8, 32-4, and 32-8 configurations are displayed. Becausethe two-level DCG breaks the entire SL into several segments,
-
8/10/2019 A High-Performance and Energy-Efficient TCAM Design for IP Address Lookup
5/5
CHANG: HIGH-PERFORMANCE AND ENERGY-EFFICIENT TCAM DESIGN FOR IP-ADDRESS LOOKUP 483
TABLE ICOLUMNPOWERCONSUMPTION(IN WATTS) F OR BOTH THE
CONVENTIONAL ANDTWO-LEVELDCG TCAM DESIGNS.CONFIGURATION 16-8 IS THEBEST TOREDUCE
TH ESWITCHPOWER CONSUMPTION
its column power shows a step wave. The power data are
summarized in Table I, where the worst and best values are for
theX = 0and X = 128 cases. In particular, the average valueis obtained by averaging the results of all 129 cases, i.e., X =
0 128, if every case has the same occurrence probability.In Table I, the key observations are as follows: 1) There are
two features about the conventional TCAM. First, due to the
need for presetting both S and S to 0, the power consumptionof the quiet pattern is almost equal to that of the switch pattern.
Second, its column power is independent of the continuous X
number. As shown in Table I, they are always 3.753 E-05 and
3.776 E-05 W for quiet and switch patterns, respectively. 2) Due
to having no SL switch, in the quiet pattern, our design almost
consumes no power compared to the conventional TCAM, and
the difference between three cases is hardly noticeable. 3) In
the worst case of the switch pattern, because no X cell can
facilitate our design to reduce the SL power, the additionalL1 and L2 gating nodes will result in absolute power penalty.
Consequently, for all configurations, the worst switch power
must be larger than that of the conventional TCAM. 4) Clearly,
for the switch pattern the best configuration is 16-8, in which
our design incurs the least power penalty, i.e., 13%, in the
worst case, while achieving the largest power reduction, i.e.,
35%, in the average case.
For a fair comparison, the evaluation metric is the energy,
which is the product of the MD and the search power. Thus,
the column energy consumption is summarized in Table II, in
which only the results of the average case are presented. In
Table II, the best configuration is still 16-8, which can achieve
70% average SL energy reduction compared to the conventional
NO R-type TCAM design.
C. Area Overhead and Leakage Power Penalty
Our design can effectively reduce the SL energy consump-
tion, but the additional SE control and L1 and L2 gating nodes
will increase the transistor count. Because both the area and the
leakage power consumption are proportional to the transistor
count, our design will result in a 5.8% area overhead and 1.6%
leakage power penalty. In addition, the two-level DCG design
also increases the mask node capacitance of the first cell of
L1 and L2 segments, and the read/write performance loss isabout 0.2%.
TABLE IICOLUMNENERGYCONSUMPTION FORBOTH THECONVENTIONAL
AN DTWO-LEVELDCG TCAM DESIGNS
V. CONCLUSION
In this brief, we have proposed a low-power TCAM designthat not only reduces the SL power consumption but also
improves the search performance. We have first used the SEtechnique to eliminate all unnecessary SL switches in the quietpattern and then used the proposed two-level DCG scheme to
reduce the SL switch power in the switch pattern. By using
the vertically continuous dont care feature, our design canachieve at least 63% reduction in average SL energy consump-
tion with a 5.8% area overhead.
REFERENCES
[1] J. S. Wang, C. C. Wang, and C. Yeh, TCAM for IP-address lookupusing tree-style AND-type match lines and segmented search lines, in
Int. Solid-S tate Circuits Con f., 2006, pp. 577586.
[2] N. Mohan and M. Sachdev, Low-capacitance and charge-shared match-lines for low-energy high-performance TCAMs, IEEE J. Solid-StateCircuits, vol. 42, no. 9, pp. 20542060, Sep. 2007.
[3] N. Mohan, W. Fung, D. Wright, and M. Sachdev, Match line senseamplifiers with positive feedback for low-power content addressablememories, inIEEE Custom Integr. Circuits Conf., 2006, pp. 297300.
[4] K. Pagiamtzis and A. Sheikholeslami, A low power content-addressablememory (CAM) using pipelined hierarchical search scheme, IEEE J.Solid-State Circuits, vol. 39, no. 9, pp. 15121519, Sep. 2004.
[5] H. Noda, K. Inoue, M. Kuroiwa, F. Igaue, K. Yamamoto, H. J. Mattausch,T. Koide, A. Amo, A. Hachisuka, S. Soeda, I. Hayashi, F. Morishita,K. Dosaka, K. Arimoto, K. Fujishima, K. Anami, and T. Yoshihara, Acost-efficient high-performance dynamic TCAM with pipelined hierar-chical searching and shift redundancy architecture, IEEE J. Solid-StateCircuits, vol. 40, no. 1, pp. 245253, Jan. 2005.
[6] P. T. Huang, S. W. Chang, W. Y. Liu, and W. Hwang, A 256x128 energy-efficient TCAM with novel low powerschemes, in Proc. Int. Symp. VLSI-
DAT, 2007, pp. 14.[7] I. Arsovski, T. Chandler, and A. Sheikholeslami, A ternary content ad-
dressable memory (TCAM) based on 4T static storage and including acurrent-race sensing scheme,IEEE J. Solid-State Circuits, vol. 38, no. 1,pp. 155158, Jan. 2003.
[8] I. Arsovski and R. Nadkarni, Low-noise embedded CAM with reducedslow-rate match-llines and asynchronous search-lines, in IEEE Custom
Integr. Circuits C onf., 2005, pp. 447450.[9] Y.-J. Chang, Two-layer hierarchical matching method for energy-
efficient CAM design, Electron. Lett., vol. 43, no. 2, pp. 8082,Jan. 2007.
[10] Y.-J. Chang, Y.-H. Liao, and S.-J. Ruan, Improve CAM power efficiencyusing decoupled match line scheme, in IEEE/ACM DATE, Apr. 1620,2007, pp. 16.
[11] Y.-J. Chang and Y.-H. Liao, Hybrid-type CAM design for both powerand performance efficiency,IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 16, no. 8, pp. 965974, Aug. 2008.[12] D. Shah and P. Gupta, Fast updating algorithms for TCAMs, IEEEMicro, vol. 21, no. 1, pp. 3647, Jan./Feb. 2001.