[ieee 2013 ieee/acm international conference on computer-aided design (iccad) - san jose, ca, usa...
TRANSCRIPT
LatchPlanner: Latch Placement Algorithm for
Datapath-oriented High-Performance VLSI Designs
Minsik Cho, Hua Xiang, Haoxing Ren, Matthew M. Ziegler, Ruchir PuriIBM T. J. Watson Research Center, Yorktown Heights, NY 10598
{minsikcho,huaxiang,haoxing,zieglerm,ruchir}@us.ibm.com
Abstract—In this paper, we present a novel algorithm for latch
placement, LatchPlanner which enables a placement engine to deliver
high quality placement for datapath-oriented design. Datapath-oriented
VLSI designs are in general hand-crafted by human at high cost, asunderstanding and capturing datapath structure is critical for the per-
formance. The conventional placement algorithms by itself cannot exploit
the underlying datapath due to lack of logic structure recognition and
inaccurate/approximated wirelength estimation. LatchPlanner addressessuch drawbacks by placing and fixing latches in the datapath context,
a key element in datapath structure. By taking placed/fixed latches
as constraints, a placer can find a more datapath-friendly placementeffectively, which results in higher-quality hardware. LatchPlanner begins
latch clustering/sizing/ordering to prepare the following steps, a) global
latch placement based on linear programming to place latch clusters,
and b) local latch placement based on network flow optimization to placelatches within each cluster. Experimental results on eighteen industrial
benchmarks show that LatchPlanner improves total wirelength by 32%,
total negative slack by 25%, and area by 3% without CPU overhead overa commercial placement engine, and delivers near semi-custom-quality
solutions.
I. INTRODUCTION
In the high-end VLSI design or microprocessor domain, the com-
plexity of chip integration becomes increasingly challenging due to
tight area/power budgets with highly aggressive performance targets.
Therefore, hierarchical approach has been widely adopted in order to
reduce the complexity where the entire design is divided into multiple
macros [10] and each macro is separately optimized. Among these
macros, one macro may be fully synthesized without human interven-
tion, while another, mostly one of datapath-oriented macros will be
manually crafted by human due to its timing/area/power criticality [7],
[17]. Such datapath-oriented macros commonly found in high-end
VLSI systems (e.g., muxing, buffering, butter-flying, rotating, and so
on) are completed with significant effort by either full-custom or semi-
custom design flows, as conventional synthesis algorithms, especially
placement, are not well-suited for such datapath-oriented macros due
to the inherent gap between the HPWL and the Steiner wirelength
models [5], [14], [23]–[25]. Thus, considering ever-growing design
complexity/functionality, shortened design-turn-around-time, and cost
of custom macros, it is in greater demand than ever to develop more
automated synthesis schemes for datapath-oriented macros.
Many approaches have been proposed to handle datapath-oriented
design. Logic synthesis level optimization of datapath is described
in [16]. Enforcing datapath structure through a logic/physical syn-
thesis flow is proposed in [5]. Traditional analytical placement
techniques are extended to handle datapath structures in [7]. Graph
automorphism-based datapath extraction and ILP-based bitstacking
selection are proposed to amend the existing HPWL-driven place-
ment [23]. However, datapath consideration in logic synthesis phase
alone is not sufficient [16], and restricting timing optimization (e.g.,
sizing/buffering) on datapath can degrade the overall quality of
HW [5], [25]. Also, capturing global dataflow rather than local
isomorphism and analyzing the impact on timing due to datapath
optimization are essential [7], [23], [24].
While previous approaches have aforementioned limitations, latch
placement is an already proven and key technique in a semi-custom
design methodology to handle datapath-oriented designs. For details
on semi-custom design methodology, see Section II-A. Structurally
placed latches will not only encourage regular bitstacks but also
enable compact gate placement between latches, which will result in
better timing and shorter wirelength. Although a semi-custom solution
with fixed latches incurs relatively small human efforts yet provides
benefits of a full-custom scheme, still manually planning latches can
be quite cost-prohibitive and labor-intensive. For example, for a given
large and complex datapath macro, the number of latches to place
and fix can be too big, or fully understanding logical and physical
characteristics of the design for manual latch placement may be too
difficult.
Therefore, we propose an automatic latch placement algorithm,
LatchPlanner to address such challenges in the semi-custom flow.
LatchPlanner relies on the extracted dataflow graph from logic
synthesis where pins and latches become nodes and datapaths become
edges. The key idea is that since a datapath is a bitstream channel
between latches and pins, properly staging latches in advance with
dataflow taken into account will result in more datapath-friendly
placement. With the dataflow graph, LatchPlanner structurally places
and fixes latches using linear programming (LP) and min-cost network
flow optimization, which provides structured guidance to a placement
engine. Comprehensive experiment results demonstrate that Latch-
Planner can improve conventional placement significantly and deliver
highly comparable placement solutions to semi-custom results. The
major contributions of this paper include the following:
• We propose LatchPlanner, which places and fixes latches in
a datapath-friendly fashion. LatchPlanner is highly comple-
mentary to any existing placement engine or design flow.
• We propose to use dataflow graph to optimize datapath
in VLSI designs, in order to optimize overall datapath
wirelength and the locations of the key datapath element,
latches.
• We compare LatchPlanner with the industrially proven and
qualitatively golden solutions from a semi-custom method-
ology, and show that LatchPlanner can be highly effective
for datapath-oriented VLSI designs.
The rest of the paper is organized as follows. Section II provides
preliminaries and introduction to semi-custom design methodology.
Section III presents LatchPlanner. In-depth discussion on complex
datapath is in Section IV. Experimental results are in Section V,
followed by the conclusion in Section VI.
342978-1-4799-1071-7/13/$31.00 ©2013 IEEE
g
h
j
k
l
m
i
a
b
c
d
e
f
Combinational Logic
Combinational Logic
n
o
q
r
t
p
x
y
z
u
v
s
(a) Datapath-oriented netlist
g
h
j
k
l
m
i
a
b
c
d
e
f
x
y
z
n
o
q
r
s
t
pu
v
c1
c2
c3
c4
ci
co
(b) Extracted dataflow graph (DFG)
g
h
jk
l
m
i
a
b
c
d
e
f
x
y
z
n
o
q
r
s t
p
u
v
c1
c2
c3
c4
co
ci
(c) Input placement
Fig. 1. Inputs to LatchPlanner: pins are in a diamond and latches are in a box.
II. PRELIMINARIES
A. Semi-Custom Design Methodology
In general, full-custom design practice can potentially deliver the
best quality HW, which scales poorly with design size and requires
high development cost (e.g., design time) [24]. On the other hand,
ASIC-style full synthesis approach can deliver very large scale HW in
shorter time at less cost, yet the quality of HW tends to be worse than
that of full-custom approach. As a middle ground, semi-custom design
methodology is widely used to deliver good quality and large-scale
HW at affordable cost. The key idea in the semi-custom approach
is that human makes critical design decisions/optimizations manually
and leaves the rest of work to tools. For example, an experienced
human designer can place/fix a couple of critical gates/IPs, performs
manual gate-sizing locally, assigns some known time-critical signals
to higher metal layers, or even routes a few nets in highly congested
regions. Among such techniques, manual latch (or flipflop) placement
in a structured fashion is a popular and proven technique in order to
accomplish high-performance and lower clock power dissipation [6],
[10]. Effectively, the placed and fixed latches become placement
anchors or constraints which guide a placement engine to produce
higher quality results [24].
B. Datapath Extraction and Dataflow Graph
Datapath extraction is translating regularity/similarity (inherent
in datapath) in circuits into mathematical information. Many ideas
such as templates, signatures, hashing, and machine-learning have
been proposed in order to find out a set of similar subgraphs from
netlist under isomorphism [3], [8], [14], [15], [20], [22], [25]. Once
datapaths are identified, a dataflow graph (DFG) can be constructed,
which is a graph representation of the flow of data through key circuit
elements including latches (or flipflops) and pins. For example, from
a given netlist as in Fig. 1 (a), we can detect a set of similar logics
between latches/pins and replace them with edges in order to create
TABLE I. THE NOTATIONS IN THIS PAPER.
Dw the width of a design
Dh the height of a design
C the set of latch clusters (indexed by c)
P the set of pin clusters (indexed by p)
Mc the set of objects in the cluster c
Vc the virtual block from the cluster c
wc the width of the Vc
hc the height of the Vc
W the width of a latch
H the height of a latch
(xi, yi) the input coordinate of an object i
(xi, yi) the coordinate of an object i
(xc, yc) the coordinate of the Vc
a DFG in Fig. 1 (b) where pins/latches become nodes. DFG has only
pins/latches on the datapath as nodes. The advantage of using DFG
is that it captures global view of datapath logic and enables more
comprehensive datapath optimization. Also, DFG is highly suitable
for global scale optimization such as latch placement. Please refer
to [19], [25] for more information.
III. LATCHPLANNER
In this section, we propose LatchPlanner, a datapath-aware latch
placement algorithm to handle datapath-oriented designs. The key
insight behind optimizing latch placement in the context of datapath is
that datapath is essentially structured delivery of bitstreams from a set
of latches to another set of latches (commonly called pipeline stages).
Therefore, smartly anchoring latches in advance with datapath taken
into consideration can accomplish desirable layout structures and
improve the quality of placement results. And, we find that DFG [25]
is a natural way of modeling datapath among pins and latches for
datapath-aware physical design. Fig. 2 illustrates the overview of
LatchPlanner in the context of a modern VLSI synthesis flow where
the steps of LatchPlanner are in solid boxes. LatchPlanner accepts a
DFG (Fig. 1 (b)) extracted from a logic netlist and an input placement
(Fig. 1 (c)), and then plans (places and fixes) datapath latches. It
is possible to run placement again with fixed latches to reflect our
datapath optimization in determining other cell locations.
LatchPlanner starts with latch/pin clustering to define the gran-
ularity of datapath-aware latch placement as in Section III-A. Latch
cluster sizing/ordering translates logical datapath information to phys-
ical information for the placement purpose as in Section III-B. Once
latch clusters are fully defined physically, we secure spaces for the
latches in each cluster during global latch placement in Section III-C.
Finally, the latches are placed and fixed during local latch placement
in Section III-D. The placed/fixed latches will guide the following
placement by anchoring non-datapath gates to planned latches, in
order to generate a datapath-friendly placement.
Latch/Pin Clustering
Latch Cluster Sizing/Ordering
Global Latch Placement
Local Latch Placement
Placement with Preplaced Latches
Dataflow Graph
Place and Fix Latches
Input Placement
Fig. 2. Latch Planner in VLSI design flow.
343
Algorithm 1 Latch Cluster Ordering and Sizing
Require: Dataflow Graph DFG = (V,E)1: An ordered set Q = φ
2: for each cluster c ∈ C do
3: Create Vc with wc = ⌈√
|Mc|⌉ ∗W,hc = ⌈√
|Mc|⌉ ∗H4: end for
5: Mark all pins in Mp,∀p ∈ P
6: repeat
7: for each cluster c ∈ C −Q do
8: num edge = 09: for each object i ∈Mc do
10: for each object j ∈Mx,∀x ∈ P ∪Q do
11: if (i, j) ∈ E and j is marked then
12: num edge++13: end if
14: end for
15: end for
16: rc = num edge
|Mc|//physical certainty
17: end for
18: Find a cluster s, rs ≥ rx, x ∈ C −Q
19: Mark all latches in Ms
20: for each object i ∈Ms do
21: for each cluster c ∈ P ∪Q do
22: num vertical edge = 023: num horizontal edge = 024: for each object j ∈Mc do
25: if (i, j) ∈ E and j is marked then
26: if|xi−xj |
Dw>
|yi−yj |
Dhthen
27: num horizontal edge++28: else
29: num vertical edge++30: end if
31: end if
32: end for
33: ws = max(ws, num vertical edge ∗W )34: hs = max(hs, num horizontal edge ∗H)35: end for
36: end for
37: Q = Q ∪ {s}38: until |C| == |Q|
A. Latch/Pin Clustering
In this section, we will explain latch/pin clustering. The goal of
clustering is to define a set of objects which will form datapath-
aware structures together. We cluster pins and latches separately, as
pins are fixed yet latches are floating. Clustering is driven by their
characteristics, such as physical/logical proximity (based on DFG),
instance names and so on. For example, two set of latches on the
ends of bus wires can form two clusters, as each cluster needs to
be placed at the end of the bus such that the bus wire routing can
be efficiently done. Another example is in Fig. 1 where the clusters
c1(M1 = {g, h, i}) and c2(M2 = {j, k, l, m}) are shown in the
dotted lines based on logical separation in (b) and physical separation
in (c). Pins are also clustered into ci and co due to their physical
separation (e.g., pin locations are known). In practice, clock domains
also play a critical role in latch clustering as latches in the same
clock domain need to be placed closely. It is beyond the scope of
this paper on how to do clustering, but in our implementation the
logical/physical proximity and clock domain drive latch clustering,
TABLE II. EXAMPLE OF LATCH CLUSTER SIZING/ORDERING.
iter. sar1
br2 r3 r4 h1
ch2 h3 h4
- - 0 0 0 0 3 4 7 2
1 c44
3
5
4
0
7
32
3 4 7 3
2 c143
5
4
8
7- 4 4 7 3
3 c3 - 5
4
117
- 4 4 8 3
4 c2 - 94
- - 4 5 8 3
a selected latch cluster. b physical certainty.c assumed W = H = 1 (Table I) and all edges are
horizontal for simplicity.
while the physical proximity rules pin clustering. Please refer to [1],
[2], [17], [26] for clustering in VLSI placement.
B. Latch Cluster Sizing and Ordering
Once clustering is completed, we create a virtual block, Vc,∀c ∈C which will be used to define a physical space where Mc will
be placed inside (See Section III-C). The purpose of sizing is to
determine a dimension (wc, hc) of Vc such that the space inside Vc
is big enough to achieve a legal datapath-aware latch placement. For
a pin cluster, (wc, hc) is unnecessary as pins are fixed, but that of
a virtual latch block needs to be determined. At the same time, we
will order all latch clusters based on their physical certainty which
is defined as the ratio of the edges to marked nodes in the DFG and
the number of objects in a cluster. The more a latch has connections
to marked (or fixed) nodes, the clearer it is where to place the latch.
Hence, such physical certainty will prioritize latch clusters so we
optimize first a latch cluster with more physical certainty (or less
placement flexibility), in order to achieve better datapath alignment.
Algorithm 1 details cluster ordering and sizing, and Table II shows
an example of latch sizing/ordering for Fig. 1 (b) where there are 4
latch clusters, c1, c2, c3, and c4. As in the line 3 of Algorithm 1, we
create a virtual block (Vc) such that it is just big enough to have its
latches (Mc) within itself. Also, we mark all the fixed nodes (e.g.,
pins) as in the line 5. The first row of Table II reveals the status after
the line 5 regarding Fig. 1 (b). Then, we compute physical certainty,
rc,∀c ∈ C by computing the ratio of the number of edges to the
marked nodes and |Mc| as in the lines 7-16. For example, r4 = 3
2as
it has 3 edges to x, y (which are already marked in the line 5 ) and
M4 = {u, v}. r1,2,3 are also shown in the second row of Table II.
Once all rc,∀c ∈ C are obtained, we find a cluster s with the
largest physical certainty as in the line 18, and then mark latches in
Ms as in the line 19. In the first iteration, c4 is selected as in the
second row of Table II. For the selected cluster s, we can update
(ws, hs) by respectively finding the largest number of vertical and
horizontal edges to its marked neighbor as in the lines 24-32. The
reason why we normalize the horizontal and vertical distances with
Dw , Dh in the line 26 is to avoid a wrong size for a cluster when a
design has a highly skewed aspect ratio (e.g., very tall or thin outline).
If we assume c4 in Fig. 1 (b) has horizontal 3 edges to co whose pins
are already marked, we can set h4 = 3 as in Table II and complete
the first iteration.
At each iteration, we need to update physical certainty as there are
newly marked latches. For example, the third row in Table II shows
that r3 has been updated to 8
7because its neighbor c4 was selected
in the previous iteration and all latches in M4 are now marked. If
we continue the steps until all the clusters are selected as in the
line 38 of Algorithm 1, we will have an ordered set of clusters
Q which leads to the order of c4 → c1 → c3 → c2. Intuitively,
we are choosing a cluster with less flexibility (or larger physical
344
Algorithm 2 Global Latch Placement
Require: Virtual Dataflow Graph V DFG = (V,E)1: A set of overlap-free constraint, T = φ
2: loop
3: F=LP formulation from Eq. (1,2,3,4)
4: for each constraint t ∈ T do
5: Add overlap-free constraints t to F
6: end for
7: Solve F
8: if overlap-free global latch placement is found then
9: return
10: else
11: for each pair of overlapped blocks Vi, Vj do
12: n=overlap-free constraint between Vi, Vj [18], [21]
13: T = T ∪ {n}14: end for
15: end if
16: end loop
17: return (xc, yc),∀c ∈ C
certainty) early due to fixed/marked neighbors, so that we can have
better datapath-aware latch placement for such less flexible clusters
during local latch placement.
C. Global Latch Placement
We can start global latch placement after latch clus-
tering/sizing/ordering where the purpose is to optimize the
(xc, yc),∀c ∈ C, for desirable local latch placement. Global latch
placement optimizes two conflicting objectives, a) minimizing data-
path wirelength which will enhance structured placement (e.g., latch
alignment), b) minimizing latch disturbance from input placement
which will avoid local congestion. Specifically, for the later objective,
if two clusters are placed too close (yet, far apart in the input
placement) for shorter datapath wirelength, the combinational gates
logically between these two clusters need to be placed at poor
locations (due to lack of space) in later placements (see Fig. 6), which
can create local congestion.
Fig. 3 illustrates the concept of global latch placement. We first
replace c1,2,3,4 in the input placement (Fig. 1 (c)) with V1,2,3,4, while
keeping the center of gravity unchanged for each cluster. Note that
(wc, hc) for Vc is already obtained by Algorithm 1. Next, we create a
virtual DFG (virtual blocks instead of latches), which results in Fig. 3
(a). The weight on an edge is the number of edges accumulated from
the DFG in Fig. 1 (b). For example, the edge between c2 and c4 has
weight 4, as there are 4 edges between both in Fig. 1 (b). Then, we can
optimize (xc, yc),∀c ∈ C. Fig. 3 (b) shows the result of global latch
placement, where overall datapath wirelength is minimized without
overlapped blocks, yet virtual blocks do not move much in order to
a
b
c
d
e
f
x
y
z
co
ci
V1
V2
V3
V42
3
4
8
(a) Before
a
b
c
d
e
f
x
y
z
co
ci
V1
V2
V3 V4
2
3
4
8
(b) After
Fig. 3. Global latch placement.
preserve the space for combinational gates (V3 and V4 do not get
close in spite of the high weight between two).
We solve global latch placement using linear programming (LP)
as in Algorithm 2, in order to capture the linear datapath wirelength
as shown below with 0 ≤ α ≤ 1:
min : α∑
c∈C
(DEV cx +DEV c
y ) + (1− α)∑
e∈E
DWLe (1)
s.t : DEV cx ≥ |xc −
∑
i∈Mcxi
|Mc||,∀c ∈ C (2)
DEV cy ≥ |yc −
∑
i∈Mcyi
|Mc||, ∀c ∈ C (3)
DWLe ≥ we(|xi − xj |+ |yi − yj |),∀e(i, j) ∈ E (4)
No overlap constraints [18], [21] (5)
As mentioned earlier, we balance datapath wirelength minimization
and latch location disturbance as in Eq. (1) with (0 < α ≤ 1).In details, Eq. (2) and (3) capture latch movement from the cluster
center, which is to prevent local congestion by preserving the space
for combinational gates. Eq. (4) is to compute weighted (we) datapath
wirelength based on the V DFG in Algorithm 2. There is a body of
literature on Eq. (5), and refer to [18], [21] for details.
One key characteristic in global latch placement is that the
placement density is not high as we only place Vc,∀c ∈ C which
makes it easy to find an overlap-free solution. Therefore, instead of the
full-blown no-overlapping constraints, we prefer to have the minimum
number of overlap-free constraints so that LP-based optimization in
Eq. (1) can be more effective and faster. The weakness of overlap-free
constraints in [18], [21] is that either vertical or horizontal relationship
between two adjacent blocks must be strictly defined, which may
over-constraint optimization and take away some potential benefit
in terms of wirelength minimization. Therefore, we choose a lazy
approach: we iteratively solve the LP problem as in the lines 2-3,
and add overlap-free constraints only if indeed necessary as in the
lines 11-14.
D. Local Latch Placement
Once global latch placement is done, we start local latch place-
ment to minimize the total datapath wirelength of a DFG such as
Algorithm 3 Local Latch Placement
Require: Dataflow Graph DFG = (V,E), Ordered Set Q
1: for each cluster c ∈ Q in an ordered way do
2: T = Create maximum slots within Vc
3: Network-flow Graph N = (O, F )4: for each object i ∈Mc do
5: Add a node ni to O as a source (+1)
6: end for
7: for each slot t ∈ T do
8: Add a node nt to O as a sink (−1)
9: end for
10: for each object i ∈Mc do
11: for each slot t ∈ T do
12: Add an edge (i, t) with c(i, t) = F lowCost(i, t)13: end for
14: end for
15: Solve Min-cost Network-flow Problem for N
16: Place/Fix all the latches in Mc
17: end for
345
Algorithm 4 FlowCost
Require: a latch i, a slot s
1: cost = 02: (xi, yi) = (x, y) coordinate of s
3: O = a set of connected objects to i
4: for each object o ∈ O do
5: if o is placed/fixed then
6: cost+ = |xi − xo|+ |yi − yo|7: end if
8: end for
9: return cost
a
b
c
d
e
f
x
y
z
co
ci
c1
c3 c4
2
3
4
8
n
o
q
r
s
t
p u
v
g
h
i 01234
c1 c3
c4
c2
c3
(a) Latch slots within a block
0 (-1)
1 (-1)
2 (-1)
3 (-1)
4 (-1)
j (+1)
k (+1)
l (+1)
m (+1)
2
3.5
2.5
(b) Network flow graph
a
b
c
d
e
f
x
y
z
co
ci
n
o
q
r
s
t
p u
v
g
h
i j
1234
00.5
1.5
(c) c(j, 0) = 2.0
a
b
c
d
e
f
x
y
z
co
ci
n
o
q
r
s
t
p u
v
g
h
i 01j
34
21.5
0
(d) c(j,2) = 3.5
a
b
c
d
e
f
x
y
z
co
ci
n
o
q
r
s
t
p u
v
g
h
i 01k
34
1
1.5
(e) c(k, 2) = 2.5
a
b
c
d
e
f
x
y
z
n
o
q
r
s
t
p u
v
g
h
i
k
l
m
j
(f) Final latch placement
Fig. 4. Local latch placement: example of c2.
Fig. 1 (b). We process each cluster at a time in the order given by
Algorithm 1. For example, c4 and c2 will be processed first and last,
respectively. Each local latch placement is driven by min-cost network
flow optimization.
Algorithm 3 describes our local latch placement which is for-
mulated as an assignment problem for each latch cluster, similarly
to [4]. As in the line 2, we first create as many slots as possible
within a virtual block. Consider the example in Fig. 4 where local
latch placements for c4,1,3 have been completed and the last cluster,
the local placement of c2 is about to begin. Let us assume that V2 is
partitioned into five slots (0− 4 in gray) as in (a) such that each slot
can accommodate one latch. Also, note that the coordinate of each
slot is known, as the (x2, y2) is computed by Algorithm 2. Then, we
start building a network flow graph in the lines 3-14. In detail, every
latch becomes a source and every slot becomes a sink so that a latch
can flow into one of the slots, as in the lines 4-9. Fig. 4 (b) shows
the nodes in the network flow graph for c2. Next, as in the line 12,
we create an edge with a unit capacity from every latch to every slot
with a cost based on Algorithm 4.
Algorithm 4 essentially computes what-if cost in terms of datap-
a
b
c
d
x y z
V1
V3
V24
4
(a) Turning datapath
a
b
c
d
x y z
4
4
(b) Diagonal latches
p p p p
F F F F
p p p p
F F F F
(c)
p p p p
p p p p
(d)
p p p p
p p p p
F F F F
F F F F
(e)Fig. 5. Handling complex datapath.
ath wirelength by trying a latch at various slots. Fig. 4 (c) illustrates
that we can compute the cost of assigning the latch j to the slot 0,
c(j, 0) by assuming j at the slot 0 and accumulating the displacements
from j to all other connected and already-planned objects in the DFG
from Fig. 1 (b). In details, c(j, 0) = 2.0, because the distances from
the slot 0 to d, e, q are 0.5, 1.5 and 0, respectively. Note that we only
use vertical displacement as distance for simplicity without loss of
generality in this example. Also, remind that d, e are fixed pins and
q has been already placed/fixed while processing c3 (See the order in
Table II). By applying the same procedure, we can get c(j, 2) = 3.5as in Fig. 4 (d), which indicates that the slot 2 is worse than the
slot 0 for j. In fact, Fig. 4 (e) shows that k is a better fit for the
slot 2. Once we complete building a network flow graph for every
possible pair of a latch and a slot for a given latch cluster, we can
find latch locations with the minimal datapath wirelength using min-
cost network optimization as in the line 15 of Algorithm 3. Since
Vc is bigger than total latch footprint of the Mc and Vi does not
overlap with Vj (i 6= j), our min-cost network optimization problem
is always feasible and optimally solvable [9].
IV. COMPLEX DATAPATH
In this section, we will discuss how LatchPlanner handles com-
plex datapath in detail. First, we assumed that all the latches have
identical footprint (e.g., W,H) in Section III, but LatchPlanner
handles various latch sizes simultaneously. For a cluster c, we can find
the largest width and height among Mc, and use them to determine
(wc, hc). In addition, although all the examples in Section III illustrate
the case of horizontal datapath only for simplicity, LatchPlanner
can handle complex datapath by reserving larger flexibility in terms
of latch cluster dimension so that nice datapath alignment can be
accomplished during local latch placement.
Fig. 5 illustrates how LatchPlanner handles complex datapath by
optimizing the dimension of each cluster based on Algorithm 1. Fig. 5
(a) shows the case where a datapath starts from the left-bottom corner
and ends at the top-right corner, making one turn in the middle. Due
to the higher physical certainty, V1 and V3 follow the dimensions of
two pin clusters. And then, we will set w2 ← w3 and h2 ← h1,
as we take the maximum length of all neighbors for each direction.
Such a large dimension of V2 allows the local latch placement in
Section III-D to have great flexibility and Algorithm 3 to find the
ideal location for each latch in terms of datapath alignment. Fig. 5
(b) shows a possible latch placement of V2 when there are no other
placement constraints: latches are diagonally placed in order to adjust
to the turning datapath.
It can be often found that pins are stacked when there are a
346
TABLE III. QUALITY OF FINAL PLACEMENT COMPARISON.
ckt LatchPlanner(Ours) Baseline (BL) Semi-Custom (SC)
id #lata TWLb WNSc TNSd Areae cpu(s)f TWL WNS TNS Area cpu(s) TWL WNS TNS Area cpu(s)
d1 35 18.8K -8.5 -868 2.4K 1024 22.0K -9.5 -921 2.4K 834 N/A
d2 42 64.5K -55.5 -1290 3.2K 1209 89.3K -54.2 -1859 3.3K 1310 N/A
d3 88 56.6K -11.9 -2339 6.3K 1201 72.5K -12.0 -2555 6.5K 1265 N/A
d4 94 66.6K -12.8 -2452 6.5K 1198 83.3K -15.6 -2647 6.6K 1353 N/A
d5 119 141.7K -4.8 -91 10.1K 4562 138.2K -4.5 -87 10.4K 3775 N/A
d6 156 141.1K -11.6 -576 13.0K 2390 152.5K -2.4 -320 12.9K 2373 N/A
d7 204 192.6K -4.8 -131 18.0K 7912 205.5K -6.3 -139 17.7K 5958 N/A
d8 349 561.2K -11.2 -747 29.8K 5413 635.8K -12.6 -1268 30.8K 6592 N/A
d9 360 327.0K -25.4 -4839 30.6K 20047 406.8K -23.3 -4815 31.0K 23559 N/A
d10 337 440.1K -45.4 -2902 40.1K 20429 497.9K -44.2 -3180 42.0K 17171 N/A
d11 321 372.8K -16.1 -1136 34.4K 8584 507.7K -15.7 -1449 35.4K 8854 363.0K -13.9 -1052 33.6K 8300
d12 640 1347.0K -26.6 -3798 86.4K 10178 1610.5K -28.8 -5863 87.7K 10411 1355.6K -23.1 -5146 85.5K 9929
d13 302 330.5K -49.8 -3994 31.6K 8578 488.2K -47.3 -4469 32.6K 10254 299.2K -48.4 -3884 30.8K 8725
d14 843 1066.3K -22.7 -5056 73.1K 9266 1327.4K -24.1 -6703 74.6K 9888 1068.4K -23.2 -5156 72.8K 9621
d15 490 739.0K -20.8 -3355 52.8K 9182 944.2K -18.1 -4136 55.4K 9347 707.6K -20.8 -3345 52.5K 10607
d16 442 733.9K -18.5 -4620 52.7K 8797 963.0K -17.6 -6182 55.9K 9564 717.8K -19.6 -4422 51.9K 8998
d17 1947 2099.5K -22.9 -7109 126.5K 13458 3208.5K -25.0 -10356 135.4K 14983 2315.4K -27.2 -7518 131.8K 13684
d18 2155 2627.0K -21.0 -8966 133.2K 15632 3577.2K -36.3 -10697 135.7K 16823 2511.5K -11.8 -7719 129.4K 15563
sum d1– 11326K -390.3 -54.2K 750.6K 149.1K 14931K -397.5 -67.7K 776.6K 154.3K N/A
ratio d18 1 1 1 1 1 1.32 1.02 1.25 1.04 1.03 N/A
sum d11– 9316K -198.4 -38.0K 590.6K 83.7K 12627K -212.9 -49.9K 612.8K 90.1K 9338K -188.0 -38.2K 588.4K 85.4K
ratio d18 1 1 1 1 1 1.36 1.07 1.31 1.04 1.08 1.00 0.95 1.01 1.00 1.02
a the number of latches bthe total wirelength in tracks. cthe worst negative slack (ps) dthe total negative slack (ps). ethe total gate
area in the smallest inverter size. f the total runtime of the whole design flow including placement and timing optimization.
large number of inputs/outputs. For such cases, we can stack latches
accordingly for better datapath-aware placement. Although we do
not detail stacking for the sake of simplicity, LatchPlanner can
automatically compute the required number of stacks during latch
cluster sizing in Section III-A. Fig. 5 shows an example of stacking
in LatchPlanner. If a latch cluster has a neighbor with stacked objects
as in (c), we grow the size of the virtual block such that the slots
for local latch placement can be stacked as well as in (d). Then,
the network flow optimization will automatically find a better latch
placement across all stacks as in (e).
V. EXPERIMENTAL RESULTS
We implemented our latch placement algorithm, LatchPlanner
in C++. All the experiments were performed on a 2.4GHz Linux
machine. We used CLP [13] as a LP solver and implemented a min-
cost network flow solver [9], [11] for LatchPlanner. We used an
in-house placement engine which handles multi-million mixed-size
objects and supports the state-of-the-art analytical techniques [17].
For comparison purpose, we prepared three different design
flows, LatchPlanner, Baseline (BL), and Semi-Custom (SC). Fig. 6
shows the simplified view of each flow. In all three design flows,
we iteratively perform placement and timing optimization such as
gate-sizing/buffering/Vt-assignment to obtain more reliable timing
statistics. While SC starts with latches fixed, both LatchPlanner
Placement
Netlist with DFG
Placement
Timing opt.
For 1..N
LatchPlannerFix latches
(a) LatchPlanner
Placement
Netlist
Placement
Timing opt.
For 1..N
Latch ClusteringFix latches
(b) Baseline
Netlist
Placement
Timing opt.
For 1..N
ManualLatch Placement
(c) Semi-Custom
Fig. 6. Three different flows for experiments.
(a) LatchPlanner (b) Semi-Custom
Fig. 7. Placement images from LatchPlanner and Semi-Custom: latches arehorizontally placed in a structured fashion for vertical datapath.
and BL start with floating latches. Then, after the first placement
(this placement becomes the initial placement for LatchPlanner),
BL clusters latches and fixes them in a way that it minimizes the
disturbance to the current snapshot of placement result [1], [6], while
LatchPlanner does the same job for latches in a way that it optimizes
datapath. Also, BL implicitly optimizes obvious datapaths between
pins and gates by having higher net weights on them for better
datapath alignment. In order to evaluate our approach, we collected
18 industrial datapath-oriented designs in the 32nm node, and 8 of
them (d11–d18) came with manual latch placement data created by
highly skilled human designers (so, we can obtain SC results). We
used the technique in [25] to obtain the DFGs for all the benchmarks.
We wish to use the public benchmarks [12] to compare with prior
arts but could not, because they are not suitable for our LatchPlanner:
there is no information on sequential elements or they are too
simple/small to reflect the characteristics of real datapath-oriented
macros [7], [12], [23]. Nonetheless, we are confident that experiments
on 18 benchmark circuits and comparison against SC which can
serve as the engineering lower bound, will make our experiments
sufficiently valid and convincing.
Table III demonstrates the effectiveness of LatchPlanner where
347
two summaries are shown at the bottom: one for all benchmarks (d1–
d18) and the other for d11–d18 with SC results. It clearly shows that
LatchPlanner and SC work better than BL which implies carefully
planned latches can be beneficial for datapath-oriented designs. By
comparing LatchPlanner with BL, we can note that both TWL and
TNS are improved by 32% and 25%, respectively. But, the improve-
ment on the larger benchmarks (d11–d18) is 36% for TWL and 31%
for TNS which indicate LatchPlanner works better for larger and
more complex datapath-dominated designs. LatchPlanner is highly
comparable to SC, yet without labor-intensive human inputs. Again,
the numbers from SC are the practical engineering lower bounds, and
LatchPlanner delivers near-lower-bound solutions for d11–d18. Also,
we noticed that well-planed latches can enhance the convergence
of the iterative placement and timing optimization. LatchPlanner
runs about 3% faster than BL. Note that LatchPlanner is one small
procedure (a fraction of the total runtime) as shown in Fig. 6. SC is as
fast as LatchPlanner, but Latch Planner offers substantial productivity
enhancement in terms of the entire design cycle, as a designer no
longer needs to create a manual latch placement. We did not include
the time spent in crafting manual latch placement for SC in Table III.
Fig. 7 compares the placement images from LatchPlanner and
SC. We can observe that there are differences in placement, but
LatchPlanner places latches in a comparably structured way to SC.
VI. CONCLUSION
We propose a novel algorithm, LatchPlanner to optimize datapath-
oriented design placement. The planned latches by LatchPlanner
enable a placement engine to deliver highly-structured and datapath-
aware placement. We apply LatchPlanner to industrial benchmarks
and prove through comprehensive experiments that LatchPlanner is
very efficient and effective in handling datapath-oriented designs.
REFERENCES
[1] C. Alpert et al. Latch Clustering with Proximity to Local Clock Buffers.In U.S. Patent Pending 20120110532, 2012.
[2] C. Alpert, A. Kahng, G.-J. Nam, S. Reda, and P. Villarrubia. A semi-persistent clustering technique for VLSI circuit placement. In Proc. Int.
Symp. on Physical Design, 2005.
[3] S. R. Arikati and R. Varadarajan. A Signature based Approach toRegularity Extraction. In Proc. Int. Conf. on Computer Aided Design,1997.
[4] U. Brenner, A. Pauli, and J. Vygen. Almost optimum placementlegalization by minimum cost flow and dynamic programming. In Proc.
Int. Symp. on Physical Design, 2004.
[5] T. Chan, A. Chowdhary, B. Krishna, A. Levin, G. Meeker, and N. Se-hgal. Challenges of CAD Development of Datapath Design. In Intel
Technology Journal, 1999.
[6] M. Cho et al. Converged Large Block and Structured Synthesis forHigh Performance Microprocessor Designs. In U.S. Patent 8271920,2012.
[7] S. Chou, M.-K. Hsu, and Y.-W. Chang. Structure-aware placement fordatapath-intensive circuit designs. In Proc. Design Automation Conf.,2012.
[8] A. Chowdhary and R. Gupta. A Methodology for Synthesis of DataPath Circuits. In IEEE Design & Test of Computers, 2002.
[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction
to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001.
[10] J. Friedrich et al. Design methodology for the ibm power7 micropro-cessor. IBM J. Res. Dev., 55(3):294–307, May 2011.
[11] A. V. Goldberg. An Effcient Implementation of a Scaling Minimum-Cost Flow Algorithmn. Journal of Algorithms, 22:1 – 29, 1997.
[12] http://www.cerc.utexas.edu/utda/download/DP.
[13] http://www.coin-or.org/projects/Clp.xml.
[14] P. Ienne and A. Grieβing. Practical experiences with standard-cell baseddatapath design tools: do we really need regular layouts? In Proc.
Design Automation Conf., 1998.
[15] T. Kutzschebauch and L. Stok. Efficient Logic Optimization UsingRegularity Extraction. In Proc. IEEE Int. Conf. on Computer Design,2000.
[16] T. Kutzschebauch and L. Stok. Regularity Driven Logic Synthesis. InProc. Int. Conf. on Computer Aided Design, 2000.
[17] I. L. Markov, J. Hu, and M.-C. Kim. Progress and Challenges in VLSIPlacement Research. In iccad, 2012.
[18] M. D. Moffitt, A. N. Ng, I. L. Markov, and M. E. Pollack. Constraint-driven floorplan repair. In Proc. Design Automation Conf., 2006.
[19] R. Namballa, N.Ranganathan, and A. Ejnioui. Control and data flowgraph extraction for high-level synthesis. In Proc. IEEE Annual Symp.
on VLSI, 2004.
[20] R. X. T. Nijssen and J. A. G. Jess. Two-dimensional datapath regularityextraction. In IFIP Workshop on Logic and Architecture Synthesis, 1996.
[21] S. Reda and A. Chowdhary. Effective Linear Programming basedPlacement Methods. In Proc. Int. Symp. on Physical Design, 2006.
[22] A. P. E. Rosiello, F. Ferrandi, D. Pandini, and D. Sciuto. A Hash-basedApproach for Functional Regularity Extraction During Logic Synthesis.In IEEE Computer Society Annual Symposium on VLSI, 2007.
[23] S. Ward, D. Ding, and D. Z. Pan. PADE: A High-PerformancePlacer with Automatic Datapath Extraction and Evaluation throughHigh Dimensional Data Learning. In Proc. Design Automation Conf.,2012.
[24] S. Ward, M.-C. Kim, N. Viswanathan, Z. Li, C. Alpert, E. E. Swartz-lander, Jr., and D. Z. Pan. Keep it straight: teaching placement how tobetter handle designs with datapaths. In Proc. Int. Symp. on Physical
Design, 2012.
[25] H. Xiang, M. Cho, H. Ren, M. Ziegler, and R. Puri. Network FlowBased Datapath Bit Slicing. In Proc. Int. Symp. on Physical Design,2013.
[26] J. Z. Yan, C. Chu, and W.-K. Mak. SafeChoice: a novel clusteringalgorithm for wirelength-driven placement. In Proc. Int. Symp. on
Physical Design, 2010.
348