[ieee 2013 ieee/acm international conference on computer-aided design (iccad) - san jose, ca, usa...

LatchPlanner: Latch Placement Algorithm for

Datapath-oriented High-Performance VLSI Designs

Minsik Cho, Hua Xiang, Haoxing Ren, Matthew M. Ziegler, Ruchir PuriIBM T. J. Watson Research Center, Yorktown Heights, NY 10598

{minsikcho,huaxiang,haoxing,zieglerm,ruchir}@us.ibm.com

Abstract—In this paper, we present a novel algorithm for latch

placement, LatchPlanner which enables a placement engine to deliver

high quality placement for datapath-oriented design. Datapath-oriented

VLSI designs are in general hand-crafted by human at high cost, asunderstanding and capturing datapath structure is critical for the per-

formance. The conventional placement algorithms by itself cannot exploit

the underlying datapath due to lack of logic structure recognition and

inaccurate/approximated wirelength estimation. LatchPlanner addressessuch drawbacks by placing and fixing latches in the datapath context,

a key element in datapath structure. By taking placed/fixed latches

as constraints, a placer can find a more datapath-friendly placementeffectively, which results in higher-quality hardware. LatchPlanner begins

latch clustering/sizing/ordering to prepare the following steps, a) global

latch placement based on linear programming to place latch clusters,

and b) local latch placement based on network flow optimization to placelatches within each cluster. Experimental results on eighteen industrial

benchmarks show that LatchPlanner improves total wirelength by 32%,

total negative slack by 25%, and area by 3% without CPU overhead overa commercial placement engine, and delivers near semi-custom-quality

solutions.

I. INTRODUCTION

In the high-end VLSI design or microprocessor domain, the com-

plexity of chip integration becomes increasingly challenging due to

tight area/power budgets with highly aggressive performance targets.

Therefore, hierarchical approach has been widely adopted in order to

reduce the complexity where the entire design is divided into multiple

macros [10] and each macro is separately optimized. Among these

macros, one macro may be fully synthesized without human interven-

tion, while another, mostly one of datapath-oriented macros will be

manually crafted by human due to its timing/area/power criticality [7],

[17]. Such datapath-oriented macros commonly found in high-end

VLSI systems (e.g., muxing, buffering, butter-flying, rotating, and so

on) are completed with significant effort by either full-custom or semi-

custom design flows, as conventional synthesis algorithms, especially

placement, are not well-suited for such datapath-oriented macros due

to the inherent gap between the HPWL and the Steiner wirelength

models [5], [14], [23]–[25]. Thus, considering ever-growing design

complexity/functionality, shortened design-turn-around-time, and cost

of custom macros, it is in greater demand than ever to develop more

automated synthesis schemes for datapath-oriented macros.

Many approaches have been proposed to handle datapath-oriented

design. Logic synthesis level optimization of datapath is described

in [16]. Enforcing datapath structure through a logic/physical syn-

thesis flow is proposed in [5]. Traditional analytical placement

techniques are extended to handle datapath structures in [7]. Graph

automorphism-based datapath extraction and ILP-based bitstacking

selection are proposed to amend the existing HPWL-driven place-

ment [23]. However, datapath consideration in logic synthesis phase

alone is not sufficient [16], and restricting timing optimization (e.g.,

sizing/buffering) on datapath can degrade the overall quality of

HW [5], [25]. Also, capturing global dataflow rather than local

isomorphism and analyzing the impact on timing due to datapath

optimization are essential [7], [23], [24].

While previous approaches have aforementioned limitations, latch

placement is an already proven and key technique in a semi-custom

design methodology to handle datapath-oriented designs. For details

on semi-custom design methodology, see Section II-A. Structurally

placed latches will not only encourage regular bitstacks but also

enable compact gate placement between latches, which will result in

better timing and shorter wirelength. Although a semi-custom solution

with fixed latches incurs relatively small human efforts yet provides

benefits of a full-custom scheme, still manually planning latches can

be quite cost-prohibitive and labor-intensive. For example, for a given

large and complex datapath macro, the number of latches to place

and fix can be too big, or fully understanding logical and physical

characteristics of the design for manual latch placement may be too

difficult.

Therefore, we propose an automatic latch placement algorithm,

LatchPlanner to address such challenges in the semi-custom flow.

LatchPlanner relies on the extracted dataflow graph from logic

synthesis where pins and latches become nodes and datapaths become

edges. The key idea is that since a datapath is a bitstream channel

between latches and pins, properly staging latches in advance with

dataflow taken into account will result in more datapath-friendly

placement. With the dataflow graph, LatchPlanner structurally places

and fixes latches using linear programming (LP) and min-cost network

flow optimization, which provides structured guidance to a placement

engine. Comprehensive experiment results demonstrate that Latch-

Planner can improve conventional placement significantly and deliver

highly comparable placement solutions to semi-custom results. The

major contributions of this paper include the following:

• We propose LatchPlanner, which places and fixes latches in

a datapath-friendly fashion. LatchPlanner is highly comple-

mentary to any existing placement engine or design flow.

• We propose to use dataflow graph to optimize datapath

in VLSI designs, in order to optimize overall datapath

wirelength and the locations of the key datapath element,

latches.

• We compare LatchPlanner with the industrially proven and

qualitatively golden solutions from a semi-custom method-

ology, and show that LatchPlanner can be highly effective

for datapath-oriented VLSI designs.

The rest of the paper is organized as follows. Section II provides

preliminaries and introduction to semi-custom design methodology.

Section III presents LatchPlanner. In-depth discussion on complex

datapath is in Section IV. Experimental results are in Section V,

followed by the conclusion in Section VI.

342978-1-4799-1071-7/13/$31.00 ©2013 IEEE

g

h

j

k

l

m

i

a

b

c

d

e

f

Combinational Logic

Combinational Logic

n

o

q

r

t

p

x

y

z

u

v

s

(a) Datapath-oriented netlist

g

h

j

k

l

m

i

a

b

c

d

e

f

x

y

z

n

o

q

r

s

t

pu

v

c1

c2

c3

c4

ci

co

(b) Extracted dataflow graph (DFG)

g

h

jk

l

m

i

a

b

c

d

e

f

x

y

z

n

o

q

r

s t

p

u

v

c1

c2

c3

c4

co

ci

(c) Input placement

Fig. 1. Inputs to LatchPlanner: pins are in a diamond and latches are in a box.

II. PRELIMINARIES

A. Semi-Custom Design Methodology

In general, full-custom design practice can potentially deliver the

best quality HW, which scales poorly with design size and requires

high development cost (e.g., design time) [24]. On the other hand,

ASIC-style full synthesis approach can deliver very large scale HW in

shorter time at less cost, yet the quality of HW tends to be worse than

that of full-custom approach. As a middle ground, semi-custom design

methodology is widely used to deliver good quality and large-scale

HW at affordable cost. The key idea in the semi-custom approach

is that human makes critical design decisions/optimizations manually

and leaves the rest of work to tools. For example, an experienced

human designer can place/fix a couple of critical gates/IPs, performs

manual gate-sizing locally, assigns some known time-critical signals

to higher metal layers, or even routes a few nets in highly congested

regions. Among such techniques, manual latch (or flipflop) placement

in a structured fashion is a popular and proven technique in order to

accomplish high-performance and lower clock power dissipation [6],

[10]. Effectively, the placed and fixed latches become placement

anchors or constraints which guide a placement engine to produce

higher quality results [24].

B. Datapath Extraction and Dataflow Graph

Datapath extraction is translating regularity/similarity (inherent

in datapath) in circuits into mathematical information. Many ideas

such as templates, signatures, hashing, and machine-learning have

been proposed in order to find out a set of similar subgraphs from

netlist under isomorphism [3], [8], [14], [15], [20], [22], [25]. Once

datapaths are identified, a dataflow graph (DFG) can be constructed,

which is a graph representation of the flow of data through key circuit

elements including latches (or flipflops) and pins. For example, from

a given netlist as in Fig. 1 (a), we can detect a set of similar logics

between latches/pins and replace them with edges in order to create

TABLE I. THE NOTATIONS IN THIS PAPER.

Dw the width of a design

Dh the height of a design

C the set of latch clusters (indexed by c)

P the set of pin clusters (indexed by p)

Mc the set of objects in the cluster c

Vc the virtual block from the cluster c

wc the width of the Vc

hc the height of the Vc

W the width of a latch

H the height of a latch

(xi, yi) the input coordinate of an object i

(xi, yi) the coordinate of an object i

(xc, yc) the coordinate of the Vc

a DFG in Fig. 1 (b) where pins/latches become nodes. DFG has only

pins/latches on the datapath as nodes. The advantage of using DFG

is that it captures global view of datapath logic and enables more

comprehensive datapath optimization. Also, DFG is highly suitable

for global scale optimization such as latch placement. Please refer

to [19], [25] for more information.

III. LATCHPLANNER

In this section, we propose LatchPlanner, a datapath-aware latch

placement algorithm to handle datapath-oriented designs. The key

insight behind optimizing latch placement in the context of datapath is

that datapath is essentially structured delivery of bitstreams from a set

of latches to another set of latches (commonly called pipeline stages).

Therefore, smartly anchoring latches in advance with datapath taken

into consideration can accomplish desirable layout structures and

improve the quality of placement results. And, we find that DFG [25]

is a natural way of modeling datapath among pins and latches for

datapath-aware physical design. Fig. 2 illustrates the overview of

LatchPlanner in the context of a modern VLSI synthesis flow where

the steps of LatchPlanner are in solid boxes. LatchPlanner accepts a

DFG (Fig. 1 (b)) extracted from a logic netlist and an input placement

(Fig. 1 (c)), and then plans (places and fixes) datapath latches. It

is possible to run placement again with fixed latches to reflect our

datapath optimization in determining other cell locations.

LatchPlanner starts with latch/pin clustering to define the gran-

ularity of datapath-aware latch placement as in Section III-A. Latch

cluster sizing/ordering translates logical datapath information to phys-

ical information for the placement purpose as in Section III-B. Once

latch clusters are fully defined physically, we secure spaces for the

latches in each cluster during global latch placement in Section III-C.

Finally, the latches are placed and fixed during local latch placement

in Section III-D. The placed/fixed latches will guide the following

placement by anchoring non-datapath gates to planned latches, in

order to generate a datapath-friendly placement.

Latch/Pin Clustering

Latch Cluster Sizing/Ordering

Global Latch Placement

Local Latch Placement

Placement with Preplaced Latches

Dataflow Graph

Place and Fix Latches

Input Placement

Fig. 2. Latch Planner in VLSI design flow.

343

Algorithm 1 Latch Cluster Ordering and Sizing

Require: Dataflow Graph DFG = (V,E)1: An ordered set Q = φ

2: for each cluster c ∈ C do

3: Create Vc with wc = ⌈√

|Mc|⌉ ∗W,hc = ⌈√

|Mc|⌉ ∗H4: end for

5: Mark all pins in Mp,∀p ∈ P

6: repeat

7: for each cluster c ∈ C −Q do

8: num edge = 09: for each object i ∈Mc do

10: for each object j ∈Mx,∀x ∈ P ∪Q do

11: if (i, j) ∈ E and j is marked then

12: num edge++13: end if

14: end for

15: end for

16: rc = num edge

|Mc|//physical certainty

17: end for

18: Find a cluster s, rs ≥ rx, x ∈ C −Q

19: Mark all latches in Ms

20: for each object i ∈Ms do

21: for each cluster c ∈ P ∪Q do

22: num vertical edge = 023: num horizontal edge = 024: for each object j ∈Mc do

25: if (i, j) ∈ E and j is marked then

26: if|xi−xj |

Dw>

|yi−yj |

Dhthen

27: num horizontal edge++28: else

29: num vertical edge++30: end if

31: end if

32: end for

33: ws = max(ws, num vertical edge ∗W )34: hs = max(hs, num horizontal edge ∗H)35: end for

36: end for

37: Q = Q ∪ {s}38: until |C| == |Q|

A. Latch/Pin Clustering

In this section, we will explain latch/pin clustering. The goal of

clustering is to define a set of objects which will form datapath-

aware structures together. We cluster pins and latches separately, as

pins are fixed yet latches are floating. Clustering is driven by their

characteristics, such as physical/logical proximity (based on DFG),

instance names and so on. For example, two set of latches on the

ends of bus wires can form two clusters, as each cluster needs to

be placed at the end of the bus such that the bus wire routing can

be efficiently done. Another example is in Fig. 1 where the clusters

c1(M1 = {g, h, i}) and c2(M2 = {j, k, l, m}) are shown in the

dotted lines based on logical separation in (b) and physical separation

in (c). Pins are also clustered into ci and co due to their physical

separation (e.g., pin locations are known). In practice, clock domains

also play a critical role in latch clustering as latches in the same

clock domain need to be placed closely. It is beyond the scope of

this paper on how to do clustering, but in our implementation the

logical/physical proximity and clock domain drive latch clustering,

TABLE II. EXAMPLE OF LATCH CLUSTER SIZING/ORDERING.

iter. sar1

br2 r3 r4 h1

ch2 h3 h4

- - 0 0 0 0 3 4 7 2

1 c44

3

5

4

0

7

32

3 4 7 3

2 c143

5

4

8

7- 4 4 7 3

3 c3 - 5

4

117

- 4 4 8 3

4 c2 - 94

- - 4 5 8 3

a selected latch cluster. b physical certainty.c assumed W = H = 1 (Table I) and all edges are

horizontal for simplicity.

while the physical proximity rules pin clustering. Please refer to [1],

[2], [17], [26] for clustering in VLSI placement.

B. Latch Cluster Sizing and Ordering

Once clustering is completed, we create a virtual block, Vc,∀c ∈C which will be used to define a physical space where Mc will

be placed inside (See Section III-C). The purpose of sizing is to

determine a dimension (wc, hc) of Vc such that the space inside Vc

is big enough to achieve a legal datapath-aware latch placement. For

a pin cluster, (wc, hc) is unnecessary as pins are fixed, but that of

a virtual latch block needs to be determined. At the same time, we

will order all latch clusters based on their physical certainty which

is defined as the ratio of the edges to marked nodes in the DFG and

the number of objects in a cluster. The more a latch has connections

to marked (or fixed) nodes, the clearer it is where to place the latch.

Hence, such physical certainty will prioritize latch clusters so we

optimize first a latch cluster with more physical certainty (or less

placement flexibility), in order to achieve better datapath alignment.

Algorithm 1 details cluster ordering and sizing, and Table II shows

an example of latch sizing/ordering for Fig. 1 (b) where there are 4

latch clusters, c1, c2, c3, and c4. As in the line 3 of Algorithm 1, we

create a virtual block (Vc) such that it is just big enough to have its

latches (Mc) within itself. Also, we mark all the fixed nodes (e.g.,

pins) as in the line 5. The first row of Table II reveals the status after

the line 5 regarding Fig. 1 (b). Then, we compute physical certainty,

rc,∀c ∈ C by computing the ratio of the number of edges to the

marked nodes and |Mc| as in the lines 7-16. For example, r4 = 3

2as

it has 3 edges to x, y (which are already marked in the line 5 ) and

M4 = {u, v}. r1,2,3 are also shown in the second row of Table II.

Once all rc,∀c ∈ C are obtained, we find a cluster s with the

largest physical certainty as in the line 18, and then mark latches in

Ms as in the line 19. In the first iteration, c4 is selected as in the

second row of Table II. For the selected cluster s, we can update

(ws, hs) by respectively finding the largest number of vertical and

horizontal edges to its marked neighbor as in the lines 24-32. The

reason why we normalize the horizontal and vertical distances with

Dw , Dh in the line 26 is to avoid a wrong size for a cluster when a

design has a highly skewed aspect ratio (e.g., very tall or thin outline).

If we assume c4 in Fig. 1 (b) has horizontal 3 edges to co whose pins

are already marked, we can set h4 = 3 as in Table II and complete

the first iteration.

At each iteration, we need to update physical certainty as there are

newly marked latches. For example, the third row in Table II shows

that r3 has been updated to 8

7because its neighbor c4 was selected

in the previous iteration and all latches in M4 are now marked. If

we continue the steps until all the clusters are selected as in the

line 38 of Algorithm 1, we will have an ordered set of clusters

Q which leads to the order of c4 → c1 → c3 → c2. Intuitively,

we are choosing a cluster with less flexibility (or larger physical

344

Algorithm 2 Global Latch Placement

Require: Virtual Dataflow Graph V DFG = (V,E)1: A set of overlap-free constraint, T = φ

2: loop

3: F=LP formulation from Eq. (1,2,3,4)

4: for each constraint t ∈ T do

5: Add overlap-free constraints t to F

6: end for

7: Solve F

8: if overlap-free global latch placement is found then

9: return

10: else

11: for each pair of overlapped blocks Vi, Vj do

12: n=overlap-free constraint between Vi, Vj [18], [21]

13: T = T ∪ {n}14: end for

15: end if

16: end loop

17: return (xc, yc),∀c ∈ C

certainty) early due to fixed/marked neighbors, so that we can have

better datapath-aware latch placement for such less flexible clusters

during local latch placement.

C. Global Latch Placement

We can start global latch placement after latch clus-

tering/sizing/ordering where the purpose is to optimize the

(xc, yc),∀c ∈ C, for desirable local latch placement. Global latch

placement optimizes two conflicting objectives, a) minimizing data-

path wirelength which will enhance structured placement (e.g., latch

alignment), b) minimizing latch disturbance from input placement

which will avoid local congestion. Specifically, for the later objective,

if two clusters are placed too close (yet, far apart in the input

placement) for shorter datapath wirelength, the combinational gates

logically between these two clusters need to be placed at poor

locations (due to lack of space) in later placements (see Fig. 6), which

can create local congestion.

Fig. 3 illustrates the concept of global latch placement. We first

replace c1,2,3,4 in the input placement (Fig. 1 (c)) with V1,2,3,4, while

keeping the center of gravity unchanged for each cluster. Note that

(wc, hc) for Vc is already obtained by Algorithm 1. Next, we create a

virtual DFG (virtual blocks instead of latches), which results in Fig. 3

(a). The weight on an edge is the number of edges accumulated from

the DFG in Fig. 1 (b). For example, the edge between c2 and c4 has

weight 4, as there are 4 edges between both in Fig. 1 (b). Then, we can

optimize (xc, yc),∀c ∈ C. Fig. 3 (b) shows the result of global latch

placement, where overall datapath wirelength is minimized without

overlapped blocks, yet virtual blocks do not move much in order to

a

b

c

d

e

f

x

y

z

co

ci

V1

V2

V3

V42

3

4

8

(a) Before

a

b

c

d

e

f

x

y

z

co

ci

V1

V2

V3 V4

2

3

4

8

(b) After

Fig. 3. Global latch placement.

preserve the space for combinational gates (V3 and V4 do not get

close in spite of the high weight between two).

We solve global latch placement using linear programming (LP)

as in Algorithm 2, in order to capture the linear datapath wirelength

as shown below with 0 ≤ α ≤ 1:

min : α∑

c∈C

(DEV cx +DEV c

y ) + (1− α)∑

e∈E

DWLe (1)

s.t : DEV cx ≥ |xc −

∑

i∈Mcxi

|Mc||,∀c ∈ C (2)

DEV cy ≥ |yc −

∑

i∈Mcyi

|Mc||, ∀c ∈ C (3)

DWLe ≥ we(|xi − xj |+ |yi − yj |),∀e(i, j) ∈ E (4)

No overlap constraints [18], [21] (5)

As mentioned earlier, we balance datapath wirelength minimization

and latch location disturbance as in Eq. (1) with (0 < α ≤ 1).In details, Eq. (2) and (3) capture latch movement from the cluster

center, which is to prevent local congestion by preserving the space

for combinational gates. Eq. (4) is to compute weighted (we) datapath

wirelength based on the V DFG in Algorithm 2. There is a body of

literature on Eq. (5), and refer to [18], [21] for details.

One key characteristic in global latch placement is that the

placement density is not high as we only place Vc,∀c ∈ C which

makes it easy to find an overlap-free solution. Therefore, instead of the

full-blown no-overlapping constraints, we prefer to have the minimum

number of overlap-free constraints so that LP-based optimization in

Eq. (1) can be more effective and faster. The weakness of overlap-free

constraints in [18], [21] is that either vertical or horizontal relationship

between two adjacent blocks must be strictly defined, which may

over-constraint optimization and take away some potential benefit

in terms of wirelength minimization. Therefore, we choose a lazy

approach: we iteratively solve the LP problem as in the lines 2-3,

and add overlap-free constraints only if indeed necessary as in the

lines 11-14.

D. Local Latch Placement

Once global latch placement is done, we start local latch place-

ment to minimize the total datapath wirelength of a DFG such as

Algorithm 3 Local Latch Placement

Require: Dataflow Graph DFG = (V,E), Ordered Set Q

1: for each cluster c ∈ Q in an ordered way do

2: T = Create maximum slots within Vc

3: Network-flow Graph N = (O, F )4: for each object i ∈Mc do

5: Add a node ni to O as a source (+1)

6: end for

7: for each slot t ∈ T do

8: Add a node nt to O as a sink (−1)

9: end for

10: for each object i ∈Mc do

11: for each slot t ∈ T do

12: Add an edge (i, t) with c(i, t) = F lowCost(i, t)13: end for

14: end for

15: Solve Min-cost Network-flow Problem for N

16: Place/Fix all the latches in Mc

17: end for

345

Algorithm 4 FlowCost

Require: a latch i, a slot s

1: cost = 02: (xi, yi) = (x, y) coordinate of s

3: O = a set of connected objects to i

4: for each object o ∈ O do

5: if o is placed/fixed then

6: cost+ = |xi − xo|+ |yi − yo|7: end if

8: end for

9: return cost

a

b

c

d

e

f

x

y

z

co

ci

c1

c3 c4

2

3

4

8

n

o

q

r

s

t

p u

v

g

h

i 01234

c1 c3

c4

c2

c3

(a) Latch slots within a block

0 (-1)

1 (-1)

2 (-1)

3 (-1)

4 (-1)

j (+1)

k (+1)

l (+1)

m (+1)

2

3.5

2.5

(b) Network flow graph

a

b

c

d

e

f

x

y

z

co

ci

n

o

q

r

s

t

p u

v

g

h

i j

1234

00.5

1.5

(c) c(j, 0) = 2.0

a

b

c

d

e

f

x

y

z

co

ci

n

o

q

r

s

t

p u

v

g

h

i 01j

34

21.5

0

(d) c(j,2) = 3.5

a

b

c

d

e

f

x

y

z

co

ci

n

o

q

r

s

t

p u

v

g

h

i 01k

34

1

1.5

(e) c(k, 2) = 2.5

a

b

c

d

e

f

x

y

z

n

o

q

r

s

t

p u

v

g

h

i

k

l

m

j

(f) Final latch placement

Fig. 4. Local latch placement: example of c2.

Fig. 1 (b). We process each cluster at a time in the order given by

Algorithm 1. For example, c4 and c2 will be processed first and last,

respectively. Each local latch placement is driven by min-cost network

flow optimization.

Algorithm 3 describes our local latch placement which is for-

mulated as an assignment problem for each latch cluster, similarly

to [4]. As in the line 2, we first create as many slots as possible

within a virtual block. Consider the example in Fig. 4 where local

latch placements for c4,1,3 have been completed and the last cluster,

the local placement of c2 is about to begin. Let us assume that V2 is

partitioned into five slots (0− 4 in gray) as in (a) such that each slot

can accommodate one latch. Also, note that the coordinate of each

slot is known, as the (x2, y2) is computed by Algorithm 2. Then, we

start building a network flow graph in the lines 3-14. In detail, every

latch becomes a source and every slot becomes a sink so that a latch

can flow into one of the slots, as in the lines 4-9. Fig. 4 (b) shows

the nodes in the network flow graph for c2. Next, as in the line 12,

we create an edge with a unit capacity from every latch to every slot

with a cost based on Algorithm 4.

Algorithm 4 essentially computes what-if cost in terms of datap-

a

b

c

d

x y z

V1

V3

V24

4

(a) Turning datapath

a

b

c

d

x y z

4

4

(b) Diagonal latches

p p p p

F F F F

p p p p

F F F F

(c)

p p p p

p p p p

(d)

p p p p

p p p p

F F F F

F F F F

(e)Fig. 5. Handling complex datapath.

ath wirelength by trying a latch at various slots. Fig. 4 (c) illustrates

that we can compute the cost of assigning the latch j to the slot 0,

c(j, 0) by assuming j at the slot 0 and accumulating the displacements

from j to all other connected and already-planned objects in the DFG

from Fig. 1 (b). In details, c(j, 0) = 2.0, because the distances from

the slot 0 to d, e, q are 0.5, 1.5 and 0, respectively. Note that we only

use vertical displacement as distance for simplicity without loss of

generality in this example. Also, remind that d, e are fixed pins and

q has been already placed/fixed while processing c3 (See the order in

Table II). By applying the same procedure, we can get c(j, 2) = 3.5as in Fig. 4 (d), which indicates that the slot 2 is worse than the

slot 0 for j. In fact, Fig. 4 (e) shows that k is a better fit for the

slot 2. Once we complete building a network flow graph for every

possible pair of a latch and a slot for a given latch cluster, we can

find latch locations with the minimal datapath wirelength using min-

cost network optimization as in the line 15 of Algorithm 3. Since

Vc is bigger than total latch footprint of the Mc and Vi does not

overlap with Vj (i 6= j), our min-cost network optimization problem

is always feasible and optimally solvable [9].

IV. COMPLEX DATAPATH

In this section, we will discuss how LatchPlanner handles com-

plex datapath in detail. First, we assumed that all the latches have

identical footprint (e.g., W,H) in Section III, but LatchPlanner

handles various latch sizes simultaneously. For a cluster c, we can find

the largest width and height among Mc, and use them to determine

(wc, hc). In addition, although all the examples in Section III illustrate

the case of horizontal datapath only for simplicity, LatchPlanner

can handle complex datapath by reserving larger flexibility in terms

of latch cluster dimension so that nice datapath alignment can be

accomplished during local latch placement.

Fig. 5 illustrates how LatchPlanner handles complex datapath by

optimizing the dimension of each cluster based on Algorithm 1. Fig. 5

(a) shows the case where a datapath starts from the left-bottom corner

and ends at the top-right corner, making one turn in the middle. Due

to the higher physical certainty, V1 and V3 follow the dimensions of

two pin clusters. And then, we will set w2 ← w3 and h2 ← h1,

as we take the maximum length of all neighbors for each direction.

Such a large dimension of V2 allows the local latch placement in

Section III-D to have great flexibility and Algorithm 3 to find the

ideal location for each latch in terms of datapath alignment. Fig. 5

(b) shows a possible latch placement of V2 when there are no other

placement constraints: latches are diagonally placed in order to adjust

to the turning datapath.

It can be often found that pins are stacked when there are a

346

TABLE III. QUALITY OF FINAL PLACEMENT COMPARISON.

ckt LatchPlanner(Ours) Baseline (BL) Semi-Custom (SC)

id #lata TWLb WNSc TNSd Areae cpu(s)f TWL WNS TNS Area cpu(s) TWL WNS TNS Area cpu(s)

d1 35 18.8K -8.5 -868 2.4K 1024 22.0K -9.5 -921 2.4K 834 N/A

d2 42 64.5K -55.5 -1290 3.2K 1209 89.3K -54.2 -1859 3.3K 1310 N/A

d3 88 56.6K -11.9 -2339 6.3K 1201 72.5K -12.0 -2555 6.5K 1265 N/A

d4 94 66.6K -12.8 -2452 6.5K 1198 83.3K -15.6 -2647 6.6K 1353 N/A

d5 119 141.7K -4.8 -91 10.1K 4562 138.2K -4.5 -87 10.4K 3775 N/A

d6 156 141.1K -11.6 -576 13.0K 2390 152.5K -2.4 -320 12.9K 2373 N/A

d7 204 192.6K -4.8 -131 18.0K 7912 205.5K -6.3 -139 17.7K 5958 N/A

d8 349 561.2K -11.2 -747 29.8K 5413 635.8K -12.6 -1268 30.8K 6592 N/A

d9 360 327.0K -25.4 -4839 30.6K 20047 406.8K -23.3 -4815 31.0K 23559 N/A

d10 337 440.1K -45.4 -2902 40.1K 20429 497.9K -44.2 -3180 42.0K 17171 N/A

d11 321 372.8K -16.1 -1136 34.4K 8584 507.7K -15.7 -1449 35.4K 8854 363.0K -13.9 -1052 33.6K 8300

d12 640 1347.0K -26.6 -3798 86.4K 10178 1610.5K -28.8 -5863 87.7K 10411 1355.6K -23.1 -5146 85.5K 9929

d13 302 330.5K -49.8 -3994 31.6K 8578 488.2K -47.3 -4469 32.6K 10254 299.2K -48.4 -3884 30.8K 8725

d14 843 1066.3K -22.7 -5056 73.1K 9266 1327.4K -24.1 -6703 74.6K 9888 1068.4K -23.2 -5156 72.8K 9621

d15 490 739.0K -20.8 -3355 52.8K 9182 944.2K -18.1 -4136 55.4K 9347 707.6K -20.8 -3345 52.5K 10607

d16 442 733.9K -18.5 -4620 52.7K 8797 963.0K -17.6 -6182 55.9K 9564 717.8K -19.6 -4422 51.9K 8998

d17 1947 2099.5K -22.9 -7109 126.5K 13458 3208.5K -25.0 -10356 135.4K 14983 2315.4K -27.2 -7518 131.8K 13684

d18 2155 2627.0K -21.0 -8966 133.2K 15632 3577.2K -36.3 -10697 135.7K 16823 2511.5K -11.8 -7719 129.4K 15563

sum d1– 11326K -390.3 -54.2K 750.6K 149.1K 14931K -397.5 -67.7K 776.6K 154.3K N/A

ratio d18 1 1 1 1 1 1.32 1.02 1.25 1.04 1.03 N/A

sum d11– 9316K -198.4 -38.0K 590.6K 83.7K 12627K -212.9 -49.9K 612.8K 90.1K 9338K -188.0 -38.2K 588.4K 85.4K

ratio d18 1 1 1 1 1 1.36 1.07 1.31 1.04 1.08 1.00 0.95 1.01 1.00 1.02

a the number of latches bthe total wirelength in tracks. cthe worst negative slack (ps) dthe total negative slack (ps). ethe total gate

area in the smallest inverter size. f the total runtime of the whole design flow including placement and timing optimization.

large number of inputs/outputs. For such cases, we can stack latches

accordingly for better datapath-aware placement. Although we do

not detail stacking for the sake of simplicity, LatchPlanner can

automatically compute the required number of stacks during latch

cluster sizing in Section III-A. Fig. 5 shows an example of stacking

in LatchPlanner. If a latch cluster has a neighbor with stacked objects

as in (c), we grow the size of the virtual block such that the slots

for local latch placement can be stacked as well as in (d). Then,

the network flow optimization will automatically find a better latch

placement across all stacks as in (e).

V. EXPERIMENTAL RESULTS

We implemented our latch placement algorithm, LatchPlanner

in C++. All the experiments were performed on a 2.4GHz Linux

machine. We used CLP [13] as a LP solver and implemented a min-

cost network flow solver [9], [11] for LatchPlanner. We used an

in-house placement engine which handles multi-million mixed-size

objects and supports the state-of-the-art analytical techniques [17].

For comparison purpose, we prepared three different design

flows, LatchPlanner, Baseline (BL), and Semi-Custom (SC). Fig. 6

shows the simplified view of each flow. In all three design flows,

we iteratively perform placement and timing optimization such as

gate-sizing/buffering/Vt-assignment to obtain more reliable timing

statistics. While SC starts with latches fixed, both LatchPlanner

Placement

Netlist with DFG

Placement

Timing opt.

For 1..N

LatchPlannerFix latches

(a) LatchPlanner

Placement

Netlist

Placement

Timing opt.

For 1..N

Latch ClusteringFix latches

(b) Baseline

Netlist

Placement

Timing opt.

For 1..N

ManualLatch Placement

(c) Semi-Custom

Fig. 6. Three different flows for experiments.

(a) LatchPlanner (b) Semi-Custom

Fig. 7. Placement images from LatchPlanner and Semi-Custom: latches arehorizontally placed in a structured fashion for vertical datapath.

and BL start with floating latches. Then, after the first placement

(this placement becomes the initial placement for LatchPlanner),

BL clusters latches and fixes them in a way that it minimizes the

disturbance to the current snapshot of placement result [1], [6], while

LatchPlanner does the same job for latches in a way that it optimizes

datapath. Also, BL implicitly optimizes obvious datapaths between

pins and gates by having higher net weights on them for better

datapath alignment. In order to evaluate our approach, we collected

18 industrial datapath-oriented designs in the 32nm node, and 8 of

them (d11–d18) came with manual latch placement data created by

highly skilled human designers (so, we can obtain SC results). We

used the technique in [25] to obtain the DFGs for all the benchmarks.

We wish to use the public benchmarks [12] to compare with prior

arts but could not, because they are not suitable for our LatchPlanner:

there is no information on sequential elements or they are too

simple/small to reflect the characteristics of real datapath-oriented

macros [7], [12], [23]. Nonetheless, we are confident that experiments

on 18 benchmark circuits and comparison against SC which can

serve as the engineering lower bound, will make our experiments

sufficiently valid and convincing.

Table III demonstrates the effectiveness of LatchPlanner where

347

two summaries are shown at the bottom: one for all benchmarks (d1–

d18) and the other for d11–d18 with SC results. It clearly shows that

LatchPlanner and SC work better than BL which implies carefully

planned latches can be beneficial for datapath-oriented designs. By

comparing LatchPlanner with BL, we can note that both TWL and

TNS are improved by 32% and 25%, respectively. But, the improve-

ment on the larger benchmarks (d11–d18) is 36% for TWL and 31%

for TNS which indicate LatchPlanner works better for larger and

more complex datapath-dominated designs. LatchPlanner is highly

comparable to SC, yet without labor-intensive human inputs. Again,

the numbers from SC are the practical engineering lower bounds, and

LatchPlanner delivers near-lower-bound solutions for d11–d18. Also,

we noticed that well-planed latches can enhance the convergence

of the iterative placement and timing optimization. LatchPlanner

runs about 3% faster than BL. Note that LatchPlanner is one small

procedure (a fraction of the total runtime) as shown in Fig. 6. SC is as

fast as LatchPlanner, but Latch Planner offers substantial productivity

enhancement in terms of the entire design cycle, as a designer no

longer needs to create a manual latch placement. We did not include

the time spent in crafting manual latch placement for SC in Table III.

Fig. 7 compares the placement images from LatchPlanner and

SC. We can observe that there are differences in placement, but

LatchPlanner places latches in a comparably structured way to SC.

VI. CONCLUSION

We propose a novel algorithm, LatchPlanner to optimize datapath-

oriented design placement. The planned latches by LatchPlanner

enable a placement engine to deliver highly-structured and datapath-

aware placement. We apply LatchPlanner to industrial benchmarks

and prove through comprehensive experiments that LatchPlanner is

very efficient and effective in handling datapath-oriented designs.

REFERENCES

[1] C. Alpert et al. Latch Clustering with Proximity to Local Clock Buffers.In U.S. Patent Pending 20120110532, 2012.

[2] C. Alpert, A. Kahng, G.-J. Nam, S. Reda, and P. Villarrubia. A semi-persistent clustering technique for VLSI circuit placement. In Proc. Int.

Symp. on Physical Design, 2005.

[3] S. R. Arikati and R. Varadarajan. A Signature based Approach toRegularity Extraction. In Proc. Int. Conf. on Computer Aided Design,1997.

[4] U. Brenner, A. Pauli, and J. Vygen. Almost optimum placementlegalization by minimum cost flow and dynamic programming. In Proc.

Int. Symp. on Physical Design, 2004.

[5] T. Chan, A. Chowdhary, B. Krishna, A. Levin, G. Meeker, and N. Se-hgal. Challenges of CAD Development of Datapath Design. In Intel

Technology Journal, 1999.

[6] M. Cho et al. Converged Large Block and Structured Synthesis forHigh Performance Microprocessor Designs. In U.S. Patent 8271920,2012.

[7] S. Chou, M.-K. Hsu, and Y.-W. Chang. Structure-aware placement fordatapath-intensive circuit designs. In Proc. Design Automation Conf.,2012.

[8] A. Chowdhary and R. Gupta. A Methodology for Synthesis of DataPath Circuits. In IEEE Design & Test of Computers, 2002.

[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction

to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001.

[10] J. Friedrich et al. Design methodology for the ibm power7 micropro-cessor. IBM J. Res. Dev., 55(3):294–307, May 2011.

[11] A. V. Goldberg. An Effcient Implementation of a Scaling Minimum-Cost Flow Algorithmn. Journal of Algorithms, 22:1 – 29, 1997.

[12] http://www.cerc.utexas.edu/utda/download/DP.

[13] http://www.coin-or.org/projects/Clp.xml.

[14] P. Ienne and A. Grieβing. Practical experiences with standard-cell baseddatapath design tools: do we really need regular layouts? In Proc.

Design Automation Conf., 1998.

[15] T. Kutzschebauch and L. Stok. Efficient Logic Optimization UsingRegularity Extraction. In Proc. IEEE Int. Conf. on Computer Design,2000.

[16] T. Kutzschebauch and L. Stok. Regularity Driven Logic Synthesis. InProc. Int. Conf. on Computer Aided Design, 2000.

[17] I. L. Markov, J. Hu, and M.-C. Kim. Progress and Challenges in VLSIPlacement Research. In iccad, 2012.

[18] M. D. Moffitt, A. N. Ng, I. L. Markov, and M. E. Pollack. Constraint-driven floorplan repair. In Proc. Design Automation Conf., 2006.

[19] R. Namballa, N.Ranganathan, and A. Ejnioui. Control and data flowgraph extraction for high-level synthesis. In Proc. IEEE Annual Symp.

on VLSI, 2004.

[20] R. X. T. Nijssen and J. A. G. Jess. Two-dimensional datapath regularityextraction. In IFIP Workshop on Logic and Architecture Synthesis, 1996.

[21] S. Reda and A. Chowdhary. Effective Linear Programming basedPlacement Methods. In Proc. Int. Symp. on Physical Design, 2006.

[22] A. P. E. Rosiello, F. Ferrandi, D. Pandini, and D. Sciuto. A Hash-basedApproach for Functional Regularity Extraction During Logic Synthesis.In IEEE Computer Society Annual Symposium on VLSI, 2007.

[23] S. Ward, D. Ding, and D. Z. Pan. PADE: A High-PerformancePlacer with Automatic Datapath Extraction and Evaluation throughHigh Dimensional Data Learning. In Proc. Design Automation Conf.,2012.

[24] S. Ward, M.-C. Kim, N. Viswanathan, Z. Li, C. Alpert, E. E. Swartz-lander, Jr., and D. Z. Pan. Keep it straight: teaching placement how tobetter handle designs with datapaths. In Proc. Int. Symp. on Physical

Design, 2012.

[25] H. Xiang, M. Cho, H. Ren, M. Ziegler, and R. Puri. Network FlowBased Datapath Bit Slicing. In Proc. Int. Symp. on Physical Design,2013.

[26] J. Z. Yan, C. Chu, and W.-K. Mak. SafeChoice: a novel clusteringalgorithm for wirelength-driven placement. In Proc. Int. Symp. on

Physical Design, 2010.

348

[ieee 2013 ieee/acm international conference on computer-aided design (iccad) - san jose, ca, usa...

Documents