1928 ieee transactions on computer-aided ... (cad) software is one impediment to the wide-spread...

1928 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 34, NO. 12, DECEMBER 2015

Fast and Memory-Efficient Routing Algorithms forField Programmable Gate Arrays With Sparse

Intracluster Routing CrossbarsYehdhih Ould Mohammed Moctar, Guy G. F. Lemieux, Senior Member, IEEE, and Philip Brisk, Member, IEEE

Abstract—Field programmable gate array (FPGA) routing isone of the most time consuming steps in a typical computer-aideddesign flow. The problem itself is similar to the NP-completeproblem of computing a set of disjoint paths in a graph. Therouting resource graph (RRG) that represents an FPGA rout-ing network is necessarily large, and becomes even larger whenmodeling modern FPGAs that integrate sparse intracluster rout-ing crossbars. This paper introduces two scalable heuristicsthat reduce the runtime and memory footprint of FPGA rout-ing: 1) selective RRG expansion (SERRGE), which employs anapplication-specific memory manager that stores the RRG ina compressed form, and dynamically decompresses it as therouter proceeds and 2) partial prerouting (PPR) locally routes allnets within each logic cluster, followed by a global routing stageto complete the routes. PPR and SERRGE converge faster thana traditional router using a fully expanded RRG. PPR runs fasterand uses less memory than SERRGE, while SERRGE yields thehighest clock frequencies among the three.

Index Terms—Field programmable gate array (FPGA), rout-ing, routing resource graph (RRG).

I. INTRODUCTION

THE LONG running times of commercial computer-aideddesign (CAD) software is one impediment to the wide-

spread adoption of field programmable gate array (FPGA)technology. Among the different stages in a typical CADflow, routing is often the most significant in terms of run-time and performance, since it directly affects the achievableclock frequency. Practically all commercial FPGA routershave their origins in the PathFinder algorithm, introduced in1995 by McMurchie and Ebeling [1]. PathFinder employsan algorithmic approach called negotiated congestion (NC),in which individual nets in the user circuit are allowed toshared FPGA routing resources; as the algorithm proceeds, the

Manuscript received February 7, 2014; revised May 23, 2014 andNovember 28, 2014; accepted May 20, 2015. Date of publication June 15,2015; date of current version November 18, 2015. This work was supportedby the DoD SMART Fellowship. This paper was recommended by AssociateEditor L. He. (Corresponding author: Philip Brisk.)

Y. O. M. Moctar is with the IBM, Electronic Design Automation, Austin,TX 78758 USA (e-mail: [email protected]).

G. G. F. Lemieux is with the Department of Electrical and ComputerEngineering, University of British Columbia, Vancouver, BC V6T 1Z4,Canada (e-mail: [email protected]).

P. Brisk is with the Department of Computer Science and Engineering,University of California, Riverside, Riverside, CA 92521 USA (email:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCAD.2015.2445739

negotiation process ensures that at most one net is routed alongeach resource. This process is often lengthy and memory-intensive. In particular, the routing resource graph (RRG) ofa commercial-grade FPGA can be very large, due to the inor-dinate quantity of uniquely programmable routing resourcesthat are present in the architecture.

One of the significant contributors to overall RRG size isthe presence of sparse intracluster routing crossbars within theFPGA routing network [2]–[5]. In early FPGA generations,intracluster routing crossbars were fully connected, whichallowed the RRG to implicitly represent them. When the cross-bars become sparse, the implicit representation is no longeraccurate, so the need to explicitly enumerate their connectivitysignificantly enlarges the overall RRG size.

This paper reduces the runtime and memory footprint ofthe PathFinder FPGA routing algorithm for FPGAs withsparse intracluster routing crossbar. Two heuristics are intro-duced with different characteristics in terms of runtime,memory usage, and quality of solution. Selective RRG expan-sion (SERRGE) employs a memory manager that compressesthe RRG and decompresses relevant portions of it as the routerexecutes, thereby eliminating the need to fully expand it priorto routing. A second, heuristic, partial prerouting (PPR) com-putes routes for each intracluster routing crossbar a priori,and then routes the rest of the circuit using the global routingresources of the FPGA. Between the two, PPR achieves shorterruntimes and consumes less memory, while SERGGE tends tofind legal routing solutions with lower critical path delays,equating to higher clock frequencies. Our results demon-strate that SERRGE and PPR address the routing challengeimposed by FPGAs with sparse intracluster routing crossbars,as they offer a clear and unequivocal improvement over thestate-of-the-art in FPGA routing algorithms.

This paper is an extension of [6], which was publishedat International Conference on Field-Programmable Logicand Applications 2012. New contributions of this paperinclude: 1) descriptions of modifications to versatile place androutes (VPRs) internal data structures used by the differentrouting algorithms and 2) more extensive experimentation andanalysis across a wider set of target FPGA architectures.

II. PRELIMINARIES

A. FPGA Architecture

Our implementation, experimental results, and analysis usethe VPR 5.0 architectural simulator, made publicly available

0278-0070 c⃝ 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

MOCTAR et al.: FAST AND MEMORY-EFFICIENT ROUTING ALGORITHMS FOR FPGAs WITH SPARSE INTRACLUSTER ROUTING CROSSBARS 1929

Fig. 1. (a) BLE of an FPGA. (b) CLB contains several BLEs with fast localinterconnect provided by the intracluster routing crossbar; the C block inputsand outputs interface the CLB with the global routing network. (c) Floorplanof a general island-style FPGA.

by the University of Toronto [7]. This section summarizes theVPR architecture.

The atomic unit of FPGA programmable logic is a K-inputlookup table (K-LUT), which can be configured to implementany K-input, one-output logic function. A basic logic ele-ment (BLE) is a K-LUT coupled with a bypassable flip-flop,as shown in Fig. 1(a).

As shown in Fig. 1(b), BLEs are clustered in groupscalled configurable logic blocks (CLBs). Each CLB containsN BLEs, along with an intracluster routing crossbar. In earlyFPGAs, the intracluster routing crossbar was fully connected;in more recent devices, it has become sparse. A connectionblock (C block) connects each CLB input pin to a subset ofthe wires in the adjacent routing channel. The intracluster rout-ing crossbar connects the CLB input pins and local feedbacks(one per BLE) to the BLE inputs.

Fig. 2. (a) Small FPGA fragment and (b) its corresponding RRG [8, Fig. 4].Observe that switches between wires can be unidirectional or bidirectional,depending on the architecture.

Fig. 1(c) illustrates the FPGA floorplan. Switch blocks(S blocks) are programmable intersections between horizontaland vertical routing channels. The multiplexers, shown on theright-hand side of Fig. 1(b), are implemented in the S blocks,which are shown (without detail) in Fig. 1(c). Fig. 1(b) depictsinputs coming in from the left hand side of the CLB and out-puts leaving to the right; in actuality, inputs and outputs mayenter and exit from all four sides.

The user describes an FPGA using VPRs architecture con-figuration file. VPR reads in the architecture configurationfile and algorithmically generates the logic and routing archi-tecture of the FPGA [8]. This alleviates the need for theuser to specify every connection within the device. The mostimportant device parameters are as follows.

K: the LUT size (i.e., a K-LUT).N: the number of LUTs per CLB.I: the number of CLB input pins.

W: the number of segments per routing channel.Fcin and Fcout: C block connectivity parameters.

Each C block input multiplexer in Fig. 1(b) selects one ofW × Fcin wires, and each BLE drives W × Fcout segments inthe adjacent routing channels. Most FPGAs use single driverrouting [9], so the C block output is a conceptual descriptionof the routing topology.

B. Routing Resource Graph

The RRG represents the connectivity between physicalresources in an FPGA. Vertices in the RRG represent wiresand pins that are internal to the FPGA, and edges representswitches that connect wires; switches may be unidirectionalor bidirectional. Fig. 2 provides an example of an RRG thatrepresents a small fragment occurring within a larger FPGA.

When performing routing, sources start at FPGA input pinsand BLE outputs, and sinks (targets) are FPGA output pinsand BLE inputs.

C. Problem Formulation

FPGA routing is a technology-specific variation of the dis-joint path problem from graph theory, which is one of Karp’soriginal NP-complete problems [10]. In a graph, two paths are


(a) (b) (c)

Fig. 3. Simple instance of the disjoint path problem. (a) Graph G(V , E)with sources S = {s1, s2} and sinks T = {t1, t2}. (b) Illegal solution, i.e., twonondisjoint paths that share a common vertex. (c) Legal solution, i.e., twodisjoint paths that share no common vertices.

(a) (b) (c)

Fig. 4. (a) Due to the equivalence of LUT inputs, different source-sink pairsmay be legal solutions. (b) However, enforcing specific source-sink pairs maybe overly restrictive. (c) Solution is to create a common sink (t) that representsall equivalent LUT inputs.

disjoint if they share no vertices or edges. Fig. 3 provides anexample of disjoint and nondisjoint paths.

An instance of the disjoint path problem is a graph G(V , E),and two sets of vertices: 1) a set of sources S = {s1, s2, . . . , sk}and 2) a set of sinks T = {t1, t2, . . . , tk}. A legal solution isa set of paths P = { p1, p2, . . . , pk} where pi is a path from si toti in G, such that the paths in P are disjoint. The NP-completedecision problem is whether or not a set P of disjoint pathsexists, given G, S, and T; corresponding optimization prob-lems may try to minimize the total lengths of the paths in P,the length of the longest path in P. In the routing problemfor FPGAs, the graph G is the RRG, and the set of sourcescorresponding sinks is derived from the placement solution.

One important difference is that each path in the FPGA rep-resents a net in a digital circuit, where a source may fan-outto drive multiple sinks. Each net has the form Ni = (si, Ti),where si is the source and Ti = {t1i , t2i , . . . , tni } is the set of nsinks driven by source si; thus, pi is actually a hyper-path (tree)that connects si to the sinks in Ti.

A second important difference involves the equivalenceof sinks. Because LUTs are programmable logic functions,their inputs are equivalent. Without loss of generality, ifa two-input LUT is configured to perform a logic functionf (s1, s2), then there is an equivalent logic function f ′(s2, s1) =f (s1, s2), yielding a symmetric source/sink assignment, shownin Fig. 4(a). Explicitly listing either pair as the one possiblelegal solution, as shown in Fig. 4(b), is overly restrictive. Thus,it is necessary to introduce a single vertex t to represent a com-mon sink, as shown in Fig. 4(c). Therefore, any legal routingsolution must be node disjoint, except at the common sink.

The objective of an FPGA router is twofold: 1) find a legalroute, supposing the one exists and 2) minimize the delay of

Fig. 5. CLBs with (a) fully connected and (b) sparsely connected intraclusterrouting crossbars.

the critical path in the circuit, which may involve the con-catenation of several disjoint paths in the RRG. Many aspectsof this delay will be technology-specific, including the logicdelay through the BLEs on the path, delays relating to fanout,delays through routing multiplexers, wire delays in the routingnetwork, etc. We employ the models VPR provides.

D. Sparse Intracluster Routing Crossbars

VPR (versions 5.0 and before) model FPGAs with full intr-acluster routing crossbars, as shown in Fig. 5(a). Specifically,a full intracluster routing crossbar means that a programmingrouting connection exists between every CLB input and BLEinput within the CLB. This means that the router only needs toalgorithmically compute routes from sources to CLB inputs,not BLE inputs; with a full crossbar connecting CLB inputsto BLE inputs, it is trivial to complete the route. Thus, theintracluster routing crossbar can be omitted from the RRG;this has been standard in VPR since its inception, althoughthe assumption has since been lifted since the release of VPR6.0 [11]. Now, the intracluster routing crossbar topology ispart of the architecture configuration file.

When the intracluster routing crossbar becomes sparse, asshown in Fig. 5(b), CLB inputs are no longer equivalent (in thegeneral case). In order for the route to complete a legal disjointpath routing solution, it is necessary to explicitly representthe intracluster routing crossbar in the RRG. This enlargesthe size of the RRG: the set of vertices must include eachCLB input and each BLE input [before, the CLB inputs couldbe represented as a single sink, akin to Fig. 4(c), while BLEinputs were omitted altogether]; and the number of edges thatare added to the RRG depends on the population density ofthe crossbar. Taken in aggregation across the entire FPGA, theRRG size can increase significantly.

E. PathFinder Algorithm

This section summarizes the PathFinder FPGA routingalgorithm [1]. PathFinder is based on the paradigm of NC,which computes illegal routing solutions in which several netsmay share a single wire (RRG vertex). The negotiation processdynamically adjusts a cost function, which, over time, pushesnets away from congested wires, and yields a globally legalrouting solution.

Fig. 6 presents pseudocode for PathFinder. The outer loop,the global router, iterates until a legal solution is found. Atthe beginning of each iteration, it rips up each routing treeand updates the vertex costs used by the algorithm. The pseu-docode assumes that a legal solution can be found. In practice,


Fig. 6. Pseudocode for the PathFinder FPGA routing algorithm.

the global router terminates after a fixed number of iterationsif a solution is not found. The signal router (lines 2–26) isoblivious to congestion (i.e., several nets sharing the sameRRG vertex); a cost function (described below) is computedfor each RRG vertex and is dynamically updated to dissuadethe usage of congested vertices during routing. The objec-tive of the signal router is to find a route that minimizes theaggregate cost of the RRG vertices that comprise the route.

The signal router routes one net a time; routing tree RTi fornet Ni is expanded in search of each sink t j

i ∈ Ti, one sink ata time, and in-order. The routing tree RTi for net Ni, computedduring the previous iteration, is discarded, and a new route iscomputed. Routes may be computed using a priority-drivenbreadth-first search [1], similar to Lee’s maze expansion [12];more efficient routes can be computed using an A* cost func-tion, which includes an additional term that directs the searchtoward the target sink t j

i [1], [13]–[15].The first search emanates from source si of net Ni to the

first sink, t1i , resulting in a routing path. Subsequent searchesexpand the routing path into a routing tree, RTi. Inductively, letRTi connect si to sinks, {t1i , t2i , . . . , tj−1

i }. The next search willfind a path that connects some vertex in RTi to the jth sink, t j

i .In its original description, PathFinder computes a route from

each source to each sink. The search initializes a priority queue

PQ to contain the vertices in RTi at zero cost. Subsequently,PQ contains each vertex that: 1) has at least one neighbor inRTi and 2) does not belong to RTi itself. VPR (and Fig. 6) incontrast, maintains the wavefront and continues expansionuntil all sinks have been discovered [15, Fig. 4.11].

The search works as follows: the lowest cost vertex v isremoved from PQ and added to RTi. The vertices adjacent tov are expanded and inserted into PQ accordingly. This pro-cess repeats until the current sink t j

i is found. Initially, PQmust include an adjacent neighbor u of v that belongs to RTi;thus, v and adjacent edge (u, v) are added together to RTi.

1) Cost Function: An important implementation detailis the cost computed for each vertex when it is insertedinto PQ. Different PathFinder implementations use differ-ent cost functions [1], [13]–[15], with different objectives andstrategies. Let v be the vertex, and u be a vertex adjacentto v that has already been added to RTi; in other words, if thesearch selects v for inclusion in RTi, it will include edge (u, v)as well. Let fu denote the cost of the path from source i tonode u, and cv denote the cost of adding node v to the route.Then the cost of routing from the source to v is

gu,v = fu + cv. (1)

To accommodate an A* cost function, let d jv be an estimate of

the cost of completing the route from node v to sink t ji . Then

the cost of the path from source i to sink t ji along RRG edge

(u, v) is

fv = gu,v + d jv. (2)

A breath-first search, i.e., a Lee-style maze expansion [12],then corresponds to the case where d j

v = 0. Several modifi-cations have been proposed to assign relative weights to thebreadth-first and A* components of the cost function

fv = gu,v + αd jv,α ≥ 0; and [13] (3)

fv = (1 − β)gu,v + βd jv, 0 ≤ β ≤ 1 [14]. (4)

When adding a new vertex u into RTi, each neighbor v of uis processed and added to PQ, unless v already belongs to RTi.If is possible that a different neighbor w of v is also partof RTi, so v may already be in the priority queue with somecost function fv = gw,v + cv.

In principle, it is now possible to add v to RTi either via edge(u, v) or (w, v). The best choice is the one that minimizes fv.Therefore, the cost and predecessor of v in PQ are changedfrom w to u if gu,v < gw,v, or, equivalently, if gu,v + cv is lessthan the current value of f−v.

Several different variants of the node cost function cv havealso been proposed

cv = (bv + hv)pv, and [1] (5)cv = bvhvpv, [15, Eq.(4.3)1] (6)

where bv is the base cost of v (typically its intrinsic delay),hv is the history cost of v, which depends on the numberof nets that are routed through v during previous iterations,and pv is a penalty function associated with the number of nets

1We ignore the BendCost(...) term from [15, Eq. (4.3)] because we areperforming combined global-detailed routing.


routed through v in the current solution. PathFinder dynam-ically updates hv and pv accordingly as routing proceeds.According to [15], the advantage of (6) over (5) is that multi-plying the bv and hv terms, rather than adding them, eliminatesthe need to normalize them; one possible drawback, not men-tioned by [15], is that bvhv > bv + hv for bv, hv > 2, so thereis a greater chance of arithmetic overflow if both terms growsignificantly as the algorithm iterates.

The difference between hv and pv is that hv permanentlyincreases the cost of using v to ensure that routes throughother vertices are attempted, while pv is based primarily onthe current routing solution. Recall that PathFinder routes netsone-at-a-time. Suppose that nets N1 and N2 are being routedin subscript order. The history cost could potentially dissuadePathFinder from routing both N1 and N2 through v duringthe current iteration, especially if v has a history of con-gestion. Now, supposing that PathFinder routes N1 throughv despite the value hv, then increasing pv in response woulddissuade PathFinder from routing N2 through v, to increasethe likelihood of converging to a legal solution.

A generalized form of the cv terms that favors delay-minimization for source-sink pairs whose delay is expectedto near-critical is

cv = Criti,jdelayv +(1 − Criti,j

)(bv + hv)pv, or (7)

cv = Criti,jdelayv +(1 − Criti,j

)bvhvpv, such that (8)

Criti,j = 1 − Slacki,j/Dmax (9)

where delayv is the intrinsic delay of RRG node v, Slacki,jis the estimated amount of delay that could be added to thesource-sink path from i to j before it becomes critical, andDmax is the estimated critical path delay of the placed-and-routed circuits. In VPRs timing-driven router [15, Sec. 4.4],the delayv term is based on the Elmore delay model, whichis derived from the existing routing tree RTi, including theprospective path from i to v; additionally, the Criti,j term ismore complex; details are omitted to conserve space.

The original PathFinder paper did not describe preciselywhich functions are used for hv and pv [1]. In VPR, pv isreset and recomputed every time a routing tree is ripped upand rerouted, while hv is defined as a recurrence relation whichvaries from iteration to iteration of the global router.

Let hkv denote the history cost of vertex v during the kth

iteration of the global router; for the first iteration, h1v = 1.

The pv and hkv terms are then defined as follows:

pv = 1 + occupancyvpfac, and [15, Eq.(4.4)2] (10)

hkv = hk−1

v + occupancyvhfac, k > 1, [15, Eq.(4.5)2] (11)

where occupancyv is the number of nets currently routedthrough RRG node v, and pfac and hfac are scaling factors.Betz et al. [15, Sec. 4.3.1] suggested that pfac should be at most0.5 for the first iteration, and then increased by a factor of 1.5×to 2× for subsequent iterations, and that hfac should remainconstant and that any value between 0.2 and 1 should suffice.

The enhancements to PathFinder introduced in this paper,PPR and SERRGE, are compatible with any cost function

2In combined global-detailed routing, which is the approach taken by VPR,the capacity of each routing resource is 1, which allows us to eliminate thecapacity(...) term from [15, Eqs. (4.4) and (4.5)].

(breadth-first or A* search) described in previous literature.We have implemented PPR and SERRGE in VPR 5.0, and allof our experimental results reported in Section V use VPRstiming-driven router [15, Sec. 4.4].

F. RRG Terminology

We use the term global RRG to refer to the representationof the FPGAs global (intercluster) routing resources. The VPR5.0 (and earlier) PathFinder implementation performs routingon a global RRG, which does not explicitly represent any local(intracluster crossbar) routing resources.

We use the term local RRG to refer to the representationof the intracluster routing resources for one CLB; if the intr-acluster routing crossbar contains just one layer of internalmultiplexers, the local RRG is bipartite: each vertex is eithera CLB input pin (including local feedback arcs) or a BLE inputpin; each edge connects a CLB input pin to a BLE input pin.

We use the term complete RRG to refer to the representationof all FPGA routing resources (intercluster and intracluster)in a single graph: a complete RRG combines the global RRGwith a local RRG for each CLB in the FPGA.

III. BASELINE ROUTER

The Baseline router, described in this section, containsa minimalist set of algorithmic modifications to extend thePathFinder algorithm to support FPGAs with sparse intra-cluster routing crossbars. The Baseline router suffers from anenlarged memory footprint, which both SERRGE and PPR,described in the subsequent sections, overcome.

A. Expanded RRG and CLB Input Pin Equivalence

The Baseline router performs routing on a complete RRG,which explicitly represents both intercluster and intraclusterrouting resources, as discussed in Section II-F. PathFinder nowroutes nets to BLE input pins, rather than CLB input pins, asshown in Fig. 7, and delineates which BLE is the sink of eachnet. The input pins of each BLE are logically equivalent.

If the intracluster routing crossbar is fully populated, thenall CLB input pins are logically equivalent and do not requireexplicit representation in the CLB. A legal route is obtainedby routing all nets from their respective sources to any inputpin of a CLB that contains the sink; a full crossbar guaranteesa direct connection from each CLB input pin to the sink.

When the intracluster routing becomes sparse, CLB inputpins are not logically equivalent; whatever equivalency existsdepends on the crossbar topology. Let Sj contain the CLBinput pins that can be routed to at least one input of BLE j. Ingeneral, each CLB input pin may belong to several such sets.

For example, consider Fig. 7: all CLB inputs, except for thefeedback emanating from the top BLE, connect to at least oneinput of both LUTs; all of them belong to subsets S1 and S2;the feedback output belongs to subset S1, but not S2.

B. Wire-to-Pin Lookup Map

An important data structure that complements the RRG inVPR is a wire-to-pin lookup map that identifies connectionsbetween the channel wires and CLB input pins. In VPR 5.0 andearlier, the lookup map is a 4-D array, as shown in Fig. 8(a).


Fig. 7. In the baseline router, the RRG is extended to include LUT inputs(each of which forms a unique equivalence class) and the sparse intraclusterrouting crossbar; this expansion is performed for every CLB in the FPGA.

Since the intracluster routing crossbar is fully populated, thereis no need to extend the map into the CLB.

When the intracluster routing crossbar becomes sparse, therouter needs to know whether a wire in the routing channel hasa connection to each LUT input, which goes through both theC-block (as before) as well the intracluster routing crossbar.To accommodate this information, two extra dimensions mustbe added to the wire-to-pin lookup map, as shown in Fig. 8(b).

Although needed for correctness, the memory overhead ofthese two extra dimensions is significant. For example, con-sider an FPGA with parameters K = 10, N = 6, I = 33,W = 100, Fcin = 15%, and Fcout = 10%. In VPR 5.0, theintracluster routing crossbar implicitly has population densityp = 100%, and a matrix of dimensions 33 × 4 × 15 = 1980is allocated (assuming height = 1). For a sparse crossbar withpopulation density of p = 50%, the matrix size expands to1980×50×60 = 5 940 000. The increase in cost is significant,since the wire-to-pin map is relatively sparse.

C. Memory Footprint

Empirically, we observed that the baseline router has anexcessively large memory footprint. The two main causes arethe expanded wire-to-pin lookup map, as described in the pre-ceding section, and the Elmore delay trees computed by VPRstiming-driven router [15, Sec. 4.4], which are used to accu-rately estimate the delay term used in the cost function cvshown in (8) and (9); VPRs routability-driven router [13],[15 Sec. 4.3] does not use Elmore delay modeling andsidesteps this overhead. Other relevant data structures (e.g.,the expanded RRG, priority queue, traceback list, etc.) becomelarger, but do not significantly impact the memory footprint.

VPRs timing-driven router builds an Elmore delay tree foreach vertex as it is discovered during maze expansion. If thesignal router is presently routing net Ni, a tree is computed foreach vertex that is discovered during the search. The tree isthen saved when the vertex is inserted into the priority queue;many of these vertices are never removed from the priorityqueue during the search, and even fewer are added to the rout-ing tree. The contribution of Elmore delay trees to the memoryfootprint varies from iteration to iteration.

The impact of the memory footprint on performancedepends on the target FPGA size, the placement solution,and the amount of memory available on the system thatcomputes the route. The operating system’s memory manage-ment policies and background applications and services also

Fig. 8. (a) Pseudocode to allocate and initialize the 4-D wire-to-pin lookupmap array in VPR 5.0 (and earlier), under the assumption that the intraclusterrouting crossbar was fully populated. (b) Pseudocode to allocate and initializea 6-D wire-to-pin lookup map array, which has been extended to supportsparse intracluster routing crossbars, which do not guarantee a connectionbetween each CLB input pin and each LUT input. The map entries that areset to USED (rather than OPEN), i.e., connections that actually exist, arederived from the FPGA routing architecture.

affect the amount of memory made available to the router.To manage this overhead, we set a limit on the memorysize of all of the routing resources; in practice, the choiceof limit depends on the system configuration and memorydemands of the operating system and other persistent applica-tions. When a PathFinder iteration exceeds the memory limit,the baseline router treats that iteration as a failure: it deal-locates all data structures and propagates any history costupdates from the failed iteration to the next iteration. TheBaseline Router does not otherwise modify PathFinder’s corealgorithmic behavior.

IV. ROUTING WITH SELECTIVE RRG EXPANSION

SERRGE refers to a collection of modifications to thebaseline router, which further reduce the memory footprint,yielding significantly faster runtimes. SERRGE features a cus-tom memory manager and garbage collector that are specificto the RRG and other associated data structures used by VPRsimplementation of PathFinder.

A. Dynamic RRG

SERRGE begins with a global RRG G = (V, E) and onecopy of a local RRG GL = (VL, EL) as a representative


of each CLB. When routing each net Ni, PathFinder’s mazeexpansion first finds a path in the global RRG from the sourcesi to an input pin j of a CLB that contains one of Ni’ssinks. SERRGE then refers to the local RRG and identifiesthe CLB input pin j′ in GL corresponding to j. Let vL(j′)and eL(j′) denote the sets of neighboring vertices and inci-dent edges in the fanout of j′ in GL. SERRGE expands theglobal RRG according to the following two rules: 1) for eachvertex v′ ∈ vL(j′), add a new vertex v to V and 2) for eachedge e′ = (j′, v′) ∈ eL(j′), allocate a new edge e = (j, v) to E.

With the newly expanded vertices and edges, PathFindercan now complete the route to the sink in the BLE, whichwill be one of the newly added vertices to V . The costs asso-ciated with each newly allocated vertex and edge are initializedand updated appropriately; nets routed during the current, andsubsequent, PathFinder iterations, may negotiate to use thesenewly allocated routing resources.

This dynamic RRG is a supergraph of the global RRG anda subgraph of the complete RRG, by construction. In the worstcase, the dynamic RRG will grow until it becomes the com-plete RRG, but this is impractical. The intuition behind thisapproach is that the baseline router preemptively allocatesportions of the RRG that are never expanded; SERRGE, incontrast, dynamically allocate the portions of the local RRGsthat PathFinder explores, on-demand.

B. Garbage Collection

To limit the memory footprint of the dynamic RRG (andother data structures that are proportional in size), SERRGEincludes a dynamic garbage collector. If the dynamic RRGgrows more than 30% larger than the global RRG, then thegarbage collector deletes all presently unused vertices andedges that were dynamically allocated; this includes all aux-iliary data structures associated with each vertex and edge,including Elmore delay trees (see Section IV-E), tracebackinformation, etc. In other words, RRG growth is used asa proxy for the growth of a much larger set of data struc-tures whose collective memory requirements greatly exceedthat of the RRG (i.e., just the vertices and edges) in isolation.

The garbage collector never deallocates vertices and edgesthat belong to the global RRG. Within each CLB, the garbagecollector identifies candidates for deletion by comparing thenumber of LUT input pins that have been reached thus far withthe number of nets that have sinks in each LUT. If the num-ber is equal, then all RRG resources that are incident on theLUT inputs are deallocated; this ensures that routes computedpreviously during this iteration can be recovered if PathFindersuccessfully converges. Otherwise, the routing resources areleft in-place under the assumption that at least one future netmay use them when searching for its sink.

The garbage collector does not consider history costs whendeleting RRG vertices; all costs associated with a deletedvertex are lost, and are reset to zero if the vertex is later reallo-cated. This may alter the way that PathFinder negotiates underSERRGE, and could yield a different routing result comparedto the baseline router (presuming that the latter does not incurfailed iterations due to exceeding the memory limit).

The lost history costs are restricted to vertices and edgesthat represent the final link connecting a routed net to its sink.

Fig. 9. Illustration of the basic behavior of SERRGE; PathFinder maintainsan RRG corresponding to global FPGA routing resources, while selectivelyexpanding the RRG to include a subset of nets that may be used inside of anintracluster routing crossbar. When routing a net, the first step is to computea route from the source s to an input of the CLB that contains the sink t.(a) Portion of the RRG corresponding to the CLBs intracluster routing crossbaris initially not allocated. (b) Fanout of the CLB input found by the globalrouter is allocated and added to the RRG. (c) In this example, the route iscompleted to the target LUT. (d) Later on, the garbage collector may reclaimCLB routing resources that have been allocated, but were not used.

Even if the history cost is deleted, a net that routes througha reallocated resource will increase the penalty cost, whichwould serve to dissuade subsequent nets from using thoseresources during the current PathFinder iteration.

C. Example

Fig. 9 illustrates the preceding discussion. PathFinder firstsearches the global RRG (not shown) from source s to a CLBthat contains sink t. Upon reaching the CLB input pin,Fig. 9(a) shows that the portion of the RRG corresponding tothe CLB has not yet been allocated (gray). SERRGE consultsthe local RRG to expand the dynamic RRG to fan out from theCLB input pin, as shown in Fig. 9(b). In Fig. 9(c), the localroute completes using a subset of the newly allocated routingresources. In Fig. 9(d), the garbage collector claims unusedrouting resources that SERRGE expanded, but did not use.

D. Compressed Wire-to-Pin Lookup Map

To further reduce the memory footprint of SERRGE, theextended wire-to-pin lookup map (Section III-B) is convertedto a 1-D array that exclusively represents routing resourcesthat could possibly be used by the netlist being routed. Forexample, connections to BLEs within a CLB that are not used(as determined by the placer/packer) are omitted; likewise,CLB I/O pins that interface exclusively with unused BLEs, andCLB sides where all pins connect to unused BLEs, are omittedfrom the lookup map as well. Fig. 10 provides pseudocode forthe map initialization process.

E. Elmore Delay Trees

VPRs timing-driven router builds an Elmore delay tree foreach vertex discovered during maze expansion (Section III-C),


Fig. 10. Pseudocode to initialize 1-D wire-to-pin map. Unlike the maps inFig. 8, the map entries that are marked as being USED (rather than OPEN)depend both on the FPGA architecture and the packing/placement result, andvary from CLB to CLB.

which is saved throughout the search. This increases therouter’s memory footprint and severely impacts performance.

VPRs timing-driven router computes the Elmore delay treefor each vertex v when v is discovered during the search. Ituses the tree to compute the term delayv that contributes tothe vertex’s cost term cv, as per (7) and (8), which, in turn,contributes to the cost function fv, in (1)–(4), i.e., the priorityof v when it is inserted into the priority queue. The baselinerouter saves the Elmore delay tree for v, so that delayv canbe updated quickly when necessary (Fig. 6, lines 13–16), andfor quick access when and if the search removes v from thepriority queue during the search (i.e., v has the highest priorityamong all enqueued vertices).

In contrast, SERRGE discards the Elmore delay tree afterdelayv is computed, and recomputes the tree on-demand,when necessary. The performance benefits accrued by thereduced memory footprint outweigh the overhead of recom-puting the Elmore delay trees on-the-fly. When and if therouter completes successfully, all of the Elmore delay treesare recomputed at the very end, in order to facilitate post-routetiming analysis based on the Elmore delay model.

F. Memory Limit

Similar to the baseline router, SERRGE sets a limit onthe memory consumption per iteration, for all of the routingresources. If this memory limit is exceeded, then the itera-tion fails, and any adjusted history costs are propagated tothe next iteration. Due to the compressed wire-to-pin lookupmap and memory-efficient approach to computing the Elmoredelays, SERRGE exceeds the memory less frequently than thebaseline router; despite these efficiencies, SERRGE cannot

(a) (b)

Fig. 11. (a) PPR starts by routing each intracluster routing crossbar individ-ually, finding a set of disjoint paths from the source to the inputs of all BLEsthat are used. (b) CLB inputs that are connected to inputs of the same BLEsform equivalence classes that act as new sinks for the global router.

guarantee that it will always stay within the memory limit,especially when routing large netlists on large FPGAs.

V. ROUTING WITH PARTIAL PREROUTING

PPR starts by locally routing each CLB (having at leastone used BLE) by executing PathFinder on the local RRG, asshown in Fig. 11(a). Fig. 11(b) illustrates one of many possi-ble local routing solutions. PPR is then followed by a globalrouting step that completes each route (i.e., from each sourceto an appropriate CLB input) using the global RRG. By usingone global and one local RRG, PPR avoids the large memoryfootprint of the Baseline router, and the complications asso-ciated with a dynamic RRG and garbage collection requiredby SERRGE.

A. Algorithm

The local RRG for each CLB is bipartite. Local routinginvolves a subset of the nets in the complete routing problemfor the entire FPGA, and involves computation of a partialpath for each net. To model the local routing problem, a super-source s* is allocated and connected to each CLB input.

Let N* be the set of nets having at least one sink in theCLB. For net Ni = (si, Ti) ∈ N∗, let Ti

′ ⊆ Ti, be the subsetof sinks in the CLB. If si is a source in the CLB, then netNi

′ = (si, Ti′) is added to the local routing problem instance;

otherwise, net Ni′ = (s∗, Ti

′) is added to the local routingproblem instance. PathFinder then computes the local routes.

If |Ti′| = 1 for every net Ni

′ in the local routing probleminstance, then it can be simplified to bipartite matching, whichcan be solved optimally in polynomial-time using a networkflow algorithm. We did not implement this option becausePathFinder converged quickly enough in practice.

To solve the local routing problem for each CLB, PPR com-putes all intracluster routes required for each net. A globalrouting problem instance using the global RRG then computesthe intercluster routes, subject to the constraints imposed bythe local routing result. The constraints can be expressed asCLB input equivalence classes. As shown in Fig. 11(b), thelocal routing solution computed two CLB inputs that route toBLE t1. As a consequence, these two CLB inputs becomelogically equivalent sinks in the global routing problem.Specifically, they become the targets for the two respectivenets for which BLE t1 was the original target.

In our experiments, PPR consumed far less memory thaneither the baseline router or SERRGE, and did not suffer frommemory-related performance issues. Consequently, we did notinclude SERRGEs memory optimizations for the wire-to-pin


(a) (b)

Fig. 12. (a) FPGA routing algorithms that take a global view can route to anyone of multiple CLB input pins, allowing for shorter overall routes. (b) PPRprecomputes the target CLB input pin for each net, without considering globalimplications, potentially leading to longer routes.

lookup maps or Elmore delay trees in PPR; however, theyremain nonetheless fully compatible, in principle, with PPR.

B. Limitations

The prerouting phase of PPR lacks a global view; thus,PPR may yield sub-optimal global routes. Fig. 12 shows anexample. In Fig. 12(a), CLB1 needs to route to its neighbor,CLB2, which has two input pins available, in1 (facing CLB1)and in2 (on the opposite side). PathFinder (in its Baseline orSERRGE configurations) could route to either in1 or in2, and,absent congestion, could find a short route to in1 in Fig. 12(a).In Fig. 12(b), PPR, being oblivious to global considerations,may produce a partial route within CLB2 using in2. In thiscase, the global routing phase has no choice other than toroute from CLB1 to in2 on the opposite side of CLB2. Thisexample demonstrates one way that PPR sacrifices optimalityfor efficient runtime.

VI. EXPERIMENTAL SETUP AND METHODOLOGY

A. Experimental Platform

We implemented the routing algorithms in VPR 5.0 [7],which was the most up-to-date version of VPR when westarted this project. VPR 5.0 did not support sparse-intraclusterrouting crossbars. The Beta release of VPR 6.0 [11] featuredsparse intracluster routing, but did not include a timing-drivenrouter; that feature was added to the official release of VPR 6.0early in 2012, when the implementation work outlined herewas mostly complete. VPR 6.0’s router shares many principlesimilarities with PPR (Section V).

We used a tool described by Lemieux et al. [2] to gener-ate routable sparse crossbars with a user-specified populationdensity. We used ABC [16] for logic synthesis and technologymapping, T-VPack for packing, and VPR 5.0 for placementand (timing-driven) routing.

All experiments reported here were performed on anApple iMac featuring a 2.66 GHz Intel Core i5 with 4 GBof DDR3 memory, running OS X 10.9.2. Around 3 GB ofmemory were taken by the OS, leaving 1 GB for user spaceprograms.

B. Sparse Intracluster Routing Crossbar Modeling

We model the intracluster routing as a 2-D binary matrix B,with I + N columns and KN rows. Each column correspondsto an input (a CLB input pin or a local feedback from a BLE

in the cluster), and each row corresponds to a BLE input.B(i, j) = 1 if a signal can route from input i to output j,and 0 otherwise. It is important to note that B simply modelsthe input-to-output connectivity of the crossbar, but does notmodel its internal architecture.

As an example, we model a CLB with N = 2, K = 2(e.g., it contains two 2-LUTs); the four BLE inputs are denotedb00, b01, b10, and b11. The CLB has three input pins, I0, I1,and I2, and two local feedbacks from the BLEs, O0 and O1.

An example of the matrix representation (for oneinput–output connectivity topology)

I0 I1 I2 O0 O1

B =

⎡

⎢⎢⎣

1 1 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0

⎤

⎥⎥⎦

b00b01b10b11

.

In this example, there is a connection from CLB input pinI0 to LUT input pins b00 and b01, but not b10 and b11; also,the local feedbacks are not used. The tool that we used togenerate routable sparse crossbars [2] creates a matrix B suchthat each row, column, and the entire matrix have a populationdensity approximately equal to a user-specified parameter p,that is

∑

j

B(i, j) = p(I + N) ± 1 (12)

∑

i

B(i, j) = pKN ± 1, and (13)

∑

i,j

B(i, j) = pK(I + N) ± 1. (14)

C. Experimental Parameters

We modeled an FPGA using an architecture configura-tion file from the iFAR repository [17], [18] based on 65nmBerkeley Predictive Technology Model. Table I lists the archi-tectural parameters that we used. We considered intraclusterrouting crossbars with population densities p = 40%, 50%,and 75%.

VPR repeatedly routes each benchmark using a binarysearch to identify the smallest channel width, Wmin, for whicha legal route can be found. VPR also allows the user tospecify a chosen channel width (W), and then tries to finda legal route, but may fail. Different routing algorithms mayyield different Wmin values for a given FPGA architecture,benchmark, and placement/packing result; for each, we ranbaseline, PPR, and SERRGE and computed their respectiveWmin values, the largest of which we denote as Max(Wmin).We generate an FPGA with channel with W = 1.4 Max(Wmin).We then reroute each benchmark using all three algorithms onthis FPGA and present those results. This prevents architec-tural differences due to varying Wmin values from skewing theexperiments.

PathFinder terminates after a user-specified number of iter-ations. We set the maximum number of iterations allowedto 300; if PathFinder cannot find a successful route after 300iterations, then it fails. Since FPGA routing is NP-complete,PathFinder is not guaranteed to find a legal routing solution,even if one exists.


TABLE IFPGA ARCHITECTURE PARAMETERS

We report the size of the RRG, wire-to-pin lookup maps,and Elmore delay trees for each routing algorithm. The wire-to-pin lookup maps are allocated once and remain staticthroughout routing; the Elmore delay trees grow and shrinkdynamically. The RRG is static under PPR and the BaselineRouter, and dynamic under SERRGE. We measure the mem-ory requirement of the static data structures once and profilethe size of the dynamic data structures after each dynamicallocation that increases their size. We report the peak mem-ory consumption of these data structures for each benchmarkand architecture during the runtime of the routers.

D. Timing and Area Models

Our timing model was similar to VPR 5.0. We added modelsto account for delays inside of the CLBs. The timing graphis generated such that every CLB or LUT input pin becomesa timing node. Timing edges represent connectivity betweenpins, and delays are marked on edges, not nodes.

The area model sums the aggregate areas of the number ofminimum-width transistors required to place and route a cir-cuit on in VPR; we did not modify VPRs counting method.We added extensions to account for the intracluster routingcrossbar area, which depends on its population density.

We employed the basic techniques that were used in VPRto estimate the silicon area occupied by each multiplexer andwire in the CLB. We assume that a minimum width transistortakes one unit of area. A double-width transistor takes twicethe diffusion width, but the same spacing, so we assume ittakes 1.5× the area of a minimum-width transistor.

Buffer sizes are calculated based on the drive strengthrequirements and depend on the fan-out of the buffer. VPRuses 4× the minimum size, which we have adopted for generalbuffers. We sized the CLB input buffers using the approachused by Lemieux and Lewis [3], where the drive strength isat least 7× and at most 25× the minimum size.

We model an FPGA with single-driver wires; each wiresegment begins with a multiplexer followed by a driver. Weattempt to judiciously select the multiplexer size dependingon the number of inputs. One-level multiplexers are usedwhen there are four or fewer inputs, and more levels are usedwhen the number of multiplexer inputs increases.

E. Benchmarks

We selected ten of the largest International Workshop onLogic and Synthesis (IWLS) benchmarks [17] for use in ourexperiments; Table II summarizes them.

VPR generates a custom FPGA that is sized for each bench-mark. The second column of Table II lists the dimensions ofthe FPGA generated for each benchmark (e.g., an M×M arrayof CLBs). The third and fourth columns list the number of netsand CLBs used in each benchmark for an FPGA architectureusing the parameters listed in Table I.

TABLE IIBENCHMARK OVERVIEW

Some of the IWLS benchmarks are I/O bound, rather thanlogic bound. In these cases, the number of I/Os per physicalpin on the perimeter of the FPGA dictates the dimensions.When this occurs, VPR generates an FPGA with far moreLUTs/CLBs than are necessary to realize each benchmark,and LUT/CLB utilization is relatively low as a result.

For each benchmark and FPGA, we generate ten placementsby varying the random number seed used in VPRs simulatedannealing-based placer. For each placement, we then route thebenchmark using PPR, SERRGE, and the baseline router. Foreach data point (benchmark/FPGA/router), the results reportedare the averages over the ten placements.

VII. EXPERIMENTAL RESULTS

A. Wmin and Routability

Fig. 13 reports the Wmin values obtained by routing eachbenchmark/FPGA combination using PPR, SERRGE, and thebaseline router. Surprisingly, PPR yields the lowest overallWmin values across all benchmark/architectures, with theexception of des_area for the FPGA with intracluster routingcrossbar population density p = 75%.

These results indicate that PPR is more likely than SERRGEor the baseline router to find a legal routing result. Whenconsidering Wmin as a proxy for routability, it is impor-tant to note that our experiments use VPRs timing-drivenrouter; experiments have been published which demonstratethat VPRs routability-driven router, which does not employthe Elmore delay model, tends to yield lower Wmin valuesthan the timing-driven router [15, Table 4.8].

B. Critical Path Delay

Fig. 14 reports the critical path delays obtained by rout-ing each benchmark/FPGA combination using PPR, SERRGE,and the baseline router. PPR is competitive with SERRGEand the baseline router in many cases; the biggest disparity incritical path delay is 3.43 MHz (143.88 to 140.45 MHz) forthe pci_bridge32 benchmark for the FPGA with intraclusterrouting crossbar population density p = 40%.

PPRs prerouting phase does constrain the search space fornegotiation, which accounts for the cases where SERRGE andthe baseline router achieve lower critical path delays. SERRGEand the baseline router permit PathFinder to negotiate forroutes at the CLB inputs and within the intracluster routingcrossbar, facilitating discovery of faster routes.


Fig. 13. Wmin for the ten largest IWLS benchmarks (Table II) placed-and-routed on an FPGA with parameters specified in Table I. Results are reportedfor devices with intracluster routing crossbar population densities of p =75% (top) 50% (middle) 40% (bottom).

Fig. 14. Critical path delay (ns) for the ten largest IWLS 2005 benchmarks(Table II) placed-and-routed on an FPGA with parameters specified in Table I.Results are reported for devices with intracluster routing crossbar populationdensities of p = 75% (top) 50% (middle) 40% (bottom).

C. Runtime and the Number of PathFinder Iterations

Fig. 15 reports the runtimes of PPR, SERRGE, and thebaseline router for all benchmark/FPGA combinations. PPR

Fig. 15. Runtime (seconds) for the ten largest IWLS benchmarks (Table II)placed-and-routed on an FPGA with parameters specified in Table I. Resultsare reported for devices with intracluster routing crossbar population densitiesof p = 75% (top) 50% (middle) 40% (bottom).

Fig. 16. Number of PathFinder iterations for the ten largest IWLS bench-marks (Table II) placed-and-routed on an FPGA with parameters specifiedin Table I. Results are reported for devices with intracluster routing crossbarpopulation densities of p = 75% (top) 50% (middle) 40% (bottom).

is uniformly the fastest, followed by SERRGE, and baseline.All three routing algorithms tend toward faster convergence atlower intracluster routing crossbar population densities.


Fig. 17. Peak memory consumption of the ten largest IWLS benchmarks (Table II) placed-and-routed on an FPGA with parameters specified in Table I.Results are reported for devices with intracluster routing crossbar population densities of p = 75% (top) 50% (middle) 40% (bottom).

Fig. 16 reports the number of PathFinder iterations requiredfor PPR, SERRGE, and the baseline router to convergefor each benchmark/FPGA combination. With one exception(wb_conmax for an FPGA with intracluster routing crossbarpopulation density p = 50%), PPR requires the fewest iter-ations, followed by SERRGE, and then the baseline router.Reducing the intracluster routing crossbar population densitymarginally reduces the number of iterations.

PPR requires far fewer iterations to converge than SERRGEor the baseline router. This is due primarily to two factors:1) PPRs restricted search space and 2) PPRs more efficientusage of memory, which limits the number of iterations thatfail due to exceeding the memory limit. These factors, correlatedirectly to the reduced runtimes reported in Fig. 16.

D. Memory Consumption

Fig. 17 reports the peak memory consumption ofPPR, SERRGE, and the baseline router for each bench-mark/FPGA combination. The Elmore delay trees, which arespecific to VPRs timing-driven router [15, Sec. 4.4], consumemore than twice as much memory than the wire-to-pin lookupmaps and RRG combined, and the wire-to-pin lookup mapsconsume significantly more memory than the RRG.

SERRGE offers a marginal improvement in peak memoryconsumption compared to the baseline router, due primarilyto its compressed wire-to-pin lookup map and recomputation,rather than storage, of the Elmore delay trees. PPR consumesfar less memory than either SERRGE or the baseline router,because the global RRG, which is smaller than SERRGEs

dynamic RRG at peak memory consumption and the base-line router’s complete RRG, has fewer vertices and edges, andthus stores fewer Elmore delay trees. Also, PPRs wire-to-pinlookup maps stop at the CLB inputs, while SERRGE and thebaseline router must route all the way to BLE input pins. Asshown in Fig. 17, memory savings for the largest benchmarkscan be as high as hundreds of MBs (e.g., for pci_bridge32).

VIII. RELATED WORK

A. Sparse Intracluster Routing Crossbars

Lemieux et al. [2] presented an algorithm to generate andevaluate routable sparse crossbars, and later proposed theirusage for FPGA intracluster routing; to improve routabil-ity they added spare CLB input pins [3]. Later work byFeng and Kaptanoglu [4] used entropy counting to designintracluster routing crossbars that offer greater routability;however, there is concern that CLB inputs and local feedbackscannot reach fast inputs for LUTs with nonuniform delay [20].

Ye [5] showed how the equivalence of LUT inputs can beleveraged to reduce the population density of the intraclusterrouting crossbar without compromising routability; however, itis unclear if this approach is compatible with more advancedlogic block features such as fracturable LUTs and carry chains,where LUT inputs can no longer be treated as logically equiv-alent. Chin and Wilton [21] extended Ye’s work to investigatehigh-capacity hierarchical CLBs with multilayer sparse cross-bar interconnects, and showed that this approach reduced theplacement and routing problem sizes significantly, therebyyielding faster and more robust CAD algorithms.


In terms of commercial FPGAs, Xilinx employs a C-block[Fig. 1(b)] without an intracluster routing crossbar, whileAltera and Microsemi (formerly Actel) employ an intraclusterrouting crossbar in conjunction with a C-block. We pre-sume that Xilinx’s C-block is much denser than Altera’s orMicrosemi’s, although no formal comparative study has beenpublished, to the best of our knowledge.

Altera has disclosed that their Stratix-series devices employa sparse intracluster routing crossbar [22]; although no detailsregarding the topology were presented. Microsemi disclosedthat their FPGAs employ a three-layer Clos network, wherethe third layer is subsumed by LUTs [20]. PPR and SERRGEare compatible with any sparse intracluster routing crossbartopology and population density, and do not rely on presumedlogical equivalence between LUT inputs. This paper has eval-uated PPR and SERRGE using sparse crossbars generated bythe tool designed and described by Lemieux et al. [2].

B. FPGA Routing Algorithms

Since its introduction in 1995, PathFinder [1] has becomethe most widely used FPGA routing algorithm; VPRsroutability-driven and timing-driven routers are both based onPathFinder [13], [15]. VPR 6.0 introduced sparse intraclusterrouting crossbars; VPR uses an approach similar to PPR toperform routing. The main difference with this paper is thatVPR integrates the PPR phase into the packer [11], [23], asa legality check. In other words, any packing solution thatcannot be routed locally within a CLB is disallowed. Therouter (which follows packing and placement) is similar toPPRs global router, described in Section V of this paper.

If a global iteration of PathFinder fails to find a legalsolution, all nets are ripped up and rerouted, which par-tially eliminates the dependence on net ordering when routing.Mulpuri and Hauck [24] modified PathFinder to exclusivelyrip up nets routed through oversubscribed resources. Althoughthis modification speeds up PathFinder’s convergence timesignificantly, a non-negligible increase in critical path delaywas observed for all benchmarks; for this reason, we do notconsider this implementation choice in our experiments.

Gort and Anderson [25] reduce contention for CLB out-put pins by forcing a multisink net to use the same outputpin for all sinks during wavefront expansion while exclu-sively ripping up and rerouting nets that route through over-subscribed resources, similar to Mulpuri and Hauck; theyreported a 3× speedup in router runtime coupled with a 2%increase in critical path delay and wirelength. Subsequently,Gort and Anderson [26] observed that PathFinder spends upto 40% of its runtime resolving congestion among nets thathave been routed legally. They allow PathFinder to convergewhen the routing solution is almost legal on a coarsened RRG,and then legalize the result using an SAT solver. In principle,this approach is also compatible with either PPR or SERRGE.

Chin and Wilton [27] developed an RRG compressionscheme that takes advantage of the regular tiled nature ofan FPGA. An RRG is instantiated for each tile, and inter-tile connections are represented using “wrap-around” edges.Different tile types are instantiated for programmable logic,embedded blocks, and I/Os. Extensions are presented to handlelong wires and sparse intracluster routing crossbars, including

the heterogeneous depopulation schemes that vary from tile-to-tile. Separate storage is maintained for the costs associatedwith each edge in the fully expanded RRG; this informationis not compressed. The additional steps added to the routerto enable the compressed RRG representation increase therouter’s runtime by a factor of 2.16×, on average. In contrast,PPR and SERRGE reduce the runtime of the router, althoughthe reductions in RRG size reported here are far more modestin comparison to Chin and Wilton’s scheme [21].

So [28] introduced a delay budgeting scheme to reduce thecritical path delay of a circuit synthesized on an FPGA; sincethis is a post-processing step, it could improve the quality ofresults for all data points reported in Fig. 10 of this paper; how-ever, it significantly increases the router runtime by a factorof at least 7× and also increases the memory footprint.

Rubin and DeHon [29] observed that small perturbationsin initial conditions (e.g., the order in which nets are routed;variations in intrinsic delays associated with routing resources)yield significant variations in the critical path delays reportedby VPRs implementation of PathFinder. They introduceda timing constraint and an increased search space that usesadditional iterations after meeting routability to meet the tim-ing constraint. Furthermore, the timing-constrained router isplaced inside a binary search to find the lowest effective timingconstraint. In principle, this approach could be used in con-junction with either PPR or SERRGE, as long as the runtimeoverhead of repeatedly routing the circuit is tolerable.

IX. CONCLUSION

PPR and SERRGE reduce the runtime and memory footprintof FPGA routing based on the PathFinder NC algorithm [1],as implemented in VPR [13], [15]. The baseline router, whichis an extension of VPRs timing-driven router, was extendedwith larger data structures that represent intracluster routingcrossbars. SERRGE extends the baseline router with com-pressed data structures and online garbage collection, whilePPR divides routing into local and global phases, requiringless memory and attaining rapid convergence, but at the costof a greatly reduced search space. PPR offers the best overallroutability (Wmin values), fastest running times, and small-est memory footprint, while SERRGE tends to find routingsolutions with the lowest overall critical path delays. If routerruntime is a premium, then PPR should be used; if criticalpath delay is more important, then SERRGE is preferable; inthe vast majority of our experiments, PPR and/or SERRGEoutperformed the baseline router for all metrics of interest, asreported in Figs. 12–16.

REFERENCES

[1] L. McMurchie and C. Ebeling, “PathFinder: A negotiation-basedperformance-driven router for FPGAs,” in Proc. 3rd ACM/SIGDA Int.Symp. Field Program. Gate Arrays (FPGA), Monterey, CA, USA,Feb. 1995, pp. 111–117.

[2] G. G. Lemieux, P. Leventis, and D. M. Lewis, “Generating highly-routable sparse crossbars for PLDs,” in Proc. 8th ACM/SIGDA Int. Symp.Field Program. Gate Arrays (FPGA), Monterey, CA, USA, Feb. 2000,pp. 155–164.

[3] G. G. Lemieux and D. M. Lewis, “Using sparse crossbars withinLUT,” in Proc. 9th ACM/SIGDA Int. Symp. Field Program. GateArrays (FPGA), Monterey, CA, USA, Feb. 2001, pp. 59–68.


[4] W. Feng and S. Kaptanoglu, “Designing efficient input interconnectblocks for LUT clusters using counting and entropy,” ACM Trans.Reconfig. Tech. Syst. (TRETS), vol. 1, no. 1, Mar. 2008, Art. ID 6.

[5] A. G. Ye, “Using the minimum set of input combinations to minimizethe area of local routing networks in logic clusters containing logicallyequivalent I/Os in FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 18, no. 1, pp. 95–107, Jan. 2010.

[6] Y. O. M. Moctar, G. G. F. Lemieux, and P. Brisk, “Routing algorithmsfor FPGAs with sparse intra-cluster routing crossbars,” in Proc. 22ndInt. Conf. Field Program. Logic Appl. (FPL), Oslo, Norway, Aug. 2012,pp. 91–98.

[7] J. Luu et al., “VPR 5.0: FPGA CAD and architecture exploration toolswith single-driver routing, heterogeneity, and process scaling,” ACMTrans. Reconfig. Tech. Syst. (TRETS), vol. 4, no. 4, 2011, Art. ID 32.

[8] V. Betz and J. Rose, “Automatic generation of FPGA routing architec-tures from high-level descriptions,” in Proc. 8th ACM/SIGDA Int. Symp.Field Program. Gate Arrays (FPGA), Monterey, CA, USA, Feb. 2000,pp. 175–184.

[9] G. Lemieux, E. Lee, M. Tom, and A. J. Yu, “Directional andsingle-driver wires in FPGA interconnect,” in Proc. 3rd Int. Conf.Field Program. Technol. (FPT), Brisbane, QLD, Australia, Dec. 2004,pp. 41–48.

[10] R. M. Karp, “Reducibility among combinatorial problems,” in Proc.Symp. Complex. Comput. Comput., Yorktown Heights, NY, USA,Mar. 1972, pp. 85–103.

[11] J. Luu, J. H. Anderson, and J. Rose, “Architecture description and pack-ing for logic blocks with hierarchy, modes, and complex interconnect,” inProc. 19th ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA),Monterey, CA, USA, Feb. 2011, pp. 227–236.

[12] C. Y. Lee, “An algorithm for path connections and its applications,” IRETrans. Electron. Comput., vol. EC-10, no. 3, pp. 346–365, Sep. 1961.

[13] J. S. Swartz, V. Betz, and J. Rose, “A fast routability-driven routerfor FPGAs,” in Proc. 6th ACM/SIGDA Int. Symp. Field Program. GateArrays (FPGA), Monterey, CA, USA, Feb. 1998, pp. 140–149.

[14] R. Tessier, “Negotiated A* routing for FPGAs,” in Proc. 5th Can.Workshop Field Program. Devices (FPD), Montreal, QC, Canada,Jun. 1998, pp. 1–6.

[15] V. Betz, S. Marquardt, and J. Rose, Architecture and CAD for DeepSubmicron FPGAs. Norwell, MA, USA: Kluwer Academic, 1999.

[16] Berkeley Logic Synthesis and Verification Group, ABC: A sys-tem for Sequential Synthesis and Verification. [Online]. Available:http://www.eecs.berkeley.edu/∼alanmi/abc/

[17] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit andarchitecture design of FPGAs,” in Proc. 16th ACM/SIGDA Int. Symp.Field Program. Gate Arrays (FPGA), Monterey, CA, USA, Feb. 2008,pp. 149–158.

[18] I. Kuon and J. Rose, “Automated transistor sizing for FPGA architectureexploration,” in Proc. ACM/IEEE Design Autom. Conf. (DAC), Anaheim,CA, USA, Jun. 2008, pp. 792–795.

[19] IWLS 2005 Benchmarks. [Online]. Available: http://iwls.org/iwls2005/benchmarks.html

[20] J. W. Greene et al. “A 65nm flash-based FPGA fabric optimized for lowcost and power,” in Proc. 19th ACM/SIGDA Int. Symp. Field Program.Gate Arrays (FPGA), Monterey, CA, USA, Feb. 2011, pp. 87–96.

[21] S. Y. L. Chin and S. J. E. Wilton, “Towards scalable FPGA CAD througharchitecture,” in Proc. 19th ACM/SIGDA Int. Symp. Field Program. GateArrays (FPGA), Monterey, CA, USA, Feb. 2011, pp. 143–152.

[22] D. M. Lewis et al., “The StratixTM routing and logic architecture,” inProc. 11th ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA),Monterey, CA, USA, Feb. 2003, pp. 12–20.

[23] J. Luu, J. Rose, and J. H. Anderson, “Towards interconnect-adaptivepacking for FPGAs,” in Proc. 22nd ACM/SIGDA Int. Symp. FieldProgram. Gate Arrays (FPGA), Monterey, CA, USA, Feb. 2014,pp. 21–30.

[24] C. Mulpuri and S. Hauck, “Runtime and quality tradeoffs inFPGA placement and routing,” in Proc. 9th ACM/SIGDA Int. Symp.Field Program. Gate Arrays (FPGA), Monterey, CA, USA, Feb. 2001,pp. 29–36.

[25] M. Gort and J. H. Anderson, “Accelerating FPGA routing through paral-lelization and engineering enhancements,” IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst. (CAD), vol. 31, no. 1, pp. 61–74, Jan. 2012.

[26] M. Gort and J. H. Anderson, “Combined architecture/algorithm approachto fast FPGA routing,” IEEE Trans. Very Large Scale Integr. Syst. (VLSI),vol. 21, no. 6, pp. 1067–1079, Jun. 2013.

[27] S. Y. L. Chin and S. J. E. Wilton, “Static and dynamic memory footprintreduction for FPGA routing algorithms,” ACM Trans. Reconfig. Technol..Syst. (TRETS), vol. 1, no. 4, Jan. 2009, Art. ID 18.

[28] K. So, “Enforcing long-path timing closure for FPGA routing withpath searchers on clamped lexicographic spirals,” in Proc. 16thACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), Monterey,CA, USA, Feb. 2008, pp. 24–34.

[29] R. Rubin and A. DeHon, “Timing-driven pathfinder pathology and reme-diation: Quantifying and reducing delay noise in VPR-pathfinder,” inProc. 19th ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA),Monterey, CA, USA, Feb. 2011, pp. 173–176.

Yehdhih Ould Mohammed Moctar received theMaitrise (B.Sc. equivalent) degree in computer engi-neering from the l’Institut Superieur Scientifique,Nouakchott, Mauritanie, in 1996, the graduatedegree in learning algorithms and intelligent systemsfrom the l’Université de Montréal, Montreal, QC,Canada, in 1999, and the M.S. and Ph.D. degrees incomputer science from the University of California,Riverside, Riverside, CA, USA, in 2012 and 2014,respectively.

Since 2014, he has been with the ElectronicDesign Automation Team, IBM Systems and Technology Group, Austin, TX,USA. His current research interests include the optimization and accelerationof electronic design automation tools for IBM power and Z microprocessors,and involved in investigation algorithms to accelerate scheduling, partitioning,placement, and routing.

Dr. Moctar was a recipient of the U.S. Department of Defense SMARTFellowship for his Ph.D. degree.

Guy G. F. Lemieux (S’91–M’04–SM’08) receivedthe B.A.Sc. degree in engineering science and theM.A.Sc. and Ph.D. degrees in electrical and com-puter engineering from the University of Toronto,Toronto, ON, Canada.

In 2003, he was with the Department of Electricaland Computer Engineering, University of BritishColumbia, Vancouver, BC, Canada, where he is cur-rently an Associate Professor. His current researchinterests include field programmable gate arrayarchitectures, computer-aided design algorithms,

IFIP/IEEE International Conference on Very Large Scale Integration (VLSI)and system-on-a-chip (SoC) circuit design, and parallel computing. Hehas co-authored the book entitled Design of Interconnection Networks forProgrammable Logic (Kluwer, 2004).

Dr. Lemieux was a recipient of the Best Paper Award from the IEEEInternational Conference on Field-Programmable Technology in 2004.

Philip Brisk (M’09) received the B.S., M.S.,and Ph.D. degrees in computer science from theUniversity of California, Los Angeles, Los Angeles,CA, USA, in 2002, 2003, and 2006, respectively.

From 2006 to 2009, he was a Post-DoctoralScholar with the Processor Architecture Laboratory,School of Computer and Communication Sciences,École Polytechnique Fédérale de Lausanne,Lausanne, Switzerland. He is currently an AssistantProfessor with the Department of Computer Scienceand Engineering, Bourns College of Engineering,

University of California, Riverside, Riverside, CA, USA. His current researchinterests include field programmable gate arrays, compilers, and designautomation and architecture for application-specific processors.

Dr. Brisk was a recipient of the Best Paper Award from the InternationalConference on Compilers, Architectures, and Synthesis for EmbeddedSystems in 2007 and the International Conference on Field-ProgrammableLogic and Applications in 2009. He is/has been the Program CommitteeMember of several international conferences and workshops, includingDesign Automation Conference, Asia and South Pacific Design AutomationConference, Design Automation and Test in Europe, VLSI-SoC, FPL, andInternational Conference on Field-Programmable Technology. He has beenthe General Co-Chair of the IEEE Symposium on Industrial EmbeddedSystems 2009, the IEEE Symposium on Application-Specific Processors2010, and the IWLS 2011, and participated in the organizing committee ofseveral other conferences and symposia as well.

1928 ieee transactions on computer-aided ... (cad) software is one impediment to the wide-spread...

Documents