test-delivery optimization in manycore socs

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 7, JULY 2014 1067

Test-Delivery Optimization in Manycore SOCsMukesh Agrawal, Student Member, IEEE, Michael Richter, Member, IEEE,

and Krishnendu Chakrabarty, Fellow, IEEE

Abstract—We present two test-data delivery optimization algo-rithms for system-on-chip (SoC) designs with hundreds of cores,where a network-on-chip (NoC) is used as the interconnectionfabric. We first present an effective algorithm based on a subset-sum formulation to solve the test-delivery problem in NOCswith arbitrary topology that use dedicated routing. We furtherpropose an algorithm for the important class of NOCs withgrid topology and XY routing. The proposed algorithm is thefirst to cooptimize the number of access points, access-pointlocations, pin distribution to access points, and assignment ofcores to access points for optimal test resource utilization of suchNOCs. Test-time minimization is modeled as an NoC partitioningproblem and solved with dynamic programming in polynomialtime. Both the proposed methods yield high-quality results andare scalable to large SOCs with many cores. We present results onsynthetic grid topology NoC-based SOCs constructed using coresfrom the ITC’02 benchmark, and demonstrate the scalability ofour approach for two SOCs of the future, one with nearly 1000cores and the other with 1600 cores. Test scheduling under powerconstraints is also incorporated in the optimization framework.

Index Terms—Dynamic programming, network-on-chip, SoCtest, system-on-chip (SoC), test scheduling.

I. Introduction

R ECENT years have seen the large-scale integration of anincreasing number of embedded cores in system-on-chip

(SoC) design. It has been predicted that the number of coreson a single chip will continue to increase in the future [1],[2]. These predictions are being matched by industry trends.For example, Nvidia has developed a manycore chip, calledFermi, that includes 512 CUDA cores [3]. Intel has announceda coprocessor called Knights Corner that has over 50 cores [4].

To support high data bandwidth in such manycore SOCs,a suitable on-chip communication infrastructure is needed.Continued increase in wire lengths relative to feature size, andthe associated problems of interconnect delay and power con-sumption, have motivated research on scalable communicationinfrastructures. Furthermore, the continuous reduction in thedesign cycle time requires not only logic cores, but also theon-chip interconnect fabric to be reusable.

Studies have shown that compared to buses, a packet-switched network-on-chip (NoC) is better suited forcommunication-intensive applications with as few as eightcores [5]. For systems with up to 16 cores, the bus-based

Manuscript received July 16, 2013; revised November 2, 2013 and Decem-ber 19, 2013; accepted January 28, 2014. Date of current version June 16,2014. This paper was recommended by Associate Editor X. Wen.

M. Agrawal and K. Chakrabarty are with the Department of Electrical andComputer Engineering, Duke University, Durham, NC 27708 USA (e-mail:[email protected]; [email protected]).

M. Richter is with Intel Mobile Communications, Neubiberg 85579,Germany (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCAD.2014.2311394

interconnection fabric was found to be better only in casesof lighter workloads. For SoCs with a larger number of cores,an NoC offers many benefits, including reduced wire lengths,less power consumption, and better scalability. An NoC istherefore viewed as a promising alternative to today’s bus-based communication fabric.

Automatic test equipment (ATE) is used to deliver testpatterns to the logic cores in an SoC and analyze test re-sponses. Since the ATE represents a significant investment,it is necessary to develop a systematic approach for utilizingATE resources optimally. Several studies reported reductionof test cost through effective utilization of test resources, suchas tester memory and tester channels, to test an SoC usingdedicated test access mechanisms (TAM) [6]–[8].

Instead of implementing a dedicated TAM for many-coreSOCs with an NoC interconnection fabric, the NoC infras-tructure itself can be used for test data delivery to the cores.Access points are used to interface an ATE with the NoCfor transporting test data between ATE and cores over NoC,and special test wrappers are implemented around the cores[9]–[12]. To ensure congestion-free test data delivery andminimize test time, a number of scheduling techniques havebeen developed [12]–[17].

The minimization of the overall test time using the NoC is acomplex problem requiring co-optimization of the number andplacement of access points, of the distribution of ATE channelsto these access points, and of the assignment of cores to accesspoints for test data delivery. Most previous approaches neglectone or more of these aspects. To be practical, the optimizationtechnique must be scalable to hundreds of cores.

This paper addresses the above challenges. Its major con-tributions are as follows.

1) We present a novel algorithm for minimizing test timefor NoCs with arbitrary topology and dedicated routingthat consistently yields shorter test times than previousapproaches. This algorithm can also be used to minimizetest times for SoCs using dedicated TAMs.

2) We model test-delivery optimization problem for anNoC-based many-core SoC with grid topology as a gridpartitioning problem and develop a dynamic program-ming solution.

3) We show that optimal grid partitioning leads to signif-icant reductions in test time. The partitioning solutionis optimal in that minimum test time is obtained forrectangular partitions. The test times obtained using DPfor the rectangular NoC topology are close to (no morethan 10% in most cases) TAM-independent provablelower bounds.

4) We demonstrate the scalability of the proposed methodby deriving optimization results for realistic SoCs of

0278-0070 c© 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1068 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 7, JULY 2014

today and for the future. We present results for a400-core SoC (20 × 20 NoC) of today, and even largerSoCs of the future, 992 cores (32×31 NoC) and 1600cores (40×40 NoC). The results were obtained in rea-sonable time using modest computing resources.

5) We show how test scheduling can be carried out underpower constraints.

6) The proposed dynamic-programming algorithm com-putes the Pareto frontier for a range of access-pointcounts and total test-pin counts in a single run. There-fore, a wealth of data can be obtained to effectivelychoose the number of access points and test channelsfor the SoC.

The rest of the paper is organized as follows. Section IIdiscusses prior work. Section III introduces the subset-sumproblem, and shows how it can be used for solving testscheduling. Section IV provides details about the proposeddynamic programming algorithm. Test scheduling under powerconstraints is discussed in Section V. Results are presented inSection VI, and Section VII concludes the paper.

II. Related Prior Work

A typical NoC is made up of several network tiles. Anetwork tile, in turn, is composed of a router, a core, anda core-to-network interface [18]. Routers are responsible forrouting communication data according to a routing protocol.Routers are interconnected using links that are composed ofmultiple channels. A network interface translates between arouter’s and a core’s communication protocols.

The topology of the network describes the logical layout ofthe NoC. In a grid-based layout, a router has multiple inputand output ports, e.g., one port each for five different directions(north, south, east, west, and core). A router receives data onone of its input ports and forwards it to one of the output portsaccording to the routing protocol. In the XY-routing protocol,data is first routed along the x-axis and then along the y-axis.Since this scheme is not affected by on-chip network traffic,routing decisions are deterministic.

The testing of the NoC infrastructure, fault detection, andreconfiguration in the presence of faults have been studiedin [19], [20]. A scheduling algorithm that assigns a higherpriority to cores requiring longer test time, and finds shortertest-delivery paths to these cores, was proposed in [13]. Thisapproach was extended to take into account power constraintsin [14]. An optimization method to identify dedicated routingpaths and incorporate precedence constraints and shared BISTresources was provided in [21].

A differential clocking scheme to reduce power consump-tion in the cores, and utilize the capability of an NoC to runat a frequency higher than scan clock, was presented in [15].Multiple clock speeds can also be leveraged to reuse commonlinks connected to cores on a time-sharing basis [16].

A test-wrapper design for the reuse of functional inter-connects was introduced in [9], [12]. TAM width constraintsimposed by the NoC flit width, and the associated problem offlit-bit under-utilization were highlighted in [22]. Other designapproaches are presented in [10] and [11], but these methodsdo not show how test time reduction can be achieved.

The compression of test input data for reducing the numberof test input pins was studied in [23]. Test-output compression

to reduce output pin-count was discussed in [17]. Our approachcan be extended to incorporate input and output compressionas proposed in [23] and [17].

The need for partitioning the NoC to avoid jitter-lesstransport of test and response data, and distributing differentbandwidths to these partitions was introduced in [9]. TheTR-architect test scheduling algorithm [24] was extended in[12] to minimize test time in a NoC with arbitrary topologyand dedicated routing. However, a systematic method forcreating valid partitions in an NoC with a grid topology thatonly support XY-routing method was not considered.

In [17], a general test-scheduling problem was formulatedand solved using ILP. Furthermore, a heuristic approach wasalso presented, where contention was avoided for routers andlinks between cores that were assigned to different accesspoints. However, the heuristic is based on ILP that does notscale well with the number of cores and the number of accesspoints. No methodology was discussed on an appropriateselection of the total number of pins or the number of accesspoints. Moreover, no strategy for locating optimal positions ofaccess points was discussed in [17].

Two methods for delivering test data using NoC have beenstudied in previous work. In packet-based scheduling [13],every packet is scheduled independent of others; a path isreserved for one packet and packets belonging to the sametest set can be assigned a different set of NoC resources. Thesecond approach, based on dedicated routing [21], schedulesall test packets of a core using the same path in the NoC. Dueto its simplicity, test scheduling using dedicated routing hasbeen further explored in [12] and [25].

III. Test Scheduling for NoCs With Dedicated

Routing

A. Problem Description

A graph-theoretic formulation for test-time minimization forNoCs with dedicated routing is presented in [12]. A graphis first constructed using the topology of NoC with networktiles representing nodes and edges between nodes replacinglinks. The goal is to partition the graph into disjoint connectedcomponents, and distribute pins to each component such thatthe maximum test time of each component is minimized. Thetest time of a component is equal to the sum of the test times ofthe cores in the component. The test time of a core dependson the number of pins allotted to the component to whichthe core belongs. We use the same formulation for devising atest-scheduling algorithm for an arbitrary topology. We furtherassume a dedicated routing path for the testing of each corein line with previous work on test scheduling [12], [21], [25].

We build on the system model presented in [9] where ATEinterface (access points) and core wrapper do all the protocoland width conversion such that both the ATE and the circuitunder test (CUT) are unaware of the on-chip routing protocoland NoC design. We assume a width conversion ratio of onein this paper; any other ratio can easily be accounted for.

B. Outline of Algorithm

If we exclude the initial delay introduced during setting upof dedicated routes in NoC for test-data delivery and responsecollection, the test time minimization problem for NoCs with

AGRAWAL et al.: TEST-DELIVERY OPTIMIZATION IN MANYCORE SOCs 1069

Fig. 1. Pseudocode for the procedure SolveSubsetSum.

dedicated routing can be considered as a special case of thebus-based TAM optimization problem [12]. This is becauseany partition of the set of cores is a valid solution candidate tothe latter, while only those partitions that represent connectedcomponents in the graphical representation of the given NoCtopology are solution candidates to the former. Hence, we firstrevisit the TAM optimization problem described in [6] and [24]for a bus-based TAM architecture. By limiting the partitionsconsidered by our algorithm to those that form valid connectedcomponents, the algorithm is also directly applicable to theNoC test scheduling problem (Section III-F).

The TAM optimization problem, discussed in [6] and [24],can be stated as follows. Given a set of cores with respectivetest time information for a range of pin widths, total TAMwidth, and the maximum number of TAM partitions, we haveto find an assignment of cores to the TAM partitions and theTAM width of each partition such that the overall test timeis minimized. We view this problem from a different angle;we partition the set of cores into disjoint subsets, and find anoptimal distribution of pins or TAM wires to these subsetsto minimize test time. A systematic method of enumeratingcandidate partitions is discussed.

Our algorithm consists of three parts: a subset-sum for-mulation to effectively enumerate candidate partitions in areasonable time, a greedy method to optimally distribute pinsto a given partition, and a brief introduction to the subset-sumproblem and a method to solve it. We first describe the subset-sum problem (SSP) before elaborating on how we utilize SSPto tackle the TAM-optimization problem.

C. Subset-Sum Problem (SSP)

The SSP is posed in the form of a question. Given a set ofn integers, X = {x1, x2, · · ·, xn}, does there exist a nonemptysubset, whose elements sum to zero? This problem is knownto be NP-complete [26]. An equivalent version of this problemasks the question if there is a subset of the given set with thesum of its elements equal to a given integer s. There exists noalgorithm that can solve a given instance of SSP in polynomialtime. However, a method based on dynamic programmingsolves the problem optimally in pseudo-polynomial time [26].Fig. 1 outlines this method—we call it SolveSubsetSum—that has runtime complexity of O(n(LSSP −USSP )), where LSSP

is the least element of the given set, and USSP is the sum ofall positive integers of the set, and LSSP ≤ s ≤ USSP .

The binary element SUM(i, s) of the 2-D array SUM

stores a true if it is possible to create a subset from the set{x1, x2, ···, xi} with sum s, otherwise it stores a false. From the

Fig. 2. Method for distributing P pins given a partition of set of cores.

procedure SolveSubsetSum, it is evident that SUM(i, s) istrue only under three conditions.

1) The subset consists only of xi, i.e., s = xi.2) There exists a subset with sum s without using the

element xi, i.e., SUM(i − 1, s) is true.3) The subset contains xi in addition to elements from the

set {x1, x2, · · ·, xi−1}, i.e., SUM(i − 1, s − xi) is true.This recursive formulation is the foundation for solving SSP

in pseudo-polynomial time and we leverage this idea for TAMoptimization. We use repeated invocations of the above methodon multiple instances of SSP to enumerate several candidatepartitions of the given set of cores in K subsets, and use agreedy approach to distribute pins to these subsets.

D. Optimal Pin Distribution for Partition

Suppose the given set of cores is partitioned into K disjointsubsets, and the cores in a subset are assigned to the sameTAM. Moreover, no two cores from different subsets use thesame TAM. The number of pins available for distribution isP , where P is the total TAM width.

We present a greedy algorithm for distributing P pins andprove that the greedy approach is optimal for a partition of agiven set of cores. Fig. 2 outlines the algorithm for pin distri-bution. In each iteration, the subset having the maximum testtime is assigned an additional pin. The algorithm terminatesafter every pin is assigned.

Let GP = {g1, g2, · · ·, gP } be the solution generated by thegreedy approach, where gi denotes the index of the subset towhich the ith pin is assigned. Clearly, 1 ≤ gi ≤ K, for alli. Let OP = {o1, o2, · · ·, oP } be an arbitrary sequence of pinassignments that leads to an optimal solution.

Let Tg(i) denote the SoC test time after the ith pin isassigned using the greedy approach. Similarly, let To(i) denotethe SoC test time in the ith step for the given arbitrarysequence. Furthermore, assume that τg(i, b) is the test timefor the bth subset after the ith iteration using the greedy ap-proach. Let τo(i, b) denotes the same quantity for the arbitrarysequence.

Clearly, Tg(i) = maxb{τg(i, b)} ⇒ τg(i, b) ≤ Tg(i) ∀b, andTo(i) = maxb{τo(i, b)} ⇒ τo(i, b) ≤ To(i) ∀b.

Lemma 1: The test time does not increase in any iterationof the greedy method, i.e., Tg(i + 1) ≤ Tg(i).

Proof: The test time of a core does not increase with theaddition of a pin. Since a pin is added at each step or iterationto exactly one subset, the test time of that subset can onlydecrease or remain the same. The test times of other subsetsremain unaffected. This is also true for the arbitrary sequence,i.e., To(i + 1) ≤ To(i).

Lemma 2: In the partial sequence Gk = {g1, g2, · · ·, gk},suppose that the pin k was added to the subset bm during thekth step, i.e., gk = bm and Tg(k) = τg(k, bm). If a pin is removed


Fig. 3. Pseudo code for the procedure optimizeTAM.

from a subset, say b′, such that bm �= b′, then the updated testtime of the SoC is strictly greater than Tg(k).

Proof: According to the greedy procedure, a pin is addedto a subset only when it has the maximum test time duringan iteration. Since a pin is removed from b′, the pin musthave been added to b′ during an iteration, say i < k, suchthat τg(i, b′) = Tg(i). From Lemma 1, Tg(k) ≤ Tg(i) = τg(i, b′).Therefore, the updated test time of b′ after removing a pinis τg(i, b′). Note that the greedy procedure keeps on selectingthe same subset until its test time is strictly lower than anothersubset. Since bm �= b′, τg(i, b′) > Tg(k). Since the test time ofb′ dominates Tg(k), the SoC test time is τg(i, b′) > Tg(k). Notethat the addition of the removed pin to any other subset doesnot reduce the SoC test time.

If bm = b′, then τg(i, b′) ≥ Tg(k) holds.Theorem 1: The greedy algorithm of Fig. 2 finds an optimal

distribution of pins for a given partition on the set of cores.Proof: We prove the theorem by contradiction. At step i, if

the distribution of pins is same for both the methods, Tg(i) =To(i) holds. If the distribution of pins is not the same, thenthere exists at least one subset that is assigned at least onemore pin in the greedy approach than what is assigned to itin the other approach. If we try to rearrange the pins in thegreedy sequence to match the pin distribution obtained fromthe arbitrary approach, then by Lemma 2, the test time canonly increase; hence Tg(i) ≤ To(i) ∀i.

E. Enumeration of Candidate Partitions

The TAM-optimization problem has been shown to beNP-complete using transformation from the bin-packing prob-lem in [27]; therefore, a polynomial-time algorithm cannotbe designed for solving the problem optimally. The test timeof cores is mapped to integer elements of the given set inSSP. Assuming that only one pin available to each TAM, aset consisting of the test times of all cores using just a singlepin is created. The test times vary from small to large valuesdepending on the scan chain length and the number of testpatterns of individual cores. If the range is very large, it maybe computationally prohibitive to run dynamic programming(Fig. 1), both in terms of time and space. Therefore, the testtimes are scaled down by a factor of the lowest test time beforeapproximating them to nearest integer values. The procedureoptimizeTAM of Fig. 3 is called on this set to create K

subsets. The greedy method outlined in Fig. 2 is executed onthe resulting partition and the test time is noted.

The element SUM(i, s), for every s, holds an answer tothe question whether the set {x1, x2, · · ·, xi} has a subset, theelements of which sum to s. For every value of s, we geta unique subset, if SUM(i, s) holds a true. The procedurefindSubset finds such a subset; it searches through theentries of the 2-D array SUM and constructs a solution. This isthe backbone for enumerating candidate partitions. Note thatthe procedure optimizeTAM is a recursive procedure andthe maximum recursive depth is equal to the total number ofsubsets that are needed to be created. In each recursive step,a subset Xs is created with sum less than equal to s. Thena new set Xfilter is obtained by removing the elements in Xs

from X. While Xs is appended to the list candidate partition,which stores the subsets in a candidate partition, the set Xfilter

is fed in the next level of recursion. After returning from therecursive call, Xs is removed from candidate partition.

This methodology can potentially enumerate all partitions,but for keeping the computation time low, we do not searchthrough all the entries of the array SUM. First a lower boundLTAM is computed as

∑x

K−depth, where x ∈ X. The variable s is

varied from LTAM−� to LTAM+�, where � is a parameter thatcontrols the size of solution space that is to be searched. Sincethe size of X decreases in every recursive step, we reduce thevalue of � by half in our experiments in successive recursivesteps (not shown in Fig. 3). Either a timeout can be set onthe overall procedure, or a limit be placed on the maximumnumber of candidate partitions that are to be enumerated.

F. Application to NoC

Similar to extension made for NoCs in [12], the methoddescribed in the previous subsection is modified for solvingthe test-delivery problem in an NoC. Listed below are therequired modifications.

1) The findSubset procedure only returns a subset,whose elements form a connected graph.

2) In the distributePins, the number of pins assignedto each subset is no more than the channel width.

The method described in [12] creates several connected com-ponents in its initial step and subsequently attempts to find apartition that minimizes test time; there is no control on howmany connected components are created in the final solution.In contrast, our approach takes the number of access points (orconnected components) as an input parameter. In Section VI,we show how restricting the number of connected componentsin [12] adversely impacts test time. Even if the number ofconnected components is kept unrestricted, for larger NoCswith a grid topology, we demonstrate that the method in [12]produces a significantly large number of access points withtest times that are worse than the test times produced by ourmethod with much fewer access points. More access pointsleads to additional DFT cost and higher power consumption.

IV. Dynamic Programming Approach for Grid

Topology

The above approach and the approach from [12] are notapplicable to an NoC with mesh topology that only supportsXY-routing protocol for on-chip communication, as they createconnected components of arbitrary shapes and do not ensure


that packets are delivered without any conflict when therouting mechanism is XY. Moreover, XY routing can beviewed as a special case of dedicating routing, if the pathstaken by packets in the former case are encoded in packetheaders in the latter; therefore, the method that we proposein this section can still be used for NoCs with grid topologythat use dedicated routing. We will demonstrate that by usingXY-routing with grid topologies, we obtain better solutionsthan reported in [12] for larger NoCs.

A. Preliminaries

The basic idea of our approach is to partition a 2-D grid-structured NoC into rectangular regions, and interface eachregion with the ATE such that the testing in each region canbe carried out in parallel without any contention for routersand links. The cores in a region are tested sequentially, andthe regions are tested in parallel. This concept is illustratedin Fig. 4. The figure shows a partition of a 4 × 4 NoC intothree rectangular regions. These regions are interfaced with anATE through ATE interfaces. The ATE drives test in all threeregions simultaneously.

The goal of partitioning is to minimize test time for theregion with the maximum test time. This method is suitablefor NoCs that implement the XY-routing protocol, which isa popular routing algorithm because of its easy hardwareimplementation and guaranteed deadlock-free routing.

The advantage of optimal partitioning is illustrated in Fig. 5.Each subfigure shows a different partition of the same 4 × 42-D grid network, where each cell represents a network tile.The number in each cell is the number of clock cycles neededto test the core in that network tile. Note that the additionaltime needed to establish a connection to a core from an accesspoint is neglected here because it is very small compared to thetest times of the cores. For the purpose of illustration only, weassume that the test pins are evenly distributed to the accesspoints. The quantity T refers to the total number of cyclesthat is required to test the chip. Since each region is tested inparallel, the maximum number of clock cycles needed to testany region is T . Fig. 5(b) shows an optimal partitioning thatreduces test time by 40% compared to the partitions shown inFig. 5(c). The number of regions is three in this figure, but ingeneral, it is an input parameter to the proposed framework.

B. Problem Formulation

An NoC N with a grid topology can be viewed as a rectanglein a 2-D Cartesian plane with the bottom-left corner of Ncoinciding with the origin (0, 0) of this plane. Any rectangleR can be uniquely identified with a four-tuple representation[i, j, l, w] where (i, j) is the bottom-left corner of R, andl and w are the length and width of R, respectively. Thisrepresentation can be captured by a concise expression, R≡ [i, j, l, w]. If the dimension of N is m× n, we can write: N≡ [0, 0, m, n].

Our goal is to optimally partition N into rectangular regionssuch that testing can be carried out independently in eachregion, thereby minimizing overall test time. This particularchoice of regions being rectangular is motivated by tworeasons: first, the popular XY-routing algorithm [28], [29]ensures that there is no conflict in test data transportationbecause data will not cross the boundary of regions; second,

Fig. 4. Figure showing the partitioning of a 4 × 4 NoC for parallel testing.

Fig. 5. Illustration of the impact of optimal partitioning on testing time.(a) T = 24200 cycles. (b) T = 22900 cycles. (c) T = 38700 cycles.(d) T = 38700 cycles.

Fig. 6. Examples of inadmissible partitioning using (a) separators and(b) ‘partial’ separators.

it is more tractable to store and use results for subproblemsin our dynamic programming-based approach.

We have to ensure that each region can be accessed bythe ATE without requiring a dedicated TAM, hence, not allpossible partitions of N are admissible. The two partitionsshown in Fig. 6, for example, are inadmissible; region B isenclosed from all sides and it cannot be tested independently.We say a region to be located at a boundary if it is notenclosed, i.e., if it has at least one boundary edge. An edge ofsuch a region R is said to be a boundary edge if it is incidenton one of the edges of the rectangle [0, 0, m, n]. In Fig. 6(a),region A has three boundary edges, regions C and D have twoeach, E has one, and B has no boundary edges.

We create a partition with the help of a sequence ofseparators. A separator is any line segment parallel to either ofthe coordinate axes, and divides a region completely into twosubregions. We impose the constraint that a separator has to


Fig. 7. Mapping of the partition in (a) to a binary tree (l1, l2, l3, and l4 areseparators) in (b).

be parallel to one of the axes. This constraint ensures that onlyrectangular regions are formed. We do not consider a sequenceof line segments as shown in Fig. 6(b) to create a partition; itcan be easily shown that such a sequence will always createan enclosed region. A separator is horizontal (vertical) if it isparallel to the axis y = 0 (x = 0).

The partition shown in Fig. 7(a) can be easily mappedto a binary tree with the region represented by the NoC asthe root of the tree and each intermediate node representingthe subregion that is created by placing a separator in itsparent region [Fig. 7(b)]. The orientations of separators arealso captured in the tree in terms of how a nonleaf node isintersected. We call a region corresponding to a leaf of this treeas a leaf region. Alternatively, a leaf region can be definedas a region with no subdivisions.

We define the cost of a leaf region as the sum of the testtimes of the cores in that leaf region. The test time of a coreis a function of the number of pins assigned to test the core.The number of pins used to test a core equals the number ofpins assigned to the leaf region in which the core is present.The cost of an intermediate region R is the maximum costover all costs of the leaf regions contained in R. The cost ofR also depends on the corresponding partition size, where thesize of a partition is the number of leaf regions created bythat partition. Let us use πN (K, P) to denote the cost of N,where K is the given partition size (number of regions) andP is the number of available test pins. We can now formulateour problem as follows. Our goal is to determine an optimalsequence of separators such that:

1) each core is present in exactly one leaf region;2) each leaf region is located at a boundary (boundary-

region constraint);3) πN (K, P) is minimized.

For a given K and P , we use πR to refer to the cost of anyregion R. Our goal is also to compute the pin assignment toeach leaf region (pin-assignment problem), i.e., which of theK pins are used to access different leaf regions.

C. Characterization of Optimal Solution Structure

Dynamic programming is an effective technique to attackoptimization problems that possesses an optimal substructure,i.e., for which an optimal solution can be computed fromoptimal solution to its smaller subproblems. Dynamic pro-gramming is typically applied when the same subproblemappears multiple times when a problem is decomposed; storinga solution to each subproblem avoids recomputation when itarises again. In this section, we show why we use dynamic

Fig. 8. (a) Position of a separator l1 on the grid. (b) and (c) Two configu-rations of a rectangle R after a separator is placed.

programming, and how we find an optimal partition andconstruct the solution to the problem of test delivery.

In order to obtain an optimal solution, we have to selecta sequence of separators that create a partition of size K. Tosimplify the following discussion, we are not considering theeffect of pin width on test time, so we assign a fixed testtime to each core. The extension of the algorithm to cover pindistribution is subsequently presented in Section V.

Consider a m × n grid. Suppose that this grid has to bepartitioned into two regions. In this case, only one separatoris needed. If placed vertically, the separator can be placed atn−1 different positions. Similarly, m−1 possible positions forplacing a horizontal separator accounts for a total of m+n−2different positions. In order to minimize the test time, wecompute the overall test time for each position of the separatorand report the position resulting in the minimum test time.Increasing the partition size K to three requires an additionalseparator to be placed. If the first separator l1 is placed asshown in Fig. 8(a), the second separator l2 can either beplaced in Region A or Region B. There are m + a − 2 waysto place l2 in A, and m + n − a − 2 ways to place in B,resulting in a total of 2m + n − 4 ways. Thus, for each choiceof a position for l1, there are several choices for placing l2.The analysis can be continued in this way for larger valuesof K.

Suppose we are given an optimal partition for the grid andthe position of l1 is as shown in Fig. 8. The cost of N (πN )is the maximum of the costs of the two regions, A and B.Let us consider three cases: 1) the cost of region A, πA, isgreater than that of region B (πB), i.e., πA > πB; 2) πB > πA;and 3) πA = πB. In the first case, πA should be the optimal costfor A. If not, optimal cost for A can replace πA to give a bettervalue for πN , which is a contradiction because we assumed πN

to be optimal. A similar argument holds for the symmetricalsecond case: if πB is not optimal, i.e., if there were a bettervalue for πB, we can always substitute this value to yielda better value of πN , which would be a contradiction. Thethird case can be analyzed in the same way to conclude thatan optimal solution for this partitioning problem encapsulatesoptimal solutions for its subproblems. This property indicatesoptimal substructure, which is a basic requirement for applyingdynamic programming. Let us recursively express this propertyfor the above example

πN = max{πA, πB}. (1)

An optimal value of πN can be obtained by sweeping l1across its all possible positions. For each different position ρ


of l1, the values for πA and πB changes. We can rewrite (1)as follows:

π� = minρ

{max{πA, πB}} (2)

where π� is an optimal solution to the actual problem andρ varies over all the possible locations for placing l1. Notethat if the grid is to be partitioned into K regions, the sumof the partition sizes of the regions A and B will be K. Thepartition sizes of A and B are not accounted for in the aboveequation. The partition cost is a function of partition size.Since changing the partition size of a region has an impact onoverall partition cost, the value of π� depends on how manyseparators we assign to regions A and B. For a given P , letus use π̃R(k) to denote the cost of R when it is partitionedin k regions. Factoring in the partition sizes for A and B, (2)becomes

π� = π̃N (K) = minρ

{minx

{max{π̃A(K − x), π̃B(x)}}} (3)

where 0 < x < K.Let us formulate a general equation for all possible sub-

problems. For any region R ≡ [i, j, l, w], we rewrite (3) bysplitting it into two cases: the case when a separator is sweptfrom top to down, and the case when it is swept from left toright. These two cases and the regions so formed are marked inFig. 8(b) and (c). The final cost is the minimum cost obtainedin these two sweeps

π̃R(k) = min{minω

{minx

{max{π̃A# (k − x), π̃B# (x)}}},min

λ{min

x{max{π̃A∗ (k − x), π̃B∗ (x)}}}} (4)

where 0 < ω < w, 0 < λ < l, 0 < x < k, R ≡ [i, j, l, w],A# ≡ [i, j, l, ω], B# ≡ [i + ω, j, l, w − ω], A∗ ≡ [i, j, λ, w],and B∗ ≡ [i, j + λ, l − λ, w].

We use (4) in our algorithm. Since a subproblem is requiredto be solved in more than one larger problem, we use dynamicprogramming to solve this problem. The key idea is to storeresults for all subproblems and find the optimal cost π�, whichis same as π̃N (K). Next we discuss the algorithm and the datastructures used.

D. Algorithm

The algorithm begins by enumerating all possible rectanglesin N, and for each such rectangle, computing and storing theoptimal cost for all partition sizes k, 0 < k ≤ K. This isshown in Fig. 9. The variables l and w are used to vary thelength and width of the current rectangle, respectively. Thepair (i, j) denotes the bottom-left coordinate of a rectangle.A 5-D integer array M is used for storing the optimal costof all rectangles and for all partition sizes. The first fourdimensions are used for identifying a rectangle, and the lastdimension denotes the partition size. The variable k iteratesover all partition sizes less than equal to K. The trivial caseof k = 1 is not shown in the figure and M[i, j, l, w, 1] isinitialized to the sum of the test times of all cores presentin the rectangle [i, j, l, w]. The procedure computeAnd-Store, shown in Fig. 10, describes how the optimal solutionis reached.

The element M[i, j, l, w, k] is initialized to a large integervalue and it is updated with a smaller value found on sweeping

Fig. 9. Top-level dynamic programming procedure for partitioning.

Fig. 10. computeAndStore Procedure.

the separator, first from left to right, and then from top tobottom. The variable v stores the current position of thevertical separator. The optimal cost of the current rectangledepends on the partition cost of the two new subregions formedby the separator. The partition cost of subregions, in turn,depends on their respective partition sizes. We use a variablex for varying the partition size in one of the subregions. If x

is the partition size of one of the subregions, then k − x is thepartition size for the other subregion.

In addition to producing the cost of an optimal partition,we also have to keep track of the sequence of separa-tors used to construct this optimal solution. The procedureConstructSolution, described in [30], can be used for thispurpose.

V. Additional Constraints and Enhancements

In this section, we incorporate the boundary-region con-straint, the pin-assignment strategy, and power constraints intoour solution approach.

A. Boundary-Region Constraint

The boundary-region constraint mandates the inclusion ofonly boundary regions in the final solution. As describedin [30], the procedure computeAndStore can be easilymodified to enumerate only admissible partitions.

B. Pin-Assignment Problem

The test time of a core varies with the TAM width assignedto it. In this paper, we do not consider the case when the TAMwidth of a core exceeds the channel width. To incorporatethe effect of pin count on the partition cost, we need to addone more dimension to the search space. The overall partitioncost can now be expressed as πN (K, P) where P has to bedistributed among all the K regions. The optimal substructureequation can be rewritten as

π� = πN (K, P) = minρ

{minx,p

{max{πA(K−x, p), πB(x, P −p)}}}


where 0 < x ≤ K, 0 < p < P and ρ varies over all possiblepositions of the separator being placed. This equation can befurther extended to consider the cases of horizontal and verticalseparator, as in (4), but is being omitted here for the purpose ofbrevity. For solving this problem, we evaluate the optimal solu-tion using the same approach as in Section IV. The dimensionof array M is increased to six such that M[i, j, l, w, k, p] storesthe optimal partition cost of region [i, j, l, w], if the partitionsize is k and the number of pins assigned to this region is p.For each value of k, we add one more loop that iterates over allthe pin-counts possible. Moreover, when a separator is placed,in addition to varying the size of partitions of the subregionsA and B, all possible distributions of the p pins to the twosubregions are evaluated. For any region R, πR(k, p) is notevaluated for p < k, because each region should be allotted atleast one pin. We use the equality, πR(k, p) = πR(k, p − 1), ifp takes values that are greater than k times the flit width. Thearray B is also a 6-D structure now with each element being asix-tuple structure. Two more tuples are added to specify thepins allotted to each subregion.

The run-time complexity of the proposed method isO(m3n2K2P2) if m ≥ n and O(m3n2K2P2) otherwise, asshown in [30]. The space complexity is O(m3n2KP) [30].

C. Power Constraints

The reduction of test application time by manipulatingpower profiles of test sets has been studied in [31]. Givena power constraint, the problem of minimizing the test timeis presented in [31]. A similar problem in the context ofNoC has been formulated and solved using ILP techniquesin [32]. The scheduling problem under power constraintswas shown to be computationally harder than the generaltest scheduling problem in [32], and the proposed methoddoes not scale well with the number of cores. This sectionelaborates on ideas presented in [31] to initially treat power asa design objective to minimize power consumption, and thensubsequently as a design constraint to achieve minimizationof the test application time.

1) Profiling of power consumption: Accounting for powerduring test scheduling entails the modeling of powerconsumed by the individual cores. A simplistic approachis to flatten the power profile of a core to the worst-case instantaneous power consumption value, i.e., itspeak value [31]. The simplicity and reliability of thismodel, called the global peak power approximationmodel (GP-PAM), is achieved at the cost of includingsignificant false power in the model; false power is thepower that is not consumed, but still being considered.This limits the degree of test concurrency that can beideally achieved. The GP-PAM model is represented bya pair [Pglobal, Lglobal], where Pglobal is the global peakpower consumed over a test length of Lglobal cycles.Another model, proposed in [31], significantly reducesfalse power that enables greater test concurrency. Thismodel, known as 2LP-PAM, creates a profile havingtwo peaks, as opposed to a single global peak inGP-PAM. 2LP-PAM is represented by a vector pair{[Phi, Lhi], [Plo, Llo]}. The entire test length of a coreis divided into two segments; Phi is the peak power for

a test length of Lhi, peak power Plo for a test length ofLlo, and Phi ≥ Plo. The splitting of the power profileprovides more flexibility in scheduling cores under apower constraint because it approximates false powermore accurately than the GP-PAM model.We adapt the benefits of 2LP-PAM in our DP approach,and discuss how to construct power profile of a region,a collective profile of all the cores present in the region.We show how this technique helps in approximatingpower consumption and creating room for test con-currency between cores from different regions under apower constraint. Note that all cores in a given regionare still scheduled sequentially, but the order in whichthey are scheduled affects the ability to simultaneouslyschedule cores in other regions. The power profile for aleaf region R is created by the procedure createPow-erProfile, as outlined in Fig. 11. The procedurecreates the two peaks PR

hi and PRlo , such that the area

under the profile, PRhi × LR

hi + PRlo × LR

lo, is minimized.All the cores are sorted in decreasing order of thepower that they consume during testing. Note that weassume uniform power consumption within a core duringthe time it is tested. The createPowerProfileprocedure takes the sorted list of the cores present ina leaf region R, and finds an index i into the list thatminimizes the area under the resulting profile. PR

hi is thepower consumption of first core in the list. The list issequentially scanned to find the index i; LR

hi is set to thesum of the test times of the cores indexed one to i−1 inthe list, and LR

lo takes the value of sum of the test timesof the remaining cores. PR

lo is the power consumption ofthe ith core in the list.

2) Merging power profiles: We next explain how we mergepower profiles of two regions to obtain power pro-file of the bigger rectangle under a given power con-straint to minimize test lengths under different scenarios.Given are two power profiles {[PA

hi, LAhi], [PA

lo , LAlo]} and

{[PBhi, L

Bhi], [PB

lo, LBlo]} for regions A and B, respectively.

For merging these profiles, such that the resultant profiledoes not exceed the power limit PL, we first select theregion with lower test time (B in this case), and thenshift and flip (mirror image) its power profile; refer toFig. 12. Shifting the profile ensures that cores of B arescheduled as late as possible, and flipping it ensuresthat the addition of PA

hi and PBhi is avoided as much as

possible, thereby leading to minimization of the powerconsumption of the two regions taken together. Thecreation of schedules for cores is abstracted out withshift and flip operations on profiles, and this helps inminimizing power consumption.The shape of the resulting profile depends on the valuesof the peaks of the profiles that are merged, and thepower constraint. For our running example, the mergedprofile can look like the profile with four peaks (Fig. 12).To be consistent with the two-peak power model, thenumber of peaks has to be reduced to two. In Fig. 13(a),different areas are marked that have to be collapsed withthe original profile to obtain two peaks. Depending onthe magnitude of the areas marked, the final profile canlook like Fig. 13(b) or Fig. 13(c). The decision on which


Fig. 11. createPowerProfile procedure.

areas to collapse depends on the area under the profile,and the profile with the least area is chosen.We next discuss the impact of power constraints on themerging of profiles. In the case when a power violationoccurs, the profile for B is shifted until no violation iscaused. If the two lower peaks from the two regionsexceed the power limit PL, i.e., PA

lo + PBlo > PL, then

the two power profiles cannot overlap. In such a case,we sort the four peaks in descending order and placethem side-by-side to create a profile shown in Fig. 14(a).Different combinations of the marked areas are groupedtogether to reduce the number of peaks to two. Thefinal profile can be any one of the profiles as shownin Fig. 14(b)–(d).We have implemented a dedicated transformation pro-cedure for all kinds of profiles that can result from themerge operation. The enumeration of these cases and thecorresponding transformations are straightforward andtherefore omitted.

3) Integration with DP: The proposed method for powerprofile manipulation has been integrated with our DPapproach. When a separator is placed in a region, ratherthan taking the maximum of the testing times of thetwo subregions as the test time for that region, wecompute the new test time as the test length of thepower profile obtained after merging the power profilesof the two subregions. Line 6 in Fig. 10 is modifiedaccordingly. Since any two power profiles are mergedwithout violating power constraint, power profile of theNoC N does not violate the constraint.

The runtime complexity of the approach under power con-straints remains unaffected. The procedure to merge twoprofiles takes a constant time for computing a merged profile.When creating a power profile for a leaf region using cre-atePowerProfile procedure, the complexity increases bya factor of mn (an efficient implementation using DP increasesthe complexity by a factor of min(m, n)·log(mn)), but this partonly constitutes the initialization step and does not dominatethe runtime of the main algorithm.

D. Reducing Runtime Complexity by Factor of P

The runtime complexity of the proposed approach wasfound to be a function of P2. The number of pins P is much

Fig. 12. Merging of power profiles.

Fig. 13. Transformation of the four-peak profile shown in (a) to two-peakprofiles shown in (b) and (c).

Fig. 14. Figure showing another example of profile transformation. Theprofile shown in (a) is transformed to one of the profiles shown in (b), (c),and (d).

larger in magnitude when compared to other parameters, suchas m, n, or k. We next show how the factor of P2 can bereduced to P . The approach presented in the previous sectiondistributes pins on both sides of the separator such that the sumof the pins on both sides is equal to a given pin count p, andselects the distribution that minimizes test time. All possiblecombinations are tried making the search exhaustive. This isrepeated for all possible values of p; therefore, a factor of P2

is seen in the computational complexity. Exhaustive searchcan be reduced to selective search by observing that given aposition of separator, and an optimal distribution of p pinsin the two subregions created by the separator, the optimaldistribution for a count of p + 1 pins is obtained by assigningthe newly added pin to the subregion having greater test time.

The modified computeAndStore procedure that imple-ments selective search is shown in Fig. 15. While computingthe minimum test time for a non-leaf region, we maintain anoptimal distribution of pins in the two subregions for eachposition of separator v, for each pin-count p, and partitionsize x of one of the subregions (the partition size of the othersub-region is computed as k − x, where k is the partition sizeof the parent region for which we are computing the minimumtest time). For the base case when the pin-count p equals k,the number of pins assigned to the subregions equals theirrespective partition sizes, i.e., x and k − x. This is the onlypossible (and hence optimal) distribution because each leafregion has to be driven by at least one pin. The arrays M andB are updated whenever a better result is found. A similarprocedure is used for sweeping horizontal separators.


Fig. 15. Modified computeAndStore for reducing the runtimecomplexity.

VI. Experimental Results

In this section, we first describe how we created our testcases. The rest of the section includes the following.

1) For NoCs with arbitrary (irregular) topologies, we com-pare the results obtained from our approach based onsubset-sum with an implementation of [12].

2) For a mesh-based NoC, we compare the results obtainedusing DP to that obtained with ILP [17]. Since thecomparison is based on the problem instances that canbe solved with ILP in a reasonable time, we take smallproblem instances first. These test cases provide a goodbaseline with which we compare the quality of solutionsproduced by our approach.

3) Results showing the effect of power constraints on thetotal test time is reported and compared with [32].

A. Test Scheduling Without Power Constraints

We used six SOCs from the ITC’02 SoC Test Benchmarks[33], namely, d695, g1023, p34392, p22810, t512505, andp93791, with 10, 14, 19, 28, 31, and 32 cores, respectively.Since we are interested in evaluating performance for a largenumber of cores, we created additional SoCs by taking coresfrom the last two SoCs and replicating them. The test timesfor all TAM widths (constrained by flit width) for each corewere obtained using the design wrapper algorithm [6].

For mesh-based NoCs, we adopted the same approachas outlined in [13], [16], [17] for generating the topologyof the NoC. We assumed XY-routing, a switching delay ofthree clock cycles per router for access point-to-core or core-to-access point path establishment, and one cycle each fortransmitting header and tail flits. A flit width of 32 bits wasassumed. For each leaf region, we placed an access point atthe middle of its longest boundary edge, and computed thecumulative routing delay of the region as the sum of routingdelays for all the cores located in that region. It can be easilyshown that, by taking the derivative of the routing delay andequating it to zero, the routing delay is minimized when theaccess point is placed at this specific location of the region.The routing delay for a core during path establishment is threetimes its Manhattan distance from the access point assigned tothat core. We executed the ILP model using the FICO XPress-MP Solver [34] that was also used in [17].

Each region contains exactly one access point, hence, thenumber of regions (partition size) in our work and the number

TABLE I

Comparison of Test Times (Clock Cycles) Obtained Using

Proposed Subset-Sum Based Method With [12] for NoCs Having

Any Arbitrary Topology

of access points, as reported in [17], are synonymous. Allresults were obtained on a 16-core Intel(R) Xeon(R) machinewith a processor speed of 2.53 GHz, 64 GB memory and acache size of 12288 KB.

In the first experiment, we compare our method for anNoC with an arbitrary topology with the method presentedin [12]. Since the latter is based on TR-Architect [24], itdoes not restrict the number of access points (ATE inter-faces) that are used; it reports the number of access pointsthat are required to minimize the test time. We modifyModifiedCreateStartSolution procedure to initialize a solutionwith a fixed number of access points (K). Table I showsthe results for the two methods for two NoC test cases. Thecolumn K∗ does not restrict the number of access points forthe method of [12]. The column [12], K∗ reports the numberof access points required (K∗) along with the minimized testtime. We use K∗ to derive the test time for the proposedmethod, which is shown in the last column. The NoC examples(irreg1 and irreg2) are taken from [12] and they both consistof nine routers. Each router is assigned a core taken fromthe SoC d695. Since there are nine routers and d695 consistsof 10 cores, we ignore one core, namely Module5, for thisexperiment. Note that the greedy procedure of pin distributionis optimal at every iteration; hence, solutions are obtained fora range of pin counts in a single run. The CPU time was foundto be negligible in all cases. Our method consistently reportsbetter solutions than [12].

Next we present results for a mesh-based NoC. We tooka 6 × 6 NoC obtained from the SoC t512505 (31 cores) andcompared the results obtained using the two approaches. TheILP proposed in [17] does not address the problem of optimalaccess point placement. Therefore, we show two different setsof results for the ILP model; one obtained by a random place-ment of access points (ILP∗) and the other obtained by placingthe access point at the locations determined by DP (ILP∗∗).The results for different values of K and P are shown inTable II. The column �∗ (�∗∗) records the relative differencein the test time reported by dynamic programming (DP) overthat obtained from ILP∗ (ILP∗∗). If the test time obtained by


TABLE II

Results for t512505 (31 Cores) for Various Values of K and P

ILP∗: ILP with random placement of access points.ILP∗∗: ILP with placement of access points guided by the proposedtechnique.

ILP is TILP∗ (TILP∗∗ ) and that obtained from DP is TDP , then�∗ = TDP−TILP∗

TILP∗ × 100%, and �∗∗ = TDP−TILP∗∗TILP∗∗ × 100%.

Negative values of � indicate that the solutions obtainedfrom ILP are worse than that from DP. The same experimentis repeated for another 6 × 6 NoC obtained from the SoCp93791 (32 cores); the corresponding results are tabulated inTable III. The CPU time for DP is less than 1 s.

The ILP solver was allowed to execute until it found anoptimal solution. Note that an optimal solution for ILP∗ maybe worse than a solution obtained using DP due to the nonop-timal placement of access points. Placing the access pointsat the locations suggested by the proposed approach almostalways resulted in reduced test time. These results demonstratethat the proposed DP approach can produce results that arevery close to that obtained by ILP. The DP approach leads tooptimal solutions under the constraint of rectangular partitions.The ILP approach can lead to better solutions than DP byallowing nonrectangular partitions.

Table IV compares the test times obtained using variousmethods with TAM-independent lower bounds on test timethat are derived using [24]. (Note that these lower boundscannot always be achieved due to bottleneck cores [6].) Thethird column reports the lower of the test times obtained usingILP∗ and ILP∗∗. Since the lower-bound expression in [24] isnot tied to any particular TAM design and utilizes only thevolume of test data that must be transported, these boundsare also applicable to the test-time minimization problem inNoCs. Moreover, an NoC imposes additional constraints onthe classical TAM optimization problem; therefore, the lowerbounds of [24] hold in our NoC-based TAM scenario aswell, and we expect actual test times to be larger than thelower bounds. Nevertheless, closeness to lower bounds is ameasure of the effectiveness of an optimization method. Forthe problem instances listed in Table IV, the test-time resultsare only 5%–13% higher than provable lower bounds.

We show similar results for a larger 14 × 14 NoC obtainedby replicating cores from the SoC benchmark p93791; seeTable V. We set a time limit of three hours for ILP for eachvalue of K and P , and report the best intermediate resultsobtained within that time limit. It was observed that with anincrease in the number of regions K, ILP took longer time toreport the first intermediate solution. We set three hours as thelimit because no noticeable improvement was seen in the ILPintermediate solutions after this duration. The CPU time takenby DP is only 4 minutes and 8 seconds, which is negligible in

TABLE III

Results for p93791 (32 cores) for Various Values of K and P

ILP∗: ILP with random placement of access points;ILP∗∗: ILP with a placement of access points guided by the proposedtechnique;#not an optimal solution—ILP timed-out after 3 h.

comparison to the cumulative execution time of 1 day and 12hours taken by ILP for all the 12 cases shown in Table V. Ourapproach also reported better results for some larger values ofK. Note that for instances for which ILP∗∗ yields lower testtimes than DP, a combination of the two methods can be used.An effective partition can be first identified using DP, and thenthe test-pin assignment problem can be solved using ILP, as in[17]. However, for larger problem instances, ILP is not feasibledue to high computation requirements.

We next show results for a 20 × 20 NoC obtained byreplicating cores from the two SoC benchmarks; see Table VI.Considering the size of NoC, a CPU time limit of 6 h wasset for ILP for each of the 12 cases shown in Table VI. Nointermediate solutions were reported for larger values of K.While the cumulative CPU time taken by ILP∗ for all thecases was found to be 2 days and 7 h, ILP∗∗ took 3 days. TheCPU time taken by ILP is clearly impractical, and the ILPapproach does not scale with the number of cores and the sizeof the partition. In contrast, the CPU time for DP is only 25min and 10 s.

To further demonstrate the scalability and benefits of theproposed approach, we evaluated the DP method for an SoCof the future that has nearly 1000 cores (a 32 × 31 NoC).The DP procedure completes in 4 h of CPU time. In orderto evaluate the quality of the solution (test time obtained), wedeveloped a simple baseline heuristic of generating a partition.Among all intermediate regions available for partitioning, theregion having the largest number of network tiles was selectedand a separator was placed randomly. Pins were distributed inthe ratio of the dimension of the leaf regions. We generated100 such partitions for each different values of K and a valueof P = 150, and report the mean, minimum, and maximumtest time for each case, as shown in Table VII. It can be seenfrom the table that DP provides consistently superior results—two orders of magnitude reduction in test time compared tothe mean test time for the baseline case. Compared to theminimum test time for the baseline case, the test time reductionis in the range of 13% to 48%.

We also considered an SoC with 1600 cores (40×40 NoC).The DP solution (for 2 ≤ K ≤ 5) was obtained in 3 h ofCPU time for P = 150. The test times obtained from DPwas consistently lower than that for the randomized baseline


TABLE IV

Comparison of Test Times (in Cycles) for p93791 Obtained From

Different Methods With the TAM-Independent Lower Bound

Reported in [24]

TABLE V

Test Times for 14 × 14 NoC for Various Values of K and P

ILP∗: ILP with random placement of access points.ILP∗∗: ILP with placement of access points guided by the proposedtechnique.†Optimal solution obtained from ILP

Fig. 16. Results showing percentage reduction in test time by the subset-sum-based method over DP-based method with varying values of � for 32×31NoC, K = 4. CPU time for each value of � is also shown.

method for all values of K. For example, for K = 5, weobtained 19.80%, 44%, and 69.33% reduction in test timecompared to the minimum, average and maximum test times,respectively, obtained from the baseline method.

We also examined the scalability of the subset-sum-basedmethod for large SoCs. We ran the procedure opti-mize TAM on the 32 × 31 NoC for K = 4. Fig. 16 showsthe percentage reduction in test time reported by the procedureover the DP-based method for varying values of � and P . Thefigure also shows the CPU time needed by the procedure foreach value of �. It can be seen that when � is high, the subset-sum-based method is capable of producing better results thanthe DP, but takes more CPU time. The reduction in test timewas found to be as high as 3.3% when � was set to 20000.

TABLE VI

Test Times for 20 × 20 NoC for Various Values of K and P

ILP∗: ILP with random placement of access points.ILP∗∗: ILP with placement of access points guided by the proposedtechnique.†Optimal solution obtained from ILPNR‡: ILP did not find any solution within the time limit (6 h).

TABLE VII

Test Time of Dynamic Programming Versus Randomized

Baseline Approach for 32 × 31 NoC, P = 150

The test time reported by the procedure optimize TAMwas 1.4% more than that for DP for � = 5000. The CPU timevaried from 7.2 h to 1.5 h as � was swept from high to lowvalues, whereas DP took 4 h to produce the results. It will beseen later that for the 32 × 31 NoC, the CPU time for DP canfurther be lowered to 11 min 50 s using the speed-up techniquediscussed in Section V-D. As mentioned in Section III-E,the optimize TAM procedure scales down the test timesof individual cores for avoiding computational bottlenecksin solving subset-sum problem instances. This approximationstep may lead to the elimination of some valid partitions fromthe solution space. Therefore, optimality is not guaranteed withthe subset-sum-based approach.

We next examine our experimental results for further anal-ysis. Table VIII reports the test times for a 14 × 14 SoC(196 cores) for partition sizes K varying from 3 to 8, andfor three different values of P . The table also shows theresults produced by [12] for all these cases. The test-timereduction (TTR) column shows the relative reduction of testtime obtained by adding an additional access point (using theDP approach). It can be seen that the magnitude of reduction


TABLE VIII

Table Showing Diminishing Improvement in the Reduction in

Test Time as K Increases for a 14 × 14 NoC

†Test-time reduction (TTR): relative reduction in test time obtained byadding an access point.

in test time gradually decreases. This is because the test timeof a core depends on the number of pins assigned to it andas the partition size increases, the number of pins availableper region decreases. By increasing the pin count, we observethat the effect of sudden decrease in TTR can be moderated.For example, for P = 80, the TTR rapidly dipped to 0%, butwe were able to moderate the sudden decline by allotting 120pins, and get further benefits by increasing the pin count to160. However, the number of available pins on the ATE islimited, hence it is natural to ask what is a suitable choice forthe partition size and the pin-count that should be used, andhow can we calculate these values. These questions will beaddressed in future work.

The row K∗ in Table VIII shows the result produced by [12]when no restriction is placed on the number of access pointsto be used. It can be seen that [12] reports an extremely largenumber of access points, which can be harder to implement inpractice. Moreover, a large number of access points can leadto the associated problem of power consumption because oftest parallelism. We report lower test times than [12] usingfewer access points. For P = 160 (not shown in the table), theimprovement achieved by our method over [12] is as high as37.6% when K is restricted to 8. When K is not restricted,[12] resulted in a test time (using 23 access points) that isworse than the test time reported by DP with only seven accesspoints. Since our approach only creates rectangular partitions,a simple postprocessing step, such as that implemented in theprocedure ModifiedReshuffle of [12], can further reduce testtime by moving cores from one region to another. We alsoreport lower bound values for the two values of P in thelast row of Table VIII. The test times that we obtained areonly 9.4% (12.5%) larger than the provable lower bounds forP = 80 (P = 120).

B. Power-Constrained Test Scheduling

To assess the impact of power constraint on test scheduling,we ran our approach on two NOCs: a 6×6 NoC and an NoCwith 100 cores (10×10), both constructed out of cores fromthe benchmark circuit p93971. Due to the lack of informationon power consumption of these cores, we assumed that thepower consumption in a core is directly proportional to thesum of the number of core’s inputs, outputs, bidirectionalpins, and memory elements, the same approach as adopted

TABLE IX

Test Times of p93791 for Four Access Points Under Power

Constraints

1 Power constraint relative to total power consumption in SoC.

in [32]. All values for power consumption used were relativewith respect to the total sum of the power consumption of allcores, which is referred to as system power consumption in[32]. We therefore refer to power in terms of a normalizedvalue relative to the total system power.

Table IX compares the test length obtained by our approachwith that obtained using the ILP model from [32]. All powerconstraints are defined as a fraction of the system power.Scheduling with the 1.0 power constraint is equivalent toscheduling without power constraints, as no schedule canexceed the total system power. Since our approach approx-imates the power consumption for a set of cores using ma-nipulation on power profiles to create an approximate profile,the performance of the approach depends on how tightly theapproximation scheme bounds the actual power profile fromabove. The test lengths were found to match closely with theresults obtained using [32] for all values of the power limit.

As our approach is necessitated by the intractability ofproblem instances involving large NoCs, we present the resultsfor a 10×10 NoC in [35, Table X] for different powerconstraints and partition sizes. Since, as in this case, eachcore contributes very little to the system power consumption,the power constraint was set to 25% of the system powerconsumption at first, and then subsequently the power budgetwas reduced by 5% at each step. The runtime complexityremains the same as before, and no appreciable difference inruntime was found for the reported cases.

C. Speedup technique

We next show the effect of the speedup technique, dis-cussed in Section V, on the computation time for DP. In[35, Table XI], the third column corresponds to the approachtaken for reducing the runtime complexity by a factor of P .The speedup is clearly evident for larger NOCs.

VII. Conclusion

We have developed a scalable solution to the problemof optimizing test-data delivery in an NoC-based manycoreSoC. A formulation based on the subset-sum problem hasbeen proposed for NoCs with dedicated routing and arbitrarytopologies. For grid topologies supporting XY routing, test-time minimization has been solved using DP, which computesoptimal solutions for rectangular partitions. Results for NoC-based manycore SoCs constructed from ITC 2002 benchmarkshave shown that the proposed method yields high-quality re-sults, and scale to large SoCs with many cores. Test scheduling


under power constraints and a speedup technique have beenincorporated. Since dynamic programming solutions are recur-sively constructed from solutions to underlying subproblems,the proposed method can inherently facilitate design-spaceexploration for effective test planning.

References

[1] International Technology Roadmap for Semiconductors (ITRS)(2009) [Online]. Available: www.itrs.net/Links/2009ITRS/Home2009.htm

[2] L. Benini and G. D. Micheli, “Networks on chips: A new SoCparadigm,” Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002.

[3] Nvidia Fermi: Next Generation CUDA Architecture. (2014, Mar. 21) [On-line]. Available: http://www.nvidia.com/object/fermi architecture.html

[4] Intel Corp., (2014, Mar. 21) [Online]. Available: http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html

[5] C. Zeferino et al., “A study on communication issues for systems-on-chip,” in Proc. Symp. Integr. Circuits Syst. Design, Sep. 2002,pp. 121–126.

[6] V. Iyengar et al., “Test wrapper and test access mechanism co-optimization for system-on-chip,” J. Electron. Test., vol. 18, no. 2,pp. 213–230, Apr. 2002.

[7] V. Iyengar et al., “Efficient test access mechanism optimization forsystem-on-chip,” IEEE Trans. Computer-Aided Design, vol. 22, no. 5,pp. 635–643, May 2003.

[8] A. Sehgal et al., “SoC test planning using virtual test access archi-tectures,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12,no. 12, pp. 1263–1276, Dec. 2004.

[9] A. M. Amory et al., “DfT for the reuse of networks-on-chip as testaccess mechanism,” in Proc. IEEE VLSI Test Symp, 2007, pp. 435–440.

[10] M. Li et al., “An efficient wrapper scan chain configuration method fornetwork-on-chip testing,” in Proc. Symp. Emerging VLSI Technol. Arch.,2006, pp. 135–140.

[11] F. A. Hussin et al., “Optimization of NoC wrapper design underbandwidth and test time constraints,” in Proc. Eur. Test Symp., 2007,pp. 35–42.

[12] A. M. Amory et al., “A new test scheduling algorithm based onnetworks-on-chip as test access mechanisms,” J. Parallel Distrib. Com-put., vol. 71, no. 5, pp. 675–686, 2011.

[13] E. Cota et al., “The impact of NoC reuse on the testing of core-basedsystems,” in Proc. IEEE VLSI Test Symp., 2003, pp. 128–133.

[14] E. Cota et al., “Power-aware NoC reuse on the testing of core-basedsystems,” in Proc. Int. Test Conf., 2003, pp. 612–621.

[15] J. M. Nolen and R. Mahapatra, “A TDM test scheduling method fornetwork-on-chip systems,” in Proc. Int. Workshop Microprocess. TestVerification Validation, 2005, pp. 90–98.

[16] C. Liu et al., “Power-aware test scheduling in network-on-chip usingvariable-rate on-chip clocking,” in Proc. IEEE VLSI Test Symp., 2005,pp. 349–354.

[17] M. Richter and K. Chakrabarty, “Test pin count reduction for NoC-based test delivery in multicore SOCs,” in Proc. Design, Autom. TestEur., 2012, pp. 787–792.

[18] T. Bjerregaard and S. Mahadevan, “A survey of research and practicesof network-on-chip,” ACM Comput. Surv., vol. 38, no. 1, pp. 1–51,Jun. 2006.

[19] P. P. Pande et al., “Design, synthesis, and test of networks on chips,”IEEE Design Test Comput., vol. 22, no. 5, pp. 404–413, Sep.–Oct. 2005.

[20] A. Dalirsani et al., “Structural test for graceful degradation of NoCswitches,” in Proc. Eur. Test Symp., 2011, pp. 183–188.

[21] C. Liu et al., “Test scheduling for network-on-chip with BIST andprecedence constraints,” in Proc. Int. Test Conf., 2004, pp. 1369–1378.

[22] A. van den Berg et al., “Bandwidth analysis for reusing functionalinterconnect as test access mechanism,” in Proc. Eur. Test Symp., 2008,pp. 21–26.

[23] J. Dalmasso et al., “Improving the test of NoC-based SoCs with help ofcompression schemes,” in Proc. Annu. Symp. VLSI, 2008, pp. 139–144.

[24] S. K. Goel and E. J. Marinissen, “Effective and efficient test architecturedesign for SoCs,” in Proc. Int. Test Conf., 2002, pp. 529–538.

[25] E. Cota and C. Liu, “Constraint-driven test scheduling for NoC-basedsystems,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,vol. 25, no. 11, pp. 2465–2478, 2006.

[26] S. Martello and P. Toth, “Subset-sum problem,” in Knapsack Prob-lems: Algorithms and Computer Interpretations. New York, NY, USA:Wiley-Interscience, 1990.

[27] Y. Huang et al., “Resource allocation and test scheduling for concurrenttest of core-based SoC design,” in Proc. IEEE Asian Test Symp., 2001,pp. 265–270.

[28] D. Wentzlaff et al., “On-Chip interconnection architecture of the tileprocessor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, Sep. 2007.

[29] P. Gratz et al., “On-chip interconnection networks of the TRIPS chip,”IEEE Micro, vol. 27, no. 5, pp. 41–50, Sep. 2007.

[30] M. Agrawal, M. Richter, and K. Chakrabarty, “A dynamic programmingsolution for optimizing test delivery in multicore SOCs,” in Proc. Int.Test Conf., 2012, pp. 1–10.

[31] N. Nicolici and B. M. Al-Hashimi, “Power profile manipulation,”in Power-Constrained Testing of VLSI Circuits. Norwell, MA, USA:Kluwer, 2003.

[32] M. Richter and K. Chakrabarty, “Optimization of test pin-count, testscheduling, and test access for NoC-based multicore SOCs,” IEEE Trans.Comput., vol. PP, no. 99, pp. 1–1, 2013, to be published.

[33] E. Marinissen, V. Iyengar, and K. Chakrabarty, “A set of benchmarks formodular testing of SoCs,” in Proc. Int. Test Conf., 2002, pp. 519–528.

[34] FICO Xpress Optimization Suite. (2014, Mar. 21) [Online]. Available:http://www.fico.com/en/Products/DMTools/Pages/FICO-Xpress-Optimization-Suite.aspx

[35] M. Agrawal et al., “Test-delivery optimization in manycore SoCs,” DukeUniversity, Durham, NC, Tech. Rep. 10161/8404 [Online]. Available:http://dukespace.lib.duke.edu/dspace/handle/10161/8404

Mukesh Agrawal (S’13) received the B.Tech andM.Tech degrees in computer science and engineeringfrom the Indian Institute of Technology Kharagpur,Kharagpur, India, in 2008. He is currently pursuingthe Ph.D. degree with the Department of Electri-cal and Computer Engineering, Duke University,Durham, NC, USA.

He was with Microsoft Corporation as a SoftwareDesign Engineer for two years. His current researchinterests include design-for-test, test scheduling, andtest-optimization techniques for many-core designs.

Michael Richter (M’11) received the B.Sc. de-gree in software engineering from Hasso-Plattner-Institute, University of Potsdam, Potsdam, Germany,in 2004, and the German Diploma and the Ph.D.degree in computer science from the University ofPotsdam, in 2007 and 2011, respectively.

He is currently a Design-for-Test Manager withIntel Mobile Communications GmbH, Germany. Hiscurrent research interests include nonlinear errorcorrection circuitry, testing and design-for-testabilityof integrated circuits, and test scheduling techniques.

Krishnendu Chakrabarty (F’08) received theB.Tech. degree from the Indian Institute of Tech-nology Kharagpur, Kharagpur, India, in 1990, andthe M.S.E. and Ph.D. degrees from the University ofMichigan, Ann Arbor, MI, USA, in 1992 and 1995,respectively.

He is currently the William H. Younger Distin-guished Professor of engineering with the Depart-ment of Electrical and Computer Engineering and aProfessor of computer science with Duke University,Durham, NC, USA. His current research interests

include testing and design-for-testability of integrated circuits, digital mi-crofluidics, biochips, and cyberphysical systems, optimization of digital print,and enterprise systems.

Dr. Chakrabarty was a recipient of the National Science Foundation EarlyFaculty Award, the Office of Naval Research Young Investigator Award, theHumboldt Research Award from the Alexander von Humboldt Foundation,Germany, and 10 Best Paper Awards at major IEEE conferences. He is a fellowof ACM and a Golden Core Member of the IEEE Computer Society. He servedas the Editor-in-Chief of the IEEE Design and Test of Computers during2010-2012. Currently, he serves as the Editor-in-Chief of the ACM Journal onEmerging Technologies in Computing Systems. He is also an Associate Editorof the IEEE Transactions on Computers and IEEE Transactions on

Biomedical Circuits and Systems.

test-delivery optimization in manycore socs

Documents