ict-2009.3.2 design of semiconductor components and

Grant Agreement number: 248972

Project acronym: NaNoC

Project title: “Nanoscale Silicon-Aware Network-on-ChipDesign Platform”

Seventh Framework Programme

Funding Scheme: Collaborative project

Theme ICT-2009.3.2 Design of semiconductor components andelectronic based miniaturised systems

Start date of project: 01/01/2010 Duration: 36 months

D 1.4 Definition of a system-level testing and diagnosis strategy for NoC architectures

Due date of deliverable: January 2012Actual submission date: January 2012

Organization name of lead beneficiary for this deliverable: UNIFE, IMCWork package contributing to the Deliverable: WP1

Dissemination LevelPU Public XPP Restricted to other programme participants (including the Commission Services)RE Restricted to a group specified by the consortium (including the Commission Services)CO Confidential, only for members of the consortium (including the Commission Services)

APPROVED BY:

Partners DateAll partners 11th January 2012

1

INDEX

1 Introduction 6

2 Exploiting Network-on-Chip Structural Redundancy for A Cooperative andScalable Built-In Self-Test Architecture 82.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Target Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Built-In Self-Test/Diagnosis Framework . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Testing communication channels . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 TPG for communication channels . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3 Testing Other Internal Switch Modules . . . . . . . . . . . . . . . . . . . . 132.3.4 Fault detection and diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.5 BIST-enhanced switch architecture . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Designing a Built-In Scan Chain-Based Testing Framework for Network-on-ChipSwitches 173.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 The scan chain tool-flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 The baseline implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Customizations for the NoC setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5.1 Scan chain results: custom Vs baseline implementation. . . . . . . . . . . . 203.5.2 Comparison with the deterministic test pattern-based framework . . . . . . 21

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Optimizing Built-In Pseudo-Random Self-Testing for Network-on-Chip Switches 244.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Target Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Optimized Pseudo-Random Testing Framework . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Testing communication channels . . . . . . . . . . . . . . . . . . . . . . . . 264.3.2 Testing multiplexers of the crossbar . . . . . . . . . . . . . . . . . . . . . . 264.3.3 Testing LBDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.4 Testing Arbiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.5 BIST-enhanced switch architecture . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Built-in Self-Testing and Self-Diagnosis for bisynchronous channels 305.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Target GALS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Built-in self-diagnosis framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Operative principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.5 Synchronization of the handshaking signals . . . . . . . . . . . . . . . . . . . . . . 345.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.6.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.6.2 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.6.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2

6 Debug and Trace 386.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Architecture and Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Monitors and Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.3.1 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.3.2 Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3.3 Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.4 Timestamping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Architecture for Hardware and Software co-debug 487.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.2 Architecture Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8 Conclusions 50

3

ABSTRACTEmbedded systems have been shifting to multi-core solutions (Multiprocessor System-on-Chips;

MPSoCs). A clear example of high-end MPSoCs are the products offered by Tilera [4] where multi-core chips provide support to a wide range of computing applications, including high-end digitalmultimedia, advanced networking, wireless infrastructure and cloud computing.

Main microprocessor manufacturers are also shifting to chip multiprocessors (CMPs) for theirlatest products. In CMPs many cores are put together in the same chip and, as technologyadvances, more cores are being included. Recently, Intel announced a chip with 48 cores, underthe Tera-scale Computing Research Program [36]. Previously, Intel also developed a chip prototype[37] that included 80 cores (known as TeraFlops Research chip).

Current trends indicate that Multi-core architectures will be used in most application domainswith energy efficiency requirements exceeding 10GOPS/Watt. However, aggressive CMOS scalingaccelerates transistor and interconnect wearout, resulting in shorter and less predictable lifespansfor CMPs and MPSoCs [34]. It has been predicted that future designs will consist of hundreds ofbillions of transistors, with upwards of 10% of them being defective due to wearout and processvariation [32]. Consequently, in order to support the current technology trends we must developsolutions to design reliable systems from unreliable components, managing both design complexityand process uncertainty [33].

Network on Chip (NoC), a high performance and scalable communication mechanism, is beingincreasingly investigated by researchers and designers to address the issues of interconnect com-plexity in both CMPs and MPSoCs [35]. The reliability of NoC designs is threatened by transistorwearout in aggressively scaled technology nodes. Wear-out mechanisms, such as oxide breakdownand electromigration, become more prominent in these nodes as oxides and wires are thinned totheir physical limits. These breakdown mechanisms occur over time, so traditional post burn-in testing will not capture them. NoCs provide inherent structural redundancy and interestingopportunities for fault diagnosis and reconfiguration to address transistor or interconnect wearout.

In this direction, this deliverable will provide an exploration of testing strategies for NoC-based systems and will promote one of them as the reference testing and diagnosis strategy. Theoptimization of the stuck-at-fault coverage and the minimization of area and latency of the testingand diagnosis strategy will be the guideline of the document.

The deliverable presents four scalable built-in self-test and self-diagnosis infrastructures forNoCs based on four different testing strategies. Although both conventional and non-conventionaltechniques are considered, all the strategies are optimized and customized for the target NoCenvironment taking full advantage of the intrinsic network structural redundancy. In addition tothe testing and the diagnosis frameworks for the NoC hardware, a novel debug and trace systemhas been also devised to optimize and debug the software in the system.

This deliverable is the outcome of tasks 1.8, focused on development of a system-level testingand diagnosis policy for NoC architectures. The result from this task will influence the WP5 wherethe switch architecture will be enhanced with a BIST/BISD infrastructure.

4

GLOSSARYCMP: Chip Multi-ProcessorMPSoC: Multi-Processor System-on-ChipNoC: Network-on-ChipGPU: Graphics Processing UnitLBDR: Logic-Based Distributed RoutingTPG: Test Pattern GeneratorATA: Auto-Test AnalyzerFIFO: First In First OutBIST: Built-In Self-TestBISD: Built-In Self-DiagnosisGALS: Globally Asynchronous Locally SynchronousFSM: Finite-State MachineDUT: Device Under TestTRC: Two-Rail CheckerMISR: Multiple Input Signature RegisterLFSR: Linear Feedback Shift RegisterMTTF: Mean Time To FailureAHB: Advanced High-performance BusDC FIFO: Dual-Clock First In First OutSoC: System-on-ChipEOP: End Of PeriodWUP: Wake Up PacketGPOS: Giga Operations Per SecondCMOS: Complementary Metal Oxide-Semiconductor

5

1 Introduction

On-chip interconnection networks are rapidly becoming the reference communication fabric formulti-core computing platforms both in high-performance processors and in many embedded sys-tems [8, 4]. As the integration densities and the uncertainties in the manufacturing process keepincreasing, complementing NoCs with efficient test mechanisms becomes a key requirement to copewith high defect rates [7, 12]. Above all, the NoC testing infrastructure should not be conceivedin isolation, but should be coherently integrated into a reliability framework taking care of faultdetection, diagnosis and network reconfiguration and recovery to preserve yield [10].

Moreover, wear-out mechanisms such as oxide breakdown, electro-migration and mechani-cal/thermal stress become more prominent in aggressively scaled technology nodes. These break-down mechanisms occur over time, therefore the methodology and the infrastructure used forproduction testing should be designed for re-use during the system lifetime as well, thus enablinggraceful degradation of the NoC over time.

The detection and identification of failures is the foundation of any reliability framework. Un-fortunately, developing such a testing infrastructure for a NoC is a serious challenge. The control-lability/observability of NoC links and sub-blocks is relatively reduced, due to the fact that theyare deeply embedded and spread across the chip. Also, pin-count limitations restrict the use ofI/O pins dedicated for the test of the different NoC components. A number of other concerns wereraised in [11] on the use of external testers for nanoscale chip testing. First, the lack of scalabil-ity of test data volumes with the number of gates hidden behind each package pin. Second, theneed for testing at full clock speed, which is overly expensive if even possible to accomplish withexternal testers. Third, the poor suitability of these latter for lifetime testing (and not just forproduction testing) and for a test-and-repair testing approach (beyond the baseline go/no-go phi-losophy). As an effect, a migration from external testers to built-in self-test (BIST) infrastructureswas envisioned in [11], and was later confirmed by the large amount of works in the open literaturetargeting scalable BIST architectures for NoC testing [2, 22, 17]. At the same time, the limitedfault coverage that functional testing can achieve on the control path of NoC switches when testgenerators are outside the switch has further pushed the adoption of BIST units at least for suchcontrol blocks [9].

In this direction, this document relies on full BIST strategies for NoC testing. A key principleof our approaches consists of exploiting the inherent structural redundancy provided by NoCs.Each switch is comprised of input ports, output ports, arbiters and FIFOs that are duplicated foreach channel. This feature is used to develop very effective test strategies which consists of testingmultiple identical blocks in parallel and of cutting down on the number of test pattern generators.This is done both at the abstraction level of the switch micro-architecture (e.g., testing of theoutput port arbiters in parallel) and of the NoC architecture (i.e., testing of all NoC switches inparallel). The inherent parallelism of our BIST procedure makes our testing infrastructure highlyscalable and best suited for large network sizes.

Finally our BIST procedures are suitable both for production and for lifetime testing, and arecomplemented by a built-in self-diagnosis logic distributed throughout the network architectureable to pinpoint the location of detected faults in each switch. This diagnosis outcome matchesthe reconfigurability requirements of logic-based distributed routing.

Considering the regular and modular structure of on-chip networks, test strategies previouslyproposed for systems with identical cores [15, 23] can be applied to the NoC. However, bothapproaches incur a significant overhead for DfT structures (full-scan and IEEE 1500 wrappedcores with registered I/O pins).

It is showed in [14] that traditional full-scan and boundary scan strategies like [18, 21, 38, 17]incur an hardly affordable area overhead. [14] also proposes a partial scan technique in combinationwith an IEEE 1500-compliant test wrapper. Area overhead is greatly reduced, but test applicationtimes amount to tens of thousands of clock cycles and test pattern generation time does not scale.

As opposed to using scan paths and wrappers for test access, [5] considers the case where testpatterns are applied at the border I/Os of the network. The method was then extended in [6] tosupport fault diagnosis, while the DfT infrastructure was developed in [9]. While very high faultcoverage was achieved, the time complexity of the test configurations is square with respect tothe rank of the NoC matrix. Moreover, in order to apply test patterns from network boundariesat-speed, a large number of test pins are necessary.

6

In [19], it is proposed to add dedicated logic to enable analysis of response from each FIFOin the switch, however no test data is presented. In [16] the possibility to repair the NoC duringtesting is envisioned, however error information is computed once for all and thus cannot handlesituations where the chip slowly degrades.

[20] proposes a built-in self-test and self-diagnosis architecture to detect and locate faults inthe FIFOs and in the MUXes of the switches. Unfortunately, the control path is left out of theframework. In [2] an automatic go/no-go BIST operation is proposed at start-up of a 2D meshNoC. Low fault coverage is achieved for the switch controller, moreover the methodology appliesonly to a 2D mesh. That idea is evolved in [13], where a fault coverage close to 100% is documentedwith a few thousand clock cycles. However, the area cost of the BIST architecture is the mainconcern of this work. The pattern based testing section from the more general reliability frameworkpresented in [10] reports a testing methodology relying on random test pattern generation andsignature analysis. Unfortunately, testing takes as large as 200000 cycles with 10000 patterns pertest.

With respect to previous work, the deliverable presents four scalable built-in self-test and self-diagnosis infrastructures where a more efficient use of NoC structural redundancy for testing anddiagnosis purposes through the use of a cooperative testing framework is provided. In the firstsection, we take on the challenge pointed by [2] of exploiting architecture behavior knowledgeto come up with a set of customized test patterns for NoC components. Thus we present adeterministic test patterns-based BIST/BISD framework with low latency and high coverage whileat the same time detecting TPG faults. In the second section, we present a strategy based on ascan chain mechanism. With respect to the conventional scan-based approaches, we built-in thetest pattern generator and the diagnosis modules and we reduce the area overhead. In the thirdsection, we design a testing based on pseudo-random patterns where we cut down on the testapplication time and we provide efficient testing of the control path. In the fourth section, wepropose one of the first built-in self-testing and diagnosis framework for Globally-Asynchronous-Locally-Synchronous Network-on-Chip based on an asynchronous handshaking. Finally, a noveldebug and trace system has been devised for debugging the software in the system in the lastsection of the deliverable.

Overall, the deliverable provides an exploration of built-in testing strategies customized forNoC-based systems comparing them under an area, coverage and latency point of view. Twopartners contributed to this deliverable, as planned in the Description of Work (task 1.8): UNIFE(hardware testing strategies) and IMC (software debugging strategies). There was a synergicco-design effort of the dual-network infrastructure used in both cases for the delivery of controlinformation to a global controller. The backbone of this infrastructure was developed in thecontext of WP2 (deliverable 2.1), while its hierarchical extension is reported in this deliverable.The co-design effort was such that the same infrastructure can be reused in the system for differentpurposes.

7

2 Exploiting Network-on-Chip Structural Redundancy forA Cooperative and Scalable Built-In Self-Test Architec-ture

2.1 Motivation

This section proposes a built-in self-test/self-diagnosis procedure at start-up of an on-chip network(NoC)based on deterministic test patterns. Concurrent BIST operations are carried out after resetat each switch, thus resulting in scalable test application time with network size. The key principleconsists of exploiting the inherent structural redundancy of the NoC architecture in a cooperativeway, thus detecting faults in test pattern generators too.

Four main features differentiate the testing framework proposed in this section from mostprevious work. First, we take on the challenge of generating deterministic test vectors on-chip ata limited area overhead. At the same time, this enables us to report much shorter test applicationtimes than typical pseudo-random testing frameworks and larger fault coverage in the control paththan most functional testing frameworks for NoCs. Second, we account for the tedious problem offaults affecting test pattern generators (TPGs) and provide large coverage for them. This is donewithout implementing more hardware redundancy but fully exploiting the existing one by means ofa cooperative testing framework among switches. Third, our testing framework targets double andtriple stuck-at faults from the ground up, and not as an afterthought. Fourth, our framework isnot limited to regular 2D meshes, but can be applied to a much wider range of network topologies.

As a result, the coverage for single stuck-at faults closely tracks 100% in both the controland data path of the network. This latter is achieved by means of deterministic test patternshandcrafted for the specific block under test by exploiting knowledge of the architecture behavior.Finally at-speed testing of stuck-at faults can be performed in less than 1200 cycles regardless oftheir size, with an hardware overhead of less than 26%.

2.2 Target Architecture

Without lack of generality, we use the xpipesLite switch architecture [3] to prove viability of ourtesting methodology in a realistic NoC setting. The baseline switch architecture is illustrated inFig.1. It implements both input and output buffering and relies on wormhole switching. Thecrossing latency is 1 cycle in the link and 1 cycle in the switch itself.

Flit width assumed in this document is 32 bits, but can be easily varied. Without lack ofgenerality, in this report the size of the output buffers is 6 flits, while it is 2 flits for the inputbuffers.

This switch relies on a stall/go flow control protocol. It requires two control wires: one goingforward and flagging data availability (”valid”) and one going backward and signaling either acondition of buffer filled (”stall”) or of buffer free (”go”).

The switch architecture is extremely modular and exposes a large structural redundancy, i.e.,a port-arbiter, a crossbar multiplexer and an output buffer are instantiated for each output port,while a routing module is cascaded to the buffer stage of each input port. This common featureto all switch architectures will be intensively exploited in this work.

We implement distributed routing by means of a route selection logic located at each inputport. Forwarding tables are usually adopted for this purpose, although they feature poor areaand delay scalability with network size [24]. The possibility to implement logic-based distributedrouting (LBDR) while retaining the flexibility of forwarding tables has been recently demonstratedin [25]. In practice, LBDR consists of a selection logic of the target switch output port relying on afew switch-specific configuration bits (namely routing Rxy, connectivity Cz and deroute bits drt).The number of these bits (14 in this case) is orders of magnitude less than the size of a forwardingtable, yet makes the routing mechanism reconfigurable.

The core of LBDR logic is illustrated in Fig.2(a), illustrating the conditions that select theoutput port north UN ′ for routing. The pre-processed direction of packet destinationN ′/S′/W ′/E′

is an input together with the routing and the connectivity bits. In some cases (see [25] for details),deroutes are needed to properly route packets, and the associated logic is reported in Fig.2(b).

LBDR supports the most widely used algorithms for irregular topologies and can be used ona 2D mesh as well as on roughly 60% of the irregular topologies derived from a 2D mesh, like

8

OUTPUT SOUTH

.

.

.

.

.

....

ARBITER NORTH

LBDR WEST

ARBITER SOUTH

LBDR EAST

INPUT WEST

INPUT EAST

OUTPUT NORTH

Figure 1: Modular structure of the baseline switch architecture. Not all connections are showed.

(a) Core logic for route com-putation

DEROUTE

DEMUX(enable)

dr0dr1

UN'UE'UW'US'

UN'

UE'

UW'

US'

UN

UE

UW

US

(b) Deroute logic

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Bidirectional Routing Restriction

Router Cn[0] Ce[0] Cw[0] Cs[0]

0 0 1 0 11 0 1 1 02 0 1 1 13 0 0 1 14 1 1 0 15 0 1 1 16 1 1 1 07 1 0 1 18 1 1 0 19 1 1 1 1

10 0 1 1 111 1 0 1 112 1 1 0 013 1 1 1 014 1 1 1 015 1 0 1 0

(c) Connectivity bit settingbecause of failed links.

Figure 2: LBDR logic and requirements on the diagnosis outcome.

in Fig.2(c). Its extension to fully irregular topologies with 12 more bits per switch is an ongoingwork. Irregularity of the connectivity pattern can be an effect of manufacturing or wearout faults,but also of power management or thermal control decisions or of virtualization strategies. Switchconfiguration bits need to be updated whenever the topology evolves from one connectivity patternto another (e.g., when a fault is detected).

Our testing and diagnosis framework has been conceived to enable a network reconfigurationstrategy leveraging the cost-effective flexibility offered by the LBDR routing mechanism. Analgorithm is reported in [25] for computation of the switch configuration bits given the topologyconnectivity pattern. As an example, updated connectivity bits are illustrated in Fig.2(c). Thisalgorithm might be executed by a centralized NoC manager and in practice needs the list offailed links to recompute the configuration bits for correct routing with the available communicationresources. Failure of a switch input or output port can be viewed as the failure of the connectedlink. Our diagnosis strategy will therefore target this requirement and will provide an indicationof whether input and output ports of a switch are operational.

2.3 Built-In Self-Test/Diagnosis Framework

The key idea of our BIST/BISD framework consists of exploiting the inherent structural redun-dancy of an on-chip network. We opt for testing the NoC switches in parallel, thus makingtest application time independent of network size. Communication channels between switches aretested as a part of the switch testing framework.

Each switch can in turn test its manyfold internal instances of the same sub-blocks(crossbar muxes, communication channels, port arbiters, routing modules) concur-rently. In fact, all the instances are assumed to be identical, therefore they should output thesame results if there is no fault. As a consequence, the test responses from these instances arefed to a comparator tree. This makes the successive diagnosis much easier. There is a uniquetest pattern generator (TPG) for all the instances of the same block, thus cutting down onthe number of TPGs. Although the principle is similar to what has been proposed in [15, 22, 14],there is a fundamental difference. If the TPG of a set of block instances is affected by a fault, thenthe comparison logic will not be able to capture this since all instances provide the same wrong

9

response. To avoid this, a cooperative framework is devised, such that each switch teststhe block instances of its neighboring switches.

As an example, a switch tests the incoming communication channels from itsnorth/south/west/east neighbors (i.e., it feeds their test responses to its local comparator tree),thus checking the responses to distinct instances of the same TPG. This way, a non-null coverageof TPG faults becomes feasible. Fig.3(a) clearly illustrates the cooperative testing framework forcommunication channels and the need for a single TPG instance per switch to feed test patterns toall of its output ports. Faults in the TPG, in the output buffer, in the link and in the input bufferwill be revealed in the downstream switch. Each switch ends up testing its input links, while itsoutput links will be tested by their respective downstream switches.

The same principle can be applied for the testing of switch internal block instances associ-ated with each output port: crossbar muxes and output port arbiters. Fig.3(b) shows thecase of port arbiters. The main requirement for testing these instances is that the communica-tion channels bringing test responses to the comparators in the downstream switches are workingcorrectly. Clearly, testing these modules can only occur after communication channels have beentested. Therefore, the procedures in Fig.3(a) and Fig.3(b) occur sequentially in time. Should onecommunication channel result defective, this would not be a problem, since it would not makeany sense to test and use a port arbiter when the corresponding port is not operational. Crossbarmultiplexers associated with each output port are tested in the same way and are hereafter notillustrated in Fig.3 for lack of space.

Finally, the methodology can be extended to test block instances associated with each switchinput port with some modifications. This is the case of the LBDR routing block. The key idea topreserve the benefits of cooperative and concurrent testing is to carry test patterns rather than testresponses over the communication channels to neighboring switches, where the LBDR instancesare stimulated and their responses compared (see Fig.3(c)). If the channel is not working properly,than testing and use of the downstream routing block is useless, since it is associated with an inputport which will not be used.

A BIST engine is embedded into each switch and regulates the testing procedure. This latteris in fact split into four phases in time:- testing of communication channels- testing of the crossbar- testing of the arbiters- testing of the LBDR routing blocksThe serial execution of test phases for the switch internal components is dictated primarily by thelimited flit width, constraining the amount of test patterns that can be transmitted at the sametime over the communication channel, and also by the limited availability of comparators, althoughin our case the former effect comes into play first. As the flit width increases, then we can performmore testing operations in parallel, starting from those components that have a limited amount ofprimary input/outputs (e.g., the arbiter with the LBDR).

A fundamental difference with respect to a lot of previous work is that we do not rely on pseudo-random testing (like in [10]), which gives rise to large testing times. We use deterministic testpatterns instead, which are handcrafted for the specific block under test by exploiting knowledgeof the architecture behavior. This way, the reduced number of test patterns enables the serializationof test phases without making test application time skyrocket (see section 2.4.1).

On a cycle by cycle basis, comparator outputs are fed to a diagnosis logic which identifies whereexactly the fault occurred. In our diagnosis framework, each switch checks whether test responsesfrom its input ports are correct or not. As a consequence, the outcome of the diagnosis is codedin only 5 bits, one for each input port of the current switch (they would be of course doubled if atwo-rail code is implemented to protect them against stuck-at faults). A ’1’ indicates that the portis faulty. In practice, the fault may be located either in the input buffer or in the LBDR module,in the connected communication link or even in the output buffer and associated port arbiter andcrossbar multiplexer of the upstream switch. This further level of detail is not needed, since inany case the meaning is that the link is unusable, and this is enough for a global controller torecompute the reconfiguration bits for the LBDR mechanism.

In the final implementation, other 5 bits will be needed to code the diagnosis outcome becauseof practical implementation issues, as discussed in section 2.3.1.

Common to most current NoC testing frameworks, the underlying assumption for correct op-

10

OUT

CHANNEL

COMPARATORTPG

TPG

TPG

TPG

LINK

BUFIN

BUF

(a) Testing communicationchannels.

ARBITERCOMPARATOR

ARBITERTPG

TPG

TPG

TPG

ARBITER

EAST

ARBITERSOUTH

NORTH ARBITER

WEST

(b) Testing output port ar-biters.

COMPARATOR

LBDRTPG

TPG

TPG

TPG

LBDR

LBDR

L B D R

L B D R

(c) Testing LBDR routinglogic.

Figure 3: The cooperative and concurrent testing framework saving TPG instances and coveringtheir faults.

TO LOCAL

TPG Data

stall_in

stall_in (testing)

valid

stall_in (normal) stall_channel

stall_channel

stall_channel (testing)

(normal)

Output Buffer Input Buffer

stall_out

TO COMPARATORS

UPSTREAM SWITCH DOWNSTREAM SWITCH

From localTPGstall_channels

of local input ports

COMPARATORS

Figure 4: Practical implementation of communication channel testing.

eration of our BIST/BISD infrastructure is that the reset signal can be synchronously deassertedin all switches of the network at the same time.

2.3.1 Testing communication channels

Communication channels include input/output buffers and their intermediate links, as illustratedin Fig.4: all these elements are jointly tested by means of a single TPG and the test patterns arehandcrafted for them based on knowledge of their behavior.

Our approach in this direction was to expand the finite state machine (FSM) of the deviceunder test (DUT) into all its possible states. Therefore, we have defined a sequential test patternthat drives the FSM to each of its states. In this way, we can ensure that if the FSM reaches theexpected state for all the test patterns there are no faults inside the DUT. As an example, the FSMof the buffers defines that if the Stall signal is asserted and the buffer receives a set of valid flits,the buffer has to store the flits that it receives until it becomes full. One test pattern to check thisbehavior would fill up the buffer by asserting the Stall signal, and would in the end check whetherthe output buffer correctly asserts the Full signal. The datapath is obviously much easier to testby means of only few test patterns.

From an implementation viewpoint, there are several practical issues. On one hand, we hadto make the stall input of the output buffer directly controllable to the TPG to raise its stuck-atfault coverage to almost 100% (see Fig.4).

On the other hand, the stall channel signal of the input buffer, which lies in the downstreamswitch, should be driven by the TPG as well. This would require an additional wire in the switch-to-switch link. A similar concern is that the stall out signal from the output buffer should bebrought to the comparators in the downstream switch, again requiring an additional wire in thelink.

11

NEXT PATTERN

100101..100010..001100..

OUTPUT BUFFER STALL

INPUT BUFFER STALL

OUTPUT BUFFER WRITE

BUFFER INPUT DATA

DUT RESET

TEST PATTERN

COUNTER

CLOCK CYCLE

COUNTER

111001..100011..000001..

110101..000111..111110..

Figure 5: TPG for communication channels.

To avoid the extra wires, we opted for the solution in Fig.4: stall channel is driven by the TPGof the downstream switch, while stall out is brought to the comparator tree in the upstream switch.From the testing viewpoint nothing changes, since all channel TPGs inject the same patterns syn-chronously, and so do the comparators. The only difference lies in the fault coverage of TPG faults,which is likely to be decreased a bit. In fact, those (upstream) TPG faults that can be detectedby only monitoring stall out will not be detected, since all the stall out signals brought to the localcomparators will be driven by the same TPG. Similarly, some faults in the (downstream) TPGwill not be detected, since the comparators compare responses to stall channel signals generatedby the same faulty TPG: the responses will look like the same. These implementation variants,needed to adapt the conceptual testing scheme to the constraints of the real implementation, willbe proven in section 2.4.1 to only marginally decrease fault coverage of the TPGs, while leavingfault coverage for the communication channel obviously unaffected.

The only major implication is that the fault detection framework becomes even more collab-orative: some (very few) faults in the channel and/or TPGs are now detected in the upstreamswitch comparators instead of the downstream ones. Therefore, other 5 additional diagnosis bitsare needed, flagging a fault in the output port of a switch. The global controller will combine this(OR operation) with the faults detected at the input port of the downstream switch to get thecomplete indication of a fault across the entire channel.

2.3.2 TPG for communication channels

A test pattern can be easily generated in hardware by using a clock cycle counter and some logicto generate the values of the input signals for the DUT. In order to extend this approach to aTPG able to generate all the test patterns for a given DUT, we can include an additional counter.This latter will indicate the current test pattern within the test sequence. Figure 5 depicts theresulting conceptual scheme for the channel TPG. The actual gate-level implementation dependson the logic synthesis tool and on the synthesis constraints. The two counters act as a FSM drivingthe control signals of two levels of multiplexing: the first one selects the current test pattern, whilethe second one selects the current clock cycle and associated input vector for the buffer.

It is however possible to easily compact the combinational logic, because there are a lot of testpatterns that include other test patterns. For instance, by checking the response not only at theend of the test pattern, but also somewhere in the middle, it is often possible to detect anotherfault. This perfectly matches with the capability of our BIST framework, which even performscheck response at each clock cycle. Therefore, it is possible in our implementation to perform acompaction of test patterns by generating in hardware only those patterns including a subset ofthe other ones, thus largely saving test time and TPG area.

12

2.3.3 Testing Other Internal Switch Modules

A similar process is followed to generate deterministic test patterns for the port arbiters, theLBDR modules and the crossbar. Also the implementation of their TPGs is identical, and so arethe optimization techniques.

Again, the most relevant practical implementation issue concerns the communication of testpatterns or responses across the switch-to-switch links for the crossbar and LBDR module. Thecrossbar outputs 34 bits in response to a test vector: 32 data bits, 1 valid bit and 1 stall bit. Thecommunication channel can only carry 32 bits (the valid bit of the channel needs to be permanentlyset to 1 during test vector transmission, while the stall signal travels in the opposite direction).The two remaining crossbar signals (valid and stall) which do not fit into the link can be eithertransmitted by means of additional lines used only during testing, or alternatively checked by localcomparators, similarly to what has been done for the communication channel. We took the latterapproach, and the results in section 2.4.1 again confirm the marginal coverage reduction on TPGfaults. Fault coverage of the crossbar is not affected at all by this choice.

Unlike other modules, test vectors for the LBDR modules should be transmitted across thelink, and they take 31 lines (the primary inputs of the LBDR module). So, they perfectly matchwith the current flit width, provided the number of network destinations does not exceed 64. Fromthere on, the test vector width starts growing logarithmically with the number of destinations, andadditional lines may be required on the link.

In contrast, the use of a larger flit width in the network (e.g., 64 bits) would automaticallysolve the problem. In that case, the test patterns of the LBDR block and the test responses of thearbiter could even be communicated at the same time over the link. Also, since LBDR moduleand arbiters have only few outputs, their response checking could be performed at the same timeon the available tree of comparators, thus cutting down on the test application time (see section2.4.1).

2.3.4 Fault detection and diagnosis

The core of the diagnosis unit is given by comparators which can be implemented in two differentways, by:- using a level of XORs and an OR gate to provide a single output encoding of the equality test;- using a two-rail checker TRC (with the second word which is negated);We opted for the TRC approach, which achieves the self-testing and fault-secure properties [26]although leading to a more complex circuit.

In the diagnosis unit we use 10 different comparators to compare data from all the possiblepairs of switch input ports. A smaller number of comparators could be used. Unless time multi-plexing is exploited, this would trade cost for diagnosis capability. The maximum number of usablecomparators also depends on the number of switch I/O ports. In what follows, we will focus onthe internal switches of a 2D mesh for the sake of simplicity (featuring 5 I/O ports, including thelocal connection to the network interface), however all irregular topologies supported by LBDR andmaking use of switches with at least 3 I/O ports are suitable for our methodology. Obviously, thelower the number of ports, the lower the diagnosis capability.

If we denote two faults in different ports under comparison as equivalent if they produce thesame output sequence in response to the same input stimuli, then our comparator and diagnosislogic is able to:- diagnose the correct position of 1 or 2 faulty channels affected by equivalent or non-equivalentfaults;- diagnose the correct position of 3 faulty channels affected by non-equivalent faults; 1

- detect the presence of 4 and 5 faulty channels. Anyway, since a 5x5 switch affected by 4 or 5faults has to be discarded, we don’t distinguish between these two scenarios.

One might argue that when a communication channel fails, then the following testing phaseshave less inputs available and diagnosis capability reduces. In practice, this effect plays only aminor role, since a fault on a communication channel means that also (say) the arbiter of thatchannel should be considered faulty (unusable). So, the diagnosis capability reduces, but also thenumber of input ports to be checked reduces as well.

1The probability that more than two faulty channels produce the same output sequence in response tothe same input stimuli is here neglected.

13

FAULTY OUTPUT PORT TO NI

...

...

ARBITERTPG

CROSSBARTPG

CROSSBARTPG

COMPARATORSTO

COMPARATORSTO

TPGLBDR

TPGCHANNEL

.. CROSSBARTPG

fromallinputports

fromallinputports

IB S T E N G I N E

0

0

0

0

0

FAULTY INPUT PORT N

FAULTY INPUT PORT S

FAULTY INPUT PORT W

FAULTY INPUT PORT E

FAULTY INPUT PORT FROM NI

0

0

0

0

0

LBDR_N

ARB_N

OUTPUT BUFFER

NORTH

NORTHINPUT BUFFER

IN_N

N_REQ_N

IN_N

IN_S

inputs_N

inputs_S

N_stall_S

N_stall_N

NORTHCROSSBAR MUX

N_stall

(in inputs_N)

COMPARATORS DIAGNOSIS

FAULTY OUTPUT PORT N

FAULTY OUTPUT PORT S

FAULTY OUTPUT PORT W

FAULTY OUTPUT PORT E

Figure 6: BIST-enhanced switch architecture.

When a switch features only three I/O ports, then the detection and diagnosis capabilitieschange as follows. Single stuck-at faults can be diagnosed while double faults can be detected,provided they are not equivalent. If they are equivalent, then diagnosis fails. However, when twofaults are detected in two ports out of three, the switch should be discarded anyway.

As regards the possible presence of faulty comparators, let us first note that any input vectorproducing less than four ones corresponds to faults in less than four comparators (we are neglectingthe case where all 5 channels are faulty and 4 of them have equivalent faults, which is very unlikely).In case the number of faulty comparators is larger than 3, some configuration exists which mayproduce a wrong diagnosis. Let us note, however, that it is sufficient to have a single test vector(not a test sequence) featuring less than four ones to immediately recognize the presence of faultycomparators because no combination of faulty channels may produce such response.

2.3.5 BIST-enhanced switch architecture

The switch architecture enriched with the BIST infrastructure is illustrated in Fig.6. Only onesection is reported. The figure is necessarily at a high abstraction level, and signal-level connectiondetails previously illustrated in sections 2.3.1 and 2.3.3 are purposely omitted.

A test wrapper consisting of multiplexers can be clearly seen, which enables test pattern injec-tion of TPGs in the modules they test. At the output of the input buffer, test patterns are directlyfed to the LBDR module, since they are carried by the communication channel as normal networktraffic. A multiplexer in front of each output buffer selects between the switch datapath, the testpatterns from the LBDR TPG (feeding the LBDR module of the downstream switch), the channelTPG (directly feeding the channel) and the arbiter test responses (checked in the downstreamswitch). A BIST engine drives the 4 phases of the testing procedure by acting upon the controlsignals of the test wrapper.

During the first three phases (communication channel, crossbar, arbiter testing), outputs of theinput buffers are selected to feed the comparator tree, while in the last phase (LBDR testing), allLBDR outputs are selected. Test response check and diagnosis are performed at each clock cycle,and result in the setting of 10 bits, indicating whether each input/output port is faulty or not.

2.4 Experimental results

We performed logic synthesis of a 5x5 switch on the 40nm Infineon low-power technology library.The baseline switch architecture of Fig.1 is compared with its BIST-enhanced counterpart. Syn-thesizing for maximum performance gives approximately the same maximum operating speed of

14

Figure 7: Area overhead for BIST implementation as a function of target speed.

0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

60,0%

70,0%

80,0%

90,0%

100,0%

CHANNEL TPG (ideal) CHANNEL TPG (real) ALLOCATOR TPG LBDR TPG CROSSBAR TPG (ideal) CROSSBAR TPG (real)

Figure 8: Coverage of TPG faults.

600 MHz for both architectures, thus proving that our BIST-enabled switch is capable of at-speedtesting.

Fig.7 shows the area overhead for BIST implementation as a function of the target speed. Areaoverhead is 25.27%, which peaks at 37.1% when maximum performance is required. In this lattercase, the multiplexers on the critical path are primary targets for delay optimization in exchangefor more area. It should be pointed out that this overhead is tightly technology-library dependent.

When considering the BIST infrastructure in isolation (at 600 MHz) most of the overheadcomes from the on-chip generation of test patterns (almost 31%) and from the multiplexers (44%)of the test wrapper. Interestingly, although arbiters and LBDR require less test vectors than thecommunication channel, their TPGs are far more complex due to higher irregularity of their testpatterns.

2.4.1 Fault Coverage

Tab.1 reports the total number of deterministic test patterns (and test vectors) generated for eachtested module, and the associated coverage. This latter was derived by means of an in-housemade gate-level fault simulation framework: (one or more) faults are applied to any or selectedgate inputs, then our testing procedure is run on the affected netlist and the diagnosis outcome iscompared with the expected one.

It can be seen that in all cases the coverage for single stuck-at faults closely tracks 100%. Thenumber of test vectors provides the test application time (in clock cycles). A network with a flitwidth of 32 bits, as assumed so far, would therefore take 1104 clock cycles for testing, regardlessof the network size. If we assume 64 bit flits, then LBDR testing occurs in parallel with arbiter

Switch sub-block Test patterns Test vectors Coverage

Comm. channel 58 464 99.4%Arbiter 82 328 97.1%Crossbar 72 72 99.8%LBDR 240 240 98.7%

Table 1: Coverage for single stuck-at faults.

15

Test Cycle Coverage

Our 864 - 1104 99.3%[20] 3.88 x 102 - 2.89 x 103 97.79%[21] 4.05 x 105 95.20%[13] 2.74 x 103 99.89%[14] 9.45 x 103 - 3.33 x 104 98.93%[22] 5 x 104 - 1.24 x 108 N.A.[9] 320 99.33%[10] 200 x 103 full (no exact numbers)

Table 2: Test application time and coverage of different testing methods.

Multiplicity of Fault Injection 2 3 4 5

Coverage 99.2% 96.4% 96.6% 96.6%

Table 3: Coverage for multiple random stuck-at faults.

testing and total test time reduces to 864 cycles.These numbers compare favorably with previous work, as Tab.2 shows. Only [20] and [9] in

some cases do better. However, [20] does not test the control path while [9] reports 320 cycles for a3x3 mesh (made of a simplified switch architecture) which however grow linearly with network size.Also, this latter approach makes additional use of BIST logic for the control path not accountedfor in the statistics.

We feel that area overhead is hardly comparable with previous work since whenever numbersare available, features of the testing frameworks are very different (e.g., control path not tested[20], test patterns generated externally [21, 14], diagnosis missing [21, 13, 14, 22], lack of similartest time scalability [5, 9], NoC architecture with overly costly links [13]). Moreover, the impactof synthesis constraints is never discussed.

Fig.8 reports the coverage of TPG faults. While single stuck-at faults in the allocator andchannel TPGs feature a coverage of roughly 95%, worse results are obtained for the LBDR andespecially for the crossbar TPGs. We verified that their lower coverage is a direct consequence ofthe low number of test patterns they generate. The designer can then choose whether increasingcrossbar TPG area and having it generate more patterns or dedicating a separate test phase toTPGs. Also, when comparing real vs ideal coverage of channel and crossbar TPGs, it is possibleto assess the marginal reduction of TPG fault coverage as an effect of the local (instead of remote)check of some signals of these modules in the switch they belong to (see section 2.3.1 and 2.3.3).

Since our BIST infrastructure targets multiple stuck-at faults from the ground up, we char-acterized fault coverage for multiple faults as well. We have injected multiple faults randomlyin the gate-level netlist of the switch and checked the diagnosis response. Fault multiplicity was2,3, 4 and 5 and fault injections for a given multiplicity were repeated 1000 times, as in [10]. AsTab.3 shows, the proposed BIST framework provides a higher than 96% coverage in every scenario.Interestingly, the coverage saturates with 4 and 5 faults since the probability to inject errors in amodule already affected by an error becomes high.

2.5 Conclusions

This section presented a scalable built-in self-test and self-diagnosis infrastructure for NoCs takingfull advantage of their structural redundancy through a cooperative testing and diagnosis frame-work. Table-less logic based distributed routing is the foundation of our approach, and enablesnetwork reconfiguration with only 10 diagnosis bits per switch. We prove the achievement of stan-dard fault coverage targets at an affordable area overhead. However, we do more than that: wequantify coverage for multiple faults and aim at the coverage of faults affecting TPGs as well. Thehigh coverage of this latter infrastructure is a key step forward to prove the efficiency of determin-istic test patterns handcrafted exploiting knowledge of the architecture behavior. Thus the workof this section will be taken as reference by the next sections where BIST approaches will rely onconventional pseudo-random testing and scan-based approaches.

16

3 Designing a Built-In Scan Chain-Based Testing Frame-work for Network-on-Chip Switches

3.1 Introduction

The test of the integrated circuits is traditionally performed by means of scan chains. The scan-chain still features a wide adoption in the industrial environment although several alternativetesting techniques have been proposed. Indeed the success of the scan-chain is justified by somekey reasons. First of all, the main synthesis tools support a semi-automatic insertion of the scan-chain in the logic under test. In addition, it achieves an extremely high stuck-at-fault coverageof both the data and the control path. Anyway, it is showed in [14] that traditional full-scanand boundary scan strategies like [18, 21, 38, 17] incur an hardly affordable area overhead. [14]also proposes a partial scan technique in combination with an IEEE 1500-compliant test wrapper.Area overhead is greatly reduced, but test application times amount to tens of thousands of clockcycles and test pattern generation time does not scale. Then the advantages in terms of coverageand design time of the scan-chain technology are offset by severe area overhead and high latency.Furthermore, the scan-chain is typically associated with test patterns injected by off-chip sources.The injection of external test patterns becomes infeasible in modern integrated circuits due to thelimited pin budget and the difficulties in reaching the widespread logic of the NoC. Finally, life-timetesting is not supported by this solution because of the need for an industrial test environment.

Aware of the relevance of the scan chain-based mechanism, we enrich our design space ex-ploration of testing strateiges by implementing this latter technique in the NoC switch. We donot simply evaluate the conventional implementation but we custom-tailor the scan chain for thetarget NoC setting in order to alleviate its intrinsic limitations. As a result, the next subsectionspresent the switch architecture enhanced with the customized scan chain infrastructure. Finally,the performance of the proposed solution is compared in terms of area, coverage and latency witha baseline (unoptimized) scan-chain strategy and with the cooperative framework presented inSection 2.

3.2 The scan chain tool-flow

In order to implement a scan chain in our design under test we exploited the Synopsys toolenhanced with some dedicated commands (set dft signal, set scan configuration, set scan path,etc.). Thus, the post-synthesis design was automatically augmented with scan-enable sequentialcells and scan-in/scan-out ports for the write/read of the scan chain control bits. Finally, the testpatterns were generated by means of the Tetramax tool. Tetramax requires as input the post-synthesis netlist together with the list of the primary inputs/outputs and the scan-in/scan-outports of the design.

The injection of a test pattern in a scan chain-enabled design follows a sequence of 5 phases(see figure 9). During the first phase (load), n test bits are driven through the scan-in port.Since a single bit for clock cycle is injected in each chain, the length of this phase is equal to nclock cycles where n is the number of sequential cells of the chain. During the next two phases(force all pis and measure all pos), the primary inputs are stimulated with the test data patternand the primary outputs’ response is read out. Both these phases last 1 clock cycle. The resetports are tested during the third phase (pulse) and finally the scan-out bits are read in the lastphase (unload). The unload phase lasts n clock cycles. Then the unload and the load phasesdominate the latency required to inject a test pattern. Intuitively the number of sequential cells(n) for chain has a relevant impact on the final latency of the test.

It is possible to model the total latency of a test based on scan chains by means of the followingformula:

TotLatency = (TotCells/TotChains+ 4)X(NumPatterns) + TotCells/TotChains (1)

The formula is parametrized by the number of test patterns (NumPatterns), the numberof total design cells (TotCells) and the number of scan chains (TotChains). To notice that it

17

Figure 9: Latency breakdown of a scan chain-based test.

is taken into account the overlapping of the unload phase with the load phase of the next testpattern.

Similarly, the number of bits stored by the test pattern generator can be modeled by thefollowing formula:

NumBit = (PIs+ ScanChainPort+ TotCells)XNumPatterns (2)

In this case, PIs represents the number of primary inputs of the design and ScanChainPort takesinto account the additional inputs required by the scan chain test framework.

The number of bits of Equation 2 gives the size of the test pattern generator, thus its areafootprint. Modeling the latency and the number of bits of the test generator represents a key stepfor the further optimizations presented in the following sections.

3.3 The baseline implementation

As in the cooperative testing framework of section 2, we use the xpipesLite switch architecture[3] to implement a baseline scan chain-enabed testing strategy in a realistic NoC setting. Whenimplementing a scan chain framework, it is possible to set few Synopsys parameters to specify thenumber of chains in the design, the number of cells for each chain and group cells in a single chain.

We did not constrain the synthesis tool during the first baseline implementation. As a result,the tool created 48 scan chains for the whole switch design. Each scan chain was composed of 40sequential cells and required 147 test patterns. The test achieved a high stuck-at-fault coverage(99.95%) and a reasonable latency of 6500 cycles. Anyway, each scan chain differs from the otherscan chains thus a dedicated set of test patterns was required for every chain. As a consequence,310k bits were stored into the test pattern generator and the resulting area footprint was toosevere when enhancing the scan chain framework with a built-in self-testing approach. Thereforealternative solutions are required in order to alleviate the area overhead.

We constrained Synopsys to modify the number of scan chains in the switch. Interestinglythe synthesis tool balances the number of sequential cells within each scan chain. As expected,the latency linearly decreased with the increase of the number of scan chains following Equation1. However, the area overhead stayed approximately constant since the same number of testpatterns was required despite the modification of the scan chain length (the remaining parametersof Equation 2 did not change as well).

In addition, the whole switch needs to be discarded when a single fault is detected with theabove implementations. Although diagnosis logic can associate a faulty cell to a single faulty scanchain, the position of the cell in the switch is still unknown. In fact, the cells of a switch portare not grouped in a single scan chain and every scan chain collects cells from multiple switchports. Therefore a faulty scan chain does not carry the information required to enable a gracefuldegradation of the switch as in section 2.

18

Figure 10: Practical implementation of the proposed scan chain-based test.

3.4 Customizations for the NoC setting

A generic implementation of a scan chain-based testing strategy proved area hungry and resultedin poor fault tolerance. Thus customizations for a NoC setting were envisioned to overcome theselatter concerns.

Following the intuition of Section 2, the key idea of our BIST framework based on scan chainsconsists of exploiting the inherent structural redundancy of an on-chip switch. We collected thecells of each input and output ports in distinct groups. In particular, a port-arbiter, a crossbarmultiplexer and an output buffer group were instantiated for each output port, while a routingmodule and an input buffer group were instantiated for each input port. Then we performed adedicated synthesis for each group before to glue all the synthesis results in a single design. In thisway every scan chain is associated with a cell group and failure of a chain can be viewed as thefailure of the associated switch input or output port. Our diagnosis strategy will therefore providean indication of whether input and output ports of a switch are operational or not. Moreover,all the scan chain instances are assumed to be identical, therefore there is a unique test patterngenerator (TPG) for all the instances of the same block, thus cutting down on the total numberof bits stored in the TPGs. Finally all the instances of the same scan chain should output thesame results if there is no fault. As a consequence, the test responses from these instances arefed to a comparator tree similar to the tree of Section 2. As already proved, this makes thesuccessive diagnosis much easier. The core of the diagnosis unit is given by comparators which areimplemented using a two-rail checker TRC achieving the self-testing and fault-secure properties.When we consider a 5x5 switch then we use 20 different comparators to compare data from all thepossible pairs of switch input ports and switch output ports. Thus the diagnosis logic diagnosesthe correct position of 3 faulty input ports and 3 faulty output ports affected by non-equivalentfaults. Finally, it detects the presence of 7, 8, 9 and 10 faults located in different input or outputports. Anyway, since a 5x5 switch affected by more than 6 faults has to be discarded, we don’tdistinguish between these latter scenarios.

Although the principle is similar to what has been proposed in Section 2, this solution presentsa fundamental drawback. If the TPG of a set of block instances is affected by a fault, thenthe comparison logic will not be able to capture this since all instances provide the same wrongresponse. Furthermore the communication channels between switches are not tested as a part ofthe switch testing framework.


The switch enhanced with the proposed scan chain was synthesized by means of a 40nm Infineontechnology library. The test patterns were generated by Tetramax. A test pattern can be generated

19

Figure 11: Maximum number of test patterns.

Figure 12: Latency of the scan chain-based test.

in hardware by using a clock cycle counter and some logic to generate the values of the input signalsfor the DUT. Then we exploited the Tetramax test patterns to implement a test pattern generatorbuilt-in into the switch. A test wrapper consisting of multiplexers enables test pattern injection ofTPGs in the scan chain they test.

3.5.1 Scan chain results: custom Vs baseline implementation.

Both the baseline and the proposed solution well perform in terms of coverage: they achieve the99.9% of single stuck-at-fault coverage. However, the proposed scan chain customization broughtan interesting reduction of the number of test patterns. In fact, Tetramax exploited respectively103 and 65 test patterns to test the scan chains of the output and the input port. On the contrary,147 test patterns were required to test the baseline scan chains (see Figure 11). Interestingly thepseudo-random clustering of the cells in the baseline solution did not allow Tetramax to performthe test patterns optimization achieved in the custom-tailored solution.

The reduction of the test patterns enables shorter load and unload phases. Thus the latency ofthe proposed solution decreases by 20.2% with respect to the baseline counterpart (see Figure 12).Furthermore, the area overhead benefits from both the test pattern reduction and the proposedoptimizations (i.e. single sets of test patterns test all the instances of the same port). Then, thearea overhead of the proposed solution was cut down by the 82% with respect to the baseline one(see Figure 13). Although this result was achieved through an aggressive reduction of the built-intest pattern generators’ area, these latter still introduce the highest area overhead (as reported inFigure 14).

20

Figure 13: Total number of bits stored by the test patterns generator.

Figure 14: Area overhead breakdown of the customized scan chain-enabled testing strategy.

3.5.2 Comparison with the deterministic test pattern-based framework

In this section the scan chain-based proposed solution is compared with the deterministic testpatterns-based solution of Section 2 in terms of coverage, latency and area overhead. As showedby Figure 15, the area of the deterministic test patterns-based solution outperforms the scanchain-based framework. They introduce respectively an area overhead lower than the 30% andhigher than the 130% with respect to a switch without BIST/BISD capability. Although theabove mentioned customizations alleviates the scan chain test pattern generators’ area overhead,the area footprint of these latter is still similar to the whole switch area confirming how the scanchain technique is not well suited for built-in approaches.

Furthermore, the latency of the scan chain solution is five times higher than the deterministiccounterpart underlining the intrinsic slowness of the load and unload phases of a scan chain-basedtest (see Figure 16). Finally, both the BIST frameworks achieve a single stuck-at-fault coveragehigher than 99% as reported in Figure 17.

21

Figure 15: Area overhead breakdown of the scan chain-based and deterministic test patterns-basedsolutions.

Figure 16: Latency of the scan chain-based and deterministic test pattern-based solutions.

Figure 17: Coverage of the scan chain-based and deterministic test pattern-based solutions.

22

3.6 Conclusions

We demonstrated that the knowledge of the design under test can be exploited to heavily reduce thelatency and the area overhead of the scan chain testing framework. The result is achieved exploitingthe intrinsic redundancy of the switch and associating dedicated scan chains to its input/outputports. Despite the effectiveness of the proposed optimizations, a built-in self-test can not still beenvisioned due to the severe area overhead of the test pattern generators required to feed the scanchains. In fact, a 137% of area overhead with respect to a switch without testing capabilities is tobe expected when the test pattern generators are built-in into the switch.

The approach from Section 2 based on deterministic test patterns outperforms the scan chain-based solution both in terms of latency and area overhead. Although the effort required by ahardware designer to implement scan chains by means of automatic tools is low, the resultingdesign does not meet the area and latency requirements of a modern integrated circuit, especiallyunder constrained resource budgets. Even when the scan chain framework achieves a 99.9% stuck-at-fault coverage (slightly higher than the deterministic test patterns-based framework), it doesnot represent a valid solution for a highly-constrained and performance-critical Network-on-Chip.

23

4 Optimizing Built-In Pseudo-Random Self-Testing forNetwork-on-Chip Switches

4.1 Motivation

Most BIST architectures use pseudo-random test pattern generators. However, whenever thistechnique has been applied to on-chip interconnection networks, overly large testing latencies havebeen reported. On the other hand, the alternative approaches proposed in the previous sectionseither suffer from large area penalties (like scan-based testing) or high design effort for the useof deterministic test patterns. This section presents the optimization of a built-in self-testingframework based on pseudo-random test patterns to the microarchitecture of network-on-chipswitches. As a result, we will demonstrate that that through proper customizations for an on-chip setting fault coverage and testing latency approach those achievable with deterministic testpatterns while materializing relevant area savings and enhanced flexibility.

This section aims at a low area footprint and low-latency testing framework for NoCs byoptimizing built-in pseudo-random self-testing for the microarchitecture of a NoC. The choice ofpseudo-random testing potentially reduces the area overhead, although the test time concern arises.This latter was tackled by reusing test responses of switch sub-blocks as test patterns for cascadedblocks combined with specific test pattern optimizations for selected blocks to preserve coverage.

The efficiency of the developed testing framework is demonstrated by the comparison withthe cooperative testing strategy relying on deterministic test patterns and strictly aiming for lowtesting latencies of section 2.

4.2 Target Architecture

The baseline switch architecture is the same as reported in the previous Section 2 and reported inFig.1. It relies on a stall/go flow control protocol and implements distributed routing by meansof a route selection logic located at each input port. Failure of a switch input or output port andtheir associated switch internal sub-blocks can be viewed as the failure of the connected link. Thediagnosis strategy proposed in this section will therefore target this requirement and will provide anindication of whether input and output ports of a switch are operational or not. As the cooperativetesting strategy of Section 2, each switch can in turn test its several internal instances of the samesub-blocks (crossbar muxes, communication channels, port arbiters, routing modules) concurrentlyby means of pseudo-random patterns.

The testing framework of Section 2, which comes up with one of the lowest testing latenciesreported in the open literature for similar single stuck-at fault coverages, will be used for the sakeof comparison as a lower bound for testing latency.

4.3 Optimized Pseudo-Random Testing Framework

Deterministic test patterns come with their own drawbacks, especially the large effort to definethe test patterns and the poor adaptation to technology library and/or node migrations.

For these reasons, we extended our analysis to a framework relying on a pseudo-random testing.However, naive application of such testing strategy to NoCs like [10] results in unacceptable testinglatencies of hundreds of thousands of cycles. In fact, a baseline pseudo-random testing usually teststhe modules under test in sequence to exploit a single test pattern generator thus reducing thearea overhead of the framework. Anyway, the total testing latency represents the potential killer ofthis approach. In fact, the testing latency does not scale at the increase of the number of modulesto test. Furthermore, random test patterns do not allow an efficient testing of the control-path.Indeed, some states of the FSMs typically have a low probability to be reached through pseudo-random input patterns.

We propose NoC test optimization based on the three following ideas:-To increase the fault coverage within reasonable times, we exploit the knowledge of the architectureunder test to rise the probability of driving the FSMs to test-relevant states. We thus foster anhybrid approach (combining deterministic and pseudo-random patterns) for these machines.-To reduce the area overhead we reuse the test responses of a switch sub-block as test patterns forthe cascaded ones.

24

(a) Testing communication channel.

(b) Testing crossbar.

(c) Testing LBDR routing logic.

(d) BIST-enhanced switch architecture.

Figure 18: Optimization steps of the pseudo-random testing framework.

25

-To minimize the latency we test the redundant replicas of the same modules in parallel, like thesolution in section 2.

We progressively optimize a baseline pseudo-random testing framework for the NoC in incre-mental steps. First of all, we decompose the network (i.e. the switch and the channel) into itsbuilding blocks: the arbiters, the crossbar multiplexers, the input buffer, the output buffer, theLBDR and the link.

Then, we exploit a Linear-Feedback-Shift-Register (LFSR) for test pattern generation and aMultiple-Input Signature Register (MISR) for compression of test responses.

Finally we progressively cascade switch sub-blocks to the LFSR and monitor the variations oftesting latency and fault coverage. For cascaded sub-blocks, test responses of the upstream blockare test patterns for the downstream one. Whenever possible, we try to compensate coveragedegradation by means of ad-hoc optimizations thus cascading as many blocks as possible andminimizing the number of test phases. This process is detailed hereafter.

4.3.1 Testing communication channels

As a first step a stand-alone output buffer was tested. A 34 bit LFSR was required to drive theflit, the valid and the stall signal. A high fault coverage was achieved (99.5%) in 250 clock cycles.The coverage for single stuck-at faults was derived by means of the Tetramax tool.

As a next step, the full communication channel was tested. Then the link and the input bufferof the downstream switch were cascaded to the output buffer, as reported in Fig.18(a). In thiscase, the testing coverage saturated to 91.2% after 250 clock cycles. This coverage degradationpointed to the need of exploiting the knowledge of buffer implementation to increase test efficiency.

As showed in Fig.18(a), we increased the controllability of the channel by driving the stallsignals. From an implementation viewpoint, there are several practical issues. In fact, thestall channel signal of the input buffer, which lies in the downstream switch, should be drivenby the LFSR as well. This would require an additional wire in the switch-to-switch link. A similarconcern is that the stall out signal from the output buffer should be brought to the MISR in thedownstream switch, again requiring an additional wire in the link. To avoid the extra wires, weopted for the solution in Fig.18(a): stall channel is driven by the LFSR of the downstream switch,while stall out is not observed directly by a MISR but it is still connected to the arbiter affectingits behavior. From the testing viewpoint nothing changes, since all the pseudo-random LFSRsinject the same patterns synchronously.

As a result, we obtained a fault coverage of 99.7% in 250 clock cycles.

4.3.2 Testing multiplexers of the crossbar

A similar process was followed to include the multiplexers of the crossbar in the testing framework.The channel modules were cascaded to the multiplexer as showed in Fig.18(b). The multiplexerwas directly fed by the LFSR while the channel was crossed by the test responses of this lattermodule.

This incremental testing step required interesting optimizations to limit the LFSR area over-head and to preserve a high fault coverage. First of all, the multiplexer of the crossbar presents165 inputs (33 inputs for port) when taking into account a 5x5 switch. Thus a baseline testing en-vironment could require a relevant area overhead due to a 165 bits LFSR. As a second concern, fewcontrol signal configurations (see select in Fig.18(b)) allow the multiplexer input signals to crossthe multiplexer logic. As a consequence, we experimented a high degradation of the fault coverageof the cascaded channel due to the low amount of testing packets forwarded by the multiplexer.

In order to tackle the LFSR area overhead, we fed each input port of the multiplexer with thesame 34 pseudo-random bits exploiting a data shift of 6 · N bits for every input port. To notethat the N parameter corresponds to the multiplexer port ID. As a result, we preserved the faultcoverage and the LFSR extension from 34 bits to 165 bits was no longer required.

Concerning the issue related to the multiplexer control signals, we designed a ring counterdriving them to data transparent configurations (10..0, 01..0, 00..1). Furthermore, the randomnessof the configurations is preserved by exploiting an LFSR bit to drive the enable signal of the ringcounter. Thus the counter moves from a configuration to the next one when the LFSR pseudo-random bit is set to 1.

26

The above mentioned optimizations finally guaranteed a fault coverage and a testing period forthe cascaded crossbar-channel similar to the results obtained by the stand-alone channel (i.e. a99.5% of coverage in 250 clock cycles).

4.3.3 Testing LBDR

Unlike the crossbar module, the LBDR block associated with the port under test lies in the down-stream switch. Thus, it is fed by the input buffer and it is directly connected to the MISR asshowed in Fig.18(c). In this case, the MISR was extended to 39 bits in order to compress also theLBDR outputs.

However, since the LBDR consists of a many-input hard-to-test combinational logic, the im-plementation of an effective testing is challenging. As a result, the following three optimizationswere introduced to rise the fault coverage of the routing mechanism:1- Some of the LBDR inputs were directly driven by the local LFSR to restore the highest ran-domness of the test patterns.2- The probability to inject a packet to the local port is negligible when exploiting pseudo-randompatterns. In fact the LBDR routes the packet in the direction of the local port only when the localID (SID bits) is equal to the destination ID of the packet. Then we connected the SID bits to theinput 0 of the crossbar multiplexer. As a result, all the packets forwarded by the port 0 of thecrossbar multiplexer are routed to the local port of the receiver switch. Thus, this solution allowsto test all the routing scenarios.3- Since the LBDR logic performs its computation only when stimulated by header flits, the testpatterns bits at the input 0 of the crossbar multiplexer were connected in order to generate ex-clusively header or tail flits (i.e. the flit type bits were driven by the same MISR output bit thusassuming only the 11/00 configuration associated to a header/tail flit type). As a consequence,payload flits are not longer forwarded by the port 0 of the multiplexer. In this way, the numberof header flits was increased and the routing logic efficiently stimulated and tested. Finally thetesting framework achieved a 97.2% of total fault coverage for the cascaded blocks in 1000 clockcycles.

4.3.4 Testing Arbiters

The test did not achieve a high coverage when the arbiter was directly connected to the existingtesting framework following the approach taken so far. The reason lies in the poor efficiency intesting the arbiter FSM. As a result, in the final testing framework the arbiter logic is directlydriven by the LFSR and its test responses feed a dedicated 11 bits MISR. Although some areaoverhead was introduced with the additional MISR, we found it necessary to achieve the maximumcoverage for such a strategic module.

4.3.5 BIST-enhanced switch architecture

The switch architecture enriched with the BIST infrastructure is illustrated in Fig.18(d). A testwrapper consisting of multiplexers can be clearly seen, which enables test pattern injection of theLFSR in the modules it tests. A unique 34 bits LFSR generates the pseudo-random patterns to testin parallel every switch port. Moreover a dedicated 11 bits MISR for every port collects the testresponse from the output port arbiters and a 38 bits MISR for every port performs the signatureanalysis of the test responses from the crossbar, the channel and the LBDR blocks.

Test diagnosis results in the setting of 10 bits (one for every MISR), indicating whether eachinput/output port is faulty or not. This meets the requirements of the LBDR configuration algo-rithm in [25]. Interestingly, the testing framework is able to reveal the correct position of multiplefaulty channels since a MISR is dedicated to each port. Obviously, it is not possible to distinguishthe elementary faulty module inside the faulty port. However, the proposed testing framework isbased on the assumption that the functionally coupled modules for an input port are its routingblock, its upstream communication channel, the port arbiter and the crossbar multiplexer in theupstream switch associated with that channel. Thus, when one of the above mentioned functionallycoupled modules fails then the associated safe logic would be unusable anyway.

27

Figure 19: Coverage for single stuck-at faults as a function of the test latency.

Figure 20: Area overhead for BIST implementation.

4.4 Experimental Results

We performed logic synthesis of a 5x5 switch on an industrial 40nm Infineon technology library.The baseline switch architecture of Fig.1 and the proposed switch augmented with the pseudo-random testing framework give approximately the same maximum operating speed of 600 MHzwhen synthesized for maximum performance, thus proving that our BIST-enabled switch is capableof at-speed testing.

Fig.19 reports the total number of test patterns (clock cycles) generated for the 5x5 switch andthe associated coverage for both the deterministic and the pseudo-random testing framework. Itcan be seen that the proposed pseudo-random framework exploits a further degree of freedom withrespect to the deterministic solution. In fact, it can trade latency for coverage. Interestingly, in allthe analyzed latency scenarios, the coverage for single stuck-at faults is above 94%. Especially thedeterministic framework achieves a 99.3% of coverage in 1104 clock cycles while the coverage ofthe proposed framework ranges between 94.2% and 98.2% achieved in 500 and 10.000 clock cyclesrespectively.

These numbers prove that it is possible to achieve a coverage and a testing latency with pseudo-random test vectors that approach those of deterministic vectors, although not entirely achievingthe same quality metrics. This was made possible by the performed optimizations which takeadvantage of the knowledge of the architecture under test to some extent, without reverting tofull deterministic vectors. The remaining difference in quality metrics with respect to them can beconsidered as the price to pay for reduced area overhead, as proved hereafter. Also, it should bementioned that a pseudo-random approach to testing is more appealing in terms of shorter designtime and adaptivity to architecture, library and technology changes.

28

Fig.20 shows the area overhead for the proposed testing framework and the reference one basedon deterministic patterns when applied to a 5x5 NoC switch. Both of them were synthesized onan 40nm Infineon technology library. Area overhead is referred to the baseline BIST-less switch inFig.1. The area overhead of the proposed framework based on pseudo-random patterns is 16.3%,which rises to 37.1% for the framework based on deterministic patterns. Interestingly, in theproposed framework most of the overhead comes from the MISR modules; on the other hand, thetest wrappers and the TPGs (i.e. the LFSRs) prove extremely lightweight when compared withtheir deterministic framework counterparts. To note that the removal of most of the test wrappersin the pseudo-random framework was possible thanks to the proposed exploitation of test responsesto test other cascaded modules.

4.5 Conclusions

This section has illustrated an efficient implementation of pseudo-random testing for NoC switches.Awareness of the architecture under test enabled testing framework optimizations that improvedcoverage while reducing latency. As a result, the quality metrics of a testing framework based onhandcrafted deterministic test patterns were approached while materializing significant area savingsand enhanced flexibility. Getting precisely the same coverage numbers is however not affordable forpseudo-random testing within reasonable testing times. Therefore, the small percentage increasein fault coverage that deterministic test patterns are able to provide represents an advantage thatshould be strictly traded off with a larger area footprint and a lower flexibility.

29

5 Built-in Self-Testing and Self-Diagnosis for bisynchronouschannels

5.1 Introduction

Traditional globally synchronous clocking circuits have become increasingly difficult to design withgrowing chip size, clock rates, relative wire delays and parameter variations. Additionally, highspeed global clocks consume a significant portion of system power and lack the flexibility to in-dependently control the clock frequencies of submodules to achieve high energy efficiency. Theglobally asynchronous locally synchronous (GALS) clocking style [39] separates processing blockssuch that each block is clocked by an independent clock domain and is an effective strategy toaddress global clock design challenges [40]. GALS NoC features no global clock distribution, thussimplifying the classical timing closure process that would otherwise require several systematiciterations to converge. Moreover, rigid timing constraints between local clock domains do not needto be enforced. In this deliverable, we view the multi-synchronous design style (where the systemis partitioned into individual islands of synchronicity, each operating at a different frequency andconnected with each other by an on-chip network which implements synchronization interfaces atisland boundaries) as a special case of GALS system. We provide testing support for this case.The more general case is when the on-chip network is fully asynchronous, but testing such anarchitecture goes outside the scope of the NaNoC project.

Although it is an appealing alternative to the common design practice, the GALS NoC paradigmheavily impacts the architecture of the chip-wide communication infrastructure. When a sourcesynchronous design style is considered then circuitry to reliably and efficiently move data acrossclock domain boundaries needs to be integrated into the NoC channels. Source synchronous inter-faces route the source clock along with the data for correct synchronization at the receiving end.The source transmits data words without individual acknowledgments and it halts transfers whenthe destination indicates it can no longer accept new data via a backward propagating stall signal.We will hereafter refer to a source synchronous channel connecting two islands of synchronicitytogether as a bisynchronous channel.

This link architecture a number of implications on the BIST/BISD framework for bisynchronouschannel testing. First, the framework must tackle the test of a complex control logic introduced bythe source synchronous interfaces. Second, the test pattern analyzer lies in a different frequencydomain with respect to the test pattern generator, when the same cooperative testing strategyconsidered so far is selected. Since the frequency ratio of the domains can change during theNoC life-time, the test pattern analyzer has not a priori knowledge of the arrival time of thetest responses. Thus, the testing framework needs to support additional mechanisms for the flowcontrol of the test patterns.

This section presents a built-in testing and diagnosis framework for the NoC bisynchronouschannel. Following the implementations of the previous sections, it exploits a cooperative testingframework redesigned under relaxed synchronizations constraints. Since the pseudo-random andthe deterministic-based approaches were the best performing in the fully synchronous system thenboth of them are considered. These latter two solutions are implemented and compared in a multi-synchronous scenario. Once the testing of the bisynchronous channel is tackled then the switchinternal logic can be tested by means of one of the fully synchronous frameworks presented so farsince the synchronism inside a NoC switch is preserved.

5.2 Target GALS Architecture

The bisynchronous channel under test includes input/output buffers and their intermediate links.We consider two neighboring switches belonging to different clock domains thus the link is bisyn-chronous and communication in it is asynchronous. Bi-synchronous FIFO synchronizers are there-fore used to connect the switches in a reliable way (see Figure 21).

The xpipesLite NoC architecture introduced in Section 2.2 is used as baseline experimentalsetting to implement the multi-synchronous communication. It is worth recalling here that theflow control protocol used by xpipesLite is stall/go: a forward signal, synchronous with data, flagsdata availability (valid), while a backward signal flags a destination buffer full (stall) or empty (go)condition.

30

Figure 21: Baseline bisynchronous communication channel.

Figure 22: Dual-Clock FIFO Architecture.

The synchronizers are typically inserted between two connected network building blocks be-longing to different clock domains. In practice, they break the switch-to-switch or the networkinterface-to-switch connections depending on the decisions about clock domain partitioning. How-ever, there is typically no co-optimization of the synchronizer with the following/preceding NoCsub-module, therefore a significant latency, area and power overhead materializes. This designpractice can be denoted as the loose coupling of synchronization interfaces with the NoC. In con-trast, the bisynchronous channel under test exploits a synchronizer tightly integrated into the NoCswitch taking care of synchronization but also of other tasks such as switch input buffering andflow control. The consequent reuse of buffering resources for different purposes in turn leads tolarge energy savings that make a GALS NoC affordable at almost the same area and power costof its synchronous counterpart. Moreover, by moving the synchronizer inside the switch, the com-munication latency at the clock domain boundary reduces to the ideal synchronization latency.This implementation is an outcome of the EU-funded Galaxy project. While a tightly coupledsolution results extremely appealing in terms of performance, it comes with a challenging testingframework. This latter needs to test both the buffer, the synchronization and the flow controllogic.

The dual-clock FIFO architecture under test [1] is illustrated in Figure 22. The dual-clockFIFO has been designed directly for use in a NoC implementing the stall/go flow control protocol.The queuing and dequeuing of data elements in the FIFO follow the following protocol. The inputdata is queued into the FIFO, if and only if the valid input signal is true and the full signal is falseat the falling edge of ClkTx. Operation is triggered at the clock falling edge to preserve timingmargins. Symmetrically, data is dequeued to data output link, if and only if the RX stall/go signalis false (go) and the empty signal is false at the rising edge of ClkRx.

31

Figure 23: Dual-Clock FIFO integration into the NoC switch architecture.

The bisynchronous FIFO architecture is composed of 2 token ring counters. In the senderinterface, the token ring counter is driven by the ClkTx, synchronous to incoming data. It generatesthe write pointer indicating the position to be written in the data buffer. In the receiver interface,the token ring counter is driven by the local clock, ClkRx. It generates the read pointer indicatingthe position to be read in the data buffer. The data buffer contains the data storage of the FIFO,which is parameterizable. Full and empty detectors signal the fullness and the emptiness of theFIFO. In our solution, these detectors perform asynchronous comparisons between the FIFO writeand read pointers that are generated in clock domains that are asynchronous to each other. TheFull detector computes the Full signal using the write pointer and read pointer contents. TheFIFO is considered full when the write pointer points to the previous position of the read pointer.Vice versa, the FIFO is considered empty when the write pointer points to the same position ofthe read pointer. Comparison of read and write pointers need to be synchronized by means ofcarefully engineered brute force synchronizers.

The proposed FIFO synchronizer was then tightly integrated into the switch architecture, asillustrated in Fig.23. The 2 slot input buffer of the vanilla switch has been replaced by a dual-clockFIFO that requires 5 slot buffers to support any kind of ratio between source and destinationpeers. During run-time operations, the stall/go signal is provided by the arbiter, which receivesthe valid signal as an input. In every frequency ratio scenario between sender and receiver, 100%throughput is guaranteed in the presence of such a FIFO synchronizer.

5.3 Built-in self-diagnosis framework

The bisynchronous channel is tested by means of a Test Pattern Generator (TPG), an Auto TestAnalyzer (ATA) and some brute force synchronizers as illustrated in Fig.24. The test phases areregulated by three asynchronous signals (full, ready and go). The TPG generates the test patternsfor the channel while the ATA read its responses. Since a single ATA is in charge of diagnosingall the switch channels in parallel, the TPGs feed all the channels in parallel and the ATA collectsall the test outputs concurrently. Thus the responses from all the channels are compared in adiagnosis block integrated into the ATA.

A TPG dedicated to each channel is placed at the sender side. It injects the test traffic to thechannel output buffer exploiting some control signals (the valid and the stall) and communicateswith the ATA through two forward signals (the full and the ready) and one backward signal (thego). These latter signals cross two frequency domains thus they are synchronized by dedicatedbrute force synchronizers before to feed the ATA and the TPG. At the receiver side, the ATAdrives the stall signal and analyzes the valid and the data signals from the channels. It integratesa diagnosis module, 10 comparator trees and a timer. The diagnosis follows the same principles ofthe fully synchronous solution: the channel instances are assumed to be identical, therefore they

32

Figure 24: Bisynchronous channel enhanced with the BIST framework.

(a) Deterministic Test Pattern Generator. (b) Linear-Feedback-Shift-Register.

Figure 25: TPGs for deterministic and pseudo-random generation of test patterns.

should output the same results if there is no fault. As the bisynchronous channel width is equalto the width of the fully synchronous channel then the comparator tree module is preserved.

The TPG performance achieved in the previous sections is now validated in a multi synchronousscenario, therefore TPGs are implemented in both their deterministic and pseudo-random variants.In the first version, the TPG was designed for the generation of deterministic test patterns asillustrated in Figure 25(a). It exploits a clock cycle counter and some logic to generate the valuesof the input signals for the DUT. In the second solution, the TPG was realized by means of theLinear-Feedback-Shift-Register represented in Figure 25(b).

5.4 Operative principles

The testing framework of the bisynchronous channel is designed to start at boot-time after thereset phase of the system completes. Thus both permanent faults due to fabrications defects andwear-out effects are tackled.

At the falling edge of the reset signal, the ATA sends concurrently a Go signal to all theTPGs integrated into the neighboring switches in order to communicate the start of the testing.Afterwards, the ATA enters a wait state and the TPGs start the injection of the first test pattern.A new pattern is injected every time that the Go signal is asserted (the Go signal directly drivesthe counters of the TPGs). The patterns are composed by 32 bits of data, the valid and the fullsignal. When the TPG injects the pattern then it asserts the ready and enters in an idle statewaiting for the next go assertion by the ATA. During the idle state, the TPG deasserts the validsignal.

Since every TPG works at an independent frequency, the test patterns do not reach the desti-nation at the same clock cycle. Thus the ATA wakes up and enables a timer when it receives thefirst pattern. Therefore the timer generates a timeout flag if the test patterns from all the channelsare not received within a maximum period of time. On the contrary, when all the pattern arrives,then the diagnosis phase can start. In this case, the incoming full signals are exploited by the ATA

33

(a) (b)

Figure 26: Effect of the synchronizer set port on the control signal.

Figure 27: Proposed triple-stage brute-force synchronizer.

to drive the stall signal of the dc-FIFOs for a single clock cycle. A clock cycle later, the dc-FIFOstall signals are asserted to freeze the channels state and the data signals are finally sampled bythe ATA. Thus, the data signals are compared between each others in the 10 comparator treesand the diagnosis is performed. To notice that the diagnosis guarantees the same error detectioncapability of the fully synchronous counterpart (see section 2.3.4).

If the test succeeds then the next test pattern can be sent. Indeed the ATA asserts the gosignal to all the TPGs and resets its timer. When an error is revealed, the information about thefailure is stored in a local register and the faulty channel output is no longer considered at the testrestart. When the test is completed, the information stored in the local register is sent to a globalmanager.

The procedure is equivalent for both the solutions based on deterministic and pseudo-randomtest patterns, although the deterministic TPG automatically stops as soon as the last pattern issent while the LFSR continues the data injection until a stop signal is received.

5.5 Synchronization of the handshaking signals

A reliable synchronization of the ready, the full and the go control signals is essential for a success-ful communication between TPG and ATA. In the integrated circuits, the brute-force synchronizerrepresents the common solution to deal with control signals synchronization. The brute-force syn-chronizer is a lightweight circuit composed by a sequence of flip-flops driven by the receiver clock.The MTTF of the circuit is dictated by the number of cascaded flip-flops. In our proposed frame-work, brute-force synchronizers enhanced with three flip-flops are adopted for the synchronizationof channel control signals. A sequence of three flip-flops guarantees a negligible MTTF despite ofthe negative effect introduced by the resolution time constant degradation with technology scaling.

However, the brute-force synchronizer did not guarantee a correct TPG-to-ATA communication

34

when it was implemented in its baseline configuration. In fact, an input signal asserted for few clockcycles was filtered by the synchronization circuit if the receiver domain was slower than the sendercounterpart. The synchronizer was not able to sample the incoming signal if any positive receiverclock edge occurred during its assertion (see Figure 26(a)). In this latter scenario, the TPG/ATAmissed the control signal and the testing framework operations were compromised. As a solution,we designed the first flip-flop of the brute-force synchronizer enabled with a set port. Thus theincoming control signal presets the first flip-flop as soon as it is asserted at the synchronizer input.As Figure 26(b) shows, this latter solution allows the correct control signal sampling even when thereceiver is slower than the sender and a positive receiver clock edge does not occur. Afterwards,the synchronizer output is deasserted at the next positive receiver clock edge.

The brute-force synchronizers have a further objective in the testing framework. The correctnessof the operations is preserved only when the ready and the full control signals arrive together orlater than the test patterns. Thus the triple-stage synchronizer avoids the arrival of those controlsignals in advance with respect to the test patterns. The proposed brute-force synchronizer isillustrated in Figure 27.


We performed logic synthesis of the channels under test on an industrial 40nm Infineon technologylibrary. In this case, the baseline channel was composed by fully synchronous input and outputbuffers while the bisynchronous channel was enhanced with DC FIFOs. Two versions of the bisyn-chronous channel were implemented, both with deterministic and pseudo-random TPGs. Thesetwo latter frameworks were compared in terms of area, coverage and latency.

Figure 28: Area of baseline, deterministic and pseudo-random channel.

5.6.1 Area

Fig.28 shows the area of the baseline channel and the overall area of the channel enhanced withdeterministic and pseudo-random patterns. The area of the channel implemented by means ofdeterministic TPGs and pseudo-random TPGs is respectively 29% and 26% larger than the area ofthe baseline solution. Fig.29 shows the area breakdown for the logic required to test the channel bydeterministic and pseudo-random patterns. As we can see, the area overhead of the ATA and thesynchronizers is equal in both cases (63.71% for the ATA and 7.31% for the synchronizers). On thecontrary, the area overhead of the test pattern generator is 28.92% for the pseudo-random solutionand 40.71% for the deterministic counterpart. In fact the area overhead of the deterministic patterngenerator results 40.85% greater than the one of the pseudo-random pattern generator.

5.6.2 Fault Coverage

Fig.30 shows the fault coverage of the baseline channel enhanced with pseudo-random patternsas function of the number of clock cycles. As we can see, the number of clock cycles required toachieve a high coverage increases pseudo-logarithmically. The 96.5% of coverage is reached in 4.000

35

Figure 29: Breakdown of Area Overhead for Pseudo-Random and Deterministic TPGs.

Figure 30: Fault coverage of the baseline channel with pseudo-random pattern as a function of thenumber of cycles.

clock cycles while the 98% after 24.500 cycles. The deterministic based solution performs slightlybetter than the pseudo-random counterpart achieving the 97% of coverage in 3950 cycles.

5.6.3 Latency

Figure 31 shows the test time as a function of the TPG (Test Pattern Generator) and ATA (Auto-Test Analyzer) frequencies. As expected, when the frequencies of the TPG and the ATA are bothlow (as an example, 10MHz), the test time is very high. Vice versa, when the frequencies areboth high (as an example, 1GHz), the test time is the smallest. In the case that only one of thefrequencies is low than this latter dictates the final test time.

Summing up, table 4 compares the performance of the two implementations of bisynchronouschannels based on deterministic and pseudo-random test pattern generators when high coverageis required. From the table, it is clear that in spite of the slightly larger area overhead, the ap-proach leveraging deterministic test patterns is able to achieve comparable coverage with verylow latency (considering the multi-synchronous environment). Therefore, in a multi-synchronouscontext pseudo-random test patterns are not as attractive as in fully synchronous settings. Inter-estingly, the relative area overhead associated with deterministic patterns is decreased with respectto the fully synchronous case because of the relevant area footprint of the ATAs.

36

Figure 31: Test time as function of ATA and TPG frequencies.

Table 4: Comparison in terms of area overhead, coverage and latency for a bisynchronous channelwith deterministic and pseudo-random TPGs with respect to a fully synchronous channel withouttesting capability.

Channel with deterministic TPG Channel with Pseudo-Random TPGArea Overhead 29.3% 26.5%

Coverage 97% 98.2%Latency 3950 cycles 24800 cycles

5.7 Conclusions

This section has illustrated a novel built-in testing and diagnosis framework for NoC bisynchronouschannels. Both solutions based on pseudo-random and on deterministic test patterns were analysedfor the multi-synchronous scenario. Finally, the area overhead, the test time and the coverage ofthe two bi-synchronous channels were presented and compared. The test time was evaluated byvarying the frequencies of the transmitter and the receiver from 10MHz to 1GHz. The channelenhanced with deterministic TPG was faster in achieving the same coverage than the pseudo-random counterpart at the cost of slightly larger area overhead. As a result, the choice betweenthe two proposed techniques is dictated by the test time, silicon area and coverage trade-off,although the overall balance is more in favour of the deterministic test pattern strategy than inthe fully synchronous scenario.

37

6 Debug and Trace

6.1 Introduction

Till now, we have presented hardware testing strategies although modern SoCs consist of bothhardware and software. It is therefore necessary to analyze software debugging techniques, whichmay be related to an overall improvement of the systems on-chip performance. In this way, bycombining the hardware testing strategies and the software debugging techniques, the networkwill be more robust and the resources can be optimized. Then, this section presents a novelsoftware debugging and trace system technique compatible with a SoC Network-on-Chip. Thistechnique allows to optimize the test of the software maintaining network performance unaltered.The solution proposed in this section is latency intolerant. Moreover it makes use of a hierarchicalversion of the dual network proposed in the WP2 and exploited for the BIST infrastructures, aswell as for the system reconfiguration.

A suboptimal software routine may significantly impact and diminish the overall performanceof a system. It also takes significant amount of time to trace and debug the behavior of a malfunc-tioning routine within a subsystem. This is due to the fact that, besides logical errors (bugs withinthe software itself), there might be deeper dependencies between the program flow and all otheractivity, which is occurring. Such complex interactions are hard to find and determine. There-fore, in order to ease the software development, the interaction between the underlying hardwareand the software code being executed on the platform has to be made transparent. This can beachieved by observing the transactions generated by the executed software and their propagationthrough the different parts of the system. Such a transparency is key to better understanding ofhow a particular code behaves when executed and what its mutual impact with the system reallyis. Thus, it would eventually lead to shorter and more precise development cycles. Information likethe delay, contents and effects of different processing blocks on a particular transaction can helpoptimizing and debugging software. Additionally, observing the operation of a particular systemshould be.

There exists several approaches that try to solve the problem. For example, the work presentedin [27] illustrates an AHB tracer, aimed at producing observations from an AHB bus. However,this work concentrates only on one type of interconnect, lacks the notion of time and is intrusiveto the normal system operation. Others like [28] and [29] concentrate on directly debugging codeexecuted within the processor by making use of scan-chains. Yet, their work primarily concen-trates on inserting breakpoints and is limited to operating only within the scope of the processor,rather than the system as a whole. The works [30] and [31] both concentrate on transaction leveldebug, where the communication, rather than the state of the processor is observed. In [30] theregular NoC interconnect of the SoC is reused for observations. These observations, however, aregenerated only when the system cores are stopped. Thus, it is not particulary suitable for tracingand non-intrusive observations. These work in [31] relies on scan-chains, which can severely limitthe bandwidth for trace and does not support accurate accounting of time.

In order to solve the problem a novel debug and trace system has been devised. It comprises itsown interconnect as to avoid introducing additional traffic to the normal operational state of thesystem. The developed solution is highly modular, a feature inherited form the fact that it itselfis a NoC. The subsystem is not restricted by the implementation of the original SoC, and can beapplied equally well to circuit switched and packet switched interconnects. It supports multiplevoltage & frequency domains, very accurate time-stamping method, which allows observations tobe ordered in a trace based on their creation times. Furthermore, it is compliant with partial andfull system shutdown and reboot and has a simplistic design aimed at consuming minimal amountof area and supports different levels of trace information granularity. The motivation behind usingNoC for tracing a SoC is the fact that, given adequate time-stamps a trace becomes latency in-tolerant, additionally the high concurrency of tasks can easily be handled with a packet switchedcommunication.

6.2 Architecture and Topology

The debug and trace system uses a hierarchical architecture. In order to minimize the use ofresources hierarchical unidirectional rings have been chosen as the underlying topology. The topol-ogy consists of a main ring, to which multiple secondary subrings are attached. Each subring

38

interconnects a particular subsystem or a portion of it. The main ring is additionally attached toa Debug/Trace Unit, which collects the information and controls the debug process. It may be onor off chip and most of its functionally as it will be shown later is implemented in software.

Figure 32 illustrates how the hierarchical ring topology looks like for a SoC consisting of twosubsystems (indicated by the red and blue colors). In this example each sub-system consists of fivemodules, which are to be traced. These modules are interconnected using a unidirectional ring.The direction of packet flow is indicated by the arc arrows in the rings. Furthermore, there arebasically two directions called upstream and downstream. The first one indicates the packet flowfrom the Debug/Trace Unit to the monitors and the second one (expected to be much more heavilyused) represents the flow of packets from the monitors to the Debug/Trace Unit. The solid circleslabeled R0-R15 represent the routers of the dedicated NoC. The boxes labeled M0-M9 representthe monitors ( the observation units) attached to each of the modules. B0 and B1 are the bridges,crossing between two different voltage and frequency islands. In order to avoid effects of poweror clock gating, the voltage and frequency island crossings are done always from the subrings tothe main ring. The main ring is dedicated to debug and is always on during such operation. Thismodularity allows the different subrings to be also operated using different bit widths. Finally, theorange box is the debug/trace unit, which is used to collect the trace and respectively control thedebug process.

Debug/Trace Unit

M0

M1

M2 M3

M4 M5

M6 M7

M8

M9

R3R4

R5

R6 R7

R8

R9

R11

R12 R13

R14

R15

R0

R1 R2

B0 B1

Figure 32: Example topology of a SoC consisting of two subsystems

6.3 Monitors and Routers

The main building blocks of the trace and debug system are the monitors and the routers. Theadditional hardware blocks required to implement the system are the bridges, which ensuresafe crossing between one voltage and frequency island to the other and eventually take care ofup/down sizing of packets if bit widths on both sides differ.

6.3.1 Monitors

The monitors used in this system are abstractly defined, and their functionality may be adaptedto the needs of a particular SoC. The main constraint is that they pass the observations in formsof packets to the routers, to which they are attached. They are programmable devices and areconfigurable by the Debug/Trace Unit, which may address them by sending packets upstream. Abasic example model of the monitor is a sniffer device, which has four modes of operation. Themodes are depicted as states in Figure 33.

39

Idle

Short

Long

Statistics

Figure 33: State transition diagram of the monitor

In the Idle state the monitor does not do anything and does not produce observations. In theShort state it only traces the command part of the transactions, i.e, the command, burst lengthand address when circuit switched communication is used on the main interconnect and the packetheader when packet switched one is used. The Long state designates the configuration, in whichthe monitor tries to encapsulate and forward the whole observed transaction. If not possible ittrims the contents e.g. by omitting the last data words. Furthermore, if no further omission is pos-sible in the Long state or there was no possibility to forward an observation in the Short state dueto insufficient buffer space, a loss statistics is initiated. It is responsible for counting the number ofobservations that were not able to be packetized over some period. At the end of such congestionperiod a special packet is sent indicating how many observations have been missed. Finally, theStatistics state forces the monitor to only count transactions and encapsulate statistics on numberof transactions observed per predefined period. All modes have different bandwidth demand, thusthe more information is required from the monitors, the more bandwidth will be consumed. Asit is not possible to encapsulate everything, the configuration of the monitor allows for tradeoffsduring debug. For example, only a few of all monitors may be operated, so that the chance ofcongestion is minimized.

The monitors can be further extended to not only produce observations, but for example, tointeract with the module. Such an interaction can be reading/writing registers, needed for break-point insertion. The interaction can be controlled in the same way the monitors are programmed,by receiving packets sent by the Debug/Trace Unit on the upstream.

6.3.2 Routers

The routers vary in flavor depending on the position they have within the topology. All of therouters are two-port routers, where one port is dedicated to the ring the router resided and theother one to the module attached to the router.

The ones that are directly attached to the monitors are the simples ones. Their only purposeis to forward packets, if the packet is targeting the monitor directly attached to one such router,then the router forwards it to the monitor, otherwise it forwards it on the ring. In Figure 32 theseare the routers R4 - R8 and R11 - R15.

The routers R0, R3 and R9 are essentially the same, but they either put or take a packetto/from the ring. They also generate special timing packets themselves, which are sent to theDebug/Trace Unit. These packets are part of the timestamping strategy and will be discussed inSection 6.4.

Finally, routers R1 and R2 like the latter ones are responsible for putting and taking of packets.Additionally, they are able to detect powering up of a particular subring and notify this by emittinga packet to the Debug/Trace Unit.

6.3.3 Arbitration

In order to keep the latency equally distributed among all monitors attached to the same ring, anarbitration algorithm has to be implemented in the routers. Each router has to be able to choose

40

a port if there is a contention. Furthermore, the arbitration has to be fair in order to ensure equalchance for all monitors to transmit. Implementing a pure Round Robin arbitration, for example,is unfair and will cause larger congestion for a monitor further away from the router bridging tothe main ring, than for a monitor attached closer to it. This is due to the fact that the packets ofmonitors, which are further away, have to pass more hops. In general, with a pure Round Robinabitration, a packet that has to traverse N hops has a chance of 2−N of not being blocked. Thus,a higher hop count increases the chance of blocking.

Figure 34 (a) presents a partial solution to the problem. Instead of using pure Round Robinarbitration, a weighted version is applied. The figure depicts a dummy ring with 4 monitors (M0 -M3) and 5 routers (K0 - K4). The numbers shown in red represent the arbitration weights of therouters’ inputs. Additionally, it can be observed that both K0 and K1 have the same port weights.The reason behind this is that the weights are based on the number of monitors, preceding a router.K0 is a special case because there are no monitors preceding it. However, a weight of 0 will blockalso all the upstream traffic and will render the system unusable, thus, it has to remain 1. Thisweight distribution, however, a valid solution, if and only if all monitors are injecting traffic. If, forexample M0 is switched off, then M1 will gain advantage over M2 and both will have advantageover M3 because of the static weights.

M0 M1 M2 M3

K0 K1 K2 K3

K4

1

1 1 1 1

1 2 3

(a)

M0 M1 M2 M3

K0 K1 K2 K3

K4

1

1 1 1 1

1 1 2

(b)

M0 M1 M2 M3

K0 K1 K2 K3

K4

1

1 1 1 1

1 1 1

(c)

M0 M1 M2 M3

K0 K1 K2 K3

K4

1

1 1 1 1

1 2 2

(d)

Figure 34: Weight adjustment

In order to compensate for this phenomenon, the weights of the arbitration algorithm aremade dynamic. Moreover, the weights associated with the routers’ ports connected to the ringchange when monitors change their states (Idle → Any or Any → Idle). Thus, Figure 34 (b)-(d)shows how the weights adapt, when monitors switch to and form the Idle state. This switching isaccomplished by sending a special packet in response to the state change (Figure 35). The packet

41

travels downstream and informs all the subsequent routers that the weight on a particular porthas to be either scaled up or down by 1. The packet has effect only if it is received on the portassociated with the ring and a decrease of the weight value below 1 is prohibited because such anaction would block also the control packets on the upstream.

M1 M2 M3

K1 K2 K3

Weight Adjust Weight Adjust

Control Response

Figure 35: Emission of special response packet

6.4 Timestamping

Correct timestamping is essential to building a correct ordered trace consisting of observationsproduced by the monitors. In order to reduce the hardware utilized for this purpose, the times-tamping is done in a differential manner. Rather than having a huge counter, which counts allthe cycles during debug, a much smaller one is used, which counts only cycles within some smallintervals called periods. Thus each observation carries a differential time stamp of only a fewbits. Furthermore, it also supports a partial/full system shutdown. The Debug/Trace Unit is thenresponsible to order the observations in appropriate manner. As it will be shown, this is not atrivial hardware task, and thus the Debug/Trace Unit shall utilize a software to cope with thisordering problem.

The timestamping solution in this work also relies on a counter, which is a simple moduleconsisting of N bits. The counter counts every clock cycle. The N bits are further, virtuallysubdivided into lower K and higher M bits (K +M = N). Figure 36 shows this for N = 10. Itmust be noted that this subdivision is virtual and no additional hardware is required. The M bitscan be seen as a count of how many times the K bits have overflowed.

N︷︸︸︷0 0︸︷︷︸M

0 0 0 0 0 0 0 0︸︷︷︸K

Figure 36: 10-bit counter

Each subring has its own counter. It may be a single entity, to which the monitors have accessor a number of identical counters. As each subsystem may have a different clock, these countersare not expected to count in a similar manner. In addition, each router that bridges a subring tothe main ring has also access to the counter of its subring or has its own counter identical to theones of the monitors on the subring (on Figure 32 these are the routers R3 and R9). As each ringspans across at most one voltage and frequency island all the counters within a subring will havethe same value every clock cycle. Each time a monitor makes an observation, it takes the countervalue as is and puts it as a timestamp to the packet. The bridging routers, as mentioned before,are responsible for generating special timing packets. Such a packet is generated and emitted everytime the K bit portion of the associated counter overflows. It is called end of period (EOP) packetand designates the end of a period of 2K clock cycles. Thus, the M bit portion of the counter

42

essentially keeps count of the number of such periods. The packet has a higher priority than thetrace packets. In addition, such a packet is generated also on the main ring e.g. by R0 from Figure32.

The Debug/Trace Unit keeps count of the absolute number of periods for each subring (Per(subring) ). Each time it receives a EOP packet of a particular subring S (EOP (S)) it icreasesthe value of Per(S) by 1. When a trace packet is received, the absolute time of the observation canbe computed by using the count Per of the subring from which the observation originates and itstimestamp value. The absolute time T for an observation OS with timestamp TS(O) from subringS with clock period TclkS can be computed in the following way:

T (OS) = TclkS× (Per(S)× 2KS + (TS(O) mod 2KS ))

However, this may result in errors equation may lead to errors as only the K bits are taken intoaccount. For example, back in Figure 32 if monitor M0 generates an observation O at time t andat time t+∆t, before the observation has left the subring the K bits of the particular subring flip,R3 will generate the EOP packet. As the EOP is generated in front of O by the time O reachesthe Debug/Trace Unit, the EOP would have already increased the period count for the particularsubring. Thus the above equation will render false absolute time, as clearly the observation wasproduced before in the previous period. The higher M bits compensate for this. In addition tothe cycle count K they carry a indicator, to which period the observation relates. It is also not afull count, but a differential one. However, it allows for a packet to be mapped to one of the last2M periods, which have been counted by the Debug/Trace Unit. The relative period counted fora subring S can be computed as Perrel(S) = Per(S) mod 2MS . If Perrel < TS(O) div 2KS thenthe following equation is used:

T (OS) = TclkS× ((Per(S)− (2MS + Perrel − (TS(O) div 2KS ))× 2KS + (TS(O) mod 2KS ))

Otherwise the following equation is valid:

T (OS) = TclkS × ((Per(S)− (Perrel − (TS(O) div 2KS ))× 2KS + (TS(O) mod 2KS ))

This approach allows for a packet to experience a maximum delay of 2N clock cycles fromgeneration to exiting the subring before it gets wrongly mapped in the trace. Finally, in orderto comply with the demand for partial/full system shutdown time on the main ring is counted ina similar manner. As mentioned before the router, to which the Debug/Trace Unit is attachedemits EOP packets for the main ring and unit counts them as well. The other routers on themain ring also have access to the counter on the main ring or have equivalent ones. Whenever asubsystem S is powered up, the router bridging the main ring to its subring, receives a signal, whichsamples its current counter value. The router then sends a wake up packet (WUPS) containing thesampled value TS(WUPS). When received at the Debug/Trace Unit the absolute time T (WUPS)is computed according to the above formulas and used as an offset ToffsetS to all observationsfrom S that come after it. Thus, T ′(OS) = ToffsetS +T (OS). As the main ring is never shut downduring a trace/debug phase, this ensures proper alignment of the wakeup time of a subsystemto the time counted for the main ring. The timing error introduced by this method is containedwithin the interval [0, TclkM

), where TclkMis the clock period of the clock of the main ring. This

is due to the fact that the offset between the two edges of the clocks on the main ring and thesubring may not be known. In a case of full power down, the whole system is switched off, after itis turned on the trace/debug starts from time 0, which is also a true value for the whole system.

6.5 Results

A SystemC model of the proposed debug and trace solution has been implemented. The followingresults are obtained after performing numerous simulations and verifying the different aspects ofthe concept. The system used for the simulations has the following setup:

� 2 subsystems

� 5 monitors per subsystem

� 16-bit wide links

43

� TclkM = 3 ns

� TclkS0= 7 ns

� TclkS1= 9 ns

� All counters have the same N = 10, K = 8 and M = 2.

� Monitors inject random observations of lengths between 4 and 40 flits

In order to observe the effect of buffer space on the throughput of the debug and trace systemthe flit depths of all routers and bridges within the system were swept between the values of 3and 10. The system utilized routers only with input buffers. Figure 37 shows how the bandwidthchanges in accordance to the buffer resources.

It must be noted that the system in total contains 15 routers, thus, each flit depth incrementequaltes to 2×15 16-bit FIFO slots added to the system. There are only two bridges and thus eachflit added to the bridge axis equates to 2 16-bit FIFO slots added (the buffers on the upstream arenot changed).

Figure 38 shows how the latency scales with buffer size. As the main ring is running on asignificantly higher speed, increasing the buffer size on the bridge allows for more information tobe stored right before the high-speed interconnect, which reduces latency. If more buffers are addedto the routers, there is more space for packets to reside in transient on the subrings.

Figure 37: Average throughput vs. buffer depth

The overall conclusion is that increasing the buffer space at the bridges is much more desirableas it has lower impact on area, very good impact on bandwidth and reduces the latency a lot.

Table 5 shows part of the trace produces by the Debug/Trace Unit during the simulation. TheTsim column shows the simulation time at which a packet has been generated, Testimate shows theestimated time by the Debug/Trace Unit based on the timestamping technique. The Id and Typecolumn show the id and the type of the different packets. The type Data designates the packetas containing an observation, the type Loss means that the packet carries a loss count, and EOPare the period packets. Finally, Src designates the origin of the packet. For the EOP packets it

44

Figure 38: Average latency vs. buffer depth

designates the subsystem (0 = Main Ring, 1 = Subring0, 2 = Subring1). For all others it showsthe monitor that has generated the packet. As it can be seen the estimated time matches the timeof the simulation, which shows that the timestamping technique is reliable. No misalignmentsbetween Tsim and Testimate were encountered during any of these simulations.

Figure 39 shows the distribution latency of packets containing observations during one simula-tion. It also provides a picture of how much latency can be tolerated by the proposed debug andtrace system. Even with such latency tending to 3µs, the trace was correctly ordered.

Figure 40 shows how the system recovers after both subrings have been powered down andsubsequently powered up. The error introduced after powering up the first subring is 33% ofTclkM

, which is equal to 1 ns. The power up of the the second subring comes with a greater errorof 66% of TclkM (2 ns), but it does not cross the upper bound of TclkM . The corresponding powerup times and the errors introduces are listed in Table 6.

To determine the area requirements and frequency constraints, a RTL models of the routersand the bridge were constructed and synthesized in 40nm. A subring with 6 routers and 1 bridgehas an area requirement of 1002 FlipFlops and 2190 gates (10 flit buffer in the bridge and 3 flitbuffers in the routers). The maximum operating frequency was determined to be 800MHz. Amonitor model was also created, which depending on the desired functionality scales between 5000and 35000 gates.

45

Table 5: Comparison of simulation time vs. estimated timeTsim Testimate Id Type Src Tsim Testimate Id Type Src

0 s 0 s 0x6aa8d0 EOP 0 1246 ns 1246 ns 0x1245d00 Data 30 s 0 s 0x6e8610 EOP 1 1274 ns 1274 ns 0x12461e0 Data 40 s 0 s 0x6d1b80 EOP 2 1287 ns 1287 ns 0x1246660 Data 8

189 ns 189 ns 0x6ae230 Data 4 1287 ns 1287 ns 0x1246720 Data 5198 ns 198 ns 0x6ec400 Data 5 1359 ns 1359 ns 0x1246d00 Data 9210 ns 210 ns 0x70c800 Data 3 1407 ns 1407 ns 0x12478a0 Data 1243 ns 243 ns 0x71f990 Data 8 1467 ns 1467 ns 0x1248290 Data 7279 ns 279 ns 0x742f00 Data 7 1470 ns 1470 ns 0x1248e30 Data 4315 ns 315 ns 0x6b4e80 Data 2 1484 ns 1484 ns 0x12499d0 Data 0343 ns 343 ns 0x123c190 Data 1 1536 ns 1536 ns 0x1249b00 EOP 0351 ns 351 ns 0x123cbc0 Data 6 1638 ns 1638 ns 0x1249e80 Data 8360 ns 360 ns 0x123d320 Data 9 1694 ns 1694 ns 0x124aa20 Data 2378 ns 378 ns 0x123d8c0 Data 0 1792 ns 1792 ns 0x124ac30 EOP 1504 ns 504 ns 0x123df70 Data 4 1818 ns 1818 ns 0x124b080 Data 6567 ns 567 ns 0x123e3d0 Data 8 1820 ns 1820 ns 0x124b290 Data 3630 ns 630 ns 0x123ec10 Data 3 1862 ns 1862 ns 0x124b830 Data 4657 ns 657 ns 0x123f8b0 Data 9 1862 ns 1862 ns 0x124b8f0 Data 1665 ns 665 ns 0x123fac0 Data 2 1863 ns 1863 ns 0x120c1f0 Data 5768 ns 768 ns 0x123fcf0 EOP 0 1899 ns 1899 ns 0x120ce20 Data 9826 ns 826 ns 0x12410d0 Data 1 1932 ns 1932 ns 0x120d370 Loss 0828 ns 828 ns 0x1241ac0 Data 7 2043 ns 2043 ns 0x120db80 Data 7861 ns 861 ns 0x1242030 Data 4 2106 ns 2106 ns 0x120e3c0 Data 8882 ns 882 ns 0x1242b40 Data 0 2170 ns 2170 ns 0x120ea10 Data 1909 ns 909 ns 0x1243140 Data 8 2178 ns 2178 ns 0x120f130 Data 6945 ns 945 ns 0x12433e0 Data 3 2282 ns 2282 ns 0x120f6a0 Data 31080 ns 1080 ns 0x1244160 Data 6 2304 ns 2304 ns 0x120f840 EOP 01169 ns 1169 ns 0x1244940 Data 2 2304 ns 2304 ns 0x120f990 EOP 21176 ns 1176 ns 0x1245060 Data 0 2310 ns 2310 ns 0x1210260 Data 0

Table 6: Error after power upPower up point Error Min Error Max

0 0 032000 ns 1 ns 1 ns50002 ns 1 ns 2 ns

6.6 Conclusions

This work presents a highly versatile and robust debug and trace system. It utilizes minimumresources and at the same time provides a great flexibility. The timestamping technique allows itto cope with partial/full shutdowns of the SoC, while keeping the hardware architecture simple.It also makes the system highly tolerant to latencies that are introduced due to its simplistic areasaving structure. Furthermore, it provides sufficient bandwidth, which can also be increased atthe cost of area. The different levels of granularity also allow different level of debug and tradeoffbetween the observations of different modules.

46

Figure 39: Latency distribution for packets containing data (observations)

Figure 40: Power up error

47

7 Architecture for Hardware and Software co-debug

7.1 Introduction

Hardware and Software debug both require the use of an additional separate control network:software debug network has been presented in section 6, while hardware debug network has beenincluded in Deliverable 2.1. Beyond these, also OSR-Lite protocol for runtime reconfiguration,also described in Deliverable 2.1, requires routing information to be somehow distributed to all theswitches of the main Network-on-Chip.The focus of this section is to provide a unified dual network architecture to serve all these needs.

7.2 Architecture Detail

The control network must satisfy many different requirements:

� It must notify diagnosis bits of BIST procedure from switches of the main network to theglobal controller.

� It must be inherently fault tolerant, in order to ensure correct communication between GlobalController and the Switches during (re-)configuration procedures.

� It must provide fair arbitration, so that all monitors have equal probability to drop a commu-nication trace when performing software debug, in case of congestion in this control network.

� In addition to previous features, it must implement runtime notification of transient errorsaffecting the attached switch, so to allow runtime reconfiguration and isolation of possiblefaulty elements.

� It must provide enough buffering so to avoid congestion phenomena that may lead to obser-vation losses during software debug.

� It must be able to satisfy the different bandwidth requirements of all the possible transmis-sions that take place through such network. Software debug and OSR-Lite reconfigurationare the most bandwidth hungry types of transmission. Software debug requires a sufficientflit-width so to avoid (or make less likely to happen) observation losses, while during recon-figuration there is a clear trade-off between reconfiguration time and information that canbe transmitted in only one flit of a control network packet. For these reasons, the Flit-widthof the control network bus has been set to 20 bits (as in the network for software debug).

� It must be connected and must serve all the devices that require such control communica-tion network. This means that, differently from before, the control network switch has 3input/output ports (one to connect the ring, one to the observation monitor and one to theattached switch) rather than 2 like in the previous implementations.

Table 7.2 summarizes all the differences between the three instances of control network.

Table 7: Difference between control network instancesDual Switch for Dual Switch for CombinedHardware Debug Software Debug Dual Switch(Deliverable D2.1) (This Deliverable)

Input Buffering 2 slots 3 slots 3 slotsNr. of In/Out Ports 2 2 3

Flit Width 15 20 20Fault Tolerant Yes (TMR) No Yes (TMR)Arbitration Simple Advanced AdvancedProtocol (Fixed Priority) (Input Weighted) (Input Weighted)

Diagnosis Notification Yes No YesTransient Notification No No Yes

Support for OSR No No Yes

48

Figure 41 presents a comparison between the three implementations of such control network.It can be seen that, thanks to resource sharing, it is possible to provide additional functionality(runtime reconfiguration with OSR-Lite and transient notification are implemented only in thenew network) by just summing up the cost of previous implementations. To appreciate this effectit should be recalled that the network for software debug natively not triplicated.

Figure 41: Comparison between different control network implementations

49

8 Conclusions

Modern Network-on-Chip should be integrated into a reliable framework taking care of fault de-tection, diagnosis and network reconfiguration. In this scenario, external testers for nanoscale chiptesting face severe concerns: lack of scalability of test data volumes, high cost for full clock speedtesting, poor suitability for the extension of production testing to lifetime testing. As an effect, amigration from external testers to built-in self-test (BIST) infrastructures becomes a must.

In this direction, the deliverable presented four scalable built-in self-test and self-diagnosisinfrastructures for NoCs providing a wide exploration of different testing strategies. Table-lesslogic based distributed routing was the foundation of all our approaches, and enabled networkreconfiguration with only 10 diagnosis bits per switch. Customizations for the NoC environmentof conventional testing strategies were proposed taking full advantage of the intrinsic networkstructural redundancy through cooperative testing and diagnosis frameworks:

� First, NoC switch was enhanced with a conventional scan chain-based mechanism. When thescan chain was automatically implemented by means of synthesis tool then a huge amountof test patterns was required to support the load and unload of the chains. Thus, the switcharea overhead was not affordable due to the size of the built-in TPG.

Therefore, a novel customized framework was envisioned to reduce the test pattern number.Dedicated scan chains were reserved for each switch port. Then the same test patterns couldbe reused for multiple scan chains cutting down the TPG area footprint. Although 5 timesless test patterns were required for a 5x5 switch, the area overhead was still higher thanthe 130%. As a result the scan-based approach proved unsuitable for use within a built-inself-testing strategy in NoCs.

� Second, the switch was tested by means of conventional pseudo-random patterns generatedby a built-in LFSR module. The solution materializes relevant area savings and enhancedflexibility. However, pseudo-random patterns suffer from low coverage on the control path.The coverage only increases logarithmically with the number of clock cycles. A low coverageis achieved with a high testing latency.

In order to tackle the intrinsic drawbacks of the pseudo-random patterns, test responses ofswitch sub-blocks were reused and test pattern optimizations were introduced. Finally the96% of switch coverage was reached in 2000 clock cycles with a 16% of area overhead. Thepseudo-random pattern framework proved an appealing solution for a low-area highly-flexibleNoC solution, provided the devised customizations are applied.

� Third, a low-latency alternative to the pseudo-random approach was proposed. Differentlyfrom the previous solutions, the framework relied on a non-conventional strategy based ondeterministic test patterns. The test patterns were handcrafted exploiting the knowledgeof the device under test. Thus a coverage close to 100% was achieved by means of few testpatterns (i.e. 1104 clock cycles). Clearly, these latter results were achieved at the cost of lowerflexibility and higher area overhead (37%) with respect to the pseudo-random counterpart.As a conclusion, the deterministic test pattern-based strategy represents the best solutionfor high-performance high-reliability NoCs.

� Finally, we proposed one of the first BIST/BISD framework for GALS Network-on-Chipbased on an asynchronous handshaking. We exploited the outcome of the previous testingexplorations to achieve a high coverage and a low area overhead by means of cooperativediagnosis and efficient test strategies. Since the pseudo-random and the deterministic-basedapproaches were the best performing in the fully synchronous system then both of them wereimplemented and compared in the multi-synchronous scenario. As a conclusion, area andflexibility were traded for coverage and latency when moving from the first to the secondapproach.

While the four above mentioned infrastructures were assessed to tackle the testing and thediagnosis of the NoC hardware, the last section of the deliverable addressed the debug of thesoftware in the system. A novel debug and trace system has been devised exploiting informationlike the delay, contents and effects of different processing blocks on a particular transaction tooptimize and debug the software.

50

Both the proposed software and hardware testing mechanisms rely on the dual-network conceptintroduced in WP2 for the delivery of control information to a centralized manager. First, thedual-network is required by the built-in hardware mechanisms to carry the diagnosis bits fromthe switches to the global manager and the routing reconfiguration bits from the manager to theswitches. Second, the dual network is required to notify the occurrences of transient faults tothe global controller for the sake of intermittent fault identification. Third, a hierarchical versionof the dual network guarantees reliable and latency intolerant tracing for the debugging system.However the microarchitectural requirements of the dual network in the two cases are different. Inthis deliverable we have merged such requirements into a unified dual network architecture capableof meeting them all.

51

References

[1] A. Strano and D. Ludovici and D. Bertozzi, ”A Library of Dual-Clock FIFOs for Cost-Effectiveand Flexible MPSoCs Design”, Proc. of SAMOS, 2010.

[2] K.Petersen, J.Oberg, ”Utilizing NoC Switches as BIST-Structures in 2D Mesh Network-on-Chip”, Future Interconnects and Network on Chip Workshop, 2006.

[3] S.Stergiou et al., ”Xpipes Lite: a Synthesis Oriented Design Library for Networks on Chips”,DAC, pp.559-564, 2005.

[4] D.Wentzlaff et al., ”On-Chip Interconnection Architecture of the Tile Processor”, IEEE Micro,vol.27, no.5, pp.15-31, 2007.

[5] J.Raik, V.Govind, R.Ubar, ”An External Test Approach for Network-on-a-Chip Switches”,Proc. of the IEEE Asian Test Symposium 2006, pp.437-442, Nov. 2006.

[6] J.Raik, V.Govind, R.Ubar, ”Test Configurations for Diagnosing Faulty Links in NoC Switches”,Proc. ETS, 2007.

[7] M. Mishra and S. Goldstein, ”Defect tolerance at the end of the roadmap”, ITC, pages1201-1211, 2003.

[8] D. A. IIitzky, J. D. Hoffman, A. Chun and B. P. Esparza, ”Architecture of the ScalableCommunications Core’s Network on Chip”, IEEE MICRO, 2007, pp. 62-74.

[9] J.Raik, V.Govind, R.Ubar, ”DfT-based External Test and Diagnosis of Mesh-like NoCs”, IETComputers and Digital Techniques, October 2009.

[10] V.Bertacco, D.Fick, A.DeOrio, J.Hu, D.Blaauw, D.Sylvester, ”VICIS: A Reliable Networkfor Unreliable Silicon”, DAC 2009, pp.812-817.

[11] Y.Zorian, ”Testing the monster chip”, IEEE Spectrum, pp.54-60,1999.

[12] Y.Zorian, ”Embedded Memory Test and Repair: Infrastructure IP for SoC Yield.”, Interna-tional Test Conference, pp.340-349,2002.

[13] K.Peterson, J.Oberg, ”Toward a Scalable Test Methodology for 2D-mesh Network-on-Chip”,DATE 2007, pp.75-80, 2007.

[14] A.M. Amory, E.Briao, E.Cota, M.Lubaszewski, F.G.Moraes, ”A Scalable Test Strategy forNetwork-on-Chip Routers”, Proc. of ITC 2005.

[15] K.Arabi, ”Logic BIST and Scan Test Techniques for Multiple Identical Blocks”, IEEE VLSITest Symnposium, pp.60-68, 2002.

[16] B.Vermeulen, J.Delissen, K.Goossens, ”Bringing Communication Networks on a Chip: Testand Verification Implications”, IEEE Communications Magazine, vol.41-9, pp.74-81, 2003.

[17] R.Ubar, J.Raik, ”Testing Strategies for Network on Chip”, Book: ”Network on Chip”, editedby A.Jantsch and H.Tenhunen, Kluwer Academic Publisher, pp.131-152, 2003.

[18] C.Aktouf, ”A Complete Strategy for Testing an on-Chip Multiprocessor Architecture”, IEEEDesign and Test of Computers, vol.19-1, pp.18-28, 2002.

[19] Panda et al., ”Design, Synthesis and Test of Networks on Chips”, IEEE Design and Test ofComputers, vol.22, issue 8, pp.404-413, 2005.

[20] S.Y.Lin, C.C.Hsu, A.Y.Wu, ”A Scalable Built-In Self-Test/Self-Diagnosis Architecture for2D-mesh Based Chip Multiprocessor Systems”, IEEE Int. Symp. on Circuits and Systems,pp.2317 - 2320, 2009

[21] M.Hosseinabady, A.Banaiyan, M.N.Bojnordi, Z.Navabi, ”A Concurrent Testing Method forNoC Switches”, DATE, pp.1171 - 1176, 2006

52

[22] C.Grecu, P.Pande, A.Ivanov, R.Saleh, ”BIST for Network-on-Chip Interconnect Infrastruc-tures”, VLSI Test Symposium, page 6, 2006.

[23] Wu, Y. and MacDonald, P., ”Testing ASICs with Multiple Identical Cores”, IEEE Trans-actions on Computer-Aided Design of Integrated Circuits and Systems, vol. 22-3, 2003, pp.327-336.

[24] S.Rodrigo, S.Medardoni, J.Flich, D.Bertozzi, J.Duato, ”Efficient Implementation of Dis-tributed Routing Algorithms for NoCs”, IET-CDT, pp.460-475, vol.3, issue 5, 2009.

[25] S.Rodrigo, J.Flich, A.Roca, S.Medardoni, D.Bertozzi, J.Camacho, F.Silla, J.Duato, ”Ad-dressing Manufacturing Challenges with Cost-Effective Fault Tolerant Routing”, NoCs 2010,pp.35-32, 2010.

[26] P.K. Lala, ”Self-checking and fault tolerant Digital Design”, MK Publishers 2001.

[27] Fu-Ching Yang, Yi-Ting Lin, Chung-Fu Kao, Ing-Jer Huang, ”An On-Chip AHB Bus TracerWith Real-Time Compression and Dynamic Multiresolution Supports for SoC”, IEEE Trans-actions on Very Large Scale Integration (VLSI) Systems, VOL. 19, No. 4, April 2011.

[28] Jianmin Zhang, Ming Yan, Sikun Li, ”Debug Support for Scalable System-on-Chip”, SeventhInternational Workshop on Microprocessor Test and Verification (MTV’06).

[29] Kuen-Jong Lee, Si-Yuan Liang, Alan Su, ”A Low-Cost SoC Debug Platform Based on On-Chip Test Architectures”.

[30] Hyunbean Yi, Sungju Park, and Sandip Kundu, ”A Design-for-Debug (DfD) for NoC-basedSoC Debugging via NoC”.

[31] Kees Goossens, Bart Vermeulen1, Remco van Steeden, Martijn Bennebroek, ”Transaction-Based Communication-Centric Debug”, Proceedings of the First International Symposium onNetworks-on-Chip (NoCS’07).

[32] S. Borkar, Microarchitecture and design challenges for gigascale integration. Proc. ACM/IEEEMICRO, keynote address, pp. 3-3, 2004.

[33] S. Borkar, Designing reliable systems from unreliable components: the challenges of transistorvariability and degradation, IEEE Micro, Vol. 25, No. 6, pp. 10-16, 2005.

[34] J. W. McPherson, Reliability challenges for 45nm and beyond, Proc. ACM/IEEE DAC, pp.176-181, 2006.

[35] T. Bjerregaard, Shankar Mahadevan, A survey of research and practices of network-on-chip,ACM Computer Survey, Vol. 38, No. 1, 2006.

[36] http://techresearch.intel.com/articles/Tera-Scale/1421.htm

[37] http://techresearch.intel.com/articles/Tera-Scale/1449.htm

[38] C.Grecu, et al. ”Methodologies and Algorithms for Testing Switch-Based NoC Interconnects”,IEEE DFT 2005, 2005.

[39] D.M.Chapiro, Globally-Asynchronous Locally-Synchronous Systems, PhD Dissertation, Stan-ford University, October 1984.

[40] Z.Yu, B.M.Baas, High Performance, Energy Efficiency, and Scalability with GALS ChipMultiprocessors, IEEE Trans. on VLSI Systems, Vol.17, no. 1, January 2009.

53

ict-2009.3.2 design of semiconductor components and

Documents