final report on noc

A

MINOR PROJECT REPORT

ON

Implementation of Network on chip on FPGA and challenges

Submitted to

Malaviya National Institute of Technology, Jaipur

for the partial fulfillment for the award of the degree

of

Master of Technology (Embedded Systems)

by

Manish Tailor

under the guidance of

Dr. Lava Bhargava

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY

MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY

Department of Electronics & Communication Engineering

CERTIFICATE

This is to certify that the minor project report entitled Implementation of

Network on chip on FPGA and challenges has been successfully completed and

presented by Manish Tailor of First Year II semester for partial fulfillment of degree

of Master of Technology in Embedded System during the academic year 2014-2016,

under my guidance & supervision in the department and is approved for submission.

Guided by

Date: Dr. Lava Bhargava

ECE Deptt.

ACKNOWLEDGEMENTS

With deep regards and profound respect, I avail this opportunity to express my deep sense of gratitude and indebtedness to Dr. Vineet Sahula, Head of Department of Electronics and Communication Engineering, NIT Jaipur for his valuable guidance and support.I am deeply indebted for the valuable discussions at each phase of the project. I consider it my good fortune to have got an opportunity to work with such a wonderful person.

I would like to give my sincere thanks and gratitude to my esteemed supervisor Dr.Lava Bhargava(Department of Electronics and communication Engineering, Malaviya National Institute of Technology, Jaipur)for providing his valuable guidance and encouragement to learn new things and enhancing our knowledge in the field of On-Chip Networks. His kind cooperation and suggestions throughout the course of this research guided me with an impetus to work, and to successfully complete the project.

I convey my special thanks to Mrs. Ashish Sharma (Research Scholar, Department of Computer Engineering, Malaviya National Institute of Technology, Jaipur) for their motivation and persistent support, which made the work possible.

Contents

Abstract i

List of Figures ii

1 Introduction iii1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii1.2 Objective and Motivation . . . . . . . . . . . . . . . . . . . . . . . iii1.3 Report Organization . . . . . . . . . . . . . . . . . . . . . . . . . . iv

2 Introduction to On Chip Networks (NoC or InterConnects) vi2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 2.2 Interconnect Design Topology . . . . . . . . . . . . . . . . . . . . . vi

2.2.1 Mesh Topology . . . . . . . . . . . . . . . . . . . . . . . . . vii 2.2.2 Cube (or Hypercube) topology . . . . . . . . . . . . . . . . . vii 2.2.3 2-D Torus . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 2.2.4 K-ary N-Cube . . . . . . . . . . . . . . . . . . . . . . . . . . viii 2.2.5 Omega or Butter y Network . . . . . . . . . . . . . . . . . . viii 2.2.6 Fat Tree Networks . . . . . . . . . . . . . . . . . . . . . . . ix2.2.7Star (or fully connected Crossbar) . . . . . . . . . . . . . . . ix

2.3 Parameters of a Topology. . . . . . . . . . . . . . . . . . . . . . . . x 2.3.1Routing Distance . . . . . . . . . . . . . . . . . . . . . . . . x 2.3.2Diameter of the Network. . . . . . . . . . . . . . . . . . . . xi 2.3.3Avarage Distance. . . . . . . . . . . . . . . . . . . . . . . . . xi 2.3.4Minimum Bisection Bandwidth. . . . . . . . . . . . . . . . . xi

2.4 Routing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 2.5 Flow Control Techniques for NoC . . . . . . . . . . . . . . . . . . . xiii

3 Router Components xvi3.1 Input Channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 3.2 Routing Logic : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 3.3 Switch Allocator : . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 3.4 Crossbar Switch : . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

4 Anatomy of a Message and Our Implementation xxi4.1 Packets : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 4.2 Flits and Phit: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii

CONTENTS ii

4.2.1Head Flit : . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii 4.2.2Body Flit : . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii 4.2.3Tail Flit : . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii 4.2.4Phits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii

4.3 Flit formats for communication. . . . . . . . . . . . . . . . . . . . . xxiii 4.4 Flit Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv4.5 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi

4.5.1Input Channel . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi 4.5.2Switch Allocation . . . . . . . . . . . . . . . . . . . . . . . . xxvii 4.5.3Crossbar Switch . . . . . . . . . . . . . . . . . . . . . . . . . xxvii

5 Logic synthesis from HDL xxviii5.1 Architecture of Synthesizers . . . . . . . . . . . . . . . . . . . . . . xxviii

5.1.1 Front end . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxviii5.1.2 Back end . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxix

5.2 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxix 5.3 Components of FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . xxx

5.3.1Con gurable Logic Blocks (CLBs) . . . . . . . . . . . . . . . xxx 5.3.2Con gurable I/O Blocks . . . . . . . . . . . . . . . . . . . . xxx 5.3.3Programmable Interconnect . . . . . . . . . . . . . . . . . . xxxi

5.4 FPGA programming . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi5.4.1 Design Input . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii5.4.2 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii5.4.3 Design Constraining . . . . . . . . . . . . . . . . . . . . . . xxxii

5.4.4Implementation . . . . . . . . . . . . . . . . . . . . . . . . . xxxiii 5.4.5Bitstream Creation . . . . . . . . . . . . . . . . . . . . . . . xxxiii

6 Conclusions and Future Challenges xxxiv

Bibliography xxxv

ABSTRACT

A Network-on-chip (NoC) is a new paradigm in complex system-on-chip (SoC) designs that provide efficient on chip communication networks. It allows scalable communication and allows decoupling of communication and computation. The data is routed through the networks in terms of packets. The routing of data is mainly done by routers. So the architecture of router must be an efficient one with a lower latency and higher throughput. Network on Chip (NoC) is emerging as a new paradigm for designing VLSI/ULSI circuits. It emerges as an attractive alternative to traditional bus based interconnects to solve the communication bottleneck arising in the billion transistor era. Networks on Chip is a new paradigm that is seen as the alternative to the bus architecture. The pin density and wiring density is growing slower than the components (Processors and memories). Most of the high-end systems waste a larger fraction of power in driving the wires instead of useful computing and gate delays. Interconnect Networks or NoCs are emerging as the solution to system-level communication.

.

i

List of Figures

2.1 Mesh Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 2.2 Cube Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 2.3 2-D Torus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 2.4 K-ary,N-Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 2.5 Omega Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 2.6 Fat Tree Network . . . . . . . . . . . . . . . . . . . . . . . . . . x2.7 Star Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 2.8 Minimum Bisection Bandwidth for the Mesh. . . . . . . . . . xi 2.9 Path in X-Y Routing . . . . . . . . . . . . . . . . . . . . . . . . xiii 2.10 On-O Stall Signals . . . . . . . . . . . . . . . . . . . . . . . . . xiv 2.11 Pipeline for the Flits in flight …………...…………….. xiv2.12 Simple credit based flow control ….. . . . . . . . . . . . . . . xv

3.1 Virtual Channel Router . . . . . . . . . . . . . . . . . . . . . . xvi 3.2 Input channel with four virtual channels . . . . . . . . . . . . xvii 3.3 A 5 x 5 Crossbar Switch . . . . . . . . . . . . . . . . . . . . . . xx

4.1 Packet Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii 4.2 Message, Packets, Flits and Phits . . . . . . . . . . . . . . . . xxiii 4.3 Virtual Channel State Register . . . . . . . . . . . . . . . . . . xxv 4.4 Head Flit Structure . . . . . . . . . . . . . . . . . . . . . . . . . xxv 4.5 Body Flit Structure . . . . . . . . . . . . . . . . . . . . . . . . . xxvi 4.6 Tail Flit Structure . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi 4.7 Input Channel States . . . . . . . . . . . . . . . . . . . . . . . . xxvi 4.8 Input Channel States . . . . . . . . . . . . . . . . . . . . . . . . xxvii

ii

Chapter 1

Introduction

1.1 Objective

For years, systems have been designed on bus architecture. Bus is used as a sharedmedium of communication. But that architecture has problems such as Bandwidth of the bus. Arbitration logic has multiple levels of complex logic to guarantee some fairness in bus allocation. Wire delays and capacitance cause physical problems. More clock cycles are wasted in driving the wires instead of logic gates. Using a bus as a communication medium seems an obvious solution but complexity and cost of communication grows as the communicating entities number increases. So the idea of routing packets instead of wires provides a promising future.

1.2 Objective and Motivation

In 1965, Gordon Moore, the founder of Intel Corp., found out that the transistor density on a silicon fabric was doubled every 18 months. This is being followed till today. Intel Corp. announced Intel Poulson which had more than 3 billion transistors. This trend (also known as Moores law) has given us systems having tens to hundreds of IP Cores (Intellectual Property cores) like Tilera Corp. Tile64

iii

Chapter 1 Introduction iv

and MIT RAW machine. Intel Corp. has an 80 core research chip. These chips are running very complex applications. Conventional communication methods such as multi-drop buses do not efficiently scale to meet the performance requirements.

Networks on Chip are proposed as a solution to address the concerns of conven-tional bus technology. NoC is a system that takes message from one terminal and delivers it to the destination terminal. These terminals can be memory arrays, registers, caches, processors, different ALUs in same processors etc. This type of communication system is becoming very common in systems where more than one component is integrated together and they need to communicate to each other. Multi-drop buses were used for interconnect design. These buses are still an important part of many systems. But now almost all interconnects are designed using a point-to-point switching mechanism. This trend is observed because of non-uniform performance scaling of wires. The speed of wire does not improve with the change in semiconductor technology. In these conditions, point-to-point interconnects provide the required bandwidth and operate faster and concurrently. So they are rapidly taking over the multi-drop bus systems. NoC actually separates communication and computation. Computations are done by the IP core and IP core pushes communication to the nearest router. Then the router is responsible for carrying data to the destination.

Design of systems based on NoC involves the validation and verification process. It is the process of experimentations with the chosen topology, routing algorithm, switching mechanism, buffer size etc. System verification sometimes needs emulation also. Writing the behavioral description of the system in some HDL and then realizing it on actual hardware gives very important insights of systems actual behavior.

1.3 Report Organization

The remaining report is organized as follows. Chapter 2 gives an introduction to interconnect design (NoC) followed by the various architectures proposed for NoC and suggested routing algorithms. Then it introduces the flow control in NoC. Chapter 3 discusses various components of a router and the virtual channels. Chapter 4 presents the anatomy of any message, describes some terms.

Chapter 1 Introduction v

Chapter 2

Introduction to On Chip Networks (NoC or InterConnects)

2.1 Introduction

Networks on Chip are proposed as a solution to address the concerns of conventional bus technology. NoC is a system that takes message from one terminal and delivers it to the destination terminal. These terminals can be memory arrays, registers, caches, processors, different ALUs in same processors etc. This type of communication system is becoming very common in systems where more than one components are integrated together and they need to communicate to each other. NoC actually separates communication and computation. Computations are done by the IP core and IP core pushes communication to the nearest router. Then the router is responsible for carrying data to the destination.

2.2 Interconnect Design Topology

Multiple arrangements of the routers have been proposed in academia and industry.

Any arrangement of the routers is known as the Topology. The routers are shared

resource for any interconnect network. Selection of topology depends on things like

the packaging technology, bandwidth requirement and latency. There is no \Perfect"

topology. Different topology is chosen under different constraints and requirements.

There are multiple topologies possible for NoC such as:

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) vii

2.2.1 Mesh Topology

This is one of the simplest topology in interconnect networks. It is a simple arrangement of the router nodes to form a mesh. Because of its simplicity and 2D packaging, this is one of the topologies used in industry. Tilera Corp. Tile64 and TilePro64 are 2D mesh interconnect networks.

Figure 2.1: Mesh Network

2.2.2 Cube (or Hypercube) topology

This topology can be considered as a 3D extension to 2D mesh topology. In this topology, instead of having just a planner arrangement, we use a cubical arrangement for the router nodes. Router nodes are placed at the corners of the cube. There is the paradigm of K-dimensional cube topology. If in any cubic arrangement of nodes, each node has the out-degree `k' then the cube is said to be K-dimensional cube. The example of this topology is the Connection Machine-1(CM-1) which was a hypercube. At the time of CM-1, wires were faster than transistors. This topology is ne for small systems but if wires get too long, this is not a right choice.

2.2.3 2-D Torus

This topology is similar to 2D Mesh topology except the top and left edge routers are

connected to bottom edge routers and right edge routers respectively. The

arrangement is regular for packaging so that is well matched to packaging

constraints. This topology allows high speed operations without repeaters. The nodes

are also physically close to each other. So this topology, along with mesh, is most

preferred.

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) viii

Figure 2.2: Cube Network

Figure 2.3: 2-D Torus

2.2.4 K-array N-Cube

This topology looks like hypercube but has a basic difference that not all nodes have the same out-degree. The name of the topology contains 2 values, K and N. K represents the number of nodes in one dimension. The value N tells us the number of dimensions. So if we write 3-ary 3-cube, that network will contain 3 X 3 Mesh and then 3 dimensions. This topology can have a mixed radix network too. For example, if we write 2,3,4-array 3 cube, that means we have 2 nodes in X dimension, 3 in Y and 4 in Z dimension. These topologies are built for packaging reasons and modularity, but they are edge-asymmetric so they cause load imbalance, non-uniform traffic pattern.

2.2.5 Omega or Butterfly Network

This topology is good example of indirect interconnects. Indirect interconnect network nodes do not have router built into them. They are connected to each

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) ix

Figure 2.4: K-ary,N-Cube

other via a multistage network of routers that switches the packets. This is similar to the internet where end host does not have any routing work to do. This type of interconnect networks are very important when we want to design a massively parallel machine like Cray Machine. Any omega network needs at least logN stages where N is the number of nodes we want to connect. For strictly logN stages, we have a unique path between each pair of nodes. If we have more stages, we get path diversity that helps us tolerate faults and congestion.

Figure 2.5: Omega Networks

2.2.6 Fat Tree Networks

As the name suggests, we design a layout in shape of a tree. Leaf node is an IP core and internal nodes are the router nodes. In order to communicate to the other half of your parent node, we need to double the links width at each level up in the hierarchy so that we provide enough bandwidth for communication.

2.2.7 Star (or fully connected Crossbar)

This is an arrangement in which each router is connected to all other routers. This topology provides really good bisection bandwidth. But this is not generally

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) x

Figure 2.6: Fat Tree Network

manufactured because of packaging constraints. If we have good number of nodes, we have to pack them in a 3D surface and the wires connecting the nodes become the trouble.

Figure 2.7: Star Network

2.3 Parameters of a Topology.

In order to compare different networks, we need some parameters for the topologies so that we can compare their relative performance. Some of the most widely used network topology parameters are the following.

2.3.1 Routing Distance

This is e effectively the number of links that any packet needs to traverse in order to go from source to destination. It is neither best-case nor the worst-case link count but the number of links for any pair of nodes.

Chapter 2 Introduction to On Chip Networks (NoC or Interconnects) xi

2.3.2 Diameter of the Network.

Diameter of the network is defined as the maximum routing distance between any two points. This is an important parameter because we always want to minimize the maximum routing distance. For the mesh topology, the diameter is

2√❑N−2

where N is the number of nodes in the mesh.

2.3.3 Average Distance.

Average distance is the average routing distance between each pair of nodes. We compute the total routing distance between each pair and divide it by the number of pairs.

2.3.4 Minimum Bisection Bandwidth.

Minimum bisection bandwidth is defined as the bandwidth of a minimal cut though

the network such that the network is divided into two equal sets of nodes. For mesh topology, minimum bisection bandwidth is 2 N where N is the number of nodes in the mesh. Here we assume that the links are bidirectional.

Figure 2.8: Minimum Bisection Bandwidth for the Mesh.

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) xii

2.4 Routing Techniques

Routing logic decides which direction the packet should take in order to reach the

destination. Each packet must contain enough information required to reach the

destination. Depending on the design decisions, packets routing information is put on

the packet. Minimum information needed by the routing logic to decide the next

direction is the destination address in which case we do not worry about the sender

and no response is to be sent. Routing logic could be more complex than we can

imagine. Routing algorithms are very critical for several reasons. Good topology

selection gives us potential performance possibilities but good routing algorithm can

help us achieve that potential.Good routing algorithms balances the networks load

across the network even if there is a non-uniform traffic generation.

Routing decision may or may not depend on the network state. Routing can be of two types:

Deterministic and Oblivious Routing:

These algorithms always choose the same path for a given source-destination pair even if there are multiple paths exist between the pair. They do not consider the network state while computing the path, so they do a bad job of load balancing. But because of the advantages of ease of implementation and dead-lock free behavior, they are quite common in practice.

Oblivious routing does not depend on the state of the network. These algorithms include deterministic algorithm as subset, it does not take the state of the network state in consideration.

Adaptive Routing:

As the name suggests, Adaptive routings consider the present network state and then by looking at the traffic and hotspots, try to route the packets around the congestion spots. The network state variables that could be considered are node state, link state, buffer queue lengths etc.

In my work, I choose Dimension Order Routing or X-Y Routing. Dimension-Order routing is one of the intuitive ways for routing packets in mesh networks. As mesh networks have two dimensions, the easiest way to send the packet to destination is to send it in X dimension and then in Y dimension. I actually order the dimensions we can take in order to reach the destination. We can go X , and then Y. We cannot move any other way.

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) xiii

Figure 2.9: Path in X-Y Routing

2.5 Flow Control Techniques for NoC

Guaranteed delivery of flits is expected from every network. We always want to design a network that is resilient. We want a network that does not drop packets even at peak performance loads. Flow control is responsible for regulating the traffic in a certain way that optimizes the bandwidth or delay or some other parameters. Flow control makes sure network does not drop data. Flow control can be either link based in which we regulate the traffic flowing through each link or it can be end-to-end in which we control the round trip flow control ensuring that we do not good the network. It is also responsible for allocation of resources in optimum way to achieve the highest performance fraction of the ideal performance. Depending on our design decisions, we may choose either link-based flow control or end-to-end flow control. Systems have been designed which have both type of flow controls in layers. An example for layered flow control can be a scenario in which a node wants to communicate to a distant memory controller. The node is allowed to have five memory transactions in flight but for the sixth, it has to wait for the acknowledgement to come back. Then the links on the path may have flown flow control to ensure that they do not overrun the buffer s. We can have any of the following flow control approaches:

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) xiv

On-Off stall signals :

This approach is intuitive. The idea is very simple- If you cannot accept data anymore, stop others and process your data. This situation arises if you are facing a buffer over flow situation. A stall wire goes from you to your neighbors. As soon as we face a buffer over flow, we generate a stall signal which is propagated to the sender and it stops sending data. In Figure 2.9,

Figure 2.10: On-Off Stall Signals

Sender is sending data to receiver via a link which is pipelined. At some point of time, receiver ends that now it cannot accept any more data because it has to process D. This situation goes through some combinational system and stall signal is generated. This stall signal runs back to sender, stalls the input- output in the link,stalls the sender. Now, as long as stall signal is high, senders cannot progress.

Figure 2.11: Pipeline for the Flits in flight

Here A,B,C and D are uniquely flow-controllable units. In Figure 2.10, we can see that when D is being processed, pipeline is being stalled. This is not desirable in terms of cycle time. The other issue with stall wire is that stall signal is generated by some combinational logic and then it runs down the way to all the functional units. If the cycle duration is close to the length

Chapter 2 Introduction to On Chip Networks (NoC or InterConnects) xv

Of the wire, this can kill the performance so this design is not chosen for high performance networks.

Credit based flow control :

This approach is based on feedback signals that convey the number of free buffer locations at next hop to the sender. The number of free buffer s is called Credit. If any destination/next hop does not have credits, then flits are not forwarded to it in next cycle. Each sender keeps a counter which counts the number of buffer locations available at the receiver end. When-ever sender sends the data, it decrements the counter. Whenever a credit is received, counter is incremented. When the counter reaches zero, sender can assume that all the round trip latencies (buffer spaces) are occupied and then it stops sending. Credit based flow control has two important concerns:

a. If we have a credit counter that can count more than the number of buffers, it will lead to packet drops because credit will not be zero when buffers will be actually full.

b. If we want to have less than ideal bandwidth and performance, we can have a credit counter to count less than the number of buffer s, which will cause a situation where we send some data and then stall even though buffer s are free. This will give us no data loss.

Figure 2.12: Simple credit based flow control

Chapter 3

Router Components

Router contains registers, switches, functional units, routing logic that collectively implement the routing and flow control functions required to store and route the packets to the destination. There are many router architectures available in the industry and academia. This chapter is about the virtual channel router which is pipelined at it level. Most of the modern router architectures have a similar mechanism- Virtual channels and Credits for buffer allocation.

Figure 3.1: Virtual Channel Router

A Virtual-Channel router concretely has the following components:

Input Channels

Virtual Channels

xvi

Chapter 3 Router Components xvii

Routing Logic

Virtual Channel Allocation

Switch Allocator

Crossbar Switch

Output Channels

3.1 Input Channel.

Input channels are the interfaces that are responsible for taking packets from the sender and then store it. Input channel is a FIFO queue of certain size. In a virtual channel router, each input channel is partitioned in some separate channels that can work as an independent input channel. These are known as virtual channels.

Figure 3.2: Input channel with four virtual channels

These virtual channels are just logical partitions of the single input buffer into multiple buffers. Each virtual channel stores some flits and maintains its state. Typically five state fields (G R O P C) are maintained by each virtual channel in order to forward the flit towards destination. Each Virtual channels has a register that stores these five fields. G field tells the state of the virtual channel (Idle/Routing/Waiting for VC /Active). R field stores the next hop direction for the destination of the packet. O field stores the VC ID that is assigned to the packet at the next hop. P field stores the pointers to tail and head it in the buffers. C field stores the credits.

In simplest scenarios with no virtual channels, On-chip networks may suffer from the problems of traffic mixing from multiple sources. To solve this problem, virtual channel is reserved for a packet so that it can pass without mixing with traffic

Chapter 3 Router Components xviii

from another source. But if we have a single channel, reservation leads to deadlocks and performance falls on the poor. So we use virtual channels at each input channel so that we can have other options if any channel is reserved. Head it reserves the virtual channels and tail it frees the virtual channel.

3.2 Routing Logic :

Routing is the process of selecting a path from the source node to destination node in a particular topology under given constraint such as congestion, faults, link failures etc. Topology behaves as the road map and routing solves the path finding problem. Topology determines the ideal performance, routing determines the achievable performance. In our work, we have used simple Dimension-Order routing (X-Y routing) to route the flits to the destination. Each input channel has its flown dedicated routing logic. The routing logic knows the tiles Id and takes the destination ID and two other parameters: number of tiles in X-dimension and number of Y-dim. Then it uses simple modulo-arithmetic to determine the next direction for the it in order to send it to the destination.

3.3 Switch Allocator :

Once We are done with routing and switch allocation, now we go for switch allocation. At this point, we know the next direction for the flit, VC that will store this it at the next hop, and the VC that is now requesting the switch allocation. The arbiter selects a VC from each input channel in any order. Each VC has a register storing GROPC fields. If any VC state is active, then it can request for switch allocation.

Next direction, next hop VC identifier and requesting VC identifier are sent to the switch allocator. Requesting VC identifier is needed in order to send a response signal back to input channel for head pointer update. Next direction is needed to generate the multiplexer select lines and next hop VC identifier is needed for Credit check. If Credit is zero, then this switch allocation request is not served in that cycle.

Switch Allocation method is simple. Each input channel can request for switch allocation rising edge of clock. Each request is stored in a register and the valid

Chapter 3 Router Components xix

bit is set for the request. At each clock cycle, the switch allocator starts from any random valid request and switch is allocated to it. Then the allocator walks through all the requests in a circular fashion and all non-blocking requests are allocated the switch. Switch allocation is simply a multiplexer select line generation. All the served requests are invalidated at the same moment and a response signal is sent to the requesting input channel. At the next cycle new requests are received and the same procedure is followed. Random selection of requests allows us to ensure some fairness.

3.4 Crossbar Switch :

Crossbar switch is the way we get physical connectivity from input channels to output channels and this lets us switch the its from any input channel to any output channel. Crossbar is generally a fully connected crossbar. It is an arrangement of multiplexers.A m x n crossbar directly connects m input channels to n output channels. It is an arrangement of n number of m:1 multiplexers. If m = n , then that crossbar is called a square crossbar. If m and n are not equal then that crossbar is called a rectangular crossbar.

In our work, we are using a 5 x 5 crossbar implemented using 5:1 multiplexers

Chapter 3 Router Components xx

Figure 3.3: A 5 x 5 Crossbar Switch

Chapter 4

Anatomy of a Message andImplementation

Packet switching is a good idea to deliver a message from one node to another. A message can be any cache line, an operand for any operation, or anything else. Variety of message types gives any message an arbitrary size depending upon the type. But the underlying network expects some constrains on the messages. This leads to the concept of packetization.

4.1 Packets :

Messages can be of arbitrary size. In NoC, we can safely assume to have maximum message size to be 512 bytes. Still it is large message size to have when we want performance. If we try to send the entire message as a single entity using virtual channel router, virtual channels might be reserved for long time. This actually affects the performance.

In order to improve performance, the long messages are divided in smaller size packets. A packet is a routable entity. A packet may have any structure. Generally, a packet has destination, source, length, data and other parameters needed. Packet formats are not standard. We can use a packet structure according to performance requirements.

xxi

Chapter 4 Anatomy of a Message and Our Implementation xxii

Figure 4.1: Packet Structure

4.2 Flits and Phit:

Messages are divided into packets in order to get performance. In internet-like networks, packets are flow controllable because we have enough resources to use. But On-chip networks have resource constraints. Packets are routable but their flow may not be easily controlled. We further divide the packets to flow controllable entity called Flit (Flow Control Digit). This is the basic unit of flow control. The packet is transferred in form of its. Flit works in following way- suppose we have it width as 4 bytes but we can transfer single byte per cycle. So whenever we transfer data that will be 4 bytes together. It will not happen that receiver received 3 bytes and then said Buffer full!!!. So receiver will always get 4 bytes one by one in single transmission.There can be following types of its:

4.2.1 Head Flit :

Head it is the first it from the packet. In virtual channel router architecture and worm-hole switching, head it is the only it that contains routing information. Head it is the first to reach to input channel. Head it performs all routing and virtual channel allocation. All information is then stored in virtual channel state registers.

4.2.2 Body Flit :

Body it contains VC identifier and data. There is no routing information in body it. It depends on the virtual channel for forwarding.

Chapter 4 Anatomy of a Message and Our Implementation xxiii

4.2.3 Tail Flit :

Tail it is the last it in the stream of data from the same packet. It is responsible for cancelling the virtual channel reservation. Once the tail it passes through the virtual channel, that virtual channel can be allocated to other packets.

4.2.4 Phits

Phit is also known as Physical Transfer Digit. It is the size of data that can be transferred in a single cycle. Generally, Phit and Flit are of similar width.

Figure 4.2: Message, Packets, Flits and Phits

4.3 Flit formats for communication.

Each and every network for communication is built upon some specific message format that contains information required for sending the message to destination. In our implementation, we have made the following assumptionsIP core will put the message to a network interface buffer with required routing information such as destination, source, length of the message etc.

Chapter 4 Anatomy of a Message and Our Implementation xxiv

Packet will be divided into it by a network interface. Each packet will be divided into head it, body it and tail it depending on the packet length. Router Architecture decisions : In our implementation, we have put four virtual channels at each input channel. The buffer s at each input channels are partitioned to be used by each virtual channel. Buffer s for virtual channels are 32 bit wide registers put in form of a FIFO of length 8. Each Input channel is responsible for credit updates. In an ideal scenario, it will take at least two cycles from crossbar to input channel so FIFO is designed to accommodate the two flits in flight. So credit is initialized to 6 and is reduced to zero at

(Head+2) % LENGTH == Tail.

VC identifier bits are taken for demux select lines at input channel. For VC allocation, VC is advertized as free if and only if VC state is Idle. So the rising edge for Idle updates the free VC status and it can be allocated to other packets. VC allocator unit is responsible for VC allocation and the free VC status from all the neighbors is input to the VC allocation unit. VC allocator stores the advertized status in its private registers. On the request for VC allocation, it allocates a free VC to the packet and then it updates the private registers to mark that VC as allocated so that it cannot be allocated to other packets. Only a rising edge of Idle state of the VC can mark that VC as free again.Switch Allocation unit receives credits from all the neighboring VCs and then depending on the request and credit availability, switch allocation is performed. Switch allocator keeps track of all the allocations per cycle. It ensures that no two flits are allowed to go to the same direction in same cycle. Switch allocation unit generates the multiplexer enable signals and forwards the flits to switch.Structure of the Virtual Channel State Register : Virtual Channel states are stored in 16 bit registers. These registers mainly store three fields : VC state, Route, and VC at next hop. Each virtual channel has its flown state register. In the virtual channel state register, bit[2:0] are VC State bit. VC state can be IDLE or ROUTING or WAITING FOR VC or ACTIVE. Bit[5:3] store the next direction for the packets. Bit [7:6] store the virtual channel identifier that will be used at the next hop. Remaining bits are not being used. We can use these bits for credits, pointers and other purposes.

Chapter 4 Anatomy of a Message and Our Implementation xxv

Figure 4.3: Virtual Channel State Register

The network interface divides the message into 32 bit wide its and these flits are put in the network by network interface. We do not packetize the message.We directly from the flits from message. Both flits and phits are 32 bit wide. There can be 64 tiles because there are only 6 address bits. Head Flit has enough space to accommodate up to 4K tiles. Router is programmable. We can set the Tile ID, number of tiles in X-dimension and number of tiles in Y-dimension.

4.4 Flit Format

There can be three types of flits.

Head Flit : Head it is the rst it and contains all the routing information for the flits. Head it contains HEAD FLIT IDENTIFIER bits, VC ID bits, Destination bits and Source bits. Head it is 32 bit wide. Bit [1:0] are always 11 which tells us that this is a head it. Bits [3:2] are the VC ID bits. These bits store the VC that was allocated to this packet in VC allocation. Next 6 bits i.e. bit [9:5] store the destination ID. The next 6 bits i.e. bit [15:10] store the Source ID. Remaining 16 bits are available for any uses.

Figure 4.4: Head Flit Structure

Body Flit: Body flits are 32 bit wide. Bit [1:0] are always 10 that tell us that it is a body it. Next 2 bits i.e. bit [3:2] contain the VC ID bits. Next four bits i.e. bit [7:4] are not being used. We can use them for any other purpose. Remaining 3 bytes contain the data.

Chapter 4 Anatomy of a Message and Our Implementation xxvi

Figure 4.5: Body Flit Structure

Tail Flit: Bit [1:0] are always 01 which say that this is a tail it. Next 2 bits are VC ID. Next 2 bits are count bits that tell us the length of tail it. For example, if count bits are 10 then only 2 bytes (second and third least significant bytes) are useful. Rest will be treated as a pad and will be rejected. Bit 7:6 are not being used. Rest three bytes may or may not have any data.

Figure 4.6: Tail Flit Structure

4.5 Router Architecture

4.5.1 Input Channel

Input channel is a simple FIFO of 32-bit wide registers. To write any it to buffer, wr-en signal is used. HEAD and TAIL pointers are updated whenever we enqueue or dequeue any it from the FIFO. Each input channel has a routing logic that performs route computation using X-Y routing. Next hop direction is stored in a register and input channel is marked as ACTIVE. It can request switch allocation and flow.

Figure 4.7: Input Channel States

Chapter 4 Anatomy of a Message and Our Implementation xxvii

4.5.2 Switch Allocation

Switch allocation block is a reservation station. Each input channel sends request for allocation. One of the requests is chosen arbitrarily. If this request can be served, then the requested output channel is reserved for this input channel. The reservation information is stored in the Reservation Status register for each output channel. Then we go in cyclic order to serve other requests. We can serve all possible requests for reservation in a single cycle. Once we have performed reservation, we can perform switching by sending the fiits and multiplexer select signals to crossbar switch in each cycle. We also acknowledge the input channel being served so that the HEAD pointer is also updated.

The Reservation Status register is a 7-bit register which has a valid bit, next dir bits for the next direction and input channel bits to identify which input channel has reserved this output channel.

Figure 4.8: Input Channel States

4.5.3 Crossbar Switch

Crossbar switch is a simple arrangement of 5-to-1 multiplexers but each input lines are 32-bit wide. We have 3 select lines and one enable line which are received from the switch allocation unit. The flit is switched to the next hop depending on the select lines.

Chapter 5

Logic synthesis from HDL

Logic synthesis is a process by which an abstract form of desired circuit behavior, typically at register transfer level (RTL), is turned into a design implementation in terms of logic gates, typically by a synthesis tool. Generally, we choose a hardware description language (HDL) such as Verilog and model the abstract form of digital circuit behavior.

5.1 Architecture of Synthesizers

The architecture of various synthesis tools is pretty much like a compiler. Logic synthesis is done by synthesizers in two steps.

Front end

Back end

5.1.1 Front end

In the front end phase, synthesizers take the RTL source code and perform parsing and syntax checking. Once this phase is carried out successfully, the synthesizer perform the elaboration step.

xxviii

Chapter 5 Logic synthesis from HDL xxix

5.1.2 Back end

Once the elaboration is done, the back end of the synthesizers start their job. The output of the elaboration step is taken as input and optimized gate-level netlist is generated via the following steps:

Analysis and Translation: In logic synthesis, there are two major concerns.

Functional metric: fan-in, fan-out and others

Non-functional metrics: area, power and delay.

Logic synthesis also includes library binding.

Technology-independent synthesis: This step includes simplification of the logic, restructuring the delay and the logic network.

Technology-dependent synthesis: This step involves technology mapping and library binding. There are two approaches for technology mapping.

A two-step Approach: The network is first decomposed into small sized blocks. Then number of nodes is reduced and mapped on the hardware.

FlowMap method: The network in decomposed into LUT sized blocks and then the number of logic blocks is reduced.

Netlist generation: Once we have done the technology binding, the synthesizer produces the netlist.

5.2 FPGAs

Field-programmable gate arrays (FPGAs) are so-called because they are structured very much like the now-obsolete \gate array" form of application specific integrated circuit (ASIC). In fact, FPGAs essentially killed the gate array ASIC business. In the not-so-distant past, FPGAs were marketed for primarily two uses:

For prototyping ASICs: For use in systems to achieve time-to-market demand knowing that they would be replaced with an ASIC implementation at the earliest opportunity. With regard to this latter point, FPGAs can be programmed on our desktop in minutes while ASICs require weeks to fabricate a new design.

Chapter 5 Logic synthesis from HDL xxx

5.3 Components of FPGA.

Each FPGA vendor has its flown FPGA architecture, but in general terms they are all a variation of same design. The architecture consists of configurable logic blocks, configurable I/O blocks, and programmable interconnect. Also, there will be clock circuitry for driving the clock signals to each logic block. Additional logic resources such as ALUs, memory, and decoders may also be available. The three basic types of programmable elements for an FPGA are static RAM, anti-fuses, and ash EPROM.

5.3.1 Configurable Logic Blocks (CLBs)

These blocks contain the logic for the FPGA. In the large-grain architecture used by all FPGA vendors today, these CLBs contain enough logic to create a small state machine. The block contains RAM for creating arbitrary combinatorial logic functions, also known as lookup tables (LUTs). It also contains flip-flops for clocked storage elements, along with multiplexers in order to route the logic within the block and to and from external resources. The multiplexers also all flow polarity selection and reset and clear input selection.

5.3.2 Configurable I/O Blocks

A Configurable input/output (I/O) Block is used to bring signals onto the chip and send them back o again. It consists of an input buffer and an output buffer with three-state and open collector output controls. Typically there are pull up resistors on the outputs and sometimes pull down resistors that can be used to terminate signals and buses without requiring discrete resistors external to the chip.

Chapter 5 Logic synthesis from HDL xxxi

The polarity of the output can usually be programmed for active high or active low output, and often the slew rate of the output can be programmed for fast or slow rise and fall times. There are typically flip-flops on outputs so that clocked signals can be output directly to the pins without encountering significant delay, more easily meeting the setup time requirement for external devices. Similarly, flip-flops on the inputs reduce delay on a signal before reaching an input- output, thus reducing the hold time requirement of the FPGA.

5.3.3 Programmable Interconnect

There are long wires that can be used to connect critical CLBs that are physically far from each other on the chip without inducing much delay. These long wires can also be used as buses within the chip.

There are also short wires that are used to connect individual CLBs that are located physically close to each other. Transistors are used to turn on or o connections between different wires. There are also several programmable switch matrices in the FPGA to connect these long and short wires together in specific, executable combinations.

In an ASIC, the majority of the delay comes from the logic in the design, because logic is connected with metal lines that exhibit little delay. In an FGPA, however, most of the delay in the chip comes from the interconnect, because the interconnect, like the logic, is xed on the chip. In order to connect one CLB to another CLB in a different part of the chip often requires a connection through many transistors and switch matrices, each of which introduces extra delay.

5.4 FPGA programming

The end-product in FPGA programming is a bitstream le that is downloaded to FPGA. FPGA vendors provide tools that take the HDL les and modules as input and then synthesize the code to bitstream. Xilinx and Altera are two leading suppliers of FPGA bitstream synthesis tools.

Chapter 5 Logic synthesis from HDL xxxii

5.4.1 Design Input

The design input stage is where we take a design you have planned out, and start inputting it into the Xilinx tools. A design can be input using a variety of methods. You can code it in either Verilog or VHDL, or alternatively you can use a Schematic view to input your design graphically. For purposes of this class we input our design as Verilog RTL. Once your design is complete, you can check the syntax of your RTL during this stage and x any errors that may result.

5.4.2 Synthesis

The synthesis phase of the design is where the Xilinx tools take your Verilog RTL, and translate it into physical hardware. The tools use a variety of logic minimization techniques to produce a rough mapping of your design on the FPGA. Xilinx FPGAs are based on Slices. Each slice contains a few Multiplexers, Look up Tables, and other logic. One of the major differences between design targeting a FPGA and an ASIC is that you have to take into account xed resources. When you run Synthesis for an ASIC in Synopsys, the software determines a netlist which eventually determines which standard cells your design needs, how to connect them, and how many you need. Synthesis for an FPGA involves the tool determining how it can map your design to these xed Slice structures in the most efficient manner possible. At the end of the Synthesis process, the tools have a rough idea of the resource utilization and maximum clock frequency of your design.

5.4.3 Design Constraining

At this point in the design flow, the tools know how they are going to map the logic into slices. The tools do not however know how to constrain the design. Another resource FPGAs provide other than slices is pins. Pins allow us to get inputs/stimulus in from the outside world and allow us to send out information. Each port in the top level design should correspond to one of the FPGAs pins. If you do not explicitly specify which pin a port should correspond too, the tools will randomly assign ports to pins. This is usually not desired. There are a variety of types of pins available on FPGAs. Some pins should only be used for Clock or reset signals. Other pins are for different input/output standards such as TTL or LVDS.

Chapter 5 Logic synthesis from HDL xxxiii

5.4.4 Implementation

After Synthesis and Design Constraining, Implementation determines which specific slices will implement the given logic. Implementation also works out the routing of signals from the input/output pins to the logic in FPGA slices. Some refer to this stage as Place and Route because the logic is placed in specific parts of the FPGA and the signals are routed.

5.4.5 Bitstream Creation

The final step in the design process is to take the implemented design and translate it into a format that can be used by the FPGA. This format is called the Bitstream, and it is a proprietary binary le that gets loaded onto the FPGA. Xilinx FPGAs are SRAM based technology, meaning that once we load a bitstream on the FPGA, it only remains in the FPGA while power is applied. Each time power is applied to the FPGA, the design must be reloaded. Most designers handle this by placing a small PROM (Programmable ROM) before the FPGA that on power up, will automatically program the FPGA in just a few milliseconds. Another part of the Bitstream creation process allows you to create a special bitstream for these PROMS.

Chapter 6

Conclusions and Future ChallengesUse of interconnect networks instead of dedicated wires has advantages in terms of structure, modularity and performance. A network simplifies the wires and gives them well-defined and well-controlled electrical properties. The _ne-grain control on parameters enables high performance. This also results in significantly lower power dissipation, high propagation velocity and very high communication bandwidth. Sometimes, the performance enhancement due to these aggressive networks has lost to network overheads. Bandwidth is also increased because of link sharing. The interconnect networks provide a standard interface similar to bus technologies.

The continuing scaling of technology poses several design challenges for future high performance architectures. One such challenge is dealing with the increased effects of process variation. Besides, the continuous in- crease in operating frequency along with higher on-chip integration of functionalities has been exacerbating chip power density and within-die temperature fluctuations Many decisions taken at system level can significantly affect the power and temperature pro- file as well as the overall performance. Assuming a uniform temperature and no process variation can lead to substantial inaccuracies in system-level design choices. It is thus important to accurately and efficiently estimate power taking into account process and temperature variations so that they can be accounted for early in the design stage.

Bibliography1. Li, Bin,Peh, Li Shiuan Patra, Priyadarsan “Impact of process and temperature

variations on network-on-chip design exploration” in Second IEEE International Symposium on Networks-on-Chip, NOCS 2008

2. Sheng, Xu Benito, Ibis Burleson, Wayne “Thermal impacts on NoC interconnects” in NOCS 2007: First International Symposium on Networks-on-Chip

3. On chip network by Natalie.

final report on noc

Documents