university of manchester school of computer science · university of manchester school of computer...

University of Manchester

School of Computer Science

Exploring router microarchitecture for

Networks on Chip

Progress Report

Student Name: Panayiotis Englezakis

Supervisor: Javier Navaridas

i

Contents Abstract ................................................................................................................................................... ii

Introduction ............................................................................................................................................ 1

Project Goals ....................................................................................................................................... 1

Report Structure ................................................................................................................................. 2

Background ............................................................................................................................................. 2

The need for Networks on Chip .......................................................................................................... 2

Network on chip Architectures ........................................................................................................... 4

Baseline Router ............................................................................................................................... 8

Rotary Router .................................................................................................................................. 9

Hierarchical Router ....................................................................................................................... 11

Research Methodology and Project Planning ....................................................................................... 13

Research Methodology ..................................................................................................................... 13

Data Collection .............................................................................................................................. 14

Project Planning ................................................................................................................................ 14

Project Progress .................................................................................................................................... 17

Building the Baseline router ............................................................................................................. 17

Summary and future work .................................................................................................................... 20

References ............................................................................................................................................ 21

Appendix A ............................................................................................................................................ 22

ii

Abstract Since Networks on Chip (NoCs) are a necessity in modern processor design a clear image about the

context that each of the router architectures has to exist. The main router architecture is the

crossbar based architecture. Proposals that reduce the power and area of the crossbar based are the

rotary and hierarchical architectures. There has been no comparison between all three router

architectures. Comparing the three architectures can make clear as in to which context each of the

architectures performs best. The comparison of the architectures will be done via a functional

simulation of all three architectures. The architectures will be simulated under various network

traffic and using different router configuration to identify the environment that each router works

best. Also a theoretical analysis will be done on the area and power consumption of the routers. The

project aims to identify the environment that a router will perform best in respect to the traffic that

it will be handling and the power and area restrictions of the device it will be working on.

1

Introduction Since the invention of the transistor there has been a revolution in digital logic design. More

complex and smaller integrated circuits made their appearance. Amongst them, general purpose

processors appeared. Since then, transistors have been getting smaller reaching deep submicron

dimensions the last decade. Since processors became more powerful, new computationally intensive

massively parallel applications made their appearance.

Transistors scaled down better than the wires that connected them. So, on chip communication

became really expensive. Wires, since they do not scale down well, are taking a lot of area on the

chip that could be used to implement functional units. Also data have been propagating slower on

modern CPUs. This is because smaller wires have increased resistance. The appearance of

computationally intensive massively parallel applications came to stress even further the on chip

communications. Also the fact that wires came so close to each other causes crosstalk noise due to

capacitance.

The increasing complexity of the processors leads to processors that need a lot of man hours to be

designed and tested. Due to the nature of the processors, component reuse is not an easy task

because there are not a lot of components to be reused.

The solution to these problems came in the form of a network. Networks on Chip are networks that

are composed of links, routers and nodes. The difference is that they are implemented on-chip. The

nodes in this case are the processing cores of the processor (CPU). Having a modular design enables

easy reuse of the components the processor consists of. Also massively parallel applications made

their appearance and parallelization is the future of computing the processors have to keep up. The

only way to sustain the on chip communications of a chip is the use of networks.

Project Goals

This project has as a goal to examine the major router architecture (crossbar based router) and two

other architectures proposed to reduce power and area consumption. The three router

architectures to be examined are the crossbar based [1], the rotary router [2] and the hierarchical

router [3]. These architectures are being to be simulated under various network traffic to identify

the context that each of them performs best.

The functional simulator is going to be built in Java. The routers are going to be simulated in a

complete network under various configurations. The complete network will be called system from

now and on. The reconfigurable system will have aspects like the number of routers that are going

to be simulated, and the various characteristics of the routers, like the number of Virtual channels

and the size of the input buffers. The system will be tested under network traffic that is targeted to

the specific architectures. All the routers are going to be simulated under all kinds of traffic so the

behaviour of the routers can be recorded. Also a theoretical analysis of the routers will be done in

respect to the power consumption of the routers and the area needed in hardware to implement

them.

Upon completion, the project aims to build a profile for each router. The profile will be build based

on the statistics collected from the simulation and the theoretical analysis. The routers will be

classified in respect to the traffic they can handle best, their power consumption and the area they

2

need in hardware. There has never been a comparison in the past that compared all three routers at

the same time.

Report Structure

The remaining of the report is structured as follows:

Chapter 2 – Background:

In Chapter 2 the reasons that lead to NoC design are further explained. Also detailed descriptions

about various techniques used in designing NoC are stated. Then the three architectures to be

simulated are described.

Chapter 3 – Research Methodology and Project Planning:

In chapter 3 the methodology that will be followed to construct the experiments is explained. Also

the way that data are going to be collected is explained. Also the plan made up until completion of

the project is stated in this section of the report.

Chapter 4 – Project progress:

In Chapter 4 the progress made in the project is explained. Detailed description of the classes used is

provided as the basic functionality of the routers is explained.

Background

The need for Networks on Chip

As computer systems evolve they become even more complex. Also modern technology scales down

the size of the transistors thus enabling more transistors to fit on a square inch [4]. Technology has

advanced so much and the complexity of the computer systems increased to a point that there is no

room for further evolvement.

Similar problems in the past were the force that radically changed the way that computer systems

were developed. As an example in the early days the dominant opinion was that the memory was

cheap and the transistors were expensive [5]. When this problem was solved we had fast and cheap

transistors but the memory became slow due to the more increasing CPU speed. CPU designers

overcame this problem by introducing a revolutionary idea at the time, but a trivial one today,

caches. Caches gave the impression to the CPU that it had an unlimited amount of fast memory at its

disposal. Along with smart cache designing and replacement policies CPUs overcame the slow

memory problem.

On the same lines today’s processors face problems that demand the radical change in their design.

Firstly, the scaling of the transistors has caused many problems. One problem is that the transistors

nowadays do not strictly behave as switches [4]. The drain and the source came so close to each

other that there is leakage current. So even if the transistor is idle, there is current leaking through

that contributes to the high power consumption of modern computer systems. The smallest

transistor used in commercial CPU is 22nm and is used by Intel in its latest technology Ivy Bridge [6].

The other problem is that the wires do not scale down as transistors do. Smaller wires mean

3

increased resistance that contributes to the slow transmission of data. Also the wires are so close to

each other that there is cross talk noise from one wire to another.

Manufacturing transistors is not an easy task. Transistors are manufactured using lithography (usage

of ultraviolet light to carve the transistors on silicon). Some of the components of the transistors are

smaller even from the wavelength of the ultra violet light used in the manufacturing process. This

process results in transistors that present a lot of variations. Tackling this variation is not an easy

task. CPUs are designed deterministically. Designing CPUs probabilistically means that extra silicon

has to be used for tackling the variability thus taking space on the chip that could be used for extra

memory or an extra functional unit.

For many years the CPU designing community was focused on increasing the instruction level

parallelism (ILP) to gain extra speed up. As in every phase of the CPU designing history an idea is

abandoned if there is nothing more to get from it in terms of performance. After finishing with the

ILP, a new approach was introduced. The approach was the thread level parallelism (TLP). Before the

phase that the TLP was introduced, the designers could easily achieve the desired performance just

by finding ways to increase the clock speed of the CPU. In mid 2000’s the clock speeds of the CPUs

was above 3.5 GHz. Because speeds like that could not be sustained any more a new idea in

designing CPU was proposed.

The new trend was designing CPUs that have multiple identical cores inside them. The clock speeds

decreased and the cores themselves became simpler. The combined computing power of the CPU

though was increased. This means that there was an urging need for implementing networks on chip

(NoC). A single bus could sustain two cores but the cores kept increasing since then and they still do.

A bus though does not scale well as more cores are added. The arbitration mechanism of the bus

would be the centre of the chip. CPU history clearly states centralized structures are a bad solution

due to the congestion created. The simplicity of the bus gives it a distinct advantage over the NoC

due to reduced design time. The bus though has poorer performance over NoCs. NoCs scale well

with the increased number of cores and various topologies and routing algorithms can be

implemented according to the nature of the CPU (e.g. server or client).To sustain the amount of

traffic between multiple cores NoCs had to be implemented.

In a few words the computation has became cheaper. In that fashion new computationally intensive

applications were constructed. Communication though is becoming more expensive. This is due to

the increased amount of on-chip traffic and the physical limitations of the wires. As mentioned

above due to the increased resistance from the scaling of the wires the time of flight, from source to

destination, for an electrical signal is larger.

Another problem phased by modern CPUs is synchronisation. A single oscillator cannot synchronize

the whole chip. This is again due to the nature of the wires, the latency of the transmission and the

crosstalk between the wires. A solution to this problem can be the division of the CPU in to

independent units that will work with their own clock. This will create a locally synchronous and

globally asynchronous scheme. To aid in that direction comes the NoC. NoCs have the ability to

transfer data between the cores of the CPU without the need for the cores to synchronise by their

own.

4

Modern CPUs have more than just simple identical cores inside them. Designers incorporate simple

GPU cores. This transforms the CPU to a System on Chip (SoC). SoC are from their nature

heterogeneous. Each module does different computations and produces different results. For these

modules to communicate a level of abstraction is needed. This level of abstraction is offered by the

NoC. Each module sends the data and the NoC manages to transform the data and send a

meaningful message to the other core. This inserts the desired level of abstraction. Someone might

argue that there is no need for a NoC to do that. The modules can be designed to receive and

interpret data form heterogeneous modules. That is correct. But adding this level of abstraction

means that the various heterogeneous modules can be reused in another design thus significantly

reducing the design time of a system.

Network on chip Architectures

Before starting to describe the NoC architectures there is some terminology that should be

explained for easier understanding. First, the deadlock and livelock states will be described.

Deadlock is a state where a system enters when the recourses needed in order for the packets to

reach their destination are held by other packets that want to reach the place that the existing

packet is. This can lead to a deadlock. The resource dependency path formed has a circular form and

no packet can advance if the other does not advance. The livelock state is similar to deadlock. In this

case the recourses constantly change state waiting for the other resource to be released in order to

advance. The system is stalled again.

Continuing with the terminology the virtual channels are going to be described. Virtual channel is the

logical division of the physical wire to logical channels. Each of the virtual channels has its own and

exclusive input and output buffers.

As NoCs evolved several architectures for implementing one were proposed. Contradicting

computer networks emphasis is given on different aspects of the network. As an example, in

computer networks the critical aspect is the packet arriving at its destination eventually, with latency

of some milliseconds or at the worst case some seconds. In NoCs on the other hand latency is a big

constrain. Packets have to arrive to their destination as fast as possible. Large NoCs nowadays are

used in super computers or large servers. The purpose of the super computer is to run large

simulations or analyse data fast. If the NoC is a bottle neck in the processing power of the super

computer then this would cause problems.

Another great constrain is power. Modern CPUs can consume around 150W of power [7]. This

consumption by itself causes problems because the consumption can build temperatures of

hundreds of degrees Celsius. The high temperatures can damage the chip. To prevent damage the

temperature of the chip has to be maintained low. This is done at the expense of more power

consumption either by fans and heat sinks or either by water cooling. This means that a NoC has to

be power friendly so it will not add any extra power consumption that the chip will not be able to

handle.

Another important aspect is area. The limited space on a chip means that if the NoC consumes a

large amount of area, there will be less area for computation. Contradicting again modern computer

networks, where routers often fill whole racks and the accumulated length of the wires in a network

can reach many kilometres. In addition due to the limited resources and depending on the routing

5

algorithms used, NoCs can be deadlock and live lock prone. Modern NoC architectures have three

critical aspects low latency, power and area.

Messages between cores in the same CPU are often too large to be transmitted in one packet. So

packets are broken down in to smaller chunks called flits. The terminology used to describe the

units that travel in NoC is shown in figure 1. The message is the top level entity and describes the

actual message that a core wants to pass to another core. Then the packet is the actual entity that

will be sent. The packet can break down to flits (flow control units). And the flits break down to fits

(physical units). In NoC the wires transferring the data from a core to another are wide so most of

the times phit is the same entity as a flit.

There are three major schemes that apply flow control at packet level [8]. The first one is store-and-

forward routing. The core sends a packet (a collection of flits). When the packet reaches the node on

its next hop the node has to store the whole packet before forwarding it. This can often cause

problems when the next node on the NoC where the packet will be forwarded does not have enough

space in its input buffers to receive the packet. In such case the packet stalls.

The second is wormhole routing. In wormhole routing the packet is divided to flits. The flit arrives at

the next hop the information contained at is header is analysed, the next hop is calculated and the

flit is forwarded to the next router. When subsequent flits of the same packet arrives the node uses

the decision made for the first flit is applied and the flit is forwarded to the next hop. Thus the

packet is spread in across different nodes until it reaches its destination just like a worm. The

problem with the wormhole is that it cannot dynamically adapt the route of the flits once the

decision on the first flit is made. If there is congestion in one node all of the flits are forced to follow

the route followed by the first flit. If a flit in the middle of the worm stalls, the flits following will also

stall. Consequently the worm can stall until the blocked flit is moved to the next hop. Apart from the

extra latency introduced, wormhole routing wastes recourses in each router that a part of the worm

resides. The latency of a packet in wormhole routing increases linearly in respect to the number of

hops (distance) and the size of the packet. So �� ∗ �� in wormhole routing.

Figure 1: The terminology of the data traversing a NoC

Figure 2: Possible topologies for a NoC

The third is virtual cut-through routing. In

enforced. The only difference is that before the

the flit ensures that the receiving node has enough space in its buffers to accept the whole packet.

Depending on what an architect wants on its NoC it can use either of these schemes. The

forward scheme has less data on its header but it needs a lot of control overheads and guarantee to

transmit a packet. In virtual cut-through there is less co

on the header and again needs reassurance that there is space in the destination to transmit

wormhole routing has extra information in the header, has less control overhead and does not need

guarantee to continue. In virtual cut

number of hops (distance) and the si

The most common technique for improving

avoidance mechanism at hand is vir

buffers in each node. A deadlock can be avoided by cho

the data thus breaking the cycle in the resource dependency graph. VCs can also improve

performance by better utilizing a link. Using VCs is similar to having

physical wire.

As in computer networks NoCs have various topologies [8

shown in figure 2a. All the nodes are connected on a single bus

between the nodes. Then there is the mesh topology as shown in figure

connected with their neighbour. In figure

6

: Possible topologies for a NoC: (a) bus, (b) a mesh, (c) a torus and (d) a tree topology.

through routing. In virtual cut-though the same algorithm as wormhole is

. The only difference is that before the flit is forwarded to the next node, the node sending


Depending on what an architect wants on its NoC it can use either of these schemes. The


through there is less control overhead but there is extra information

reassurance that there is space in the destination to transmit


In virtual cut-through and store-and-forward the latency is the sum of the

and the size of the packet, �� .

improving the performance of the network while having a deadlock

virtual channel (VC). This is done in the expense of extra input

buffers in each node. A deadlock can be avoided by choosing a different virtual channel to transmit


performance by better utilizing a link. Using VCs is similar to having multiple logical links over a

NoCs have various topologies [8]. The simplest one is the bus topology

. All the nodes are connected on a single bus which manages the transmission

is the mesh topology as shown in figure 2b. All the nodes are

In figure 2c the torus topology is shown an extension of the mesh

: (a) bus, (b) a mesh, (c) a torus and (d) a tree topology.

lgorithm as wormhole is

the node sending


Depending on what an architect wants on its NoC it can use either of these schemes. The store and


ntrol overhead but there is extra information

reassurance that there is space in the destination to transmit. The


forward the latency is the sum of the

the performance of the network while having a deadlock

. This is done in the expense of extra input

osing a different virtual channel to transmit


logical links over a

one is the bus topology

the transmission

b. All the nodes are

the torus topology is shown an extension of the mesh

7

topology. Every node is connected to its neighbour as if the nodes reside on a sphere. In figure 2d

the tree topology is shown. In this case is a simple binary three but depending on the application

different variations can be used. There are topologies that often are a hybrid of these basic ones or

others that incorporate heterogeneous routing schemes that use two or more of these topologies

together [9].

To achieve low latency simple routing algorithms along with simple topologies were proposed. As an

example achieving low latency router for a NoC means that the router has to remain as simple as

possible without extra overheads [1]. The simplest routing algorithm that can be used is XY routing.

The flit is first transmitted along the X axis and then on the Y axis until it reaches its destination.

Using a crossbar the XY routing can be implemented while having the least overheads. To further

reduce overheads wormhole packet flow control can be used. By also using VC means that the links

will be better utilized thus achieving enhanced performance. The drawback in this scheme is area.

Implementing a crossbar router requires a lot of area when the number of input ports increases. This

mean that the NoC scheme is efficient in the latency wise but the area might be a limitation in an

area sensitive design.

Another approach to reduce the power consumption, improve the performance of the routers and

reduce the communication cost is to use heterogeneous routers [9, 10]. The term heterogeneous

refers to routers that use different topology, routing algorithm or router architecture on the same

chip. The idea for heterogeneous routing [9] came by the observation that in a NoC the utilization of

the resources is not uniform. Specifically in deterministic XY routing in a simple mesh, the routers

that lie towards the centre of the chip exhibit higher buffer utilization than the routers that lie on

the edge of the chip as shown in figure 3.

So a redistribution of the buffer size and link bandwidth is done in the NoC so the parts of the NoC

that exhibit more utilization that others would have the resources to cope with the demand. In other

cases [10] emphasis is given on using heterogeneous routing to minimize the communication cost on

the NoC. Reducing the communication cost means to exploit the communication locality. In other

words data that are used by the cores should be placed at memory banks near the core that uses

them. A solution proposed was a hierarchical network. A hierarchical network is a network that has

two levels and in each level a different topology is used. A bus is used for local communication

Figure 3: The utilisation in percentage of the buffers (a) and links (b) in an 8x8 mesh topology. [9]

connecting 4-8 cores and a mesh for global

between them. In this way the benefits given by the bus topology are exploited such as si

and power efficiency but leaving behind the unwanted characteristics like the ability on not scaling

well with large number of cores. Also by using the mesh for global communication only means that

the resulting mesh will be smaller and in this wa

consumption of the network at low levels.

A NoC consists of two major components, the links and the routers. Combined together the routers

and the links form the topology which is not a component by it

mentioned before. The main task of this

exists today, as and two other proposals

not only be in the terms of functional behaviour of the routers but also a theoretical analysis about

the area and the power consumption of each router.

Baseline Router

The baseline router [1] is the dominant

perform different routing algorithms at the expense of extra hardware. In a mesh topology it can

easily perform XY routing.

The heart of the baseline router is the cross

all of the outputs as shown in figure

route a package to the destined output port provided that the port is free from other packages. The

selection of the output port is done in two phases. In phase one the

calculated from the information of the package header and the routing algorithm currently being

used. After that the input ports issue a request as to which output port they would like to use. This is

the port request phase. Then the out

to use them. If there are multiple input port requests the output ports select one

schemes. The selection can be done either at random, using round robin or with a priority on eac

the input ports. The output port will answer back and select an input port only if it has the resources

to serve the request.

8

8 cores and a mesh for global communication that connects the core clusters in

benefits given by the bus topology are exploited such as si



the resulting mesh will be smaller and in this way preserving the performance and the power

consumption of the network at low levels.


which is not a component by itself but rather a combination as

he main task of this dissertation is to examine the main router architecture

as and two other proposals made to reduce the area and latency. This examination will

in the terms of functional behaviour of the routers but also a theoretical analysis about

the area and the power consumption of each router.

dominant router. Depending on the topology the baseline router can


he baseline router is the crossbar. A crossbar connects all of the inputs of the router to

figure 4. Having that functionality the baseline router can in one cycle


selection of the output port is done in two phases. In phase one the destination output port is



the port request phase. Then the output ports answer back and select the input port that requested

to use them. If there are multiple input port requests the output ports select one

schemes. The selection can be done either at random, using round robin or with a priority on eac

. The output port will answer back and select an input port only if it has the resources

Figure 4: The baseline router layout

that connects the core clusters in

benefits given by the bus topology are exploited such as simplicity



y preserving the performance and the power


her a combination as

main router architecture that

. This examination will

in the terms of functional behaviour of the routers but also a theoretical analysis about

router. Depending on the topology the baseline router can


crossbar connects all of the inputs of the router to

Having that functionality the baseline router can in one cycle


ion output port is



put ports answer back and select the input port that requested

to use them. If there are multiple input port requests the output ports select one using various

schemes. The selection can be done either at random, using round robin or with a priority on each of

. The output port will answer back and select an input port only if it has the resources

9

To better utilize the physical links and have a deadlock avoidance mechanism the baseline router can

also implement virtual channels. As mentioned before the virtual channels technique is time

multiplexing of the physical channel, dividing the physical link in to multiple logical links. In

architectural terms the VCs are multiple buffers in each input port. When the VC zero of port one is

used the data are written to the zero input buffer of port one. The choice of which VC to use is not

trivial. The simplest scheme is having a simple round robin algorithm that would use one VC at a

time. There can be a more dynamic way of using the virtual channels where information about the

capacity of the buffer can be disclosed. In that case the decision about which channel is going to be

used can be done dynamically.

A drawback of the baseline router is cost in hardware. A crossbar is always costly and does not scale

well in terms of area as the number of input and output ports increase. In a mesh topology deadlock

avoidance and flow control mechanisms can be implemented with low hardware cost. But in a

different topology the control overhead in the router would result in a significant increase in

hardware. The increased hardware, in correlation with the crossbar implementation, would be

extremely costly to implement.

Baseline router is a fast and efficient router when used in small networks. With some possible

optimizations it can reduce the latency of a network significantly. The hardware cost of the router

though is high. In some cases it can also be a limiting factor for some devices that have limited

hardware space and belong to the low power category.

Rotary Router

The rotary router [2] is the second router to be simulated. The rotary router makes use of two rings

that are connected to the input and output ports to route the flits. It also uses bubbles in the buffers

as a deadlock/livelock avoidance mechanism.

The router consists of three different building blocks Input, Output and Buffering segment as shown

Figure 5: The rotary router layout. [2]

10

by figure 5. These blocks construct the two rings that are used to route the flits. For a flit to move

from one element of the router (e.g. input port) to another element (e.g. buffering segment) a cycle

is needed. Each ring rotates flits in the opposite direction of the other. In a topology where each

router is connected to another four routers like a torus topology then the router has five pairs input

and output ports. The four are used to serve the connections between the routers and the fifth one

to connect the core that injects or consumes flits from the network. An input port is connected to

both directions of the ring using a multiplexer. Which direction the flit will choose is described

below.

The router uses flow control based on bubbles in the buffers to establish deadlock/livelock

avoidance. For a flit to be inserted in the ring the first buffering segment after the input port must

have at least two holes. The term “hole”, means the empty memory in the buffer to hold another

two flits. If the input port is connected to the core that the router serves then the number of holes

increases to three. Also for a flit to advance between the buffer segments each buffer must have

information about its own occupation level and about the occupation level of the next buffer in the

router. The buffer will send the data forward only if the flits inside the destination buffer are less or

equal than the flits of the current buffer. The routing engine of the router tries to insert the flit in the

direction that will take the least cycles for the flit to exit the router.

The number of complete turns that a flit is allowed to do in a ring is predetermined. When a flit

completes the predetermined turns without leaving the router through the calculated output port it

will be marked as a misrouted flit and it will leave form the first output port that can serve it. This

has two effects. Firstly, it solves any anomalies from miscalculating the destination that the flit has

to go. Secondly, if there is congestion at a specific output port and consequently the router that the

port is connected to, the flit will follow another route to reach its destination and not waiting for the

flits for the congestion in the consequent router to reduce.

As mentioned before the deadlock/livelock avoidance is achieved by the use of bubbles/holes. The

flow control mechanism guarantees that the core attached to the router and injects flits will stop

injecting flits before the input ports stop injecting flits. This is due to the restriction that for the core

to inject flits there has to be room for at least three other flits in the input buffer opposed to the

input ports that need only two. This means that at some point in the NoC no new flits will be

injected. Only the existing flits will be served so nothing will stall. The only elements that can

increase the number of flits inside the network are the cores. So the last flit that will cause the

network to come at an extreme state will come from a core. If no core consumes any flits the total

number of holes in the network will be 2 � 1 where is the number of routers. The hole can

never stay at one router because after a finite number of cycles a flit will be marked as misrouted

from a router that is next to the router that has the extra hole and it will be moved to the new

router. So the router that lost a package will have an extra hole. Since the hole will be moving in

such fashion the router will never reach a deadlock situation.

The livelock avoidance comes from the fact that statistically no flit can perform any circular path.

When a flit is marked for misrouting it leaves from the first available port. This means that when the

hole moves to another router, an input port form that router will inject a flit in the ring. That will

free a place in the input buffer for another router to send to it thus filling the hole. There is the

possibility of the hole to ping pong in two neighbouring routers.

11

Also the flow control mechanisms can be used also to avoid starvation. In real time programs the

data flow follows a pattern. This can lead to situations where in a router only one port will be used

and the others will not be able to inject new flits thus leading to starvation. To solve this problem

the injection rules can change dynamically. The input port that has the most traffic can increase the

number of holes that need to exist in the next buffer to insert a packet. So the heavily utilized port

will give the chance to the underutilized ports to insert their flits in the router. This change can be

done dynamically.

This router design can become really costly in terms of area, power consumption and latency. The

usage of so many buffers has a huge impact on the area taken by the router. Also the fact that the

flits do not stop moving inside the rings and circulate around until they are marked for misrouting

has an effect on power consumption. This effect can be reduced if the number of complete cycles

that a flit has to complete until it is marked for misrouting is small. Contradicting the baseline router,

if the number of input ports in rotary router increases the maximum cycles needed by the router

route a flit also increases. The growth rate of the rotary router when the input ports are increased

is��. The baseline router has a growth rate of��. Each input port adds one extra cycle to the

worst case scenario for the router.

Hierarchical Router

The hierarchical router [3] is the third router to be simulated. The design of the router is consecrated

to reduce as much as possible the area and power consumption. It mostly uses simple routing

algorithms will little or no deadlock/livelock avoidance mechanisms.

The design as imposed by the name is hierarchical. There is a single routing engine at the heart of

the router that is able to handle one flit at a time. The flits are inserted in a FIFO queue and the head

of the queue is served by the routing engine. This design is to serve the purpose of low power and

small area taken by the design. When a flit is routed it is forwarded to one of the output ports of the

Figure 6: The structure of the hierarchical router

12

router. If the output port does not have the resources to serve the request by the routing engine

then the whole router is blocked until the port is available.

The hierarchical nature of the router is in the way that it handles flits. The router gathers the flits

from the input ports (if a flit exists) and then level by level the flits are merged until the last level is

reached as shown by figure 6. Such a design has many problems as to the amount of cycles the

router will need to merge all of the flits. The amount of cycles to merge the flits in to one large list of

flits is �� log�� (where is the number on input ports). Then the routing engine takes the

flits one by one and routes them to the appropriate output port.

The router by itself is inefficient as to the number of cycles it needs to route the flits. Opposed to the

baseline router, where the number of cycles is constant, as the number of input ports in the

hierarchical router increase, the cycles that are required to route all of the flits also increases. Even

with this major drawback present this type of router has a lot of advantages. The most important

one is its simplicity. In hardware terms a simple router consumes less energy in a specific

environment and also it needs less hardware to function. In a wireless sensor network for example

were the traffic is low and the most critical resource is energy, such routers are ideal.

The router has also the capability of using VCs to better utilize the physical wires but also as a

deadlock/livelock avoidance mechanism. Such mechanism though will increase the number of

transistors that compose the router. Increasing the number of transistors means more energy and

more area. Also virtual channels are implemented to better utilize the physical wires. The very

nature of the hierarchical router makes it behave poorly in environments with intense traffic. So the

use of virtual channels might give the router a way to avoid deadlocks but it will affect all of the low

power and small area of the implementation. In a mesh topology that uses XY routing deadlock

cannot occur. If the router is used in such environment no measures to avoid potential deadlocks are

needed.

13

Research Methodology and Project Planning

Research Methodology

The purpose of the project is to compare the three router architectures. The architectures are going

to be compared at various levels. Firstly, the latency of each router is going to be examined. The

latency is going to be examined under various workloads and router configurations to identify the

strong and weak points of each router. After doing the functional simulation, a theoretical analysis

of the area that the implementation of each router is going to be made. The stages of the research

are going to be stated below:

1. Background research:

The background research started with understanding the reasons that lead to the need for

NoCs. After understanding which problems the NoCs are trying to solve, the various

problems and limitations that emerged from implementing NoCs were studied. The three

main points studied were:

• What is a NoC

• Which are the problems that a NoC solves

• What are the limitations of NoCs

Then the three different architectures are studied in detail. In every case the assumptions

that were made for each of the three architectures were taken under consideration to

provide a general idea as to what limitation of the NoCs each of them is trying to tackle.

2. Functional simulation:

Each router will be built and simulated in java. The routers are going to have a variable

number of resources so various scenarios can be simulated. At first the baseline router is

going to be built, then the rotary router and lastly the hierarchical router. After building the

routers the complete system will be build. The system can simulate various scenarios. The

elements that will be changing in each scenario are the resources (memory, number of VCs),

the traffic and possibly the topology. The traffic is one of the most important aspects in

simulating NoC routers, so special care will be given in developing traffic that will be

stressing the routers and expose their vulnerabilities and their strong points. Each topology

has different routing algorithms. Different routing algorithms might have deadlock/livelock

problems. Different topologies will be simulated if there is time to implement the new

routing algorithms and the deadlock/livelock avoidance mechanisms. There will be a final

testing stage were the data for the routers will be gathered and analyzed. During the

building of the routers and the system there will be continuous testing with predetermined

traffic to test the correctness of the router. Testing the correctness of a network with one

node and two flits it is easy. For more complex traffic a correctness engine will be built that

will monitor the traffic and ensure that the system functions as it should do.

3. Testing and analysing the data:

As mentioned before custom benchmarks will be built to stress the functionality of the

routers. Each router has different characteristics and can be used under different traffic

efficiently. Based on these characteristics benchmarks will be build that will favour one

router at a time while seeing how the other two routers react to the same traffic. Also, to

14

gain a broader perspective, benchmarks will be build that will not favour any of the routers

to test how the routers will behave in the worst case scenario. After building the traffic the

system will be simulated and statistics about the network will be collected. The various

benchmarks will be produced by a benchmark engine that will instruct each core when to

produce a flit, what type of flit to produce (stream or stand alone flit) and which would the

destination would be. Instructing the cores to develop traffic that will either favour or not a

specific router, while all the routers will be tested under the same traffic will give general

image about the behaviour of each router compared to the others.

Data Collection

The conclusions that will be reached mostly depend on the data that will be gathered. Mostly the

data that will be collected will have to do with the latency of the router. The latency of a router

though at a random time does not mean a lot. The latency has to be correlated with what it is going

on at the whole NoC.

The data will be collected by the flits that will be traversing through the network. After reaching

their destination the data collected by the flit will be saved in a file which will be later analysed to

extract the needed measurements. Each flit will collect the following measurements:

• Flit ID (a unique identifier of the packet)

• Start time (the simulation will be a clock accurate simulation so the start time will be the

cycle id that the flit will start to exist)

• End cycle

• Hop counter

• Route followed

An analyser will collect the flits after the simulation is finished. Based on the measurements taken in

the flit the analyser will produce meaningful data. The data will be interpreted and the appropriate

conclusions will be drawn.

Project Planning

As show from the research methodology the project is divided in to three major sections. The

sections are the background research, the building of the routers and the testing of the system to

obtain statistics for each router. Each of these three major sections was subdivided in to single tasks

so the completion of the project can be done in a modular fashion.

As shown in figure 7 in Appendix A the project is divided in to single tasks and some of them are in

parallel. The parallel tasks are tasks that have no dependencies between them. They might affect

one another but that does not mean they cannot be executed in parallel. The first task had been

studying and understanding the background of NoCs. The process of acquiring the knowledge had

taken little over a month. That process had set the foundation to start working on the project. After

that, a week had been devoted in writing the preliminary report for the project that set the goals of

the project. Then the building of the routers follows. The process is again divided in three stages.

The first stage had been the development of the baseline router, the second is the development of

the rotary router and the third is the development of the hierarchical router. The development of

the other two routers is expected to take less that the development of the baseline router. This is

15

because the building blocks and the testing strategy of the routers is the same whatever the

functionality. So it is normal that the development of the other two routers will take less time. In

parallel with the development of the router there is the process of writing the progress report. The

progress report was affected by the development of the functionality of the router but not at a

degree that had prohibited the parallel completion of the two. The completion of the hierarchical

and rotary router will finish after the second milestone of the progress report. The completion of the

standalone routers is the third milestone.

When the basic functionality of the routers is finished then the system incorporating the routers will

be developed. The will be a way of changing the variables of the routers fast so the simulation can

be made efficiently. There will be a graphical user interface (GUI) that will enable the user to change

the various aspects of the routers such as:

• Topology

• Number of input and output buffers

• Size of input and output buffers

• Number of Virtual Channels

• Type of traffic to be simulated

• Type of router to be simulated

The system entity will be able to produce results either in the form of raw data or readymade

graphs. Then the system along with the routers will be tested to ensure the correctness of the

simulator. This would be the final testing to ensure correctness but testing of the functionality of

each router will be done in every step. The complete system is the fourth milestone of the project.

After finishing the system the development of the benchmarks will start. A benchmark engine will be

constructed. The benchmark engine will instruct the cores simulated in the system as to what kind of

traffic to produce. The traffic generated will be constructed to favour one router at a time. Also a

worst case scenario will be developed for each router.

Finishing the development of the benchmarks will mean the beginning of data collection for the

system. The benchmarks will be applied to the system along with changing the various aspects of the

NoC like topology, router type, size of buffers, Number of virtual channels etc. Along with the testing

and data collection the final report will start. The aim for that stage is to finish the parts of the final

report that are not affected by the testing of the system like background work.

Finally when the data are collected the analysis on the data will begin. The data collected will be

analysed to produce results that will lead to the final conclusions. Also a theoretical analysis of the

routers will be made to identify the area that each router will need when implemented in hardware.

Having the results and conclusions the final stage of the project will start, writing the final report.

The conclusions drawn from the theoretical analysis and the results obtained by the simulations will

be presented and commented. A draft report will be delivered to my supervisor. The comments

made by my supervisor on the report will be examined, and then changed. Then by September the

6th

the final report along with the produced code will be delivered.

16

The project planning changed from the initial plan on the preliminary report. The reason changing

the plan is because there were factors that were not taken under consideration. Factors like the

intense coursework of other modules, the exams of the spring semester and the time that the

progress report needed to complete.

17

Project Progress Up until now (May 2013) the baseline router is completed. The development of the routers is not as

planned. The whole project is three weeks behind schedule. This is mostly due to the misjudgement

of the coursework of the modules for the second semester. The rotary router is currently being

developed. In this part of the report the way that the routers are developed is described.

Building the Baseline router

To simplify the design of the routers various classes were constructed for two main reasons. First,

having distinct classes would greatly help to reuse them without any or minor changes thus reducing

the development time of the consequent routers. Second, since the behaviour of hardware is

simulated is easier to divide the parts of hardware in to different classes. This will give a clearer

image to someone reading the code how it works and it is easier to program. Not all of the classes

represent a hardware module. Some of the classes are data structures to help the development of

the functionality. The classes are described below:

Coordinate: The Coordinate class holds two integers x and y. The integers represent a point on a

grid. The Coordinate class is mainly used to define the point of a router on a grid. Thus the router

gains a unique id in a 2D plane.

Flit: This class at this point defines the flits that are exchanged in the router. It has four fields. The

two first fields are of type Coordinate and are used to define the source and the destination of the

flits so the routing can be made. The third field is the flit header. The header indicates the id of the

flit in the stream to identify when all of the flits arrived. In a standalone flit the flit header is -1. The

headers are numbered in descending order. The first flit has the highest id and the last flit has a flit

header of 1.

Buffer: This class describes the functionality of a first in first out (FIFO) buffer. The class is an array of

Flit as flits are defined in the previous class. The size of the buffer can be defined which gives an easy

way to increase or decrease the recourses of the system. The class provides the functionality to peek

at the top of the buffer and also to remove flits from the head and add flits to the tail.

Link: The Link class is the class responsible for connecting the routers to form the topology. The Link

class holds the values of the router that a specific port (either input or output) is connected to.

These values are the destination Router ID which is of type Coordinate as the Coordinate is

described above. Also the Link value holds the ID of the port (destination) that the source port is

connected to.

Flit Records: The flit record is a structure that helps in the routing. As mentioned before the Routers

will use worm whole routing. In order to achieve the worm whole routing there has to be a record

that will keep the decision made for the initial flit routed. The flit record makes just that. It holds

information of a worm of flits to make decisions for the flits to come. The information kept are the

Source router, the destination router the decision made as through port the flit will be routed

though to reach the next hop and the count of flits passed.

Port: One of the most complex classes of the design. The Port class is the class that makes use of the

classes described above. It consists of an ID, a Link, virtual channels number, Input Channels and an

integer which is the buffer size. The ID is responsible to uniquely identify a Port in the Router

18

structure. Each port is connected to a different component so it has to be uniquely identified. The

Link is the structure that gives the information as to which port of what router a specific port is

connected to. The virtual channels number an integer that describes mere the number of virtual

channels in a structure. It is variable, so different scenarios using various virtual channels can be

tested. The Input channels structure is an array of buffers. The size of the array is defined by the

number of virtual channels. The size of the buffers is also defined from the buffer size which is set as

a variable in the constructor of the Port class.

Router: The Router class using all of the other classes described above can describe the functionality

of the whole baseline router. The attributes of the Router Class are the Input ports which is an array

of Ports, the Output ports which again is an array of Ports, an array of Flit Records to hold the

decision for subsequent flits, the number of virtual channels, the ID of the Router which is a

Coordinate class and two other integers which is the number of input and output ports respectively.

All of the values that need to be initialized like the number of input and output ports and the

number of virtual channels the ID of the router are initialized in the constructor of the class. Also in

the constructor the Port arrays are initialized. A method called CreateConnections() that takes two

arguments is responsible for the establishment of the connections for each router. The first array is

of type Coordinate and the second is an integer array. The first element of the first array holds the

ID of the router that the first port is going to be connected to. The first element of the second array

holds the port in the router identified by the first array the first port of the router is connected to.

The identifiers (Router ID, Port ID) initialize the link class that resides in each and every port. The

second most important method of the router class is the RoutePackets() method. The method takes

as argument the VirtualChannelID. VirtualChannelID is an integer that identifies the virtual channel

that this cycle uses. The routing in the baseline router is in two different phases. The first phase is

the Port Request phase. All of the input ports are traversed and those which have a flit at their top

that needs routing set their request in a Boolean array. Each element of the Boolean array

represents an input port. Then if the flit is a flit the decision made for the preceding flits will be

searched to be applied. If it is not a flit that has a predetermined output port, depending on the

destination of the flit and the coordinates of the router the next hop will be calculated. The next hop

determines in which output port the flit has to be forwarded to reach the next router. The routing

done is XY routing. The second phase of routing is the arbitration phase. In this phase the output

ports choose which request of the input ports they will satisfy. A new Boolean array is constructed to

identify if an output port has already satisfied a request in this cycle. All the input port requests are

examined. If two input ports want to send a flit though the same port only one of the two will be

satisfied in a cycle. Then the flit is deleted from the input port and it is injected in to the output port.

If the output port buffer is full then the flit will not move.

NoC: This is the top level class of the system. This is the class where the connectivity data are

generated and passed to the routers. Also this is the class that generates traffic and injects it in the

routers to route. The NoC is an array of routers. Each router is connected to other routers and thus

the topology is formed. A for loop acts as the clock of the system. In the for loop a nested for loop

calls the RoutePacket() method in all of the routers. This moves the flits from the input ports of a

router to the output ports. The SendMessages() method in the NoC class sends the messages from

the output buffer of a router (source) to the input port of another router (destination). The method

takes as an argument the Virtual channel ID used in the cycle. The method checks the output buffers

of all the routers. If a router has a flit to send, due to the Link object the method can find where to

19

send the packet. If the input port at the other side of the link has an empty slot then the flit is

deleted from the output port and pasted in the destination input port.

The baseline router was tested with traffic that was easy to predict the result. This was done to

ensure the correctness routers in small scale. For further testing though a debugging machine will be

developed to test the correctness of the flits route.

20

Summary and future work The shift of the industry in building processors that have multiple cores and are connected via a

Network on Chip is a one way road. The scaling down of the transistors and the uneven scaling of the

wires lead to the increase of the cost for wires. Cost in manufacturing, performance and area. Also

the complexity of modern processors requires a more modular design that will enable the designers

to reuse hardware efficiently thus reducing the developing time of a processor. For these reasons

NoCs are a necessity. NoCs though, face a lot of problems and intense research is done in the area to

improve them at a level that they can be applied to the processor used by an average user.

In this project the functionality of three router architectures will be tested. Through the simulations

conclusions will be drawn about the kind of traffic that each router can handle best thus identifying

the application context that each router can be established in. There will also be a theoretical

analysis on the power consumption and area needed by the routers to be deployed. Through

comparing the routers at various levels conclusions will be drawn. These conclusions will provide a

specific profile as to in which context the routers can be used.

Up until now one of the three standalone routers has been developed and tested against basic

concepts to ensure the correct behaviour of the router. Due to misjudgement the plan was not

accurate enough as planned in the preliminary report. Adjustments were made to reschedule the

tasks.

In the days to come the rotary and hierarchical routers behavioural architectures will be finished.

After that a system will be developed that will enclose the architectures and will enable fast and

easy testing. The fast and easy testing will be done through the ability of easily changing the various

aspects of the routers such as number of virtual channels, size of buffers, types of traffic and even if

possible various topologies. After the data about the functional simulations are gathered, a

theoretical analysis will be done on the area and the power consumption of the routers. Along with

the statistical analysis made on the data from the simulation the theoretical analysis will provide a

more general idea on the various aspects of the routers.

21

References [1] Mullins, Robert, Andrew West, and Simon Moore. "Low-latency virtual-channel routers for on-

chip networks." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society,

2004.

[2] Abad, Pablo, et al. "Rotary router: an efficient architecture for CMP interconnection

networks." ACM SIGARCH Computer Architecture News. Vol. 35. No. 2. ACM, 2007.

[3] Plana, Luis A., et al. "A GALS infrastructure for a massively parallel multiprocessor." Design & Test

of Computers, IEEE 24.5 (2007): 454-463.

[4] Borkar, Shekhar. "Design challenges of technology scaling." Micro, IEEE 19.4 (1999): 23-29.

[5] Ronen, Ronny, et al. "Coming challenges in microarchitecture and architecture." Proceedings of

the IEEE 89.3 (2001): 325-340.

[6] "Intel 22nm 3-D Tri-Gate Transistor Technology." Intel 22nm 3D TriGate Transistor Technology

Version History. Ed. Patric Dalvin. IntelIPR, 20 May 2011. Web. 21 Apr. 2013.

[7] Flynn, Michael J., Patrick Hung, and Kevin W. Rudd. "Deep submicron microprocessor design

issues." Micro, IEEE 19.4 (1999): 11-22.

[8] Bjerregaard, Tobias, and Shankar Mahadevan. "A survey of research and practices of network-on-

chip." ACM Computing Surveys (CSUR) 38.1 (2006): 1.

[9] Mishra, Asit K., Narayanan Vijaykrishnan, and Chita R. Das. "A case for heterogeneous on-chip

interconnects for CMPs." Computer Architecture (ISCA), 2011 38th Annual International Symposium

on. IEEE, 2011.

[10] Das, Reetuparna, et al. "Design and evaluation of a hierarchical on-chip interconnect for next-

generation CMPs." High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th

International Symposium on. IEEE, 2009.

Appendix A

Figure 7: The project plan

for the dissertation

22

university of manchester school of computer science · university of manchester school of computer...

Documents