University of Manchester
School of Computer Science
Exploring router microarchitecture for
Networks on Chip
Progress Report
Student Name: Panayiotis Englezakis
Supervisor: Javier Navaridas
i
Contents Abstract ................................................................................................................................................... ii
Introduction ............................................................................................................................................ 1
Project Goals ....................................................................................................................................... 1
Report Structure ................................................................................................................................. 2
Background ............................................................................................................................................. 2
The need for Networks on Chip .......................................................................................................... 2
Network on chip Architectures ........................................................................................................... 4
Baseline Router ............................................................................................................................... 8
Rotary Router .................................................................................................................................. 9
Hierarchical Router ....................................................................................................................... 11
Research Methodology and Project Planning ....................................................................................... 13
Research Methodology ..................................................................................................................... 13
Data Collection .............................................................................................................................. 14
Project Planning ................................................................................................................................ 14
Project Progress .................................................................................................................................... 17
Building the Baseline router ............................................................................................................. 17
Summary and future work .................................................................................................................... 20
References ............................................................................................................................................ 21
Appendix A ............................................................................................................................................ 22
ii
Abstract Since Networks on Chip (NoCs) are a necessity in modern processor design a clear image about the
context that each of the router architectures has to exist. The main router architecture is the
crossbar based architecture. Proposals that reduce the power and area of the crossbar based are the
rotary and hierarchical architectures. There has been no comparison between all three router
architectures. Comparing the three architectures can make clear as in to which context each of the
architectures performs best. The comparison of the architectures will be done via a functional
simulation of all three architectures. The architectures will be simulated under various network
traffic and using different router configuration to identify the environment that each router works
best. Also a theoretical analysis will be done on the area and power consumption of the routers. The
project aims to identify the environment that a router will perform best in respect to the traffic that
it will be handling and the power and area restrictions of the device it will be working on.
1
Introduction Since the invention of the transistor there has been a revolution in digital logic design. More
complex and smaller integrated circuits made their appearance. Amongst them, general purpose
processors appeared. Since then, transistors have been getting smaller reaching deep submicron
dimensions the last decade. Since processors became more powerful, new computationally intensive
massively parallel applications made their appearance.
Transistors scaled down better than the wires that connected them. So, on chip communication
became really expensive. Wires, since they do not scale down well, are taking a lot of area on the
chip that could be used to implement functional units. Also data have been propagating slower on
modern CPUs. This is because smaller wires have increased resistance. The appearance of
computationally intensive massively parallel applications came to stress even further the on chip
communications. Also the fact that wires came so close to each other causes crosstalk noise due to
capacitance.
The increasing complexity of the processors leads to processors that need a lot of man hours to be
designed and tested. Due to the nature of the processors, component reuse is not an easy task
because there are not a lot of components to be reused.
The solution to these problems came in the form of a network. Networks on Chip are networks that
are composed of links, routers and nodes. The difference is that they are implemented on-chip. The
nodes in this case are the processing cores of the processor (CPU). Having a modular design enables
easy reuse of the components the processor consists of. Also massively parallel applications made
their appearance and parallelization is the future of computing the processors have to keep up. The
only way to sustain the on chip communications of a chip is the use of networks.
Project Goals
This project has as a goal to examine the major router architecture (crossbar based router) and two
other architectures proposed to reduce power and area consumption. The three router
architectures to be examined are the crossbar based [1], the rotary router [2] and the hierarchical
router [3]. These architectures are being to be simulated under various network traffic to identify
the context that each of them performs best.
The functional simulator is going to be built in Java. The routers are going to be simulated in a
complete network under various configurations. The complete network will be called system from
now and on. The reconfigurable system will have aspects like the number of routers that are going
to be simulated, and the various characteristics of the routers, like the number of Virtual channels
and the size of the input buffers. The system will be tested under network traffic that is targeted to
the specific architectures. All the routers are going to be simulated under all kinds of traffic so the
behaviour of the routers can be recorded. Also a theoretical analysis of the routers will be done in
respect to the power consumption of the routers and the area needed in hardware to implement
them.
Upon completion, the project aims to build a profile for each router. The profile will be build based
on the statistics collected from the simulation and the theoretical analysis. The routers will be
classified in respect to the traffic they can handle best, their power consumption and the area they
2
need in hardware. There has never been a comparison in the past that compared all three routers at
the same time.
Report Structure
The remaining of the report is structured as follows:
Chapter 2 – Background:
In Chapter 2 the reasons that lead to NoC design are further explained. Also detailed descriptions
about various techniques used in designing NoC are stated. Then the three architectures to be
simulated are described.
Chapter 3 – Research Methodology and Project Planning:
In chapter 3 the methodology that will be followed to construct the experiments is explained. Also
the way that data are going to be collected is explained. Also the plan made up until completion of
the project is stated in this section of the report.
Chapter 4 – Project progress:
In Chapter 4 the progress made in the project is explained. Detailed description of the classes used is
provided as the basic functionality of the routers is explained.
Background
The need for Networks on Chip
As computer systems evolve they become even more complex. Also modern technology scales down
the size of the transistors thus enabling more transistors to fit on a square inch [4]. Technology has
advanced so much and the complexity of the computer systems increased to a point that there is no
room for further evolvement.
Similar problems in the past were the force that radically changed the way that computer systems
were developed. As an example in the early days the dominant opinion was that the memory was
cheap and the transistors were expensive [5]. When this problem was solved we had fast and cheap
transistors but the memory became slow due to the more increasing CPU speed. CPU designers
overcame this problem by introducing a revolutionary idea at the time, but a trivial one today,
caches. Caches gave the impression to the CPU that it had an unlimited amount of fast memory at its
disposal. Along with smart cache designing and replacement policies CPUs overcame the slow
memory problem.
On the same lines today’s processors face problems that demand the radical change in their design.
Firstly, the scaling of the transistors has caused many problems. One problem is that the transistors
nowadays do not strictly behave as switches [4]. The drain and the source came so close to each
other that there is leakage current. So even if the transistor is idle, there is current leaking through
that contributes to the high power consumption of modern computer systems. The smallest
transistor used in commercial CPU is 22nm and is used by Intel in its latest technology Ivy Bridge [6].
The other problem is that the wires do not scale down as transistors do. Smaller wires mean
3
increased resistance that contributes to the slow transmission of data. Also the wires are so close to
each other that there is cross talk noise from one wire to another.
Manufacturing transistors is not an easy task. Transistors are manufactured using lithography (usage
of ultraviolet light to carve the transistors on silicon). Some of the components of the transistors are
smaller even from the wavelength of the ultra violet light used in the manufacturing process. This
process results in transistors that present a lot of variations. Tackling this variation is not an easy
task. CPUs are designed deterministically. Designing CPUs probabilistically means that extra silicon
has to be used for tackling the variability thus taking space on the chip that could be used for extra
memory or an extra functional unit.
For many years the CPU designing community was focused on increasing the instruction level
parallelism (ILP) to gain extra speed up. As in every phase of the CPU designing history an idea is
abandoned if there is nothing more to get from it in terms of performance. After finishing with the
ILP, a new approach was introduced. The approach was the thread level parallelism (TLP). Before the
phase that the TLP was introduced, the designers could easily achieve the desired performance just
by finding ways to increase the clock speed of the CPU. In mid 2000’s the clock speeds of the CPUs
was above 3.5 GHz. Because speeds like that could not be sustained any more a new idea in
designing CPU was proposed.
The new trend was designing CPUs that have multiple identical cores inside them. The clock speeds
decreased and the cores themselves became simpler. The combined computing power of the CPU
though was increased. This means that there was an urging need for implementing networks on chip
(NoC). A single bus could sustain two cores but the cores kept increasing since then and they still do.
A bus though does not scale well as more cores are added. The arbitration mechanism of the bus
would be the centre of the chip. CPU history clearly states centralized structures are a bad solution
due to the congestion created. The simplicity of the bus gives it a distinct advantage over the NoC
due to reduced design time. The bus though has poorer performance over NoCs. NoCs scale well
with the increased number of cores and various topologies and routing algorithms can be
implemented according to the nature of the CPU (e.g. server or client).To sustain the amount of
traffic between multiple cores NoCs had to be implemented.
In a few words the computation has became cheaper. In that fashion new computationally intensive
applications were constructed. Communication though is becoming more expensive. This is due to
the increased amount of on-chip traffic and the physical limitations of the wires. As mentioned
above due to the increased resistance from the scaling of the wires the time of flight, from source to
destination, for an electrical signal is larger.
Another problem phased by modern CPUs is synchronisation. A single oscillator cannot synchronize
the whole chip. This is again due to the nature of the wires, the latency of the transmission and the
crosstalk between the wires. A solution to this problem can be the division of the CPU in to
independent units that will work with their own clock. This will create a locally synchronous and
globally asynchronous scheme. To aid in that direction comes the NoC. NoCs have the ability to
transfer data between the cores of the CPU without the need for the cores to synchronise by their
own.
4
Modern CPUs have more than just simple identical cores inside them. Designers incorporate simple
GPU cores. This transforms the CPU to a System on Chip (SoC). SoC are from their nature
heterogeneous. Each module does different computations and produces different results. For these
modules to communicate a level of abstraction is needed. This level of abstraction is offered by the
NoC. Each module sends the data and the NoC manages to transform the data and send a
meaningful message to the other core. This inserts the desired level of abstraction. Someone might
argue that there is no need for a NoC to do that. The modules can be designed to receive and
interpret data form heterogeneous modules. That is correct. But adding this level of abstraction
means that the various heterogeneous modules can be reused in another design thus significantly
reducing the design time of a system.
Network on chip Architectures
Before starting to describe the NoC architectures there is some terminology that should be
explained for easier understanding. First, the deadlock and livelock states will be described.
Deadlock is a state where a system enters when the recourses needed in order for the packets to
reach their destination are held by other packets that want to reach the place that the existing
packet is. This can lead to a deadlock. The resource dependency path formed has a circular form and
no packet can advance if the other does not advance. The livelock state is similar to deadlock. In this
case the recourses constantly change state waiting for the other resource to be released in order to
advance. The system is stalled again.
Continuing with the terminology the virtual channels are going to be described. Virtual channel is the
logical division of the physical wire to logical channels. Each of the virtual channels has its own and
exclusive input and output buffers.
As NoCs evolved several architectures for implementing one were proposed. Contradicting
computer networks emphasis is given on different aspects of the network. As an example, in
computer networks the critical aspect is the packet arriving at its destination eventually, with latency
of some milliseconds or at the worst case some seconds. In NoCs on the other hand latency is a big
constrain. Packets have to arrive to their destination as fast as possible. Large NoCs nowadays are
used in super computers or large servers. The purpose of the super computer is to run large
simulations or analyse data fast. If the NoC is a bottle neck in the processing power of the super
computer then this would cause problems.
Another great constrain is power. Modern CPUs can consume around 150W of power [7]. This
consumption by itself causes problems because the consumption can build temperatures of
hundreds of degrees Celsius. The high temperatures can damage the chip. To prevent damage the
temperature of the chip has to be maintained low. This is done at the expense of more power
consumption either by fans and heat sinks or either by water cooling. This means that a NoC has to
be power friendly so it will not add any extra power consumption that the chip will not be able to
handle.
Another important aspect is area. The limited space on a chip means that if the NoC consumes a
large amount of area, there will be less area for computation. Contradicting again modern computer
networks, where routers often fill whole racks and the accumulated length of the wires in a network
can reach many kilometres. In addition due to the limited resources and depending on the routing
5
algorithms used, NoCs can be deadlock and live lock prone. Modern NoC architectures have three
critical aspects low latency, power and area.
Messages between cores in the same CPU are often too large to be transmitted in one packet. So
packets are broken down in to smaller chunks called flits. The terminology used to describe the
units that travel in NoC is shown in figure 1. The message is the top level entity and describes the
actual message that a core wants to pass to another core. Then the packet is the actual entity that
will be sent. The packet can break down to flits (flow control units). And the flits break down to fits
(physical units). In NoC the wires transferring the data from a core to another are wide so most of
the times phit is the same entity as a flit.
There are three major schemes that apply flow control at packet level [8]. The first one is store-and-
forward routing. The core sends a packet (a collection of flits). When the packet reaches the node on
its next hop the node has to store the whole packet before forwarding it. This can often cause
problems when the next node on the NoC where the packet will be forwarded does not have enough
space in its input buffers to receive the packet. In such case the packet stalls.
The second is wormhole routing. In wormhole routing the packet is divided to flits. The flit arrives at
the next hop the information contained at is header is analysed, the next hop is calculated and the
flit is forwarded to the next router. When subsequent flits of the same packet arrives the node uses
the decision made for the first flit is applied and the flit is forwarded to the next hop. Thus the
packet is spread in across different nodes until it reaches its destination just like a worm. The
problem with the wormhole is that it cannot dynamically adapt the route of the flits once the
decision on the first flit is made. If there is congestion in one node all of the flits are forced to follow
the route followed by the first flit. If a flit in the middle of the worm stalls, the flits following will also
stall. Consequently the worm can stall until the blocked flit is moved to the next hop. Apart from the
extra latency introduced, wormhole routing wastes recourses in each router that a part of the worm
resides. The latency of a packet in wormhole routing increases linearly in respect to the number of
hops (distance) and the size of the packet. So ������� � �� �� ∗ ���� in wormhole routing.
Figure 1: The terminology of the data traversing a NoC
Figure 2: Possible topologies for a NoC
The third is virtual cut-through routing. In
enforced. The only difference is that before the
the flit ensures that the receiving node has enough space in its buffers to accept the whole packet.
Depending on what an architect wants on its NoC it can use either of these schemes. The
forward scheme has less data on its header but it needs a lot of control overheads and guarantee to
transmit a packet. In virtual cut-through there is less co
on the header and again needs reassurance that there is space in the destination to transmit
wormhole routing has extra information in the header, has less control overhead and does not need
guarantee to continue. In virtual cut
number of hops (distance) and the si
The most common technique for improving
avoidance mechanism at hand is vir
buffers in each node. A deadlock can be avoided by cho
the data thus breaking the cycle in the resource dependency graph. VCs can also improve
performance by better utilizing a link. Using VCs is similar to having
physical wire.
As in computer networks NoCs have various topologies [8
shown in figure 2a. All the nodes are connected on a single bus
between the nodes. Then there is the mesh topology as shown in figure
connected with their neighbour. In figure
6
: Possible topologies for a NoC: (a) bus, (b) a mesh, (c) a torus and (d) a tree topology.
through routing. In virtual cut-though the same algorithm as wormhole is
. The only difference is that before the flit is forwarded to the next node, the node sending
the flit ensures that the receiving node has enough space in its buffers to accept the whole packet.
Depending on what an architect wants on its NoC it can use either of these schemes. The
forward scheme has less data on its header but it needs a lot of control overheads and guarantee to
through there is less control overhead but there is extra information
reassurance that there is space in the destination to transmit
wormhole routing has extra information in the header, has less control overhead and does not need
In virtual cut-through and store-and-forward the latency is the sum of the
and the size of the packet, ������� � �� �� � ����.
improving the performance of the network while having a deadlock
virtual channel (VC). This is done in the expense of extra input
buffers in each node. A deadlock can be avoided by choosing a different virtual channel to transmit
the data thus breaking the cycle in the resource dependency graph. VCs can also improve
performance by better utilizing a link. Using VCs is similar to having multiple logical links over a
NoCs have various topologies [8]. The simplest one is the bus topology
. All the nodes are connected on a single bus which manages the transmission
is the mesh topology as shown in figure 2b. All the nodes are
In figure 2c the torus topology is shown an extension of the mesh
: (a) bus, (b) a mesh, (c) a torus and (d) a tree topology.
lgorithm as wormhole is
the node sending
the flit ensures that the receiving node has enough space in its buffers to accept the whole packet.
Depending on what an architect wants on its NoC it can use either of these schemes. The store and
forward scheme has less data on its header but it needs a lot of control overheads and guarantee to
ntrol overhead but there is extra information
reassurance that there is space in the destination to transmit. The
wormhole routing has extra information in the header, has less control overhead and does not need
forward the latency is the sum of the
the performance of the network while having a deadlock
. This is done in the expense of extra input
osing a different virtual channel to transmit
the data thus breaking the cycle in the resource dependency graph. VCs can also improve
logical links over a
one is the bus topology
the transmission
b. All the nodes are
the torus topology is shown an extension of the mesh
7
topology. Every node is connected to its neighbour as if the nodes reside on a sphere. In figure 2d
the tree topology is shown. In this case is a simple binary three but depending on the application
different variations can be used. There are topologies that often are a hybrid of these basic ones or
others that incorporate heterogeneous routing schemes that use two or more of these topologies
together [9].
To achieve low latency simple routing algorithms along with simple topologies were proposed. As an
example achieving low latency router for a NoC means that the router has to remain as simple as
possible without extra overheads [1]. The simplest routing algorithm that can be used is XY routing.
The flit is first transmitted along the X axis and then on the Y axis until it reaches its destination.
Using a crossbar the XY routing can be implemented while having the least overheads. To further
reduce overheads wormhole packet flow control can be used. By also using VC means that the links
will be better utilized thus achieving enhanced performance. The drawback in this scheme is area.
Implementing a crossbar router requires a lot of area when the number of input ports increases. This
mean that the NoC scheme is efficient in the latency wise but the area might be a limitation in an
area sensitive design.
Another approach to reduce the power consumption, improve the performance of the routers and
reduce the communication cost is to use heterogeneous routers [9, 10]. The term heterogeneous
refers to routers that use different topology, routing algorithm or router architecture on the same
chip. The idea for heterogeneous routing [9] came by the observation that in a NoC the utilization of
the resources is not uniform. Specifically in deterministic XY routing in a simple mesh, the routers
that lie towards the centre of the chip exhibit higher buffer utilization than the routers that lie on
the edge of the chip as shown in figure 3.
So a redistribution of the buffer size and link bandwidth is done in the NoC so the parts of the NoC
that exhibit more utilization that others would have the resources to cope with the demand. In other
cases [10] emphasis is given on using heterogeneous routing to minimize the communication cost on
the NoC. Reducing the communication cost means to exploit the communication locality. In other
words data that are used by the cores should be placed at memory banks near the core that uses
them. A solution proposed was a hierarchical network. A hierarchical network is a network that has
two levels and in each level a different topology is used. A bus is used for local communication
Figure 3: The utilisation in percentage of the buffers (a) and links (b) in an 8x8 mesh topology. [9]
connecting 4-8 cores and a mesh for global
between them. In this way the benefits given by the bus topology are exploited such as si
and power efficiency but leaving behind the unwanted characteristics like the ability on not scaling
well with large number of cores. Also by using the mesh for global communication only means that
the resulting mesh will be smaller and in this wa
consumption of the network at low levels.
A NoC consists of two major components, the links and the routers. Combined together the routers
and the links form the topology which is not a component by it
mentioned before. The main task of this
exists today, as and two other proposals
not only be in the terms of functional behaviour of the routers but also a theoretical analysis about
the area and the power consumption of each router.
Baseline Router
The baseline router [1] is the dominant
perform different routing algorithms at the expense of extra hardware. In a mesh topology it can
easily perform XY routing.
The heart of the baseline router is the cross
all of the outputs as shown in figure
route a package to the destined output port provided that the port is free from other packages. The
selection of the output port is done in two phases. In phase one the
calculated from the information of the package header and the routing algorithm currently being
used. After that the input ports issue a request as to which output port they would like to use. This is
the port request phase. Then the out
to use them. If there are multiple input port requests the output ports select one
schemes. The selection can be done either at random, using round robin or with a priority on eac
the input ports. The output port will answer back and select an input port only if it has the resources
to serve the request.
8
8 cores and a mesh for global communication that connects the core clusters in
benefits given by the bus topology are exploited such as si
and power efficiency but leaving behind the unwanted characteristics like the ability on not scaling
well with large number of cores. Also by using the mesh for global communication only means that
the resulting mesh will be smaller and in this way preserving the performance and the power
consumption of the network at low levels.
A NoC consists of two major components, the links and the routers. Combined together the routers
which is not a component by itself but rather a combination as
he main task of this dissertation is to examine the main router architecture
as and two other proposals made to reduce the area and latency. This examination will
in the terms of functional behaviour of the routers but also a theoretical analysis about
the area and the power consumption of each router.
dominant router. Depending on the topology the baseline router can
perform different routing algorithms at the expense of extra hardware. In a mesh topology it can
he baseline router is the crossbar. A crossbar connects all of the inputs of the router to
figure 4. Having that functionality the baseline router can in one cycle
route a package to the destined output port provided that the port is free from other packages. The
selection of the output port is done in two phases. In phase one the destination output port is
calculated from the information of the package header and the routing algorithm currently being
used. After that the input ports issue a request as to which output port they would like to use. This is
the port request phase. Then the output ports answer back and select the input port that requested
to use them. If there are multiple input port requests the output ports select one
schemes. The selection can be done either at random, using round robin or with a priority on eac
. The output port will answer back and select an input port only if it has the resources
Figure 4: The baseline router layout
that connects the core clusters in
benefits given by the bus topology are exploited such as simplicity
and power efficiency but leaving behind the unwanted characteristics like the ability on not scaling
well with large number of cores. Also by using the mesh for global communication only means that
y preserving the performance and the power
A NoC consists of two major components, the links and the routers. Combined together the routers
her a combination as
main router architecture that
. This examination will
in the terms of functional behaviour of the routers but also a theoretical analysis about
router. Depending on the topology the baseline router can
perform different routing algorithms at the expense of extra hardware. In a mesh topology it can
crossbar connects all of the inputs of the router to
Having that functionality the baseline router can in one cycle
route a package to the destined output port provided that the port is free from other packages. The
ion output port is
calculated from the information of the package header and the routing algorithm currently being
used. After that the input ports issue a request as to which output port they would like to use. This is
put ports answer back and select the input port that requested
to use them. If there are multiple input port requests the output ports select one using various
schemes. The selection can be done either at random, using round robin or with a priority on each of
. The output port will answer back and select an input port only if it has the resources
9
To better utilize the physical links and have a deadlock avoidance mechanism the baseline router can
also implement virtual channels. As mentioned before the virtual channels technique is time
multiplexing of the physical channel, dividing the physical link in to multiple logical links. In
architectural terms the VCs are multiple buffers in each input port. When the VC zero of port one is
used the data are written to the zero input buffer of port one. The choice of which VC to use is not
trivial. The simplest scheme is having a simple round robin algorithm that would use one VC at a
time. There can be a more dynamic way of using the virtual channels where information about the
capacity of the buffer can be disclosed. In that case the decision about which channel is going to be
used can be done dynamically.
A drawback of the baseline router is cost in hardware. A crossbar is always costly and does not scale
well in terms of area as the number of input and output ports increase. In a mesh topology deadlock
avoidance and flow control mechanisms can be implemented with low hardware cost. But in a
different topology the control overhead in the router would result in a significant increase in
hardware. The increased hardware, in correlation with the crossbar implementation, would be
extremely costly to implement.
Baseline router is a fast and efficient router when used in small networks. With some possible
optimizations it can reduce the latency of a network significantly. The hardware cost of the router
though is high. In some cases it can also be a limiting factor for some devices that have limited
hardware space and belong to the low power category.
Rotary Router
The rotary router [2] is the second router to be simulated. The rotary router makes use of two rings
that are connected to the input and output ports to route the flits. It also uses bubbles in the buffers
as a deadlock/livelock avoidance mechanism.
The router consists of three different building blocks Input, Output and Buffering segment as shown
Figure 5: The rotary router layout. [2]
10
by figure 5. These blocks construct the two rings that are used to route the flits. For a flit to move
from one element of the router (e.g. input port) to another element (e.g. buffering segment) a cycle
is needed. Each ring rotates flits in the opposite direction of the other. In a topology where each
router is connected to another four routers like a torus topology then the router has five pairs input
and output ports. The four are used to serve the connections between the routers and the fifth one
to connect the core that injects or consumes flits from the network. An input port is connected to
both directions of the ring using a multiplexer. Which direction the flit will choose is described
below.
The router uses flow control based on bubbles in the buffers to establish deadlock/livelock
avoidance. For a flit to be inserted in the ring the first buffering segment after the input port must
have at least two holes. The term “hole”, means the empty memory in the buffer to hold another
two flits. If the input port is connected to the core that the router serves then the number of holes
increases to three. Also for a flit to advance between the buffer segments each buffer must have
information about its own occupation level and about the occupation level of the next buffer in the
router. The buffer will send the data forward only if the flits inside the destination buffer are less or
equal than the flits of the current buffer. The routing engine of the router tries to insert the flit in the
direction that will take the least cycles for the flit to exit the router.
The number of complete turns that a flit is allowed to do in a ring is predetermined. When a flit
completes the predetermined turns without leaving the router through the calculated output port it
will be marked as a misrouted flit and it will leave form the first output port that can serve it. This
has two effects. Firstly, it solves any anomalies from miscalculating the destination that the flit has
to go. Secondly, if there is congestion at a specific output port and consequently the router that the
port is connected to, the flit will follow another route to reach its destination and not waiting for the
flits for the congestion in the consequent router to reduce.
As mentioned before the deadlock/livelock avoidance is achieved by the use of bubbles/holes. The
flow control mechanism guarantees that the core attached to the router and injects flits will stop
injecting flits before the input ports stop injecting flits. This is due to the restriction that for the core
to inject flits there has to be room for at least three other flits in the input buffer opposed to the
input ports that need only two. This means that at some point in the NoC no new flits will be
injected. Only the existing flits will be served so nothing will stall. The only elements that can
increase the number of flits inside the network are the cores. So the last flit that will cause the
network to come at an extreme state will come from a core. If no core consumes any flits the total
number of holes in the network will be 2 � 1 where is the number of routers. The hole can
never stay at one router because after a finite number of cycles a flit will be marked as misrouted
from a router that is next to the router that has the extra hole and it will be moved to the new
router. So the router that lost a package will have an extra hole. Since the hole will be moving in
such fashion the router will never reach a deadlock situation.
The livelock avoidance comes from the fact that statistically no flit can perform any circular path.
When a flit is marked for misrouting it leaves from the first available port. This means that when the
hole moves to another router, an input port form that router will inject a flit in the ring. That will
free a place in the input buffer for another router to send to it thus filling the hole. There is the
possibility of the hole to ping pong in two neighbouring routers.
11
Also the flow control mechanisms can be used also to avoid starvation. In real time programs the
data flow follows a pattern. This can lead to situations where in a router only one port will be used
and the others will not be able to inject new flits thus leading to starvation. To solve this problem
the injection rules can change dynamically. The input port that has the most traffic can increase the
number of holes that need to exist in the next buffer to insert a packet. So the heavily utilized port
will give the chance to the underutilized ports to insert their flits in the router. This change can be
done dynamically.
This router design can become really costly in terms of area, power consumption and latency. The
usage of so many buffers has a huge impact on the area taken by the router. Also the fact that the
flits do not stop moving inside the rings and circulate around until they are marked for misrouting
has an effect on power consumption. This effect can be reduced if the number of complete cycles
that a flit has to complete until it is marked for misrouting is small. Contradicting the baseline router,
if the number of input ports in rotary router increases the maximum cycles needed by the router
route a flit also increases. The growth rate of the rotary router when the input ports are increased
is����. The baseline router has a growth rate of�����. Each input port adds one extra cycle to the
worst case scenario for the router.
Hierarchical Router
The hierarchical router [3] is the third router to be simulated. The design of the router is consecrated
to reduce as much as possible the area and power consumption. It mostly uses simple routing
algorithms will little or no deadlock/livelock avoidance mechanisms.
The design as imposed by the name is hierarchical. There is a single routing engine at the heart of
the router that is able to handle one flit at a time. The flits are inserted in a FIFO queue and the head
of the queue is served by the routing engine. This design is to serve the purpose of low power and
small area taken by the design. When a flit is routed it is forwarded to one of the output ports of the
Figure 6: The structure of the hierarchical router
12
router. If the output port does not have the resources to serve the request by the routing engine
then the whole router is blocked until the port is available.
The hierarchical nature of the router is in the way that it handles flits. The router gathers the flits
from the input ports (if a flit exists) and then level by level the flits are merged until the last level is
reached as shown by figure 6. Such a design has many problems as to the amount of cycles the
router will need to merge all of the flits. The amount of cycles to merge the flits in to one large list of
flits is ������ � �log�� (where is the number on input ports). Then the routing engine takes the
flits one by one and routes them to the appropriate output port.
The router by itself is inefficient as to the number of cycles it needs to route the flits. Opposed to the
baseline router, where the number of cycles is constant, as the number of input ports in the
hierarchical router increase, the cycles that are required to route all of the flits also increases. Even
with this major drawback present this type of router has a lot of advantages. The most important
one is its simplicity. In hardware terms a simple router consumes less energy in a specific
environment and also it needs less hardware to function. In a wireless sensor network for example
were the traffic is low and the most critical resource is energy, such routers are ideal.
The router has also the capability of using VCs to better utilize the physical wires but also as a
deadlock/livelock avoidance mechanism. Such mechanism though will increase the number of
transistors that compose the router. Increasing the number of transistors means more energy and
more area. Also virtual channels are implemented to better utilize the physical wires. The very
nature of the hierarchical router makes it behave poorly in environments with intense traffic. So the
use of virtual channels might give the router a way to avoid deadlocks but it will affect all of the low
power and small area of the implementation. In a mesh topology that uses XY routing deadlock
cannot occur. If the router is used in such environment no measures to avoid potential deadlocks are
needed.
13
Research Methodology and Project Planning
Research Methodology
The purpose of the project is to compare the three router architectures. The architectures are going
to be compared at various levels. Firstly, the latency of each router is going to be examined. The
latency is going to be examined under various workloads and router configurations to identify the
strong and weak points of each router. After doing the functional simulation, a theoretical analysis
of the area that the implementation of each router is going to be made. The stages of the research
are going to be stated below:
1. Background research:
The background research started with understanding the reasons that lead to the need for
NoCs. After understanding which problems the NoCs are trying to solve, the various
problems and limitations that emerged from implementing NoCs were studied. The three
main points studied were:
• What is a NoC
• Which are the problems that a NoC solves
• What are the limitations of NoCs
Then the three different architectures are studied in detail. In every case the assumptions
that were made for each of the three architectures were taken under consideration to
provide a general idea as to what limitation of the NoCs each of them is trying to tackle.
2. Functional simulation:
Each router will be built and simulated in java. The routers are going to have a variable
number of resources so various scenarios can be simulated. At first the baseline router is
going to be built, then the rotary router and lastly the hierarchical router. After building the
routers the complete system will be build. The system can simulate various scenarios. The
elements that will be changing in each scenario are the resources (memory, number of VCs),
the traffic and possibly the topology. The traffic is one of the most important aspects in
simulating NoC routers, so special care will be given in developing traffic that will be
stressing the routers and expose their vulnerabilities and their strong points. Each topology
has different routing algorithms. Different routing algorithms might have deadlock/livelock
problems. Different topologies will be simulated if there is time to implement the new
routing algorithms and the deadlock/livelock avoidance mechanisms. There will be a final
testing stage were the data for the routers will be gathered and analyzed. During the
building of the routers and the system there will be continuous testing with predetermined
traffic to test the correctness of the router. Testing the correctness of a network with one
node and two flits it is easy. For more complex traffic a correctness engine will be built that
will monitor the traffic and ensure that the system functions as it should do.
3. Testing and analysing the data:
As mentioned before custom benchmarks will be built to stress the functionality of the
routers. Each router has different characteristics and can be used under different traffic
efficiently. Based on these characteristics benchmarks will be build that will favour one
router at a time while seeing how the other two routers react to the same traffic. Also, to
14
gain a broader perspective, benchmarks will be build that will not favour any of the routers
to test how the routers will behave in the worst case scenario. After building the traffic the
system will be simulated and statistics about the network will be collected. The various
benchmarks will be produced by a benchmark engine that will instruct each core when to
produce a flit, what type of flit to produce (stream or stand alone flit) and which would the
destination would be. Instructing the cores to develop traffic that will either favour or not a
specific router, while all the routers will be tested under the same traffic will give general
image about the behaviour of each router compared to the others.
Data Collection
The conclusions that will be reached mostly depend on the data that will be gathered. Mostly the
data that will be collected will have to do with the latency of the router. The latency of a router
though at a random time does not mean a lot. The latency has to be correlated with what it is going
on at the whole NoC.
The data will be collected by the flits that will be traversing through the network. After reaching
their destination the data collected by the flit will be saved in a file which will be later analysed to
extract the needed measurements. Each flit will collect the following measurements:
• Flit ID (a unique identifier of the packet)
• Start time (the simulation will be a clock accurate simulation so the start time will be the
cycle id that the flit will start to exist)
• End cycle
• Hop counter
• Route followed
An analyser will collect the flits after the simulation is finished. Based on the measurements taken in
the flit the analyser will produce meaningful data. The data will be interpreted and the appropriate
conclusions will be drawn.
Project Planning
As show from the research methodology the project is divided in to three major sections. The
sections are the background research, the building of the routers and the testing of the system to
obtain statistics for each router. Each of these three major sections was subdivided in to single tasks
so the completion of the project can be done in a modular fashion.
As shown in figure 7 in Appendix A the project is divided in to single tasks and some of them are in
parallel. The parallel tasks are tasks that have no dependencies between them. They might affect
one another but that does not mean they cannot be executed in parallel. The first task had been
studying and understanding the background of NoCs. The process of acquiring the knowledge had
taken little over a month. That process had set the foundation to start working on the project. After
that, a week had been devoted in writing the preliminary report for the project that set the goals of
the project. Then the building of the routers follows. The process is again divided in three stages.
The first stage had been the development of the baseline router, the second is the development of
the rotary router and the third is the development of the hierarchical router. The development of
the other two routers is expected to take less that the development of the baseline router. This is
15
because the building blocks and the testing strategy of the routers is the same whatever the
functionality. So it is normal that the development of the other two routers will take less time. In
parallel with the development of the router there is the process of writing the progress report. The
progress report was affected by the development of the functionality of the router but not at a
degree that had prohibited the parallel completion of the two. The completion of the hierarchical
and rotary router will finish after the second milestone of the progress report. The completion of the
standalone routers is the third milestone.
When the basic functionality of the routers is finished then the system incorporating the routers will
be developed. The will be a way of changing the variables of the routers fast so the simulation can
be made efficiently. There will be a graphical user interface (GUI) that will enable the user to change
the various aspects of the routers such as:
• Topology
• Number of input and output buffers
• Size of input and output buffers
• Number of Virtual Channels
• Type of traffic to be simulated
• Type of router to be simulated
The system entity will be able to produce results either in the form of raw data or readymade
graphs. Then the system along with the routers will be tested to ensure the correctness of the
simulator. This would be the final testing to ensure correctness but testing of the functionality of
each router will be done in every step. The complete system is the fourth milestone of the project.
After finishing the system the development of the benchmarks will start. A benchmark engine will be
constructed. The benchmark engine will instruct the cores simulated in the system as to what kind of
traffic to produce. The traffic generated will be constructed to favour one router at a time. Also a
worst case scenario will be developed for each router.
Finishing the development of the benchmarks will mean the beginning of data collection for the
system. The benchmarks will be applied to the system along with changing the various aspects of the
NoC like topology, router type, size of buffers, Number of virtual channels etc. Along with the testing
and data collection the final report will start. The aim for that stage is to finish the parts of the final
report that are not affected by the testing of the system like background work.
Finally when the data are collected the analysis on the data will begin. The data collected will be
analysed to produce results that will lead to the final conclusions. Also a theoretical analysis of the
routers will be made to identify the area that each router will need when implemented in hardware.
Having the results and conclusions the final stage of the project will start, writing the final report.
The conclusions drawn from the theoretical analysis and the results obtained by the simulations will
be presented and commented. A draft report will be delivered to my supervisor. The comments
made by my supervisor on the report will be examined, and then changed. Then by September the
6th
the final report along with the produced code will be delivered.
16
The project planning changed from the initial plan on the preliminary report. The reason changing
the plan is because there were factors that were not taken under consideration. Factors like the
intense coursework of other modules, the exams of the spring semester and the time that the
progress report needed to complete.
17
Project Progress Up until now (May 2013) the baseline router is completed. The development of the routers is not as
planned. The whole project is three weeks behind schedule. This is mostly due to the misjudgement
of the coursework of the modules for the second semester. The rotary router is currently being
developed. In this part of the report the way that the routers are developed is described.
Building the Baseline router
To simplify the design of the routers various classes were constructed for two main reasons. First,
having distinct classes would greatly help to reuse them without any or minor changes thus reducing
the development time of the consequent routers. Second, since the behaviour of hardware is
simulated is easier to divide the parts of hardware in to different classes. This will give a clearer
image to someone reading the code how it works and it is easier to program. Not all of the classes
represent a hardware module. Some of the classes are data structures to help the development of
the functionality. The classes are described below:
Coordinate: The Coordinate class holds two integers x and y. The integers represent a point on a
grid. The Coordinate class is mainly used to define the point of a router on a grid. Thus the router
gains a unique id in a 2D plane.
Flit: This class at this point defines the flits that are exchanged in the router. It has four fields. The
two first fields are of type Coordinate and are used to define the source and the destination of the
flits so the routing can be made. The third field is the flit header. The header indicates the id of the
flit in the stream to identify when all of the flits arrived. In a standalone flit the flit header is -1. The
headers are numbered in descending order. The first flit has the highest id and the last flit has a flit
header of 1.
Buffer: This class describes the functionality of a first in first out (FIFO) buffer. The class is an array of
Flit as flits are defined in the previous class. The size of the buffer can be defined which gives an easy
way to increase or decrease the recourses of the system. The class provides the functionality to peek
at the top of the buffer and also to remove flits from the head and add flits to the tail.
Link: The Link class is the class responsible for connecting the routers to form the topology. The Link
class holds the values of the router that a specific port (either input or output) is connected to.
These values are the destination Router ID which is of type Coordinate as the Coordinate is
described above. Also the Link value holds the ID of the port (destination) that the source port is
connected to.
Flit Records: The flit record is a structure that helps in the routing. As mentioned before the Routers
will use worm whole routing. In order to achieve the worm whole routing there has to be a record
that will keep the decision made for the initial flit routed. The flit record makes just that. It holds
information of a worm of flits to make decisions for the flits to come. The information kept are the
Source router, the destination router the decision made as through port the flit will be routed
though to reach the next hop and the count of flits passed.
Port: One of the most complex classes of the design. The Port class is the class that makes use of the
classes described above. It consists of an ID, a Link, virtual channels number, Input Channels and an
integer which is the buffer size. The ID is responsible to uniquely identify a Port in the Router
18
structure. Each port is connected to a different component so it has to be uniquely identified. The
Link is the structure that gives the information as to which port of what router a specific port is
connected to. The virtual channels number an integer that describes mere the number of virtual
channels in a structure. It is variable, so different scenarios using various virtual channels can be
tested. The Input channels structure is an array of buffers. The size of the array is defined by the
number of virtual channels. The size of the buffers is also defined from the buffer size which is set as
a variable in the constructor of the Port class.
Router: The Router class using all of the other classes described above can describe the functionality
of the whole baseline router. The attributes of the Router Class are the Input ports which is an array
of Ports, the Output ports which again is an array of Ports, an array of Flit Records to hold the
decision for subsequent flits, the number of virtual channels, the ID of the Router which is a
Coordinate class and two other integers which is the number of input and output ports respectively.
All of the values that need to be initialized like the number of input and output ports and the
number of virtual channels the ID of the router are initialized in the constructor of the class. Also in
the constructor the Port arrays are initialized. A method called CreateConnections() that takes two
arguments is responsible for the establishment of the connections for each router. The first array is
of type Coordinate and the second is an integer array. The first element of the first array holds the
ID of the router that the first port is going to be connected to. The first element of the second array
holds the port in the router identified by the first array the first port of the router is connected to.
The identifiers (Router ID, Port ID) initialize the link class that resides in each and every port. The
second most important method of the router class is the RoutePackets() method. The method takes
as argument the VirtualChannelID. VirtualChannelID is an integer that identifies the virtual channel
that this cycle uses. The routing in the baseline router is in two different phases. The first phase is
the Port Request phase. All of the input ports are traversed and those which have a flit at their top
that needs routing set their request in a Boolean array. Each element of the Boolean array
represents an input port. Then if the flit is a flit the decision made for the preceding flits will be
searched to be applied. If it is not a flit that has a predetermined output port, depending on the
destination of the flit and the coordinates of the router the next hop will be calculated. The next hop
determines in which output port the flit has to be forwarded to reach the next router. The routing
done is XY routing. The second phase of routing is the arbitration phase. In this phase the output
ports choose which request of the input ports they will satisfy. A new Boolean array is constructed to
identify if an output port has already satisfied a request in this cycle. All the input port requests are
examined. If two input ports want to send a flit though the same port only one of the two will be
satisfied in a cycle. Then the flit is deleted from the input port and it is injected in to the output port.
If the output port buffer is full then the flit will not move.
NoC: This is the top level class of the system. This is the class where the connectivity data are
generated and passed to the routers. Also this is the class that generates traffic and injects it in the
routers to route. The NoC is an array of routers. Each router is connected to other routers and thus
the topology is formed. A for loop acts as the clock of the system. In the for loop a nested for loop
calls the RoutePacket() method in all of the routers. This moves the flits from the input ports of a
router to the output ports. The SendMessages() method in the NoC class sends the messages from
the output buffer of a router (source) to the input port of another router (destination). The method
takes as an argument the Virtual channel ID used in the cycle. The method checks the output buffers
of all the routers. If a router has a flit to send, due to the Link object the method can find where to
19
send the packet. If the input port at the other side of the link has an empty slot then the flit is
deleted from the output port and pasted in the destination input port.
The baseline router was tested with traffic that was easy to predict the result. This was done to
ensure the correctness routers in small scale. For further testing though a debugging machine will be
developed to test the correctness of the flits route.
20
Summary and future work The shift of the industry in building processors that have multiple cores and are connected via a
Network on Chip is a one way road. The scaling down of the transistors and the uneven scaling of the
wires lead to the increase of the cost for wires. Cost in manufacturing, performance and area. Also
the complexity of modern processors requires a more modular design that will enable the designers
to reuse hardware efficiently thus reducing the developing time of a processor. For these reasons
NoCs are a necessity. NoCs though, face a lot of problems and intense research is done in the area to
improve them at a level that they can be applied to the processor used by an average user.
In this project the functionality of three router architectures will be tested. Through the simulations
conclusions will be drawn about the kind of traffic that each router can handle best thus identifying
the application context that each router can be established in. There will also be a theoretical
analysis on the power consumption and area needed by the routers to be deployed. Through
comparing the routers at various levels conclusions will be drawn. These conclusions will provide a
specific profile as to in which context the routers can be used.
Up until now one of the three standalone routers has been developed and tested against basic
concepts to ensure the correct behaviour of the router. Due to misjudgement the plan was not
accurate enough as planned in the preliminary report. Adjustments were made to reschedule the
tasks.
In the days to come the rotary and hierarchical routers behavioural architectures will be finished.
After that a system will be developed that will enclose the architectures and will enable fast and
easy testing. The fast and easy testing will be done through the ability of easily changing the various
aspects of the routers such as number of virtual channels, size of buffers, types of traffic and even if
possible various topologies. After the data about the functional simulations are gathered, a
theoretical analysis will be done on the area and the power consumption of the routers. Along with
the statistical analysis made on the data from the simulation the theoretical analysis will provide a
more general idea on the various aspects of the routers.
21
References [1] Mullins, Robert, Andrew West, and Simon Moore. "Low-latency virtual-channel routers for on-
chip networks." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society,
2004.
[2] Abad, Pablo, et al. "Rotary router: an efficient architecture for CMP interconnection
networks." ACM SIGARCH Computer Architecture News. Vol. 35. No. 2. ACM, 2007.
[3] Plana, Luis A., et al. "A GALS infrastructure for a massively parallel multiprocessor." Design & Test
of Computers, IEEE 24.5 (2007): 454-463.
[4] Borkar, Shekhar. "Design challenges of technology scaling." Micro, IEEE 19.4 (1999): 23-29.
[5] Ronen, Ronny, et al. "Coming challenges in microarchitecture and architecture." Proceedings of
the IEEE 89.3 (2001): 325-340.
[6] "Intel 22nm 3-D Tri-Gate Transistor Technology." Intel 22nm 3D TriGate Transistor Technology
Version History. Ed. Patric Dalvin. IntelIPR, 20 May 2011. Web. 21 Apr. 2013.
[7] Flynn, Michael J., Patrick Hung, and Kevin W. Rudd. "Deep submicron microprocessor design
issues." Micro, IEEE 19.4 (1999): 11-22.
[8] Bjerregaard, Tobias, and Shankar Mahadevan. "A survey of research and practices of network-on-
chip." ACM Computing Surveys (CSUR) 38.1 (2006): 1.
[9] Mishra, Asit K., Narayanan Vijaykrishnan, and Chita R. Das. "A case for heterogeneous on-chip
interconnects for CMPs." Computer Architecture (ISCA), 2011 38th Annual International Symposium
on. IEEE, 2011.
[10] Das, Reetuparna, et al. "Design and evaluation of a hierarchical on-chip interconnect for next-
generation CMPs." High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th
International Symposium on. IEEE, 2009.
Appendix A
Figure 7: The project plan
for the dissertation
22