multicore processors for mobile platforms - future systems on a chip

8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

1/46


2/46

1

Table of Contents1 Project Overview ........................................................................................................................................ 2

Abstract ..................................................................................................................................................... 2

2 Project Objectives ...................................................................................................................................... 3

3 Understanding System-on-a-chip (SoC) ..................................................................................................... 4

3.1 System-on-a-chip - basics ................................................................................................................... 4

3.2 System-on-a-Chip - Structure .............................................................................................................. 5

Whatsinside of a SoC? ............................................................................................................................. 5

4 Briefly about Multi-core processors .......................................................................................................... 7

4.1 Multithreading, Hyper-Threading, or Multi-Core?....................................................................... 8

5 On-Chip Network to a Multi-core System ................................................................................................ 10

5.1 Abstract ............................................................................................................................................. 10

5.2 Topology ............................................................................................................................................ 10

5.3 Routing .............................................................................................................................................. 12

5.4 Flow Control ...................................................................................................................................... 14

5.5 Router Micro-architecture ................................................................................................................ 17

6 Project Methodologies, Results, and Achievements ............................................................................... 18

6.1 Usage of SmartphonesSurvey (Q & A) .......................................................................................... 19

6.2 Analysis on the survey results ........................................................................................................... 22

6.3 Discrete event simulations ................................................................................................................ 23

6.4 Example of DES in real life ................................................................................................................ 24

6.5 Components of a discrete-event simulation ..................................................................................... 24

6.6 Network Simulators as DES ............................................................................................................... 26

6.7 Network Simulations with OPNET..................................................................................................... 26

7 Implementing the project in OPNET ........................................................................................................ 27

7.1 Adding Traffic .................................................................................................................................... 27

7.2 Network On-Chip realizations in OPNET ........................................................................................... 29

7.3 Results Comparison........................................................................................................................... 41

8 Conclusions .............................................................................................................................................. 44

9 References ............................................................................................................................................... 45


3/46

2

1 Project Overview

AbstractSince smartphones and tablets are basically smaller computers, they require pretty much the

same components we see in desktops and laptops in order to offer us all the amazing things

they can do (apps, music and video playing, 3D gaming support, advanced wireless features,

etc).

But smartphones and tablets do not offer the same amount of internal space as desktops and

laptops for the various components needed such as the logic board, the processor, the RAM,

the graphics card, and others. That means these internal parts need to be as small as possible,

so that device manufacturers can use the remaining space to fit the device with a long-lasting

battery life.

In recent years, due to the continuous development in the field of silicon technology, it is

possible to implement complex electronic systems in a single integrated circuit. Systems-on-

chips (SoCs) are small, powerful multi-core systems that are being implemented in a vast

number of ways across the booming electronics market, primarily in small mobile devices. But

there comes a question: What architecture design should be proposed to solve this problem, or

with other words, how one tiny little computer can be designed so that smartphones can rise

up to PCs levels?

The architecture complexity of these SoCs requires new design methodologies and the

development of a seamless design flow that integrates existing and emerging tools. As micro

and nano technologies continuously progress, it has led to a growing integration and clock

frequency increment in electronics systems. These combined effects have led to an increase

both in power density and energy dissipation which consequently must be managed, above all,

in portable systems. Design and technology issues relating to power efficiency are crucial, in

particular for power optimized cell libraries, clock gating and clock trees optimization, and

dynamic power management. Thats why this projectdiscusses options for different low-power,

faster, and cheaper design techniques for systems-on-chips, at a design level and an

architectural level.


4/46

3

2 Project Objectives

This project intends to contribute to solutions for the growing industrial need to design reliablenetwork-on-a-chip with efficient multi-core processors for mobile platforms as future systems

on a chip. In particular, it intends to provide theory and practical examples where possible

designs can take place and maybe contribute in the future SOCs development.

The system on a chip design doesnt only require new design methodologies and development

of design flow of already known elements and tools. Here comes one of the main objectives of

this project: What are the needs of the users of this growing technology? This question derives

many possible situations, and clearly gives us idea that not all PCs components and

performances should be just copied into our hand devices. Thats why this project is based on a

previous research about smartphone users and their view of their devices (either as calling

device, gaming platform, or business tool).

Based on SOCs structure, Multi-core systems architectures, and the research we have done, this

project will offer On-Chip network that will bring high performance, smartly used space, and

network interconnections between the components of the chip-which means long-lasting

battery.


5/46

4

3 UnderstandingSystem-on-a-chip (SoC)

3.1 System-on-a-chip - basics

System-on-a-chip (SoC) technology is the packaging of all the necessary electronic circuits and

parts for a "system" (such as a cell phone or digital camera) on a single integrated circuit ( IC ),

generally known as amicrochip . For example, a system-on-a-chip for a sound-detecting device

might include an audio receiver, an analog-to-digital converter (ADC ), a microprocessor,

necessarymemory, and theinput/outputlogic control for a user - all on a single microchip.

System-on-a-chip technology is used in small, increasingly complex consumer electronic

devices. Some such devices have more processing power and memory than a typical 10-year-

old desktop computer. In the future, SoC-equippednanorobots (robots of microscopicdimensions) might act as programmable antibodies to fend off previously incurable diseases.

SoC video devices might be embedded in the brains of blind people, allowing them to see; SoC

audio devices might allow deaf people to hear. Handheld computers with small whip antennas

might someday be capable of browsing the Internet at megabit-per-second speeds from any

point on the surface of the earth.

SoC is evolving along with other technologies such as silicon-on-insulator ( SOI ), which can

provide increasedclock speed while reducing thepower consumed by a microchip.

Image 1: How far the technology has gone?

All necessary desktop computers components to be packed into a couple of cm long chip(This image is in an ownership fo WIKIPEDIA.COM)
http://searchcio-midmarket.techtarget.com/definition/integrated-circuithttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/analog-to-digital-conversionhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://search400.techtarget.com/definition/Silicon-on-Insulatorhttp://searchcio-midmarket.techtarget.com/definition/clock-speedhttp://searchcio-midmarket.techtarget.com/definition/powerhttp://searchcio-midmarket.techtarget.com/definition/powerhttp://searchcio-midmarket.techtarget.com/definition/clock-speedhttp://search400.techtarget.com/definition/Silicon-on-Insulatorhttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchcio-midmarket.techtarget.com/definition/analog-to-digital-conversionhttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/integrated-circuit


6/46

5

3.2 System-on-a-Chip - Structure

Whatsinside of a SoC?

Now that we know what a SoC is, lets take a quick look at the components that can be found

inside it. Mind you, not all the following parts are built in all the different SoCs that were going

to show you later on, but in order to better understand how a SoC works, you should have a

general picture of what goes inside it:

CPU the central processing unit, whether its single- or multiple-core, this is what

makes everything possible on your smartphone. Most processors found inside the SoCs

that were going to look at will be based on ARM technology, but more on that later .

Memory just like in a computer, memory is required to perform the various tasks

smartphone and tablets are capable of, and therefore SoCs come with various memory

architectures on board.

GPUthe graphic processing unit is also an important component on the SoC, and its

responsible for handling those complex 3D games on the smartphone or tablets. As you

can expect, there are various GPU architectures available out there, and were going tofurther detail them in what follows.

Northbridgethis is a component that handles communications between the CPU and

other components of the SoC including the southbridge.

Southbrigea second chipset usually found on computers that handles various I/O

functions. In some cases the southbridge can be found on the SoC.

Cellular radios some SoCs also come with certain modems on board that are needed

by mobile operators. Such is the case with the Snapdragon S4 from Qualcomm, which

has an embedded LTE modem on board responsible for 4G LTE connectivity.

Other radiossome SoCs may also have other components responsible for other types

of connectivity, including Wi-Fi, GPS/GLONASS or Bluetooth. Again, the S4 is a good


7/46

6

example in this regard.

Timing sources includingoscillators andphase-locked loops.

Externalinterfaces including industry standards suchasUSB,FireWire,Ethernet,USART,SPI.

Peripherals includingcounter-timers, real-timetimers andpower-on reset generators.

Analog interfaces includingADCs andDACs.

Voltage regulators andpower management circuits.

Other circuitry.

Image 2:Simplified look at the layout of Samsung's Exynos 5 Dual. The CPU and GPU are

there, but they're just small pieces of the larger puzzle.(This image is in an ownership of http://www.intechopen.com)
http://en.wikipedia.org/wiki/Oscillatorhttp://en.wikipedia.org/wiki/Phase-locked_loophttp://en.wikipedia.org/wiki/Electrical_connectorhttp://en.wikipedia.org/wiki/Universal_Serial_Bushttp://en.wikipedia.org/wiki/FireWirehttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/USARThttp://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bushttp://en.wikipedia.org/wiki/Counterhttp://en.wikipedia.org/wiki/Timerhttp://en.wikipedia.org/wiki/Power-on_resethttp://en.wikipedia.org/wiki/Analog_signalhttp://en.wikipedia.org/wiki/Analog_to_digital_converterhttp://en.wikipedia.org/wiki/Digital_to_analog_converterhttp://en.wikipedia.org/wiki/Voltage_regulatorhttp://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Voltage_regulatorhttp://en.wikipedia.org/wiki/Digital_to_analog_converterhttp://en.wikipedia.org/wiki/Analog_to_digital_converterhttp://en.wikipedia.org/wiki/Analog_signalhttp://en.wikipedia.org/wiki/Power-on_resethttp://en.wikipedia.org/wiki/Timerhttp://en.wikipedia.org/wiki/Counterhttp://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bushttp://en.wikipedia.org/wiki/USARThttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/FireWirehttp://en.wikipedia.org/wiki/Universal_Serial_Bushttp://en.wikipedia.org/wiki/Electrical_connectorhttp://en.wikipedia.org/wiki/Phase-locked_loophttp://en.wikipedia.org/wiki/Oscillator


8/46

7

4 Briefly about Multi-core processors

A multi-core processor is a singlecomputing component with two or more independent

actualcentral processing units (called "cores"), which are the units that read and

executeprogram instructions.The instructions are ordinaryCPU instructions such as add, move

data, and branch, but the multiple cores can run multiple instructions at the same time,

increasing overall speed for programs amenable toparallel computing.Manufacturers typically

integrate the cores onto a singleintegrated circuitdie (known as a chip multiprocessor or CMP),

or onto multiple dies in a singlechip package.

Processors were originally developed with only one core. Multi-core processors were

developed in the early 2000s byIntel,AMD and others. They may have two cores (Dual core)

(e.g.AMD Phenom II X2,Intel Core Duo), four cores (Quad core) (e.g.AMD Phenom II X4,Intel's

quad-core processors, seei5,andi7 atIntel Core), 6-cores (e.g.AMD Phenom II X6,Intel Core i7

Extreme Edition 980X), 8-cores (e.g.Intel Xeon E7-2820,AMD FX-8350), 10-cores (e.g.Intel Xeon

E7-2850)or more.

A multi-core processor implementsmultiprocessing in a single physical package. Designers may

couple cores in a multi-core device tightly or loosely. For example, cores may or may not

sharecaches,and they may implementmessage passing orshared memory inter-core

communication methods. Commonnetwork topologies to interconnect cores includebus,ring,

two-dimensional mesh, andcrossbar.

Homogeneous multi-core systems include only identical cores, and on the othersideheterogeneous multi-core systemshave cores that are not identical. Just as with single-

processor systems, cores in multi-core systems may implement architectures such

assuperscalar,VLIW,vector processing,SIMD,ormultithreading.

Multi-core processors are widely used across many application domains including general-

purpose,embedded,network,digital signal processing (DSP), andgraphics.

The improvement in performance gained by the use of a multi-core processor depends very

much on the software algorithms used and their implementation. In particular, possible gains

are limited by the fraction of the software that can berun in parallel simultaneously on multiple

cores; this effect is described by Amdahl's law. In the best case, so-calledembarrassinglyparallel problems may realize speedup factors near the number of cores, or even more if the

problem is split up enough to fit within each core's cache(s), avoiding use of much slower main

system memory. Most applications, however, are not accelerated so much unless programmers

invest a prohibitive amount of effort in re-factoring the whole problem. The parallelization of

software is a significant ongoing topic of research.
http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_sethttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Die_(integrated_circuit)http://en.wikipedia.org/wiki/Chip_carrierhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Callisto.22_.28C2.2FC3.2C_45_nm.2C_Dual-core.29http://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Zosma.22_.28E0.2C_45_nm.2C_Quad-core.29http://en.wikipedia.org/wiki/Intel_Core_i5http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Thuban.22_.28E0.2C_45_nm.2C_Hexa-core.29http://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29http://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Message_passinghttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Bus_topologyhttp://en.wikipedia.org/wiki/Crossbar_switchhttp://en.wikipedia.org/w/index.php?title=Homogeneous_computing&action=edit&redlink=1http://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/Embedded_processorhttp://en.wikipedia.org/wiki/Network_processorhttp://en.wikipedia.org/wiki/Digital_signal_processinghttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Parallel_processinghttp://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Parallel_processinghttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Digital_signal_processinghttp://en.wikipedia.org/wiki/Network_processorhttp://en.wikipedia.org/wiki/Embedded_processorhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/w/index.php?title=Homogeneous_computing&action=edit&redlink=1http://en.wikipedia.org/wiki/Crossbar_switchhttp://en.wikipedia.org/wiki/Bus_topologyhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Message_passinghttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29http://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Thuban.22_.28E0.2C_45_nm.2C_Hexa-core.29http://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i5http://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Zosma.22_.28E0.2C_45_nm.2C_Quad-core.29http://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Callisto.22_.28C2.2FC3.2C_45_nm.2C_Dual-core.29http://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Chip_carrierhttp://en.wikipedia.org/wiki/Die_(integrated_circuit)http://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Instruction_sethttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computing


9/46

8

4.1 Multithreading, Hyper-Threading, or Multi-Core?

Programs are made up of execution threads. These threads are sequences of related

instructions. In the early days of the PC, most programs consisted of a single thread. The

operating systems in those days were capable of running only one such program at a time. The

result was-as some of us painfully recall-that your PC would freeze while it printed a document

or a spreadsheet. The system was incapable of doing two things simultaneously. Innovations in

the operating system introduced multitasking in which one program could be briefly suspended

and another one run. By quickly swapping programs in and out in this manner, the system gave

the appearance of running the programs simultaneously.

By the beginning of this decade, processor design had gained additional execution resources

(such as logic dedicated to floating-point and integer math) to support executing multiple

instructions in parallel. Intel saw an opportunity in these extra facilities. The company reasoned

it could make better use of these resources by employing them to execute two separatethreads simultaneously on the same processor core. Intel named this simultaneous processing

Hyper-Threading Technology and released it on the Intel Xeon processors in 2003. According to

Intel benchmarks, applications that were written using multiple threads could see

improvements of up to 30% by running on processors with HT Technology. More important,

however, two programs could now run simultaneously on a processor without having to be

swapped in and out (See Figure 1.) To induce the operating system to recognize one processor

as two possible execution pipelines, the new chips were made to appear as two logical

processors to the operating system.

Fig.4.1 HT Technology enables two threads to execute simultaneously on a single processor core

The performance boost of HT Technology was limited by the availability of shared resources to

the two executing threads. As a result, HT Technology cannot approach the processing

throughput of two distinct processors because of the contention for these shared resources. To

achieve greater performance gains on a single chip, a processor would require two separate

cores, such that each thread would have its own complete set of execution resources. Enter

multi-core.


10/46

9

Multi-Core Processors

Multi-core processors, as the name implies, contain two or more distinct cores in the same

physical package. Figure 2 shows how this appears in relation to previous technologies.

Fig. 4.2 Multi-Core processors have multiple execution cores on a single chip

In this design, each core has its own execution pipeline. And each core has the resourcesrequired to run without blocking resources needed by the other software threads.

While the example in Figure 2 shows a two-core design, there is no inherent limitation in the

number of cores that can be placed on a single chip. Intel has committed to shipping dual-core

processors in 2005, but it will add additional cores in the future. Mainframe processors today

use more than two cores, so there is precedent for this kind of development.

The multi-core design enables two or more cores to run at somewhat slower speeds and at

much lower temperatures. The combined throughput of these cores delivers processing power

greater than the maximum available today on single-core processors and at a much lower level

of power consumption. In this way, Intel increases the capabilities of server platforms as

predicted by Moores Law while the technology no longer pushes the outer limits of physical

constraints.


11/46

10

5 On-Chip Network to a Multi-core System

5.1 Abstract

On-chip network architecture can be defined by four parameters: its topology, routing

algorithm, flow control protocol, and router micro architecture.

Throughout this section, we will discuss how different choices of the above four parameters

affect the overall costperformance of an on-chip network. Clearly, the costperformance of an

on-chip network depends on the requirements faced by its designers. Latency is a key

requirement in many on-chip network designs, where network latency refers to the delay

experienced by messages as they traverse from source to destination. Most on-chip networks

must also ensure high throughput, where network throughput is the peak bandwidth thenetwork can handle.

Another metric that is particularly critical in on-chip network design is network power, which

approximately correlates with the activity in the network as well as its complexity.

5.2 Topology

The effect of a topology on overall network costperformance is profound. A topology

determines the number of hops (or routers) a message must traverse as well as the

interconnect lengths between hops, thus influencing network latency significantly.

As traversing routers and links incurs energy, a topologys effect on hop count also directly

affects network energy consumption. As for its effect on throughput, since a topology dictates

the total number of alternate paths between nodes, it affects how well the network can spread

out traffic and thus the effective bandwidth a network can support. Network reliability is also

greatly influenced by the topology as it dictates the number of alternative paths for routing

around faults. The implementation complexity cost of a topology depends on two factors: the

number of links at each node (node degree) and the ease of laying out a topology on a chip

(wire lengths and the number of metal layers required). Figure 2.1 shows three commonly used

on-chip topologies. For the same number of nodes, and assuming uniform random traffic where

every node has an equal probability of sending to every other node, a ring (Fig. 2.1.a) will lead

to higher hop count than a mesh (Fig. 2.1.b) or a torus [11] (Fig. 2.1.c). For instance, in the

figure shown, assuming bidirectional links and shortest-path routing, the maximum hop count

of the ring is 4, that of a mesh are also 4, while a torus improves it to 2. A ring topology also

offers fewer alternate paths between nodes than a mesh or torus, and thus saturates at a lower

network throughput for most traffic patterns. For instance, a message between nodes A and B

in the ring topology can only traverse one of two paths in a ring, but in a 33 mesh topology,


12/46

11

there are six possible paths. As for network reliability, among these three networks, a torus

offers the most tolerance to faults because it has the highest number of alternative paths

between nodes.

Figure 2.1 Common on-chip network topologies: (a) ring, (b) mesh, and (c) torus

While rings have poorer performance (latency, throughput, energy, and reliability) when

compared to higher dimensional networks, they have lower implementation overhead. A ring

has a node degree of 2 while a mesh or torus has a node degree of 4, where node degree refers

to the number of links in and out of a node. A higher node degree requires more links and

higher port counts at routers. All three topologies featured are two-dimensional planar

topologies that map readily to a single-metal layer, with a layout similar to that shown in the

figure, except for torus which should be arranged physically in a folded manner to equalize wire

lengths (see Fig. 2.2), instead of employing long wrap-around links between edge nodes. A

torus illustrates the importance of considering implementation details in comparing alternative

topologies. While a torus has lower hop count (which leads to lower delay and energy)compared to a mesh, wire lengths in a folded torus are twice that in a mesh of the same size, so

per-hop latency and energy are actually higher. Furthermore, a torus requires twice the number

of links which must be factored into the wiring budget. If the available wiring bisection

bandwidth is fixed, a torus will be restricted to narrower links than a mesh, thus lowering per-

link bandwidth, and increasing transmission delay. Determining the best topology for an on-

chip network subject to the physical and technology constraints is an area of active research.

Figure 2.2 Layout of a 8x8 folded torus


13/46

12

5.3 Routing

The goal of the routing algorithm is to distribute traffic evenly among the paths supplied by the

network topology, so as to avoid hotspots and minimize contention, thus improving network

latency and throughput. In addition, the routing algorithm is the critical component in faulttolerance: once faults are identified, the routing algorithm must be able to skirt the faulty

nodes and links without substantially affecting network performance. All of these performance

goals must be achieved while adhering to tight constraints on implementation complexity:

routing circuitry can stretch critical path delay and add to a routers area footprint. While

energy overhead of routing circuitry is typically low, the specific route chosen affects hop count

directly and thus substantially affects energy consumption.

While numerous routing algorithms have been proposed, the most commonly used routing

algorithm in on-chip networks is dimension-ordered routing (DOR) due to its simplicity. With

DOR, a message traverses the network dimension-by dimension, reaching the coordinate

matching its destination before switching to the next dimension.

In a two-dimensional topology such as the mesh in Fig. 2.3, dimension-ordered routing, say XY

routing, sends packets along the X-dimension first, followed by the Y-dimension. A packet

traveling from (0,0) to (2,3) will first traverse two hops along the X-dimension, arriving at (2,0),

before traversing three hops along the Y-dimension to its destination. Dimension-ordered

routing is an example of a deterministic routing algorithm, in which all messages from node A

to B will always traverse the same path. Another class of routing algorithms is oblivious ones,

where messages traverse different paths from A to B, but the path is selected without regards

to the actual network situation at transmission time. For instance, a router could randomlychoose among alternative paths prior to sending a message. Figure 2.3 shows an example

where messages from (0,0) to (2,3) can be randomly sent along either the YX route or the XY

route. A more sophisticated routing algorithm can be adaptive, in which the path a message

takes from A to B depends on network traffic situation. For instance, a message can be going

along the

XY route, sees congestion at (1,0)s east outgoing link and instead choose to take the north

outgoing link towards the destination (see Fig. 2.3).


14/46

13

Fig. 2.2 DOR illustrates an

XY route from (0,0) to (2,3)

in a mesh, while Oblivious

shows two alternative routes

(XY and YX) between thesame sourcedestination pair

that can be chosen obliviously

prior to message

transmission. Adaptive shows

a possible adaptive route that

branches away from the XY

route if congestion is

encountered at (1,0)

In selecting or designing a routing algorithm, not only must its effect on delay, energy,

throughput, and reliability be taken into account, most applications also require the network to

guarantee deadlock freedom. A deadlock occurs when a cycle exists among the paths of

multiple messages. Figure 2.4 shows four gridlocked (and deadlocked) messages waiting for

links that are currently held by other messages and prevented from making forward progress:

The packet entering router A from the South input port is waiting to leave through the East

output port, but another packet is holding onto that exact link while waiting at router B to leave

via the South output port, which is again held by another packet that is waiting at router C to

leave via the West output port and so on.

Fig. 2.4A classic network

deadlock where four packets

cannot make forward

progress as they are waiting

for links that other packetsare holding on to


15/46

14

Deadlock freedom can be ensured in the routing algorithm by preventing cycles among the

routes generated by the algorithm, or in the flow control protocol by preventing router buffers

from being acquired and held in a cyclic manner. Using the routing algorithms we discussed

above as examples, dimension-ordered routing is deadlock-free since in XY routing, there

cannot be a turn from a Y link to an X link, so cycles cannot occur. Two of the four turns in Fig.

2.4 will not be permitted, so a cycle is not possible. The oblivious algorithm that randomlychooses between XY or YX routes is not deadlock-free because all four turns from Fig. 2.4 are

possible leading to potential cycles in the link acquisition graph. Likewise, the adaptive route

shown in Fig. 2.3 is a superset of oblivious routing and is subject to potential deadlock. A

network that uses a deadlock-prone routing algorithm requires a flow control protocol that

ensures deadlock freedom.

Routing algorithms can be implemented in several ways. First, the route can beembedded in

the packet header at the source, known as source routing. For instance,the XY route in Fig. 2.3

can be encoded as , while theYX route can be encoded as . At each hop, the routerwill read the left-most direction off the route header, send thepacket towards thespecified outgoing link, and strip off the portion of the header

corresponding to thecurrent hop. Alternatively, the message can encode the coordinates of the

destination,and comparators at each router determine whether to accept or forward the

message. Simple routing algorithms are typically implemented as combinationalcircuits within

the router due to the low overhead, while more sophisticated algorithms are realized using

routing tables at each hop which store the outgoing link a message should take to reach a

particular destination. Adaptive routing algorithms need mechanisms to track network

congestion levels, and update the route. Route adjustments can be implemented by modifying

the header, employing combinational circuitry that accepts as input these congestion signals, or

updating entries in a routing table. Many congestion-sensitive mechanisms have been

proposed, with the simplest being tapping into information that is already captured and used

by the flow control protocol, such as buffer occupancy or credits.

5.4 Flow Control

Flow control governs the allocation of network buffers and links. It determines when buffers

and links are assigned to which messages, the granularity at which they are allocated, and how

these resources are shared among the many messages using the network. A good flow control

protocol lowers the latency experienced by messages at low loads by not imposing highoverhead in resource allocation, and pushes network throughput through enabling effective

sharing of buffers and links across messages. In determining the rate at which packets access

buffers (or skip buffer access altogether) and traverse links, flow control is instrumental in

determining network energy and power. Flow control also critically affects network quality- off

service since it determines the arrival and departure time of packets at each hop. The

implementation complexity of a flow control protocol includes the complexity of the router


16/46

15

micro-architecture as well as the wiring overhead imposed in communicating resource

information between routers.

In store-and-forward flow control, each node waits until an entire packet has been received

before forwarding any part of the packet to the next node. As a result, long delays are incurred

at each hop, which makes them unsuitable for on-chip networks that are usually delay-critical.

To reduce the delay packets experience at each hop, virtual cut-through flow control allows

transmission of a packet to begin before the entire packet is received. Latency experienced by a

packet is thus drastically reduced, as shown in Fig. 2.5. However, bandwidth and storage are

still allocated in packet-sized units. Packets still only move forward if there is enough storage to

hold the entire packet. On-chip networks with tight area and power constraints find it difficult

to accommodate the large buffers needed to support virtual cut-through (assuming large

packets).

Like virtual cut-through flow control, wormhole flow control cuts through flits, allowing flits to

move on to the next router as soon as there is sufficient buffering for this flit. However, unlike

store-and-forward and virtual cut-through flow control, wormhole flow control allocates

storage and bandwidth to flits that are smaller than a packet. This allows relatively small flit-

buffers to be used in each router, even for large packet sizes. While wormhole flow control uses

buffers effectively, it makes inefficient use of link bandwidth. Though it allocates storage and

bandwidth

in flit-sized units, a link is held for the duration of a packets lifetime in the router.

As a result, when a packet is blocked, all of the physical links held by that packet are left idle.

Throughput suffers because other packets queued behind the blocked packet are unable to use

the idle physical links.

Fig. 2.5Timing for (a) store-and-forward and (b) cut-through flow control at low loads, where

tr refers to the delay routing the head flit through each router, ts refers to the serialization delay

transmitting the remaining flits of the packet through each router, and tw refers to the time


17/46

16

involved in propagating bits across the wires between adjacent routers. Wormhole and virtual-

channel flow control have the same timing as cut-through flow control at low loads.

Virtual-channelflow control improves upon the link utilization of wormhole flow control,

allowing blocked packets to be passed by other packets. A virtual channel (VC) consists merelyof a separate flit queue in the router; multiple VCs share the physical wires (physical link)

between two routers. Virtual channels arbitrate for physical link bandwidth on a flit-by-flit

basis. When a packet holding a virtual channel becomes blocked, other packets can still

traverse the physical link through other virtual channels. Thus, VCs increase the utilization of

the critical physical links and extend overall network throughput. Current on-chip network

designs overwhelmingly adopt wormhole flow control for its small area and power footprint,

and use virtual channels to extend the bandwidth where needed. Virtual channels are also

widely used to break deadlocks, both within the network, and for handling system level

deadlocks.

Figure 2.6 illustrates how two virtual channels can be used to break a cyclic deadlock in thenetwork when the routing protocol permits a cycle.

Fig. 2.6Two virtual

channels (denoted by solid

and dashed lines) and their

associated separate buffer

queues (denoted as two

circles at each router) used

to break the cyclic route

deadlock in Fig. 2.4

Since each VC is time-multiplexed onto the physical link cycle-by-cycle, holding onto a VC

implies holding on to its

associated buffer queue rather than locking down a physical link. By enforcing an

order on VCs, so that lower-priority VCs cannot request and wait for higher-priority

VCs, there cannot be a cycle in resource usage. At the system level, messages that

can potentially block on each other can be assigned to different message classes

that are mapped onto different virtual channels within the network, such as request

and acknowledgment messages of coherence protocols. Implementation complexity

of virtual channel routers will be discussed in detail next in Sect. 2.2.5 on routermicro architecture.


18/46

17

Buffer backpressure:

Unlike broadband networks, on-chip network are typically

not designed to tolerate dropping of packets in response to congestion. Instead,

buffer backpressure mechanisms stall flits from being transmitted when buffer space

is not available at the next hop. Two commonly used buffer backpressure mechanismsare credits and on/off signaling. Credits keep track of the number of buffers

available at the next hop, by sending a credit to the previous hop when a flit leaves

and vacates a buffer, and incrementing the credit count at the previous hop upon

receiving the credit. On/off signaling involves a signal between adjacent routers that

is turned off to stop the router at the previous hop from transmitting flits when the

number of buffers drop below a threshold, with the threshold set to ensure that all

in-flight flits will have buffers upon arrival.

5.5 Router Micro-architecture

How a router is built determines its critical path delay, per-hop delay, and overall network

latency. Router micro-architecture also affects network energy as it determines the circuit

components in a router and their activity. The implementation

of the routing and flow control and the actual router pipeline will affect the efficiency

at which buffers and links are used and thus overall network throughput. In

message, while errors in the control circuitry can lead to lost and mis-routed messages.

The area footprint of the router is clearly highly determined by the chosen router micro-

architecture and underlying circuits, terms of reliability, faults in the router datapath will lead

to errors in the transmitted message, while errors in the control circuitry can lead to lost and

mis-routed messages.

Figure 2.7 shows the micro architecture of a state-of-the-art credit-based virtual

channel (VC) router to explain how typical routers work. The example assumes a

two-dimensional mesh, so that the router has five input and output ports corresponding

to the four neighboring directions and the local processing element (PE) port.

The major components which constitute the router are the input buffers, route computation

logic, virtual channel allocator, switch allocator, and the crossbar switch.Most on-chip network routers are input-buffered, in which packets are stored in

buffers only at the input ports.


19/46

18

Fig. 2.7Credit-based virtual channel router microarchitecture

6 Project Methodologies, Results, and

Achievements

The purpose of this section is to summarize the developments that took place within this

project and put them in large scientific and technological context. The principles underlying this

approach and all the methods used in this project are derived from a previously composed

research.

As already explained in the beginnings of this project, our purpose is to concentrate on the

customers needs. To accomplish that, a survey took place over the internet, so we could gain


20/46

19

some other point of view that is maybe different from ours, or with other words to conclude for

what the smartphones are mostly used.

Later on, according to the gained results from the survey and based on SOCs structure and

Multi-core systems architectures, this project will offer On-Chip network that will bring high

performance, smartly used space, and network interconnections between the components ofthe chip-which means long-lasting battery.

All the testing over the new design that will be proposed will be done over a Discrete Event

Simulator, but more about that, later in this section.

6.1 Usage of Smartphones Survey (Q & A)

The survey took place from 4 March, 2014 to 15 March, 2014 and it was created over the free

survey websitewww.freeonlinesurveys.com. Later on it was shared over the social media and

it was answered by 43 people. In the following sections the questions and results of the survey

are presented.
http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/


21/46

20


22/46

21


23/46

22

6.2 Analysis on the survey results

From the results we can see that all generations have some representatives, but mainly this

survey was answered by people aged from 15 to 22 years (59.5 %). Also, most of the surveyedare users of the Android Operating System (69 % vs 19% Apple). According to the question how

much time the users spend daily on their smartphones (1-4 hours - 59.6%), and how much their

smartphones batteries last (10-12 hours - 41.9 %), it can be concluded that the battery is

maybe one of the biggest problems of todays smartphones, which means, our priority in the

new architecture plan.

The connection on the internet is also main factor, but we must approach to a design that

implements both Wi-Fi and 3G/4G connections.

From the ranking of items that was offered to the surveyed people, we can clearly see why

most of them use their phones. It looks like this is new era in this technology, since 40.48 % said

that their first usage priority of the phones are the social media (Facebook, Twitter, Foursquare,

etc.), and on the other side, 38.10 % claimed that their first priority is to use the phone for

Phone Callsrather than the other offered items in the list.

Other thing that we can learn from this question is that people are using their phones mostly

for Internet Calls (Viber, Skype, etc.), Instagram (as a social network and picture editor), Internet

surfing, Music players, etc. But, clearly lot of people use their phones like cameras and for

playing games. On the last question of this survey we can see that almost half of the people

(41.9 %) prefer playing games in HD graphics.

What we learned from the collected results?

Light and heavy data transfer should be offered in our plan

Internet browsing


24/46

23

HD Graphic Card

Conference calling

HD editing

Quick and with low-delay database

Medium and Fast data links

Cellular support

From the results it can be concluded that different performances of the processors would be

needed. For example its not the same if two processors are executing a program for picture

editing and on the other side if one specifically Graphic processor executes the code.

Also, if we want to preserve our battery to last longer, it would be useful if we only use regular

processors for easy tasks (f.e light database transfer).

Now, we can conclude that it would be better to design heterogeneous multi-core systems

which include non-identical processors. With that on mind, now the interconnections of the

network and all its components will be of great importance.

To present the planed design, firstly, in the next section the simulator on which the tests will be

done is going to be presented, and later on, the architecture with facts.

6.3 Discrete event simulations

In the field ofsimulation, adiscrete-eventsimulation (DES), models the operation of

asystem as a discretesequence of events in time. Each event occurs at a particular instant in

time and marks a change ofstate in the system. Between consecutive events, no change in the

system is assumed to occur; thus the simulation can directly jump in time from one event to the

next.

This contrasts withcontinuous simulation in which the simulation continuously tracks the

system dynamics over time. Instead of beingevent-based, this is called an activity-based

simulation; time is broken up into small time slices and the system state is updated according to

the set of activities happening in the time slice. Because discrete-event simulations do not have

to simulate every time slice, they can typically run much faster than the corresponding

continuous simulation.

Another alternative to event-based simulation is process-based simulation. In this approach,

each activity in a system corresponds to a separate process, where a process is typically
http://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Discrete_timehttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/Sequence_of_eventshttp://en.wikipedia.org/wiki/State_(computer_science)http://en.wikipedia.org/wiki/Continuous_simulationhttp://en.wikipedia.org/wiki/Event-driven_programminghttp://en.wikipedia.org/wiki/Event-driven_programminghttp://en.wikipedia.org/wiki/Continuous_simulationhttp://en.wikipedia.org/wiki/State_(computer_science)http://en.wikipedia.org/wiki/Sequence_of_eventshttp://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Discrete_timehttp://en.wikipedia.org/wiki/Simulation


25/46

24

simulated by athread in the simulation program. In this case, the discrete events, which are

generated by threads, would cause other threads to sleep, wake, and update the system state.

A more recent method is the three-phased approach to discrete event simulation (Pidd, 1998).

In this approach, the first phase is to jump to the next chronological event. The second phase is

to execute all events that unconditionally occur at that time (these are called B-events). The

third phase is to execute all events that conditionally occur at that time (these are called C-

events). The three phase approach is a refinement of the event-based approach in which

simultaneous events are ordered so as to make the most efficient use of computer resources.

The three-phase approach is used by a number of commercial simulation software packages,

but from the user's point of view, the specifics of the underlying simulation method are

generally hidden.

6.4 Example of DES in real life

A common exercise in learning how to build discrete-event simulations is to model aqueue,

such as customers arriving at a bank to be served by a teller. In this example, the system

entities are Customer-queue and Tellers. The system events are Customer-

Arrival and Customer-Departure. (The event of Teller-Begins-Service can be part of the logic of

the arrival and departure events.) The system states, which are changed by these events,

are Number-of-Customers-in-the-Queue (an integer from 0 to n) and Teller-Status (busy or

idle). Therandom variables that need to be characterized to model this

systemstochastically are Customer-Interarrival-Timeand Teller-Service-Time. An agent-based

framework for performance modeling of an optimistic parallel discrete event simulator is

another example for a discrete event simulation.

6.5 Components of a discrete-event simulation

State

A system state is a set of variables that captures the salient properties of the system to be

studied. The state trajectory overtime S(t) can mathematically represented by a step

function whose values change in correspondence of discrete events.

Clock

The simulation must keep track of the current simulation time, in whatever measurement units

are suitable for the system being modeled. In discrete-event simulations, as opposed toreal-

time simulations,time hops because events are instantaneous the clock skips to the next

event start time as the simulation proceeds.
http://en.wikipedia.org/wiki/Thread_(computing)http://en.wikipedia.org/wiki/Queueing_theoryhttp://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Stochastichttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Stochastichttp://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Queueing_theoryhttp://en.wikipedia.org/wiki/Thread_(computing)


26/46

25

Events list

The simulation maintains at least one list of simulation events. This is sometimes called

the pending event set because it lists events that are pending as a result of previously simulated

event but have yet to be simulated themselves. An event is described by the time at which itoccurs and a type, indicating the code that will be used to simulate that event. It is common for

the event code to be parametrized, in which case, the event description also contains

parameters to the event code.

When events are instantaneous, activities that extend over time are modeled as sequences of

events. Some simulation frameworks allow the time of an event to be specified as an interval,

giving the start time and the end time of each event.

Random-number generators

The simulation needs to generaterandom variables of various kinds, depending on the system

model. This is accomplished by one or morePseudo-random number generators.The use of

pseudo-random numbers as opposed to true random numbers is a benefit should a simulation

need a rerun with exactly the same behavior.

One of the problems with the random number distributions used in discrete-event simulation is

that the steady-state distributions of event times may not be known in advance. As a result, the

initial set of events placed into the pending event set will not have arrival times representative

of the steady-state distribution. This problem is typically solved by bootstrapping the simulation

model. Only a limited effort is made to assign realistic times to the initial set of pending events.

These events, however, schedule additional events, and with time, the distribution of event

times approaches its steady state. This is called bootstrapping the simulation model. In

gathering statistics from the running model, it is important to either disregard events that occur

before the steady state is reached or to run the simulation for long enough that the

bootstrapping behavior is overwhelmed by steady-state behavior. (This use of the

term bootstrapping can be contrasted with its use in bothstatistics andcomputing.)

Statistics

The simulation typically keeps track of the system'sstatistics,which quantify the aspects of

interest. In the bank example, it is of interest to track the mean waiting times. In a simulation

model, performance metrics are not analytically derived fromprobability distributions,butrather as averages overreplications,that is different runs of the model.Confidence

intervals are usually constructed to help assess the quality of the output.
http://en.wikipedia.org/wiki/Random_variableshttp://en.wikipedia.org/wiki/Pseudorandom_number_generatorhttp://en.wikipedia.org/wiki/Bootstrapping_(statistics)http://en.wikipedia.org/wiki/Bootstrapping_(computing)http://en.wikipedia.org/wiki/Statistichttp://en.wikipedia.org/wiki/Probability_distributionshttp://en.wikipedia.org/wiki/Replication_(statistics)http://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Replication_(statistics)http://en.wikipedia.org/wiki/Probability_distributionshttp://en.wikipedia.org/wiki/Statistichttp://en.wikipedia.org/wiki/Bootstrapping_(computing)http://en.wikipedia.org/wiki/Bootstrapping_(statistics)http://en.wikipedia.org/wiki/Pseudorandom_number_generatorhttp://en.wikipedia.org/wiki/Random_variables


27/46

26

Ending condition

Because events are bootstrapped, theoretically a discrete-event simulation could run forever.

So the simulation designer must decide when the simulation will end. Typical choices are at

time t or after processing n number of events or, more generally, when statistical measure

X reaches the value x.

6.6 Network Simulators as DES

Discrete event simulation is used in computer network to simulate new protocols for different

network traffic scenarios before deployment.

Incommunication andcomputer network research, network simulation is a technique where a

program models the behavior of a network either by calculating the interaction between the

different network entities (hosts/packets, etc.) using mathematical formulas, or actuallycapturing and playing back observations from a production network. The behavior of the

network and the various applications and services it supports can then be observed in a test

lab; various attributes of the environment can also be modified in a controlled manner to assess

how the network would behave under different conditions.

There are many both free/open-source and proprietary network simulators. Examples of

notable network simulation software are, ordered after how often they are mentioned in

research papers:

ns (open source) OPNET (proprietary software)

NetSim (proprietary software)

6.7 Network Simulations with OPNET

OPNET Technologies, INC. is a software business that provides performance management for

computer networks and applications.The company was founded in 1986 and went public in 2000.

OPNET can serve for a variety of needs. Compared to the cost and time involved in setting up

an entiretest bed containing multiple networkedcomputers,routers and data links,OPNET is

relatively fast and inexpensive. They allow engineers, researchers to test scenarios that might

be particularly difficult or expensive toemulate using real hardware - for instance, simulating a

scenario with several nodes or experimenting with a new protocol in the network. Network
http://en.wikipedia.org/wiki/Communicationhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Ns_(simulator)http://en.wikipedia.org/wiki/OPNEThttp://en.wikipedia.org/wiki/NetSimhttp://en.wikipedia.org/wiki/Test_bedhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Data_linkhttp://en.wikipedia.org/wiki/Emulatehttp://en.wikipedia.org/wiki/Emulatehttp://en.wikipedia.org/wiki/Data_linkhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Test_bedhttp://en.wikipedia.org/wiki/NetSimhttp://en.wikipedia.org/wiki/OPNEThttp://en.wikipedia.org/wiki/Ns_(simulator)http://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Communication


28/46

27

simulators are particularly useful in allowing researchers to test new networking protocols or

changes to existing protocols in a controlled and reproducible environment. A typical network

simulator encompasses a wide range of networking technologies and can help the users to

build complex networks from basic building blocks such as a variety of nodes and links. With the

help of simulators, one can design hierarchical networks using various types of nodes like

computers,hubs,bridges,routers, switches, links, mobile units etc.

Various types ofWide Area Network (WAN) technologies like TCP, ATM, IP etc. andLocal Area

Network (LAN) technologies likeEthernet,token rings etc., can all be simulated with a typical

simulator and the user can test, analyze various standard results apart from devising some

novel protocol or strategy for routing etc. Network simulators are also widely used to simulate

battlefield networks inNetwork-centric warfare

Minimally, a network simulator must enable a user to represent anetwork topology,specifying

the nodes on the network, the links between those nodes and the traffic between the nodes.

More complicated systems like OPNET allow the user to specify everything about the protocols

used to handle traffic in a network. Graphical applications allow users to easily visualize the

workings of their simulated environment. Text-based applications may provide a less intuitive

interface, but may permit more advanced forms of customization.

7 Implementing the project in OPNET

7.1 Adding Traffic

According to the theory presented until now, and the results gained from the survey, we made

three architectural designs using OPNET.

We stated that its going to be better if we organize network that is going to be heterogeneous

(with different processors), but we must pay attention of the interconnections between the

components of our System-on-a-Chip. Thats why this section is going to present different

interconnections of 8 CPUs. Like most of the modern smartphones, our proposed architecture

will be built from basic CPUsthat will implement simple tasks (keep the system running, light

data transfer, cellular calls, etc), and on the other side additional CPUs in charge for more

complicated tasks (heavy data transfer, picture editing, playing HD videos, etc). In the testing in

OPNET two versions of Intels full nodes were used: Intel_D875PBZ_P4 (3200MHz), and

Intel_VC820 (800MHz), where the first model represents architectures basic CPU and the

second model represents the additional CPU.
http://en.wikipedia.org/wiki/Communication_protocolhttp://en.wikipedia.org/wiki/Network_hubhttp://en.wikipedia.org/wiki/Network_bridgehttp://en.wikipedia.org/wiki/Wide_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/Token_ringhttp://en.wikipedia.org/wiki/Network-centric_warfarehttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Network-centric_warfarehttp://en.wikipedia.org/wiki/Token_ringhttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Wide_Area_Networkhttp://en.wikipedia.org/wiki/Network_bridgehttp://en.wikipedia.org/wiki/Network_hubhttp://en.wikipedia.org/wiki/Communication_protocol


29/46

28

For the testing to be complete, the simulation implemented traffic data and real life events.

Figure 7.1 provides the profile configuration table with the profiles created specifically for

testing events over the network. Five profiles were created in such a way that both basic and

additional CPUs would be tested over different tasks.

Fig. 7.1 Profile configuration table

Over the next page a brief documentation will be given about all five profiles created, including

their applications tables, start time offset, and the duration of the applications. Each of these

applications offers and simulates events over the network. The traffic will be tested over

different network configurations and evaluation will take place on different performances.

1. Telecom

2. Data Transfer


30/46

29

3. Video Conference

4. Web Browsing

5. Gaming

7.2 Network On-Chip realizations in OPNET

In this section different network architectures will be tested in OPNET. All of them are designed

and theoretically proofed in section 5(On-Chip Network to Multi-core systems) by four

parameters: topology, routing algorithm, flow control protocol, and router micro architecture.

Statistics about the networks will be collected based on the following information:


31/46

30

Simulation process of execution

queuing delay of every node in the network

point -to-point throughput of the channels between the CPUs

throughput of the channels between the CPUs and the router

and the global delay of the network

7.2.1 Star network (and 4 basic CPUs interconnected

forming a ring network)

Fig.7.2 Combination of Star and Ring Network in OPNET modeler

This network is actually combination of two networks. Firstly all 8 nodes are forming a Star

network (since all of them are connected directly to the router in the middle), and another

network (Ring) is formed from the basic CPUs, which are interconnected with each other

(every node is connected to its neighbors.


32/46

31

Characteristics of the network:

Server

Application Configuration

Profile Configuration

Routerto assign tasks to the nodes

1000BaseX (1 Gbps) duplex links to interconnect additional nodes with the router

10GbpsBaseT (10 Gbps) duplex links to interconnect basic nodes with the router and

between each other

Max Hop Count of the Network - 2

Max Node Degree3

Simulation Process:

The execution lasted 90 simulation seconds and completed 37, 475, 887 events, or withAverage speed of 309, 876 events/sec.

Queuing delay of the nodes in the network:

This statistic is collected on every CPU node in the network in order to see the difference

between both Basic and Additional CPUs over the queuing delay. Clearly, all the basic

processors show bigger delay in queuing packets.

*Note that Basic CPUs are working on 3200 MHz and Additional CPUs on 800 MHz.

Fig. 7.3 Queuing

delay of the nodes in

the network.


33/46

32

Point -to-point throughput of the channels between the CPUs:

Only the Basic Intel nodes are interconnected between their neighbors, so that means that

every node is connected through channel with two other nodes. For simplicity, these are the

throughput statistics from two pair of nodes (between Intel 1 and Intel 2, and Intel 1 and Intel

3). From the figure bellow it can be seen that the rate of successful message delivery between

the two channels is showing a difference, but when analyzing one channel it can be seen that

the rate of exchange is quite the same.

Fig. 7.4 Point-to-point throughput of the channels between Basic CPUs

Throughput of the channels between the CPUs and the router:

One node is taken from both CPUs as their representative. The figure bellow clearly explains

that more tasks were assigned to the basic nodes then the additional ones, or it can mean that

the channel connection Additional nodes with the router doesnt allow such high data transfer.

See the figure bellow.


34/46

33

Fig. 7.5 Throughput of the channels between the Router and both types of nodes

Global Delay of the Network:

The delay of the network reaches a maximum point of 700 microseconds, which is quite a good

result.

Fig. 7.6 Global delay of Star Network


35/46

34

7.2.2 Ring network (and interconnections of all nodes with

the router)

Fig. 7.7An image of Ring Architecture designed in OPNET

In the previous network we had interconnection only on the Basic nodes, now there is aconnection with both models of nodes. Every node now is connected with the router and plus

with both neighbor nodes. In two points there is a connection between Basic and Additional

Nodes (Intel Basic 1 Additional CPU 1, and Intel Basic 4 Additional CPU 4). The idea is to

see how the nodes will interact, or in other words to see how the events will be assigned now.


Server Application Configuration



1000BaseX (1 Gbps) duplex links to interconnect all the nodes between them and with

the Router

10GbpsBaseT (10 Gbps) duplex links to interconnect the Router and the Server


36/46

35


Max Node Degree3

Simulation Process:

The execution lasted 90 simulation seconds and completed 40, 614, 395 events, or with

Average speed of 358, 979 events/sec.


No major difference can be seen in the queuing delay between the two types of nodes.

Fig. 7.8 Queuing delay on the nodes in the network


37/46

36


In the figure bellow we can see representative of all possible interconnections in this particular

network, since Basic CPUs are connected between them, Additional CPUs also, and there are

two connections between Basic and Additional CPUs.The rate of successful message delivery is

reaching 2000 bits/sec in all of the cases. (See figure bellow).

Fig. 7.9 Point-to-point throughput on channels connecti-

ng Basic, Additional, and Basic-Additional nodes


Fig. 7.9Both the channels betweenBasic and Additional nodes and the

router reached successful delivery of

5.9 MB/sec, but the channels to Basic

nodes delivered 4 times more packets

then the other ones.


38/46

37


Fig. 7.10 The delay of the network differs for only for 30 microseconds from the Star

Architecture

7.2.3 Mesh network

In this architecture all the nodes are again connected with the router, but also every node is

connected with their neighbors and plus with one different node (additional with basic, and

vice versa). The idea is to make a safer network, where if more than one link fails there are still

other paths that can help to reach the goal node.

*Note that this is not a full mesh network where all the nodes are interconnected with each

other.


39/46

38

Fig. 7.11An image of mesh architecture designed in OPNET


Server

Application Configuration



100BaseT (10 Gbps) duplex links to interconnect Basic CPUs with the Router and witheach other

1000BaseX (1 Gbps) duplex links to interconnect Additional CPUs with Basic CPUs and

with each other


Max Node Degree - 4

Simulation Process:

The execution lasted 90 simulation seconds and completed 55, 280, 320 events, or with

Average speed of 328, 347 events/sec.


40/46

39


Fig. 7.12 The queuing of packets is quite bigger in the Additional nodes in comparison to the

Basic ones.


An interesting situation is presented in the figure bellow. Both of the channels can deliver up to

1 Gbps data transfer, but in the first case Additional CPU 1 Basic Intel 4, the Basic CPU isonly sending 3000 bits/sec. That can show us the node doesnt need a lot help for executing its

tasks.

Fig. 7.13 Point-to-point

throughput on channels

connection of different CPUs


41/46

40


In the channel between the router and the Additional CPU we can see that the successful

message delivery reaches 250, 000, 000 bits/sec from both sides of the link. On the other

channel where the Router is connected with a Basic CPU, the delivery from the Router to the

node is of the reverse situation.

Fig. 7.14 Throughput on channels connection the Router and the nodes


Fig. 7.15 The global

delay of the network reaches

almost 1000 microseconds,

which is almost identical with

the previously offered

architectures.


42/46

41

7.3 Results Comparison

Network Delay:

Fig. 7.16 The delay is almost identical in Star and Ring Topology assignment of the network

Simulation Events:

The execution time of all the simulations is fixed on 90 seconds, but the architectures differ in

the number of events they can complete. Figure 7.17 provides the difference between the

networks over number of events completed. Of course this should be seriously taken as an

influential factor in deciding what architecture should be proposed.

Fig. 7.17 Number of events completed

0100

200

300

400

500

600

700

800

900

1000

20 sim/s 40 sim/s 60 sim/s 80 sim/s 100 sim/s

Star

Ring

Mesh

0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000

Star

Ring

Mesh

Number of Events


43/46

42

Queuing Delay of the nodes:

For creating the ideal architecture a lot of attention should be given on the delay of the nodes.

Generally speaking the nodes in Mesh network have lowest queuing delay. Additional CPUs

have delay 5.7 Microseconds, and Basic CPUs have delay of 400 Microseconds.

In comparison to that, the nodes in Star and Ring Network have delay of 20000 Microseconds

(Additional Nodes), and 10000 Microseconds (Basic Nodes), respectively.

Point-to-point throughput of channels

The point-to-point throughput is almost the same in the first two networks, as given in the

figure bellow. On the other side, when Mesh topology is implemented, the throughput of the

channels can differ from 375 Bps up to 50 Mbps. Also we must note that in Mesh topology

there are link connections between two Basic nodes, two Additional nodes, and Basic withAdditional nodes.

Fig. 7.18 Throughput of channels between nodes. The successful delivery rate is given in bits/s.

Hop count, Max node degree, and Alternative paths:

As for its effect on throughput, since a topology dictates the total number of alternate paths

between nodes, it affects how well the network can spread out traffic and thus the effective

bandwidth a network can support. Network reliability is also greatly influenced by the topology

as it dictates the number of alternative paths for routing around faults.

0

500

1000

1500

2000

2500

Star Ring

Basic

Additional

Additional and Basic


44/46

43

Attributes/Topologies Star Ring Mesh

Hop Count 2 1 2

Node Degree 3 3 4

Alternative Paths More than 3 More than 5 More than 5

A star topology offers fewer alternate paths between nodes than a mesh or ring, and thus

saturates at a lower network throughput for most traffic patterns. In a case of fault, the mesh

topology offers the most alternate paths. Also the hop count implies lower network

throughput, so it is another negative thing about Star network.

While star network have poorer performance (latency, throughput, energy, and reliability)

when compared to higher dimensional networks, they have lower implementation overhead. A

ring has a node degree of 3 while a mesh has a node degree of 4, where node degree refers to

the number of links in and out of a node. A higher node degree requires more links and higher

port counts at routers.

So it can be concluded that the most suitable network when discussing about topology would

be the Mesh network, since we look for faster performances, and reliable bandwidth.


45/46

44

8 Conclusions

With all the accumulated effort invested in this project, there are reasons to believe that thearchitectures provided together with theories would be quite closer to industrial acceptance.

We summarize the progress with respect to the main objectives of the project, namely,

reliability, high performance, and long lasting battery.

Reliability: This is a major obstacle for acceptance of a particular design.The proposed

solution should be reliable network-on-a-chip with efficient multi-core processors for

mobile platforms as future systems on a chip. Thats why eight processors and links

with high bandwidth were offered. Also with high number of alternate paths, Mesh

would offer tolerance for faults, which again will make the network reliable.

High Performance: Accordingly to the trends in todays smartphone technology, where

high performance is needed, the entire proposed network that was designed in OPNET

offered high performance because of the eight processors. Four processors of 3200MHz

and four additional working on 800MHz would be totally enough for completing events

in a fast manner. Now comparing the results we gained, the overall performance

depends on latency, throughput, and energy.

The point-to-point throughput on channels in the three proposed networks showed

that the nodes are working together in processing events. In mesh, there is higher

deviation of the throughputs of the channels, from 375 Bps up to 50 Mbps, which is a

lot more than the throughput in star and ring network.

The latency or delay of the networks proposed showed that the ring is the most

suitable network. But now, we should also take in mind the number of events

completed by the networks. Mesh executes 1/3 events more than the other networks

for the same time, so again it is more suitable solution.

Long Lasting Battery:The battery of a system-on-a-chip cannot be the same as the ones

in laptops, or tablets, etc. But, with smart usage of the power of the chip, a lot of

energy can be saved. In the section with OPNET testing we saw that all the processors

in the network are dividing the tasks between them, depending on how occupied a

particular processor is. Also the plan is to not use all of the processing power when

there are not enough events to be executed.


46/46

9 References

(1)A. Vajda, Programming Many-Core Chips Springer Science+Business Media, LLC2011

(2)S.W. Keckler et al. (eds.), Multicore Processors and Systems, Integrated Circuites and

Systems, Springer Science+Business Media, LLC 2009

(3)Brayan Schauer, Multicore Processors A Necessity, 2008 ProQuest, Released

September 2008

(4)Multicore Processors and Systems, Retrieved from

https://noggin.intel.com/technology-journal

(5)Association of Computing Machinery,Retrievedfromhttp://www.acm.org/sigs

(6)ARM Smartphones, Retrieved from

http://www.arm.com/markets/mobile/smartphones.php

(7)The PC Inside your Phone, Retrieved fromhttp://arstechnica.com/

(8)CPU Info Center,http://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFt

(9)Computer Architecture Page,http://arch-www.cs.wisc.edu/wwwarch/public/home

(10) http://wikipedia.com
https://noggin.intel.com/technology-journalhttps://noggin.intel.com/technology-journalhttp://www.acm.org/sigshttp://www.acm.org/sigshttp://www.acm.org/sigshttp://www.arm.com/markets/mobile/smartphones.phphttp://www.arm.com/markets/mobile/smartphones.phphttp://arstechnica.com/http://arstechnica.com/http://arstechnica.com/http://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://wikipedia.com/http://wikipedia.com/http://wikipedia.com/http://arch-www.cs.wisc.edu/wwwarch/public/homehttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://arstechnica.com/http://www.arm.com/markets/mobile/smartphones.phphttp://www.acm.org/sigshttps://noggin.intel.com/technology-journal

multicore processors for mobile platforms - future systems on a chip

Documents