multicore processors for mobile platforms - future systems on a chip

Upload: pavel-gicevski

Post on 02-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    1/46

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    2/46

    1

    Table of Contents1 Project Overview ........................................................................................................................................ 2

    Abstract ..................................................................................................................................................... 2

    2 Project Objectives ...................................................................................................................................... 3

    3 Understanding System-on-a-chip (SoC) ..................................................................................................... 4

    3.1 System-on-a-chip - basics ................................................................................................................... 4

    3.2 System-on-a-Chip - Structure .............................................................................................................. 5

    Whatsinside of a SoC? ............................................................................................................................. 5

    4 Briefly about Multi-core processors .......................................................................................................... 7

    4.1 Multithreading, Hyper-Threading, or Multi-Core?....................................................................... 8

    5 On-Chip Network to a Multi-core System ................................................................................................ 10

    5.1 Abstract ............................................................................................................................................. 10

    5.2 Topology ............................................................................................................................................ 10

    5.3 Routing .............................................................................................................................................. 12

    5.4 Flow Control ...................................................................................................................................... 14

    5.5 Router Micro-architecture ................................................................................................................ 17

    6 Project Methodologies, Results, and Achievements ............................................................................... 18

    6.1 Usage of SmartphonesSurvey (Q & A) .......................................................................................... 19

    6.2 Analysis on the survey results ........................................................................................................... 22

    6.3 Discrete event simulations ................................................................................................................ 23

    6.4 Example of DES in real life ................................................................................................................ 24

    6.5 Components of a discrete-event simulation ..................................................................................... 24

    6.6 Network Simulators as DES ............................................................................................................... 26

    6.7 Network Simulations with OPNET..................................................................................................... 26

    7 Implementing the project in OPNET ........................................................................................................ 27

    7.1 Adding Traffic .................................................................................................................................... 27

    7.2 Network On-Chip realizations in OPNET ........................................................................................... 29

    7.3 Results Comparison........................................................................................................................... 41

    8 Conclusions .............................................................................................................................................. 44

    9 References ............................................................................................................................................... 45

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    3/46

    2

    1 Project Overview

    AbstractSince smartphones and tablets are basically smaller computers, they require pretty much the

    same components we see in desktops and laptops in order to offer us all the amazing things

    they can do (apps, music and video playing, 3D gaming support, advanced wireless features,

    etc).

    But smartphones and tablets do not offer the same amount of internal space as desktops and

    laptops for the various components needed such as the logic board, the processor, the RAM,

    the graphics card, and others. That means these internal parts need to be as small as possible,

    so that device manufacturers can use the remaining space to fit the device with a long-lasting

    battery life.

    In recent years, due to the continuous development in the field of silicon technology, it is

    possible to implement complex electronic systems in a single integrated circuit. Systems-on-

    chips (SoCs) are small, powerful multi-core systems that are being implemented in a vast

    number of ways across the booming electronics market, primarily in small mobile devices. But

    there comes a question: What architecture design should be proposed to solve this problem, or

    with other words, how one tiny little computer can be designed so that smartphones can rise

    up to PCs levels?

    The architecture complexity of these SoCs requires new design methodologies and the

    development of a seamless design flow that integrates existing and emerging tools. As micro

    and nano technologies continuously progress, it has led to a growing integration and clock

    frequency increment in electronics systems. These combined effects have led to an increase

    both in power density and energy dissipation which consequently must be managed, above all,

    in portable systems. Design and technology issues relating to power efficiency are crucial, in

    particular for power optimized cell libraries, clock gating and clock trees optimization, and

    dynamic power management. Thats why this projectdiscusses options for different low-power,

    faster, and cheaper design techniques for systems-on-chips, at a design level and an

    architectural level.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    4/46

    3

    2 Project Objectives

    This project intends to contribute to solutions for the growing industrial need to design reliablenetwork-on-a-chip with efficient multi-core processors for mobile platforms as future systems

    on a chip. In particular, it intends to provide theory and practical examples where possible

    designs can take place and maybe contribute in the future SOCs development.

    The system on a chip design doesnt only require new design methodologies and development

    of design flow of already known elements and tools. Here comes one of the main objectives of

    this project: What are the needs of the users of this growing technology? This question derives

    many possible situations, and clearly gives us idea that not all PCs components and

    performances should be just copied into our hand devices. Thats why this project is based on a

    previous research about smartphone users and their view of their devices (either as calling

    device, gaming platform, or business tool).

    Based on SOCs structure, Multi-core systems architectures, and the research we have done, this

    project will offer On-Chip network that will bring high performance, smartly used space, and

    network interconnections between the components of the chip-which means long-lasting

    battery.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    5/46

    4

    3 UnderstandingSystem-on-a-chip (SoC)

    3.1 System-on-a-chip - basics

    System-on-a-chip (SoC) technology is the packaging of all the necessary electronic circuits and

    parts for a "system" (such as a cell phone or digital camera) on a single integrated circuit ( IC ),

    generally known as amicrochip . For example, a system-on-a-chip for a sound-detecting device

    might include an audio receiver, an analog-to-digital converter (ADC ), a microprocessor,

    necessarymemory, and theinput/outputlogic control for a user - all on a single microchip.

    System-on-a-chip technology is used in small, increasingly complex consumer electronic

    devices. Some such devices have more processing power and memory than a typical 10-year-

    old desktop computer. In the future, SoC-equippednanorobots (robots of microscopicdimensions) might act as programmable antibodies to fend off previously incurable diseases.

    SoC video devices might be embedded in the brains of blind people, allowing them to see; SoC

    audio devices might allow deaf people to hear. Handheld computers with small whip antennas

    might someday be capable of browsing the Internet at megabit-per-second speeds from any

    point on the surface of the earth.

    SoC is evolving along with other technologies such as silicon-on-insulator ( SOI ), which can

    provide increasedclock speed while reducing thepower consumed by a microchip.

    Image 1: How far the technology has gone?

    All necessary desktop computers components to be packed into a couple of cm long chip(This image is in an ownership fo WIKIPEDIA.COM)

    http://searchcio-midmarket.techtarget.com/definition/integrated-circuithttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/analog-to-digital-conversionhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://search400.techtarget.com/definition/Silicon-on-Insulatorhttp://searchcio-midmarket.techtarget.com/definition/clock-speedhttp://searchcio-midmarket.techtarget.com/definition/powerhttp://searchcio-midmarket.techtarget.com/definition/powerhttp://searchcio-midmarket.techtarget.com/definition/clock-speedhttp://search400.techtarget.com/definition/Silicon-on-Insulatorhttp://searchcio-midmarket.techtarget.com/definition/nanorobothttp://searchcio-midmarket.techtarget.com/definition/input-outputhttp://searchmobilecomputing.techtarget.com/definition/memoryhttp://searchcio-midmarket.techtarget.com/definition/microprocessorhttp://searchcio-midmarket.techtarget.com/definition/analog-to-digital-conversionhttp://searchcio-midmarket.techtarget.com/definition/microchiphttp://searchcio-midmarket.techtarget.com/definition/integrated-circuit
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    6/46

    5

    3.2 System-on-a-Chip - Structure

    Whatsinside of a SoC?

    Now that we know what a SoC is, lets take a quick look at the components that can be found

    inside it. Mind you, not all the following parts are built in all the different SoCs that were going

    to show you later on, but in order to better understand how a SoC works, you should have a

    general picture of what goes inside it:

    CPU the central processing unit, whether its single- or multiple-core, this is what

    makes everything possible on your smartphone. Most processors found inside the SoCs

    that were going to look at will be based on ARM technology, but more on that later .

    Memory just like in a computer, memory is required to perform the various tasks

    smartphone and tablets are capable of, and therefore SoCs come with various memory

    architectures on board.

    GPUthe graphic processing unit is also an important component on the SoC, and its

    responsible for handling those complex 3D games on the smartphone or tablets. As you

    can expect, there are various GPU architectures available out there, and were going tofurther detail them in what follows.

    Northbridgethis is a component that handles communications between the CPU and

    other components of the SoC including the southbridge.

    Southbrigea second chipset usually found on computers that handles various I/O

    functions. In some cases the southbridge can be found on the SoC.

    Cellular radios some SoCs also come with certain modems on board that are needed

    by mobile operators. Such is the case with the Snapdragon S4 from Qualcomm, which

    has an embedded LTE modem on board responsible for 4G LTE connectivity.

    Other radiossome SoCs may also have other components responsible for other types

    of connectivity, including Wi-Fi, GPS/GLONASS or Bluetooth. Again, the S4 is a good

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    7/46

    6

    example in this regard.

    Timing sources includingoscillators andphase-locked loops.

    Externalinterfaces including industry standards suchasUSB,FireWire,Ethernet,USART,SPI.

    Peripherals includingcounter-timers, real-timetimers andpower-on reset generators.

    Analog interfaces includingADCs andDACs.

    Voltage regulators andpower management circuits.

    Other circuitry.

    Image 2:Simplified look at the layout of Samsung's Exynos 5 Dual. The CPU and GPU are

    there, but they're just small pieces of the larger puzzle.(This image is in an ownership of http://www.intechopen.com)

    http://en.wikipedia.org/wiki/Oscillatorhttp://en.wikipedia.org/wiki/Phase-locked_loophttp://en.wikipedia.org/wiki/Electrical_connectorhttp://en.wikipedia.org/wiki/Universal_Serial_Bushttp://en.wikipedia.org/wiki/FireWirehttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/USARThttp://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bushttp://en.wikipedia.org/wiki/Counterhttp://en.wikipedia.org/wiki/Timerhttp://en.wikipedia.org/wiki/Power-on_resethttp://en.wikipedia.org/wiki/Analog_signalhttp://en.wikipedia.org/wiki/Analog_to_digital_converterhttp://en.wikipedia.org/wiki/Digital_to_analog_converterhttp://en.wikipedia.org/wiki/Voltage_regulatorhttp://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Voltage_regulatorhttp://en.wikipedia.org/wiki/Digital_to_analog_converterhttp://en.wikipedia.org/wiki/Analog_to_digital_converterhttp://en.wikipedia.org/wiki/Analog_signalhttp://en.wikipedia.org/wiki/Power-on_resethttp://en.wikipedia.org/wiki/Timerhttp://en.wikipedia.org/wiki/Counterhttp://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bushttp://en.wikipedia.org/wiki/USARThttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/FireWirehttp://en.wikipedia.org/wiki/Universal_Serial_Bushttp://en.wikipedia.org/wiki/Electrical_connectorhttp://en.wikipedia.org/wiki/Phase-locked_loophttp://en.wikipedia.org/wiki/Oscillator
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    8/46

    7

    4 Briefly about Multi-core processors

    A multi-core processor is a singlecomputing component with two or more independent

    actualcentral processing units (called "cores"), which are the units that read and

    executeprogram instructions.The instructions are ordinaryCPU instructions such as add, move

    data, and branch, but the multiple cores can run multiple instructions at the same time,

    increasing overall speed for programs amenable toparallel computing.Manufacturers typically

    integrate the cores onto a singleintegrated circuitdie (known as a chip multiprocessor or CMP),

    or onto multiple dies in a singlechip package.

    Processors were originally developed with only one core. Multi-core processors were

    developed in the early 2000s byIntel,AMD and others. They may have two cores (Dual core)

    (e.g.AMD Phenom II X2,Intel Core Duo), four cores (Quad core) (e.g.AMD Phenom II X4,Intel's

    quad-core processors, seei5,andi7 atIntel Core), 6-cores (e.g.AMD Phenom II X6,Intel Core i7

    Extreme Edition 980X), 8-cores (e.g.Intel Xeon E7-2820,AMD FX-8350), 10-cores (e.g.Intel Xeon

    E7-2850)or more.

    A multi-core processor implementsmultiprocessing in a single physical package. Designers may

    couple cores in a multi-core device tightly or loosely. For example, cores may or may not

    sharecaches,and they may implementmessage passing orshared memory inter-core

    communication methods. Commonnetwork topologies to interconnect cores includebus,ring,

    two-dimensional mesh, andcrossbar.

    Homogeneous multi-core systems include only identical cores, and on the othersideheterogeneous multi-core systemshave cores that are not identical. Just as with single-

    processor systems, cores in multi-core systems may implement architectures such

    assuperscalar,VLIW,vector processing,SIMD,ormultithreading.

    Multi-core processors are widely used across many application domains including general-

    purpose,embedded,network,digital signal processing (DSP), andgraphics.

    The improvement in performance gained by the use of a multi-core processor depends very

    much on the software algorithms used and their implementation. In particular, possible gains

    are limited by the fraction of the software that can berun in parallel simultaneously on multiple

    cores; this effect is described by Amdahl's law. In the best case, so-calledembarrassinglyparallel problems may realize speedup factors near the number of cores, or even more if the

    problem is split up enough to fit within each core's cache(s), avoiding use of much slower main

    system memory. Most applications, however, are not accelerated so much unless programmers

    invest a prohibitive amount of effort in re-factoring the whole problem. The parallelization of

    software is a significant ongoing topic of research.

    http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_sethttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Die_(integrated_circuit)http://en.wikipedia.org/wiki/Chip_carrierhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Callisto.22_.28C2.2FC3.2C_45_nm.2C_Dual-core.29http://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Zosma.22_.28E0.2C_45_nm.2C_Quad-core.29http://en.wikipedia.org/wiki/Intel_Core_i5http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Thuban.22_.28E0.2C_45_nm.2C_Hexa-core.29http://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29http://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Message_passinghttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Bus_topologyhttp://en.wikipedia.org/wiki/Crossbar_switchhttp://en.wikipedia.org/w/index.php?title=Homogeneous_computing&action=edit&redlink=1http://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/Embedded_processorhttp://en.wikipedia.org/wiki/Network_processorhttp://en.wikipedia.org/wiki/Digital_signal_processinghttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Parallel_processinghttp://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Embarrassingly_parallelhttp://en.wikipedia.org/wiki/Amdahl%27s_lawhttp://en.wikipedia.org/wiki/Parallel_processinghttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Digital_signal_processinghttp://en.wikipedia.org/wiki/Network_processorhttp://en.wikipedia.org/wiki/Embedded_processorhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/w/index.php?title=Homogeneous_computing&action=edit&redlink=1http://en.wikipedia.org/wiki/Crossbar_switchhttp://en.wikipedia.org/wiki/Bus_topologyhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Message_passinghttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29_Expandablehttp://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessorshttp://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Westmere-EX.22_.2832_nm.29http://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/Gulftownhttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Thuban.22_.28E0.2C_45_nm.2C_Hexa-core.29http://en.wikipedia.org/wiki/Intel_Corehttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i5http://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Zosma.22_.28E0.2C_45_nm.2C_Quad-core.29http://en.wikipedia.org/wiki/Intel_Core_Duohttp://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#.22Callisto.22_.28C2.2FC3.2C_45_nm.2C_Dual-core.29http://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Chip_carrierhttp://en.wikipedia.org/wiki/Die_(integrated_circuit)http://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Instruction_sethttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computing
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    9/46

    8

    4.1 Multithreading, Hyper-Threading, or Multi-Core?

    Programs are made up of execution threads. These threads are sequences of related

    instructions. In the early days of the PC, most programs consisted of a single thread. The

    operating systems in those days were capable of running only one such program at a time. The

    result was-as some of us painfully recall-that your PC would freeze while it printed a document

    or a spreadsheet. The system was incapable of doing two things simultaneously. Innovations in

    the operating system introduced multitasking in which one program could be briefly suspended

    and another one run. By quickly swapping programs in and out in this manner, the system gave

    the appearance of running the programs simultaneously.

    By the beginning of this decade, processor design had gained additional execution resources

    (such as logic dedicated to floating-point and integer math) to support executing multiple

    instructions in parallel. Intel saw an opportunity in these extra facilities. The company reasoned

    it could make better use of these resources by employing them to execute two separatethreads simultaneously on the same processor core. Intel named this simultaneous processing

    Hyper-Threading Technology and released it on the Intel Xeon processors in 2003. According to

    Intel benchmarks, applications that were written using multiple threads could see

    improvements of up to 30% by running on processors with HT Technology. More important,

    however, two programs could now run simultaneously on a processor without having to be

    swapped in and out (See Figure 1.) To induce the operating system to recognize one processor

    as two possible execution pipelines, the new chips were made to appear as two logical

    processors to the operating system.

    Fig.4.1 HT Technology enables two threads to execute simultaneously on a single processor core

    The performance boost of HT Technology was limited by the availability of shared resources to

    the two executing threads. As a result, HT Technology cannot approach the processing

    throughput of two distinct processors because of the contention for these shared resources. To

    achieve greater performance gains on a single chip, a processor would require two separate

    cores, such that each thread would have its own complete set of execution resources. Enter

    multi-core.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    10/46

    9

    Multi-Core Processors

    Multi-core processors, as the name implies, contain two or more distinct cores in the same

    physical package. Figure 2 shows how this appears in relation to previous technologies.

    Fig. 4.2 Multi-Core processors have multiple execution cores on a single chip

    In this design, each core has its own execution pipeline. And each core has the resourcesrequired to run without blocking resources needed by the other software threads.

    While the example in Figure 2 shows a two-core design, there is no inherent limitation in the

    number of cores that can be placed on a single chip. Intel has committed to shipping dual-core

    processors in 2005, but it will add additional cores in the future. Mainframe processors today

    use more than two cores, so there is precedent for this kind of development.

    The multi-core design enables two or more cores to run at somewhat slower speeds and at

    much lower temperatures. The combined throughput of these cores delivers processing power

    greater than the maximum available today on single-core processors and at a much lower level

    of power consumption. In this way, Intel increases the capabilities of server platforms as

    predicted by Moores Law while the technology no longer pushes the outer limits of physical

    constraints.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    11/46

    10

    5 On-Chip Network to a Multi-core System

    5.1 Abstract

    On-chip network architecture can be defined by four parameters: its topology, routing

    algorithm, flow control protocol, and router micro architecture.

    Throughout this section, we will discuss how different choices of the above four parameters

    affect the overall costperformance of an on-chip network. Clearly, the costperformance of an

    on-chip network depends on the requirements faced by its designers. Latency is a key

    requirement in many on-chip network designs, where network latency refers to the delay

    experienced by messages as they traverse from source to destination. Most on-chip networks

    must also ensure high throughput, where network throughput is the peak bandwidth thenetwork can handle.

    Another metric that is particularly critical in on-chip network design is network power, which

    approximately correlates with the activity in the network as well as its complexity.

    5.2 Topology

    The effect of a topology on overall network costperformance is profound. A topology

    determines the number of hops (or routers) a message must traverse as well as the

    interconnect lengths between hops, thus influencing network latency significantly.

    As traversing routers and links incurs energy, a topologys effect on hop count also directly

    affects network energy consumption. As for its effect on throughput, since a topology dictates

    the total number of alternate paths between nodes, it affects how well the network can spread

    out traffic and thus the effective bandwidth a network can support. Network reliability is also

    greatly influenced by the topology as it dictates the number of alternative paths for routing

    around faults. The implementation complexity cost of a topology depends on two factors: the

    number of links at each node (node degree) and the ease of laying out a topology on a chip

    (wire lengths and the number of metal layers required). Figure 2.1 shows three commonly used

    on-chip topologies. For the same number of nodes, and assuming uniform random traffic where

    every node has an equal probability of sending to every other node, a ring (Fig. 2.1.a) will lead

    to higher hop count than a mesh (Fig. 2.1.b) or a torus [11] (Fig. 2.1.c). For instance, in the

    figure shown, assuming bidirectional links and shortest-path routing, the maximum hop count

    of the ring is 4, that of a mesh are also 4, while a torus improves it to 2. A ring topology also

    offers fewer alternate paths between nodes than a mesh or torus, and thus saturates at a lower

    network throughput for most traffic patterns. For instance, a message between nodes A and B

    in the ring topology can only traverse one of two paths in a ring, but in a 33 mesh topology,

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    12/46

    11

    there are six possible paths. As for network reliability, among these three networks, a torus

    offers the most tolerance to faults because it has the highest number of alternative paths

    between nodes.

    Figure 2.1 Common on-chip network topologies: (a) ring, (b) mesh, and (c) torus

    While rings have poorer performance (latency, throughput, energy, and reliability) when

    compared to higher dimensional networks, they have lower implementation overhead. A ring

    has a node degree of 2 while a mesh or torus has a node degree of 4, where node degree refers

    to the number of links in and out of a node. A higher node degree requires more links and

    higher port counts at routers. All three topologies featured are two-dimensional planar

    topologies that map readily to a single-metal layer, with a layout similar to that shown in the

    figure, except for torus which should be arranged physically in a folded manner to equalize wire

    lengths (see Fig. 2.2), instead of employing long wrap-around links between edge nodes. A

    torus illustrates the importance of considering implementation details in comparing alternative

    topologies. While a torus has lower hop count (which leads to lower delay and energy)compared to a mesh, wire lengths in a folded torus are twice that in a mesh of the same size, so

    per-hop latency and energy are actually higher. Furthermore, a torus requires twice the number

    of links which must be factored into the wiring budget. If the available wiring bisection

    bandwidth is fixed, a torus will be restricted to narrower links than a mesh, thus lowering per-

    link bandwidth, and increasing transmission delay. Determining the best topology for an on-

    chip network subject to the physical and technology constraints is an area of active research.

    Figure 2.2 Layout of a 8x8 folded torus

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    13/46

    12

    5.3 Routing

    The goal of the routing algorithm is to distribute traffic evenly among the paths supplied by the

    network topology, so as to avoid hotspots and minimize contention, thus improving network

    latency and throughput. In addition, the routing algorithm is the critical component in faulttolerance: once faults are identified, the routing algorithm must be able to skirt the faulty

    nodes and links without substantially affecting network performance. All of these performance

    goals must be achieved while adhering to tight constraints on implementation complexity:

    routing circuitry can stretch critical path delay and add to a routers area footprint. While

    energy overhead of routing circuitry is typically low, the specific route chosen affects hop count

    directly and thus substantially affects energy consumption.

    While numerous routing algorithms have been proposed, the most commonly used routing

    algorithm in on-chip networks is dimension-ordered routing (DOR) due to its simplicity. With

    DOR, a message traverses the network dimension-by dimension, reaching the coordinate

    matching its destination before switching to the next dimension.

    In a two-dimensional topology such as the mesh in Fig. 2.3, dimension-ordered routing, say XY

    routing, sends packets along the X-dimension first, followed by the Y-dimension. A packet

    traveling from (0,0) to (2,3) will first traverse two hops along the X-dimension, arriving at (2,0),

    before traversing three hops along the Y-dimension to its destination. Dimension-ordered

    routing is an example of a deterministic routing algorithm, in which all messages from node A

    to B will always traverse the same path. Another class of routing algorithms is oblivious ones,

    where messages traverse different paths from A to B, but the path is selected without regards

    to the actual network situation at transmission time. For instance, a router could randomlychoose among alternative paths prior to sending a message. Figure 2.3 shows an example

    where messages from (0,0) to (2,3) can be randomly sent along either the YX route or the XY

    route. A more sophisticated routing algorithm can be adaptive, in which the path a message

    takes from A to B depends on network traffic situation. For instance, a message can be going

    along the

    XY route, sees congestion at (1,0)s east outgoing link and instead choose to take the north

    outgoing link towards the destination (see Fig. 2.3).

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    14/46

    13

    Fig. 2.2 DOR illustrates an

    XY route from (0,0) to (2,3)

    in a mesh, while Oblivious

    shows two alternative routes

    (XY and YX) between thesame sourcedestination pair

    that can be chosen obliviously

    prior to message

    transmission. Adaptive shows

    a possible adaptive route that

    branches away from the XY

    route if congestion is

    encountered at (1,0)

    In selecting or designing a routing algorithm, not only must its effect on delay, energy,

    throughput, and reliability be taken into account, most applications also require the network to

    guarantee deadlock freedom. A deadlock occurs when a cycle exists among the paths of

    multiple messages. Figure 2.4 shows four gridlocked (and deadlocked) messages waiting for

    links that are currently held by other messages and prevented from making forward progress:

    The packet entering router A from the South input port is waiting to leave through the East

    output port, but another packet is holding onto that exact link while waiting at router B to leave

    via the South output port, which is again held by another packet that is waiting at router C to

    leave via the West output port and so on.

    Fig. 2.4A classic network

    deadlock where four packets

    cannot make forward

    progress as they are waiting

    for links that other packetsare holding on to

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    15/46

    14

    Deadlock freedom can be ensured in the routing algorithm by preventing cycles among the

    routes generated by the algorithm, or in the flow control protocol by preventing router buffers

    from being acquired and held in a cyclic manner. Using the routing algorithms we discussed

    above as examples, dimension-ordered routing is deadlock-free since in XY routing, there

    cannot be a turn from a Y link to an X link, so cycles cannot occur. Two of the four turns in Fig.

    2.4 will not be permitted, so a cycle is not possible. The oblivious algorithm that randomlychooses between XY or YX routes is not deadlock-free because all four turns from Fig. 2.4 are

    possible leading to potential cycles in the link acquisition graph. Likewise, the adaptive route

    shown in Fig. 2.3 is a superset of oblivious routing and is subject to potential deadlock. A

    network that uses a deadlock-prone routing algorithm requires a flow control protocol that

    ensures deadlock freedom.

    Routing algorithms can be implemented in several ways. First, the route can beembedded in

    the packet header at the source, known as source routing. For instance,the XY route in Fig. 2.3

    can be encoded as , while theYX route can be encoded as . At each hop, the routerwill read the left-most direction off the route header, send thepacket towards thespecified outgoing link, and strip off the portion of the header

    corresponding to thecurrent hop. Alternatively, the message can encode the coordinates of the

    destination,and comparators at each router determine whether to accept or forward the

    message. Simple routing algorithms are typically implemented as combinationalcircuits within

    the router due to the low overhead, while more sophisticated algorithms are realized using

    routing tables at each hop which store the outgoing link a message should take to reach a

    particular destination. Adaptive routing algorithms need mechanisms to track network

    congestion levels, and update the route. Route adjustments can be implemented by modifying

    the header, employing combinational circuitry that accepts as input these congestion signals, or

    updating entries in a routing table. Many congestion-sensitive mechanisms have been

    proposed, with the simplest being tapping into information that is already captured and used

    by the flow control protocol, such as buffer occupancy or credits.

    5.4 Flow Control

    Flow control governs the allocation of network buffers and links. It determines when buffers

    and links are assigned to which messages, the granularity at which they are allocated, and how

    these resources are shared among the many messages using the network. A good flow control

    protocol lowers the latency experienced by messages at low loads by not imposing highoverhead in resource allocation, and pushes network throughput through enabling effective

    sharing of buffers and links across messages. In determining the rate at which packets access

    buffers (or skip buffer access altogether) and traverse links, flow control is instrumental in

    determining network energy and power. Flow control also critically affects network quality- off

    service since it determines the arrival and departure time of packets at each hop. The

    implementation complexity of a flow control protocol includes the complexity of the router

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    16/46

    15

    micro-architecture as well as the wiring overhead imposed in communicating resource

    information between routers.

    In store-and-forward flow control, each node waits until an entire packet has been received

    before forwarding any part of the packet to the next node. As a result, long delays are incurred

    at each hop, which makes them unsuitable for on-chip networks that are usually delay-critical.

    To reduce the delay packets experience at each hop, virtual cut-through flow control allows

    transmission of a packet to begin before the entire packet is received. Latency experienced by a

    packet is thus drastically reduced, as shown in Fig. 2.5. However, bandwidth and storage are

    still allocated in packet-sized units. Packets still only move forward if there is enough storage to

    hold the entire packet. On-chip networks with tight area and power constraints find it difficult

    to accommodate the large buffers needed to support virtual cut-through (assuming large

    packets).

    Like virtual cut-through flow control, wormhole flow control cuts through flits, allowing flits to

    move on to the next router as soon as there is sufficient buffering for this flit. However, unlike

    store-and-forward and virtual cut-through flow control, wormhole flow control allocates

    storage and bandwidth to flits that are smaller than a packet. This allows relatively small flit-

    buffers to be used in each router, even for large packet sizes. While wormhole flow control uses

    buffers effectively, it makes inefficient use of link bandwidth. Though it allocates storage and

    bandwidth

    in flit-sized units, a link is held for the duration of a packets lifetime in the router.

    As a result, when a packet is blocked, all of the physical links held by that packet are left idle.

    Throughput suffers because other packets queued behind the blocked packet are unable to use

    the idle physical links.

    Fig. 2.5Timing for (a) store-and-forward and (b) cut-through flow control at low loads, where

    tr refers to the delay routing the head flit through each router, ts refers to the serialization delay

    transmitting the remaining flits of the packet through each router, and tw refers to the time

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    17/46

    16

    involved in propagating bits across the wires between adjacent routers. Wormhole and virtual-

    channel flow control have the same timing as cut-through flow control at low loads.

    Virtual-channelflow control improves upon the link utilization of wormhole flow control,

    allowing blocked packets to be passed by other packets. A virtual channel (VC) consists merelyof a separate flit queue in the router; multiple VCs share the physical wires (physical link)

    between two routers. Virtual channels arbitrate for physical link bandwidth on a flit-by-flit

    basis. When a packet holding a virtual channel becomes blocked, other packets can still

    traverse the physical link through other virtual channels. Thus, VCs increase the utilization of

    the critical physical links and extend overall network throughput. Current on-chip network

    designs overwhelmingly adopt wormhole flow control for its small area and power footprint,

    and use virtual channels to extend the bandwidth where needed. Virtual channels are also

    widely used to break deadlocks, both within the network, and for handling system level

    deadlocks.

    Figure 2.6 illustrates how two virtual channels can be used to break a cyclic deadlock in thenetwork when the routing protocol permits a cycle.

    Fig. 2.6Two virtual

    channels (denoted by solid

    and dashed lines) and their

    associated separate buffer

    queues (denoted as two

    circles at each router) used

    to break the cyclic route

    deadlock in Fig. 2.4

    Since each VC is time-multiplexed onto the physical link cycle-by-cycle, holding onto a VC

    implies holding on to its

    associated buffer queue rather than locking down a physical link. By enforcing an

    order on VCs, so that lower-priority VCs cannot request and wait for higher-priority

    VCs, there cannot be a cycle in resource usage. At the system level, messages that

    can potentially block on each other can be assigned to different message classes

    that are mapped onto different virtual channels within the network, such as request

    and acknowledgment messages of coherence protocols. Implementation complexity

    of virtual channel routers will be discussed in detail next in Sect. 2.2.5 on routermicro architecture.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    18/46

    17

    Buffer backpressure:

    Unlike broadband networks, on-chip network are typically

    not designed to tolerate dropping of packets in response to congestion. Instead,

    buffer backpressure mechanisms stall flits from being transmitted when buffer space

    is not available at the next hop. Two commonly used buffer backpressure mechanismsare credits and on/off signaling. Credits keep track of the number of buffers

    available at the next hop, by sending a credit to the previous hop when a flit leaves

    and vacates a buffer, and incrementing the credit count at the previous hop upon

    receiving the credit. On/off signaling involves a signal between adjacent routers that

    is turned off to stop the router at the previous hop from transmitting flits when the

    number of buffers drop below a threshold, with the threshold set to ensure that all

    in-flight flits will have buffers upon arrival.

    5.5 Router Micro-architecture

    How a router is built determines its critical path delay, per-hop delay, and overall network

    latency. Router micro-architecture also affects network energy as it determines the circuit

    components in a router and their activity. The implementation

    of the routing and flow control and the actual router pipeline will affect the efficiency

    at which buffers and links are used and thus overall network throughput. In

    message, while errors in the control circuitry can lead to lost and mis-routed messages.

    The area footprint of the router is clearly highly determined by the chosen router micro-

    architecture and underlying circuits, terms of reliability, faults in the router datapath will lead

    to errors in the transmitted message, while errors in the control circuitry can lead to lost and

    mis-routed messages.

    Figure 2.7 shows the micro architecture of a state-of-the-art credit-based virtual

    channel (VC) router to explain how typical routers work. The example assumes a

    two-dimensional mesh, so that the router has five input and output ports corresponding

    to the four neighboring directions and the local processing element (PE) port.

    The major components which constitute the router are the input buffers, route computation

    logic, virtual channel allocator, switch allocator, and the crossbar switch.Most on-chip network routers are input-buffered, in which packets are stored in

    buffers only at the input ports.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    19/46

    18

    Fig. 2.7Credit-based virtual channel router microarchitecture

    6 Project Methodologies, Results, and

    Achievements

    The purpose of this section is to summarize the developments that took place within this

    project and put them in large scientific and technological context. The principles underlying this

    approach and all the methods used in this project are derived from a previously composed

    research.

    As already explained in the beginnings of this project, our purpose is to concentrate on the

    customers needs. To accomplish that, a survey took place over the internet, so we could gain

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    20/46

    19

    some other point of view that is maybe different from ours, or with other words to conclude for

    what the smartphones are mostly used.

    Later on, according to the gained results from the survey and based on SOCs structure and

    Multi-core systems architectures, this project will offer On-Chip network that will bring high

    performance, smartly used space, and network interconnections between the components ofthe chip-which means long-lasting battery.

    All the testing over the new design that will be proposed will be done over a Discrete Event

    Simulator, but more about that, later in this section.

    6.1 Usage of Smartphones Survey (Q & A)

    The survey took place from 4 March, 2014 to 15 March, 2014 and it was created over the free

    survey websitewww.freeonlinesurveys.com. Later on it was shared over the social media and

    it was answered by 43 people. In the following sections the questions and results of the survey

    are presented.

    http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/http://www.freeonlinesurveys.com/
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    21/46

    20

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    22/46

    21

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    23/46

    22

    6.2 Analysis on the survey results

    From the results we can see that all generations have some representatives, but mainly this

    survey was answered by people aged from 15 to 22 years (59.5 %). Also, most of the surveyedare users of the Android Operating System (69 % vs 19% Apple). According to the question how

    much time the users spend daily on their smartphones (1-4 hours - 59.6%), and how much their

    smartphones batteries last (10-12 hours - 41.9 %), it can be concluded that the battery is

    maybe one of the biggest problems of todays smartphones, which means, our priority in the

    new architecture plan.

    The connection on the internet is also main factor, but we must approach to a design that

    implements both Wi-Fi and 3G/4G connections.

    From the ranking of items that was offered to the surveyed people, we can clearly see why

    most of them use their phones. It looks like this is new era in this technology, since 40.48 % said

    that their first usage priority of the phones are the social media (Facebook, Twitter, Foursquare,

    etc.), and on the other side, 38.10 % claimed that their first priority is to use the phone for

    Phone Callsrather than the other offered items in the list.

    Other thing that we can learn from this question is that people are using their phones mostly

    for Internet Calls (Viber, Skype, etc.), Instagram (as a social network and picture editor), Internet

    surfing, Music players, etc. But, clearly lot of people use their phones like cameras and for

    playing games. On the last question of this survey we can see that almost half of the people

    (41.9 %) prefer playing games in HD graphics.

    What we learned from the collected results?

    Light and heavy data transfer should be offered in our plan

    Internet browsing

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    24/46

    23

    HD Graphic Card

    Conference calling

    HD editing

    Quick and with low-delay database

    Medium and Fast data links

    Cellular support

    From the results it can be concluded that different performances of the processors would be

    needed. For example its not the same if two processors are executing a program for picture

    editing and on the other side if one specifically Graphic processor executes the code.

    Also, if we want to preserve our battery to last longer, it would be useful if we only use regular

    processors for easy tasks (f.e light database transfer).

    Now, we can conclude that it would be better to design heterogeneous multi-core systems

    which include non-identical processors. With that on mind, now the interconnections of the

    network and all its components will be of great importance.

    To present the planed design, firstly, in the next section the simulator on which the tests will be

    done is going to be presented, and later on, the architecture with facts.

    6.3 Discrete event simulations

    In the field ofsimulation, adiscrete-eventsimulation (DES), models the operation of

    asystem as a discretesequence of events in time. Each event occurs at a particular instant in

    time and marks a change ofstate in the system. Between consecutive events, no change in the

    system is assumed to occur; thus the simulation can directly jump in time from one event to the

    next.

    This contrasts withcontinuous simulation in which the simulation continuously tracks the

    system dynamics over time. Instead of beingevent-based, this is called an activity-based

    simulation; time is broken up into small time slices and the system state is updated according to

    the set of activities happening in the time slice. Because discrete-event simulations do not have

    to simulate every time slice, they can typically run much faster than the corresponding

    continuous simulation.

    Another alternative to event-based simulation is process-based simulation. In this approach,

    each activity in a system corresponds to a separate process, where a process is typically

    http://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Discrete_timehttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/Sequence_of_eventshttp://en.wikipedia.org/wiki/State_(computer_science)http://en.wikipedia.org/wiki/Continuous_simulationhttp://en.wikipedia.org/wiki/Event-driven_programminghttp://en.wikipedia.org/wiki/Event-driven_programminghttp://en.wikipedia.org/wiki/Continuous_simulationhttp://en.wikipedia.org/wiki/State_(computer_science)http://en.wikipedia.org/wiki/Sequence_of_eventshttp://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Discrete_timehttp://en.wikipedia.org/wiki/Simulation
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    25/46

    24

    simulated by athread in the simulation program. In this case, the discrete events, which are

    generated by threads, would cause other threads to sleep, wake, and update the system state.

    A more recent method is the three-phased approach to discrete event simulation (Pidd, 1998).

    In this approach, the first phase is to jump to the next chronological event. The second phase is

    to execute all events that unconditionally occur at that time (these are called B-events). The

    third phase is to execute all events that conditionally occur at that time (these are called C-

    events). The three phase approach is a refinement of the event-based approach in which

    simultaneous events are ordered so as to make the most efficient use of computer resources.

    The three-phase approach is used by a number of commercial simulation software packages,

    but from the user's point of view, the specifics of the underlying simulation method are

    generally hidden.

    6.4 Example of DES in real life

    A common exercise in learning how to build discrete-event simulations is to model aqueue,

    such as customers arriving at a bank to be served by a teller. In this example, the system

    entities are Customer-queue and Tellers. The system events are Customer-

    Arrival and Customer-Departure. (The event of Teller-Begins-Service can be part of the logic of

    the arrival and departure events.) The system states, which are changed by these events,

    are Number-of-Customers-in-the-Queue (an integer from 0 to n) and Teller-Status (busy or

    idle). Therandom variables that need to be characterized to model this

    systemstochastically are Customer-Interarrival-Timeand Teller-Service-Time. An agent-based

    framework for performance modeling of an optimistic parallel discrete event simulator is

    another example for a discrete event simulation.

    6.5 Components of a discrete-event simulation

    State

    A system state is a set of variables that captures the salient properties of the system to be

    studied. The state trajectory overtime S(t) can mathematically represented by a step

    function whose values change in correspondence of discrete events.

    Clock

    The simulation must keep track of the current simulation time, in whatever measurement units

    are suitable for the system being modeled. In discrete-event simulations, as opposed toreal-

    time simulations,time hops because events are instantaneous the clock skips to the next

    event start time as the simulation proceeds.

    http://en.wikipedia.org/wiki/Thread_(computing)http://en.wikipedia.org/wiki/Queueing_theoryhttp://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Stochastichttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Real-time_simulationhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Step_functionhttp://en.wikipedia.org/wiki/Stochastichttp://en.wikipedia.org/wiki/Random_variablehttp://en.wikipedia.org/wiki/Queueing_theoryhttp://en.wikipedia.org/wiki/Thread_(computing)
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    26/46

    25

    Events list

    The simulation maintains at least one list of simulation events. This is sometimes called

    the pending event set because it lists events that are pending as a result of previously simulated

    event but have yet to be simulated themselves. An event is described by the time at which itoccurs and a type, indicating the code that will be used to simulate that event. It is common for

    the event code to be parametrized, in which case, the event description also contains

    parameters to the event code.

    When events are instantaneous, activities that extend over time are modeled as sequences of

    events. Some simulation frameworks allow the time of an event to be specified as an interval,

    giving the start time and the end time of each event.

    Random-number generators

    The simulation needs to generaterandom variables of various kinds, depending on the system

    model. This is accomplished by one or morePseudo-random number generators.The use of

    pseudo-random numbers as opposed to true random numbers is a benefit should a simulation

    need a rerun with exactly the same behavior.

    One of the problems with the random number distributions used in discrete-event simulation is

    that the steady-state distributions of event times may not be known in advance. As a result, the

    initial set of events placed into the pending event set will not have arrival times representative

    of the steady-state distribution. This problem is typically solved by bootstrapping the simulation

    model. Only a limited effort is made to assign realistic times to the initial set of pending events.

    These events, however, schedule additional events, and with time, the distribution of event

    times approaches its steady state. This is called bootstrapping the simulation model. In

    gathering statistics from the running model, it is important to either disregard events that occur

    before the steady state is reached or to run the simulation for long enough that the

    bootstrapping behavior is overwhelmed by steady-state behavior. (This use of the

    term bootstrapping can be contrasted with its use in bothstatistics andcomputing.)

    Statistics

    The simulation typically keeps track of the system'sstatistics,which quantify the aspects of

    interest. In the bank example, it is of interest to track the mean waiting times. In a simulation

    model, performance metrics are not analytically derived fromprobability distributions,butrather as averages overreplications,that is different runs of the model.Confidence

    intervals are usually constructed to help assess the quality of the output.

    http://en.wikipedia.org/wiki/Random_variableshttp://en.wikipedia.org/wiki/Pseudorandom_number_generatorhttp://en.wikipedia.org/wiki/Bootstrapping_(statistics)http://en.wikipedia.org/wiki/Bootstrapping_(computing)http://en.wikipedia.org/wiki/Statistichttp://en.wikipedia.org/wiki/Probability_distributionshttp://en.wikipedia.org/wiki/Replication_(statistics)http://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Replication_(statistics)http://en.wikipedia.org/wiki/Probability_distributionshttp://en.wikipedia.org/wiki/Statistichttp://en.wikipedia.org/wiki/Bootstrapping_(computing)http://en.wikipedia.org/wiki/Bootstrapping_(statistics)http://en.wikipedia.org/wiki/Pseudorandom_number_generatorhttp://en.wikipedia.org/wiki/Random_variables
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    27/46

    26

    Ending condition

    Because events are bootstrapped, theoretically a discrete-event simulation could run forever.

    So the simulation designer must decide when the simulation will end. Typical choices are at

    time t or after processing n number of events or, more generally, when statistical measure

    X reaches the value x.

    6.6 Network Simulators as DES

    Discrete event simulation is used in computer network to simulate new protocols for different

    network traffic scenarios before deployment.

    Incommunication andcomputer network research, network simulation is a technique where a

    program models the behavior of a network either by calculating the interaction between the

    different network entities (hosts/packets, etc.) using mathematical formulas, or actuallycapturing and playing back observations from a production network. The behavior of the

    network and the various applications and services it supports can then be observed in a test

    lab; various attributes of the environment can also be modified in a controlled manner to assess

    how the network would behave under different conditions.

    There are many both free/open-source and proprietary network simulators. Examples of

    notable network simulation software are, ordered after how often they are mentioned in

    research papers:

    ns (open source) OPNET (proprietary software)

    NetSim (proprietary software)

    6.7 Network Simulations with OPNET

    OPNET Technologies, INC. is a software business that provides performance management for

    computer networks and applications.The company was founded in 1986 and went public in 2000.

    OPNET can serve for a variety of needs. Compared to the cost and time involved in setting up

    an entiretest bed containing multiple networkedcomputers,routers and data links,OPNET is

    relatively fast and inexpensive. They allow engineers, researchers to test scenarios that might

    be particularly difficult or expensive toemulate using real hardware - for instance, simulating a

    scenario with several nodes or experimenting with a new protocol in the network. Network

    http://en.wikipedia.org/wiki/Communicationhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Ns_(simulator)http://en.wikipedia.org/wiki/OPNEThttp://en.wikipedia.org/wiki/NetSimhttp://en.wikipedia.org/wiki/Test_bedhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Data_linkhttp://en.wikipedia.org/wiki/Emulatehttp://en.wikipedia.org/wiki/Emulatehttp://en.wikipedia.org/wiki/Data_linkhttp://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Test_bedhttp://en.wikipedia.org/wiki/NetSimhttp://en.wikipedia.org/wiki/OPNEThttp://en.wikipedia.org/wiki/Ns_(simulator)http://en.wikipedia.org/wiki/Router_(computing)http://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Communication
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    28/46

    27

    simulators are particularly useful in allowing researchers to test new networking protocols or

    changes to existing protocols in a controlled and reproducible environment. A typical network

    simulator encompasses a wide range of networking technologies and can help the users to

    build complex networks from basic building blocks such as a variety of nodes and links. With the

    help of simulators, one can design hierarchical networks using various types of nodes like

    computers,hubs,bridges,routers, switches, links, mobile units etc.

    Various types ofWide Area Network (WAN) technologies like TCP, ATM, IP etc. andLocal Area

    Network (LAN) technologies likeEthernet,token rings etc., can all be simulated with a typical

    simulator and the user can test, analyze various standard results apart from devising some

    novel protocol or strategy for routing etc. Network simulators are also widely used to simulate

    battlefield networks inNetwork-centric warfare

    Minimally, a network simulator must enable a user to represent anetwork topology,specifying

    the nodes on the network, the links between those nodes and the traffic between the nodes.

    More complicated systems like OPNET allow the user to specify everything about the protocols

    used to handle traffic in a network. Graphical applications allow users to easily visualize the

    workings of their simulated environment. Text-based applications may provide a less intuitive

    interface, but may permit more advanced forms of customization.

    7 Implementing the project in OPNET

    7.1 Adding Traffic

    According to the theory presented until now, and the results gained from the survey, we made

    three architectural designs using OPNET.

    We stated that its going to be better if we organize network that is going to be heterogeneous

    (with different processors), but we must pay attention of the interconnections between the

    components of our System-on-a-Chip. Thats why this section is going to present different

    interconnections of 8 CPUs. Like most of the modern smartphones, our proposed architecture

    will be built from basic CPUsthat will implement simple tasks (keep the system running, light

    data transfer, cellular calls, etc), and on the other side additional CPUs in charge for more

    complicated tasks (heavy data transfer, picture editing, playing HD videos, etc). In the testing in

    OPNET two versions of Intels full nodes were used: Intel_D875PBZ_P4 (3200MHz), and

    Intel_VC820 (800MHz), where the first model represents architectures basic CPU and the

    second model represents the additional CPU.

    http://en.wikipedia.org/wiki/Communication_protocolhttp://en.wikipedia.org/wiki/Network_hubhttp://en.wikipedia.org/wiki/Network_bridgehttp://en.wikipedia.org/wiki/Wide_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/Token_ringhttp://en.wikipedia.org/wiki/Network-centric_warfarehttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Network_topologyhttp://en.wikipedia.org/wiki/Network-centric_warfarehttp://en.wikipedia.org/wiki/Token_ringhttp://en.wikipedia.org/wiki/Ethernethttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Local_Area_Networkhttp://en.wikipedia.org/wiki/Wide_Area_Networkhttp://en.wikipedia.org/wiki/Network_bridgehttp://en.wikipedia.org/wiki/Network_hubhttp://en.wikipedia.org/wiki/Communication_protocol
  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    29/46

    28

    For the testing to be complete, the simulation implemented traffic data and real life events.

    Figure 7.1 provides the profile configuration table with the profiles created specifically for

    testing events over the network. Five profiles were created in such a way that both basic and

    additional CPUs would be tested over different tasks.

    Fig. 7.1 Profile configuration table

    Over the next page a brief documentation will be given about all five profiles created, including

    their applications tables, start time offset, and the duration of the applications. Each of these

    applications offers and simulates events over the network. The traffic will be tested over

    different network configurations and evaluation will take place on different performances.

    1. Telecom

    2. Data Transfer

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    30/46

    29

    3. Video Conference

    4. Web Browsing

    5. Gaming

    7.2 Network On-Chip realizations in OPNET

    In this section different network architectures will be tested in OPNET. All of them are designed

    and theoretically proofed in section 5(On-Chip Network to Multi-core systems) by four

    parameters: topology, routing algorithm, flow control protocol, and router micro architecture.

    Statistics about the networks will be collected based on the following information:

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    31/46

    30

    Simulation process of execution

    queuing delay of every node in the network

    point -to-point throughput of the channels between the CPUs

    throughput of the channels between the CPUs and the router

    and the global delay of the network

    7.2.1 Star network (and 4 basic CPUs interconnected

    forming a ring network)

    Fig.7.2 Combination of Star and Ring Network in OPNET modeler

    This network is actually combination of two networks. Firstly all 8 nodes are forming a Star

    network (since all of them are connected directly to the router in the middle), and another

    network (Ring) is formed from the basic CPUs, which are interconnected with each other

    (every node is connected to its neighbors.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    32/46

    31

    Characteristics of the network:

    Server

    Application Configuration

    Profile Configuration

    Routerto assign tasks to the nodes

    1000BaseX (1 Gbps) duplex links to interconnect additional nodes with the router

    10GbpsBaseT (10 Gbps) duplex links to interconnect basic nodes with the router and

    between each other

    Max Hop Count of the Network - 2

    Max Node Degree3

    Simulation Process:

    The execution lasted 90 simulation seconds and completed 37, 475, 887 events, or withAverage speed of 309, 876 events/sec.

    Queuing delay of the nodes in the network:

    This statistic is collected on every CPU node in the network in order to see the difference

    between both Basic and Additional CPUs over the queuing delay. Clearly, all the basic

    processors show bigger delay in queuing packets.

    *Note that Basic CPUs are working on 3200 MHz and Additional CPUs on 800 MHz.

    Fig. 7.3 Queuing

    delay of the nodes in

    the network.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    33/46

    32

    Point -to-point throughput of the channels between the CPUs:

    Only the Basic Intel nodes are interconnected between their neighbors, so that means that

    every node is connected through channel with two other nodes. For simplicity, these are the

    throughput statistics from two pair of nodes (between Intel 1 and Intel 2, and Intel 1 and Intel

    3). From the figure bellow it can be seen that the rate of successful message delivery between

    the two channels is showing a difference, but when analyzing one channel it can be seen that

    the rate of exchange is quite the same.

    Fig. 7.4 Point-to-point throughput of the channels between Basic CPUs

    Throughput of the channels between the CPUs and the router:

    One node is taken from both CPUs as their representative. The figure bellow clearly explains

    that more tasks were assigned to the basic nodes then the additional ones, or it can mean that

    the channel connection Additional nodes with the router doesnt allow such high data transfer.

    See the figure bellow.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    34/46

    33

    Fig. 7.5 Throughput of the channels between the Router and both types of nodes

    Global Delay of the Network:

    The delay of the network reaches a maximum point of 700 microseconds, which is quite a good

    result.

    Fig. 7.6 Global delay of Star Network

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    35/46

    34

    7.2.2 Ring network (and interconnections of all nodes with

    the router)

    Fig. 7.7An image of Ring Architecture designed in OPNET

    In the previous network we had interconnection only on the Basic nodes, now there is aconnection with both models of nodes. Every node now is connected with the router and plus

    with both neighbor nodes. In two points there is a connection between Basic and Additional

    Nodes (Intel Basic 1 Additional CPU 1, and Intel Basic 4 Additional CPU 4). The idea is to

    see how the nodes will interact, or in other words to see how the events will be assigned now.

    Characteristics of the network:

    Server Application Configuration

    Profile Configuration

    Routerto assign tasks to the nodes

    1000BaseX (1 Gbps) duplex links to interconnect all the nodes between them and with

    the Router

    10GbpsBaseT (10 Gbps) duplex links to interconnect the Router and the Server

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    36/46

    35

    Max Hop Count of the Network - 1

    Max Node Degree3

    Simulation Process:

    The execution lasted 90 simulation seconds and completed 40, 614, 395 events, or with

    Average speed of 358, 979 events/sec.

    Queuing delay of the nodes in the network:

    No major difference can be seen in the queuing delay between the two types of nodes.

    Fig. 7.8 Queuing delay on the nodes in the network

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    37/46

    36

    Point -to-point throughput of the channels between the CPUs:

    In the figure bellow we can see representative of all possible interconnections in this particular

    network, since Basic CPUs are connected between them, Additional CPUs also, and there are

    two connections between Basic and Additional CPUs.The rate of successful message delivery is

    reaching 2000 bits/sec in all of the cases. (See figure bellow).

    Fig. 7.9 Point-to-point throughput on channels connecti-

    ng Basic, Additional, and Basic-Additional nodes

    Throughput of the channels between the CPUs and the router:

    Fig. 7.9Both the channels betweenBasic and Additional nodes and the

    router reached successful delivery of

    5.9 MB/sec, but the channels to Basic

    nodes delivered 4 times more packets

    then the other ones.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    38/46

    37

    Global Delay of the Network:

    Fig. 7.10 The delay of the network differs for only for 30 microseconds from the Star

    Architecture

    7.2.3 Mesh network

    In this architecture all the nodes are again connected with the router, but also every node is

    connected with their neighbors and plus with one different node (additional with basic, and

    vice versa). The idea is to make a safer network, where if more than one link fails there are still

    other paths that can help to reach the goal node.

    *Note that this is not a full mesh network where all the nodes are interconnected with each

    other.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    39/46

    38

    Fig. 7.11An image of mesh architecture designed in OPNET

    Characteristics of the network:

    Server

    Application Configuration

    Profile Configuration

    Routerto assign tasks to the nodes

    100BaseT (10 Gbps) duplex links to interconnect Basic CPUs with the Router and witheach other

    1000BaseX (1 Gbps) duplex links to interconnect Additional CPUs with Basic CPUs and

    with each other

    Max Hop Count of the Network - 2

    Max Node Degree - 4

    Simulation Process:

    The execution lasted 90 simulation seconds and completed 55, 280, 320 events, or with

    Average speed of 328, 347 events/sec.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    40/46

    39

    Queuing delay of the nodes in the network:

    Fig. 7.12 The queuing of packets is quite bigger in the Additional nodes in comparison to the

    Basic ones.

    Point -to-point throughput of the channels between the CPUs:

    An interesting situation is presented in the figure bellow. Both of the channels can deliver up to

    1 Gbps data transfer, but in the first case Additional CPU 1 Basic Intel 4, the Basic CPU isonly sending 3000 bits/sec. That can show us the node doesnt need a lot help for executing its

    tasks.

    Fig. 7.13 Point-to-point

    throughput on channels

    connection of different CPUs

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    41/46

    40

    Throughput of the channels between the CPUs and the router:

    In the channel between the router and the Additional CPU we can see that the successful

    message delivery reaches 250, 000, 000 bits/sec from both sides of the link. On the other

    channel where the Router is connected with a Basic CPU, the delivery from the Router to the

    node is of the reverse situation.

    Fig. 7.14 Throughput on channels connection the Router and the nodes

    Global Delay of the Network:

    Fig. 7.15 The global

    delay of the network reaches

    almost 1000 microseconds,

    which is almost identical with

    the previously offered

    architectures.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    42/46

    41

    7.3 Results Comparison

    Network Delay:

    Fig. 7.16 The delay is almost identical in Star and Ring Topology assignment of the network

    Simulation Events:

    The execution time of all the simulations is fixed on 90 seconds, but the architectures differ in

    the number of events they can complete. Figure 7.17 provides the difference between the

    networks over number of events completed. Of course this should be seriously taken as an

    influential factor in deciding what architecture should be proposed.

    Fig. 7.17 Number of events completed

    0100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    20 sim/s 40 sim/s 60 sim/s 80 sim/s 100 sim/s

    Star

    Ring

    Mesh

    0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000

    Star

    Ring

    Mesh

    Number of Events

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    43/46

    42

    Queuing Delay of the nodes:

    For creating the ideal architecture a lot of attention should be given on the delay of the nodes.

    Generally speaking the nodes in Mesh network have lowest queuing delay. Additional CPUs

    have delay 5.7 Microseconds, and Basic CPUs have delay of 400 Microseconds.

    In comparison to that, the nodes in Star and Ring Network have delay of 20000 Microseconds

    (Additional Nodes), and 10000 Microseconds (Basic Nodes), respectively.

    Point-to-point throughput of channels

    The point-to-point throughput is almost the same in the first two networks, as given in the

    figure bellow. On the other side, when Mesh topology is implemented, the throughput of the

    channels can differ from 375 Bps up to 50 Mbps. Also we must note that in Mesh topology

    there are link connections between two Basic nodes, two Additional nodes, and Basic withAdditional nodes.

    Fig. 7.18 Throughput of channels between nodes. The successful delivery rate is given in bits/s.

    Hop count, Max node degree, and Alternative paths:

    As for its effect on throughput, since a topology dictates the total number of alternate paths

    between nodes, it affects how well the network can spread out traffic and thus the effective

    bandwidth a network can support. Network reliability is also greatly influenced by the topology

    as it dictates the number of alternative paths for routing around faults.

    0

    500

    1000

    1500

    2000

    2500

    Star Ring

    Basic

    Additional

    Additional and Basic

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    44/46

    43

    Attributes/Topologies Star Ring Mesh

    Hop Count 2 1 2

    Node Degree 3 3 4

    Alternative Paths More than 3 More than 5 More than 5

    A star topology offers fewer alternate paths between nodes than a mesh or ring, and thus

    saturates at a lower network throughput for most traffic patterns. In a case of fault, the mesh

    topology offers the most alternate paths. Also the hop count implies lower network

    throughput, so it is another negative thing about Star network.

    While star network have poorer performance (latency, throughput, energy, and reliability)

    when compared to higher dimensional networks, they have lower implementation overhead. A

    ring has a node degree of 3 while a mesh has a node degree of 4, where node degree refers to

    the number of links in and out of a node. A higher node degree requires more links and higher

    port counts at routers.

    So it can be concluded that the most suitable network when discussing about topology would

    be the Mesh network, since we look for faster performances, and reliable bandwidth.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    45/46

    44

    8 Conclusions

    With all the accumulated effort invested in this project, there are reasons to believe that thearchitectures provided together with theories would be quite closer to industrial acceptance.

    We summarize the progress with respect to the main objectives of the project, namely,

    reliability, high performance, and long lasting battery.

    Reliability: This is a major obstacle for acceptance of a particular design.The proposed

    solution should be reliable network-on-a-chip with efficient multi-core processors for

    mobile platforms as future systems on a chip. Thats why eight processors and links

    with high bandwidth were offered. Also with high number of alternate paths, Mesh

    would offer tolerance for faults, which again will make the network reliable.

    High Performance: Accordingly to the trends in todays smartphone technology, where

    high performance is needed, the entire proposed network that was designed in OPNET

    offered high performance because of the eight processors. Four processors of 3200MHz

    and four additional working on 800MHz would be totally enough for completing events

    in a fast manner. Now comparing the results we gained, the overall performance

    depends on latency, throughput, and energy.

    The point-to-point throughput on channels in the three proposed networks showed

    that the nodes are working together in processing events. In mesh, there is higher

    deviation of the throughputs of the channels, from 375 Bps up to 50 Mbps, which is a

    lot more than the throughput in star and ring network.

    The latency or delay of the networks proposed showed that the ring is the most

    suitable network. But now, we should also take in mind the number of events

    completed by the networks. Mesh executes 1/3 events more than the other networks

    for the same time, so again it is more suitable solution.

    Long Lasting Battery:The battery of a system-on-a-chip cannot be the same as the ones

    in laptops, or tablets, etc. But, with smart usage of the power of the chip, a lot of

    energy can be saved. In the section with OPNET testing we saw that all the processors

    in the network are dividing the tasks between them, depending on how occupied a

    particular processor is. Also the plan is to not use all of the processing power when

    there are not enough events to be executed.

  • 8/10/2019 Multicore Processors for Mobile Platforms - Future Systems on a Chip

    46/46

    9 References

    (1)A. Vajda, Programming Many-Core Chips Springer Science+Business Media, LLC2011

    (2)S.W. Keckler et al. (eds.), Multicore Processors and Systems, Integrated Circuites and

    Systems, Springer Science+Business Media, LLC 2009

    (3)Brayan Schauer, Multicore Processors A Necessity, 2008 ProQuest, Released

    September 2008

    (4)Multicore Processors and Systems, Retrieved from

    https://noggin.intel.com/technology-journal

    (5)Association of Computing Machinery,Retrievedfromhttp://www.acm.org/sigs

    (6)ARM Smartphones, Retrieved from

    http://www.arm.com/markets/mobile/smartphones.php

    (7)The PC Inside your Phone, Retrieved fromhttp://arstechnica.com/

    (8)CPU Info Center,http://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFt

    (9)Computer Architecture Page,http://arch-www.cs.wisc.edu/wwwarch/public/home

    (10) http://wikipedia.com

    https://noggin.intel.com/technology-journalhttps://noggin.intel.com/technology-journalhttp://www.acm.org/sigshttp://www.acm.org/sigshttp://www.acm.org/sigshttp://www.arm.com/markets/mobile/smartphones.phphttp://www.arm.com/markets/mobile/smartphones.phphttp://arstechnica.com/http://arstechnica.com/http://arstechnica.com/http://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://arch-www.cs.wisc.edu/wwwarch/public/homehttp://wikipedia.com/http://wikipedia.com/http://wikipedia.com/http://arch-www.cs.wisc.edu/wwwarch/public/homehttp://stason.org/TULARC/pc/cpu.html#.U6LccvmSzFthttp://arstechnica.com/http://www.arm.com/markets/mobile/smartphones.phphttp://www.acm.org/sigshttps://noggin.intel.com/technology-journal