alternative architectures christopher trinh cis fall 2009 chapter 9 (pg 461 – 486)

Alternative Architectures

Christopher TrinhCIS Fall 2009

Chapter 9 (pg 461 – 486)

What are they?

• Architectures that transcend the classic von Neumann approach.

• Instruction level parallelism • Multiprocessing architectures

– Parallel processing• Dataflow computing• Neural networks• Systolic array• Quantum computing• Optical computing• Biological computing

Trade-offs

• Defined as a situation that involves losing one quality or aspect of something in return for gaining another quality or aspect.

• Important concept in the computer field.• Speed vs. Money • Speed vs. Power consumption/Heat

It's All about the Benjamins

• In trade-offs, money in most cases take precedence.

• Moore’s Law vs. Rock's Law• The economic flipside to Moore's Law.• The cost of a semiconductor chip fabrication plant

doubles every four years. As of 2003, the price had already reached about 3 billion US dollars.

• Consumer market is now dominated with parallel computing, in the form of Multiprocessor system.

• Exceptions: Research

Back in the days

• CISC vs. RISC (complex vs reduced instruction sets).

• CISC largely motivated by high cost of memory (i.e. registers).

• Analogous to Text messages (SMS).– “lol,” “u,” “brb,” “omg”, and “gr8”– Same motivation for it, provided the same benefit.– More information per memory/SMS.

What's the Difference

• RSIC– Minimize the number of cycles per instructions, most

instructions execute in one clock cycle.– Uses hardwired control, easier to do instruction pipelining.– Complexity is pushed up into the domain of the compiler.– More instructions.

• CSIS– Increases performance by reducing number of instructions

per programs.

Between RISC and CISC • Cheaper and more plentiful memory be came

available. Money became less of a trade-off factor.

• “Case for a Reduced Instruction Set Computer,” David Patterson and David Ditzel.– 45% data movement instructions– 25% ALU instructions– 30% flow control instructions

• Overall only 20% of the time complex instructions were used.

Performance Formula:

Same programConstant

CISC RISC5 X 10 =

Microcode

• CISC rely on microcode for instruction complexity.

• Efficiency is limited by variable length instructions, slowing down the decoding process.– Leads to varying number of clock cycles / instruction,

difficult to implement pipelines. • Interprets each instruction as it is fetched from

memory. Additional translation process.• More complex the instruction set, more time it

takes to look up the instructions and execute it– Back to text messages, “IYKWIM” and ~(_8^(|)

Comparison chart RISC vs. CISC on page 468

RISC is a misnomer. Presently, there are more instructions in RISC machines than CISC.

Most architecture today is based off of RISC.

Register windows sets

• Registers offer the greatest potential for performance improvement.– Recall that on average, 45% of instructions in

programs involved the movement of data. • Saving registers, passing parameters, and

restoring registers involves considerable effort and resources.

• High–level languages depend on modularization for efficiency, procedure calls and parameter passing are natural side effects.

• Imagine all registers divided into sets (or windows). Each set has a specific number of registers.

• Only one set (or windows) is “visible” to the processor.– Similar in concept to variable “scope”

• Global registers - common to all windows.• Local registers - local to the current window.• Input registers – overlaps with the preceding

window’s output registers.• Output registers – overlaps with the next

window’s input registers.• Current window pointer (CWP) – points to the

register window set to be used at any given time.

Registers have a circular nature. When procedures end they are marked as “reusable”.

Recursion and deeply nested functions use main memory when registers are full.

Flynn’s Taxonomy

PU - Processing Unit

Considers two factors: Number of instructions and the number of data streams that flow into the processor.

Page 469 - 471

Single Instruction, Single Data stream (SISD)Single Instruction, Multiple Data streams (SIMD)Multiple Instruction, Single Data stream (MISD)Multiple Instruction, Multiple Data streams (MIMD)Single Program, Multiple Data streams (SPMD)

SPMD

• Single Program, Multiple Data streams• Consists of multiprocessors, each with its own

data set and program memory.• Same program is executed on each processor.• Each node can do different things at the same

time. – If myNode = 1 do this, else do that.

• Synchronization at various global control points. • Often used as “supercomputers”.

Vector processors (SIMD)• Referred to as supercomputers

– Most famous are the Cray series, little change to their basic architecture in the past 25 years.

• Vector processors – specialized heavily pipelined processors that perform efficient operations on entire vectors and matrices at once.

• Suited for applications that benefit from high degree of parallelism (ie. weather forecasting, medical diagnoses, and image processing).

• Efficient for two reasons: machine fetches significantly fewer instructions leading to less decoding, control unit overhead and memory bandwidth usage. Processor knows it’ll have continuous source of data, so it can begin pre-fetching corresponding pairs of values.

Vector registers – specialized registers that can hold several vector elements at one time.Two types of vector processors: registers-register vector processors and memory-memory vector processors.Registers-register vector processors:

• Require that all operations use registers as source and destination operands. • disadvantage in that long vectors must be broken into fixed length segments that are mall enough to fit into registers.

Memory-memory vector processors:• allow operands from memory to be routed directly to the ALU, results are stream back to memory.• disadvantage is that they have large startup time due to memory latency, after the pipeline is full disadvantage disappears.

Parallel and multiprocessor • Two major parallel architectural paradigms. Under MIMD

architectures, but differ in how they use memory. – Symmetric multiprocessors (SMPs)– Massively parallel processors (MPPs)

MPP = many processors + distributed memory + communication via networkSMP = few processors + shared memory + communication via memory

MPPMPP

• Harder to program - so that pieces of the program on separate CPUs can communicate with each other.• Uses if program is easily partitioned.• Large companies (data warehousing) frequently use this system.

• Harder to program - so that pieces of the program on separate CPUs can communicate with each other.• Uses if program is easily partitioned.• Large companies (data warehousing) frequently use this system.

SMPSMP

• Easier to program.• Suffer from bottleneck when all processors attempt to access the same memory at the same time.

• Easier to program.• Suffer from bottleneck when all processors attempt to access the same memory at the same time.

Multiprocessing parallel architecture is analogous to adding horse to help out with the work (horsepower).

We improve processor performance by distributing the computational load among several processors.

Parallelism results in higher throughput (data/sec), better fault tolerance, and more attractive price/performance ratio.

Amdahl’s Law – States that if two processing components run at two different speeds, the slower speed will dominate. Perfect speed up is not possible.

• “You are only as fast as your slowest part”• Every algorithm will eventually have a sequential part to it. Additional processors have to wait till the serial processing is complete.

Parallelism is not a “magic” solution to improve speed. Some algorithms/programs have more sequential processing and it is less cost effective to employ a multiprocessing parallel architecture (ie. Programming individual bank transactions, however transactions of all bank customers may have added benefit).

Instruction level parallelism (ILP)• Superscalar vs. Very long instruction words (VLIW)• Superscalar – design methodology that allows multiple instructions

to be executed simultaneously in each cycle.– Achieve speedup similar to the idea of adding another lane to a

busy single lane highway.– Exhibit parallelism through pipelining and replication.

• Added “highway lanes” are called execution units. – Execution units consists of floating-point adders, multipliers, and

other specialized components. – Not uncommon to have these units duplicated.– Units are pipelined

• Pipelining - divides the fetch-decode-execute cycle into stages, in which a set of instructions are in different stages at the same time.

• Superpipelining is when a pipeline has stages that require less than half a clock cycle to execute

• Accomplished using an internal clock which can be added which is double the speed of the external clock, allowing completion of two tasks per external clock cycle.

• Instruction fetch – component that can retrieve multiple instructions simultaneously from memory. • Decoding unit – determines whether the instructions are independent (and thus be executed simultaneously).• Superscalar processors rely on both the hardware and compiler to generates approximate schedules to make the best use of the machine resources.

VLIW• Relay entire on the compiler for scheduling of operations.• Packs independent instructions into one long instruction. • Because the instructions are fixed at compile time, changes

such as memory latency requires you to recompile the code. • Could also lead to significant increases in the amount of code

generated.• Intel’s Itanium IA-64 – is an example of VLIW processor

– Uses an EPIC style of VLIW– Difference: bundles its instructions in various lengths, uses a special

delimiter to indicate where one bundle ends and another begins. – Instructions words are prefetched by hardware, instructions within

bundles are executed in parallel and have no concern for ordering.

Interconnection Networks

• Each processor has its own memory, but processors are allowed to access each other memories via the network.

• Network topology - factor in the overhead of cost of message passing. List of message passing efficient factors: – Bandwidth– Message latency– Transport latency– Overhead

• Static networks vs. dynamic networks– Dynamic networks allow the path between two entites to

change from one communication to the next, static networks do not.

Dynamic networks allow for dynamic configuration: Bus or switch.

Bus-based networks – the simplest and most cost efficient when amount of entities is moderate.

Main disadvantage is the bottleneck can occur. Parallel buses can remove this issue but the cost is considerable.

Crossbar switch

2 X 2 switch

Omega Network

Trade off chart of various networks.

Example of a multistage network, built using 2 x 2 switches.

Shared memory processorsDoesn’t mean all processors must share one large memory, each processor can have a local memory, but it must be shared with other processors.

Shared memory MIMD have two categories in how they sync their memory operations

Uniformed Memory Access (UMA) – all memory access take the same amount of time. Pool of shared memory that is connected to a group of processors through a bus or switch network.

Nonuniformed Memory Access (NUMA) – memory access is inconsistent across the address of the machine.

• Leads to cache coherence problems. (race conditions) • Can use Snoopy cache controllers that monitor the caches on all systems. Call cache coherent NUMA (CC-NUMA)• You can use a various cache update protocol

• write-through• write-through with update • write-through with invalidation • write-back

Distributed Systems

• Loosely coupled distributed computers dependent on a network for communication among processors to solve a problem.

• Cluster computing – NOWs, COWs, DCPCs, and PoPCs, all resources are within the same admin domain working on group tasks.– You can make your own cluster by downloading BEOWULF

open-source project. • Public-resource computing or Global computing – grid

computing where computing power is supplied by volunteers thru the internet. Very cheap source of computing power.

SETI@Home project – analyze radio data to determine if there is intelligent life out there. (Think the movie “Contact”).

Folding@Home project - designed to perform computationally intensive simulations of protein folding and other molecular dynamics (MD), and to improve on the methods available to do so.

7.87 PFLOPS (250 bytes), the first computing project of any kind to cross the four petaFLOPS milestone. This level of performance is primarily enabled by the cumulative effort of a vast array of PlayStation 3 and powerful GPU units.

alternative architectures christopher trinh cis fall 2009 chapter 9 (pg 461 – 486)

Documents

number of instructions

instruction complexity

time complex instructions

variable length instructions

specific number of registers

risc machines

restoring registers

number of cycles