ee3.cma - computer architecture5/20/2015ee3.cma - computer architecture1 ee3004 (ee3.cma) - computer...

EE3.cma - Computer Architecture04/18/23 EE3.cma - Computer Architecture 1

EE3004 (EE3.cma) - Computer Architecture

Roger Webb

[email protected]

University of Surreyhttp://www.ee.surrey.ac.uk/Personal/R.Webb/l3a15

also link from Teaching/Course page


IntroductionBook List

Computer Architecture - Design & PerformanceBarry Wilkinson, Prentice-Hall 1996(nearest to course)Advanced Computer ArchitectureRichard Y. Kain, Prentice-Hall 1996(good for multiprocessing + chips + memory)Computer ArchitectureBehrooz Parhami, Oxford Univ Press, 2005(good for advanced architecture and Basics)Computer ArchitectureDowsing & Woodhouse(good for putting the bits together..)Microprocessors & Microcomputers - Hardware & SoftwareAmbosio & Lastowski(good for DRAM, SRAM timing diagrams etc.)Computer Architecture & DesignVan de Goor(for basic Computer Architecture)

Wikipedia is as good as anything...!


IntroductionOutline Syllabus

Memory Topics• Memory Devices• Interfacing/Graphics• Virtual Memory• Caches & HierarchiesInstruction Sets• Properties & Characteristics• Examples• RISC v CISC• Pipelining & ConcurrencyParallel Architectures• Performance Characteristics• SIMD (vector) processors• MIMD (message-passing) • Principles & Algorithms


Computer Architectures - an overview

What are computers used for?

3 ranges of product cover the majority of processor sales:

• Appliances (consumer electronics)

• Communications Equipment

• Utilities (conventional computer systems)


Consumer ElectronicsThis category covers a huge range of processor performance• Micro-controlled appliances

– washing machines, time switches, lamp dimers– lower end, characterised by:

• low processing requirements• microprocessor replaces logic in small package• low power requirements

• Higher Performance Applications– Mobile phones, printers, fax machines, cameras, games

consoles, GPS, TV set-top boxes, video/DVD/HD recorders…...

• High bandwidth - 64-bit data bus• Low power - to avoid cooling• Low cost - < $20 for the processor• Small amounts of software - small cache (tight program loops)



Communications Equipmenthas become the major market – WWW, mobile comms• Main products containing powerful processors are:

– LAN products - bridges, routers, controllers in computers– ATM exchanges– Satellite & Cable TV routing and switching– Telephone networks (all-digital)

• The main characteristics of these devices are:– Standardised application (IEEE, CCITT etc.) - means

competitive markets– High bandwidth interconnections– Wide processor buses - 32 or 64 bits– Multi-processing (either per-box, or in the distributed

computing sense



Utilities (Conventional Computer Systems)Large scale computing devices will, to some extent, be replaced by

greater processing power on the desk-top. • But some centralised facilities are still required, especially

where data storage is concerned– General-purpose computer servers; supercomputers– Database servers - often safer to maintain a central corporate

database– File and printer servers - again simpler to maintain– Video on demand servers

• These applications are characterised by huge memory requirements and:– Large operating systems– High sustained performance over wide workload variations– Scalability - as workload increases– 64 bit (or greater) data paths, multiprocessing, large caches



Computer Architectures - an overviewComputer System Performance• Most manufacturers quote performance of their processors in terms of

the peak rate - MIPS (MOPS) of MFLOPS.• Most of the applications above depend on the continuous supply of

data or results - especially for video images• Thus critical criterion is the sustained throughput of instructions

– (MPEG image decompression algorithm requires 1 billion operations per second for full-quality widescreen TV)

– Less demanding VHS quality requires 2.7Mb per second of compressed data

– Interactive simulations (games etc) must respond to a user input within 100ms - re-computing and displaying the new image

• Important measures are:– MIPS per dollar– MIPS per Watt


Computer Architectures - an overviewUser InteractionsConsider how we interact with our computers:

0

10

20

30

40

50

60

70

80

90

100

1955 1965 1975 1985 1995 2005

Lights & Switches

Punched Card & Tape

Timesharing

Menus, Forms

WYSIWIG, Mice, Windows

Virtual Reality, Cyberspace

% o

f C

PU

tim

e sp

ent m

anag

ing

inte

ract

ion

What does a typical CPU do?70% User interface; I/O

processing20% Network interface;

protocols9% Operating system;

system calls1% User application


Computer Architectures - an overviewSequential Processor Efficiency

The current state-of-the-art of large microprocessors include:

• 64-bit memory words, using interleaved memory

• Pipelined instructions

• Multiple functional units (integer, floating point, memory fetch/store)

• 5 GHz practical maximum clock speed

• Multiple processors

• Instruction set organised for simple decoding (RISC?)

However as word length increases, efficiency may drop:

• many operands are small (16 bit is enough for many VR tasks)• many literals are small - loading 00….00101 as 64 bits is a waste

• may be worth operating on several literals per word in parallel


Computer Architectures - an overviewExample - reducing the number of instructions

Perform a 3D transformation of a point (x,y,z) by multiplying the 4-element matrix (x,y,z,1) by a 4x4 transformation matrix A. All operands are 16-bits long.

=

Conventionally this requires 20 loads, 16 multiplies, 12 adds and 4 stores, using 16-bit operands on a 16-bit CPU.

On a 64-bit CPU with instructions dealing with groups of four parallel 16-bit operands, as well as a modest amount of pipelining, all this can take just 7 processor cycles.

x y z 1 a b c de f g hi j k l m n o p

x’ y’ z’ r


Computer Architectures - an overviewThe Effect of Processor Intercommunication Latency

In a multiprocessor, and even in a uniprocessor, the delays associated with communicating and fetching data (latency) can dominate the processing times.

Consider:

memory memory memory

CPU CPU CPU

Interconnection Network

memory

CPU

cacheSymmetrical Multiprocessor

Uniprocessor

Delays can be minimised by placing components closer together and:

• Add caches to provide local data storage

• Hide latency by multi-tasking - needs fast context switching

• Interleave streams of independent instructions - scheduling

• Run groups of independent instructions together (each ending with long latency instruction)


Computer Architectures - an overviewMemory EfficiencyQuote from 1980s “Memory is free”By the 2000s the cost per bit is no longer falling so fast and

consumer electronics market is becoming cost sensitiveRenewed interest in compact instruction sets and data

compactness - both from the 1960s and 1970s

Instruction CompactnessRISC CPUs have a simple register-based instruction encoding• Can lead to codebloat - as can poor coding and compiler design• Compactness gets worse as the word size increasese.g. INMOS (1980s) transputer had a stack based register scheme• needed 60% of the code of an equivalent register based cpu• lead to smaller cache needs for instruction fetches & data

1977 - £3000/Mb1994 - £4/MbNow – <1p/Mb



Cache Efficiency

• Designer should aim to optimise the instruction performance whilst using the smallest cache possible

• Hiding latency (using parallelism & instruction scheduling) is an effective alternative to minimising it (by using large caches)

• Instruction scheduling can initiate cache pre-fetches

• Switch to another thread if the cache is not ready to supply data for the current one

• In video and audio processing, especially, unroll the inner code loops – loop unrolling (more on that later)



Predictable Codes

In many applications (e.g. video and audio processing) much is known about the code which will be executed. Techniques which are suitable for these circumstances include:

• Partition the cache separately for code and different data structures

• The cache requirements of the inner code loops can be pre-determined, so cache usage can be optimised

• Control the amounts of a data structure which are cached

• Prevent interference between threads by careful scheduling

• Notice that a conventional cache’s contents are destroyed by a single block copy instruction



Processor Engineering Issues• Power consumption must be minimised (to simplify on-chip and in-

box cooling issues)– Use low-voltage processors (2V instead of 3.3V)– Don’t over-clock the processor– Design logic carefully to avoid propagation of redundant signals– Tolerance of latency allows lower performance (cheaper)

subsystems to be used– Explicit subsystem control allows subsystems to be powered down

when not in use– Eliminate redundant actions - eg speculative pre-fetching– Provide non-busy synchronisation to avoid the need for spin-locks

• Battery design is advancing slowly - power stored per unit weight or volume will quadruple (over NiCd) with 5-10 years



Processor Engineering Issues• Speed to market is increasing, so processor design is becoming

critical. Consider the time for several common devices to become established:– 70 years Telephone (0% to 60% of households)– 40 years Cable Television– 20 years Personal Computer– 10 years Video Recorders– <10years Web based video

• Modularity and common processor cores provide design flexibility– reusable cache and CPU cores– product-specific interfaces and co-processors– common connection schemes



Interconnect Schemes

Wide data buses are a problem:

• They are difficult to route on printed circuit boards

• They require huge numbers of processor and memory pins (expensive to manufacture on chips and PCBs)

• Clocking must accommodate the slowest bus wire.

• Parallel back-planes add to loading and capacitance, slowing signals further and increasing power consumption

Serial chip interconnects offer 1Gbit/s performance using just a few pins and wires. Can we use a packet routing chip as a back-plane?

• Processors, memories, graphic devices, networks, slow external interfaces all joined to a central switch


33


Memory Devices

Regardless of scale of computer the memory is similar.

Two major types:• Static• Dynamic

Larger memories get cheaper as production increases and smaller memories get more expensive - you pay more for less!

See:

http://www.educypedia.be/computer/memoryram.htm

http://www.kingston.com/tools/umg/default.asp

http://www.ahinc.com/hhmemory.htm


Memory DevicesStatic Memories

• made from static logic elements - an array of flip-flops

• don’t lose their stored contents until clocked again

• may be driven as slowly as needed - useful for single stepping a processor

• Any location may be read or written independently

• Reading does not require a re-write afterwards

• Writing data does not require the row containing it to be pre-read

• No housekeeping actions are needed

• The address lines are usually all supplied at the same time

• Fast - 15ns was possible in Bipolar and 4-15ns in CMOS

Not used anymore – too much power for little gain in speed


Memory Devices

Memory Matrix256x256

Column I/O

Column Decoder

Row

Decoder

Input Data

Control

Timing Pulse Generator

Read Write Control

Vcc

Gnd

A0A1A2A3A4A5A6A7

A8A15

I/O0

I/O7

CS2

CS1

WE

OE

HM

6264 - 8K*8 static R

AM

organisation


Memory Devices

HM


AM

organisation

HM

6264 Read C

ycle

Item Symbol min max UnitRead Cycle Time tRC 100 nsAddress Access Time tAA - 100 ns

CS1 tCO1 - 100 nsChip Selection toOutput CS2 tCO2 - 100 nsOutput Enable to Output Valid tOE - 50 ns

CS1 tLZ1 10 - nsChip Selection toOutput in Low Z CS2 tLZ2 10 - nsOutput Enable to Output in Low Z tOLZ 5 - ns

CS1 tHZ1 0 35 nsChip Deselection toOutput in High Z CS2 tHZ2 0 35 nsOutput Disable to Output in High Z tOHZ 0 35 nsOutput Hold from Address Change tOH 10 - ns

tAA tCO1

tLZ1

tCO2

tHZ1

tLZ2 tOE

tOLZ

tHZ2

tOHZ

tOH

Data Valid

tRC

Address

CS1

CS2

OE

Dout


Memory Devices

HM


AM

organisation

HM

6264 Write C

ycle

Item Symbol min max UnitWrite Cycle Time tWC 100 nsChip Selection to End of Write tCW 80 nsAddress set up time tAS 0 nsAddress valid to End of Write tAW 80 nsWrite Pulse Width tWP 60 ns

CS1,WE tWR1 5 - nsWrite Recovery TimeCS2 tWR2 15 - ns

Write to Output in High Z tWHZ 0 35 nsData to Write Time Overlap tDW 40 nsData Hold from Write Time tDH 0 nsOutput Enable to Output in High Z tOHZ 0 35 nsOutput Active from End of Write tOW 5 - ns

Address

Din

tWC

CS1

CS2

OE

Dout

WE

tCWtWR1

tCW tWR2

tAW

tAS tOHZ

tDW tDH

tWP

Data sampled by memory


Memory DevicesDynamic Memories

• information stored on a capacitor - discharges with time

• Only one transistor required to control - 6 for SRAM

• must be refreshed (0.1-0.01 pF needs refresh every 2-8ms)

• memory cells are organised so that cells can be refreshed a row at a time to minimise the time taken

• row and column organisation lends itself to multiplexed row and column addresses - fewer pins on chip

• Use RAS and CAS to latch row and column addresses sequentially

• DRAM consumes high currents when switching transistors (1024 columns at a time). Can cause nasty voltage transients


Memory Devices

HM

50464 - 64K*4 dynam

ic RA

M organisation

row select

Bit LineDynamic memory cell

OE

Output Buffer OE ClockInput Buffer

WEClock

CASClock

RASClock

X Addrss

YAddrss

R/W Switch

X Decoder X Decoder

Y D

ecoderY

Decoder

MemoryArray 1

MemoryArray 2

MemoryArray 3

MemoryArray 4

I/O1-4

Refresh AddressCounter

WE

CAS

RAS

Ai

RASCAS


Memory Devices

HM

50464 Read C

ycle

HM

50464 - 64K*4 dynam

ic RA

M organisation

row column

validoutput

RAS

CAS

Address

WRITE

IO

OERead CycleDynamic memory read operation is as follows• The memory read cycle starts by setting all bit lines (columns) to a

suitable sense voltage. - pre charging• Required row address is applied and a RAS (row address) is asserted

• selected row is decoded and opens transistors (one per column). This dumps their capacitors charge into high feedback amplifiers which recharge the capacitors - RAS must remain low

• simultaneously apply column address and set CAS. Decoded and requested bits are gated to output - goes to outside when OE is active


Memory Devices

HM

50464 Write C

ycle

HM

50464 - 64K*4 dynam

ic RA

M organisation

row column

Valid Input

RAS

CAS

Address

WRITE

IO

Early Write Cycle

Similar to the read cycle except the fall in WRITE signals time to latch input data.

During the “Early Write” cycle - the WRITE falls before CAS - ensures that memory device keeps data outputs disabled (otherwise when CAS goes low they could output data!)

Alternatively a “Late Write” cycle the sequence is reversed and the OE line is kept high - this can be useful in common address/data bus architectures


Memory Devices

HM

50464 - 64K*4 dynam

ic RA

M organisation

Refresh Cycle

For a refresh no output is needed. A read, with a valid RAS and row address pulls the data out all we need to do is put it back again by de-asserting RAS.

This needs to be repeated for all 256 rows (on the HM50464) every 4ms. There is an on chip counter which can be used to generate refresh addresses.

Page Mode Access [“Fast Page Mode DRAM”] – standard DRAM

The RAS cycle time is relatively long so optimisations have been made for common access patterns

Row address is supplied just once and latched with RAS. Then column address are supplied and latched using CAS, data is read using WRITE or OE. CAS and column address can then be cycled to access bits in same row. The cycle ends when RAS goes high again.

Care must be taken to continue to refresh the other rows of memory at the specified rate if needed


Memory Devices

HM

50464 - 64K*4 dynam

ic RA

M organisation

RAS

CAS

row colAddress col col

DataIO Data Data

Page Mode DRAM access - nibble and static column mode are similar

Nibble Mode

Rather than supplying the second and subsequent column addresses they can be calculated by incrementing the initial address - first column address stored in register when CAS goes low then incremented and used in next low CAS transition - less common then Page Mode.

Static Column Mode

Column addresses are treated statically and when CAS is low the outputs are read if OE is low as well. If the column address changes the outputs change (after a propagation delay). The frequency of address changes can be higher as there is no need to have an inactive CAS time


Memory Devices

HM

50464 - 64K*4 dynam

ic RA

M organisation

OE

CAS

row colAddress col col

DataIO Data Data

Extended Data Out DRAM access

Extended Data Out Mode (“EDO DRAM”)

EDO DRAM is very similar to page mode access. Except that data bus outputs are controlled exclusively by the OE line. So that CAS can be taken high and low again without data from previous word being removed from data bus - so data can be latched by processor whilst new column address is being latched by memory. Overall cycle times can be shortened.

RAS


Memory Devices

HM

50464 - 64K*4 dynam

ic RA

M organisation

row colAddress bank row

IO D0

Simplified SDRAM burst read access

Synchronous DRAM (“SDRAM”)

Instead of asynchronous control signals SDRAMs accept one command in each cycle. Different stages of access initiated by separate commands - initial row address, reading etc. all pipelined so that a read might not return a word for 2 or 3 cycles

Bursts of accesses to sequential words within a row may be requested by issuing a burst-length command. Then, subsequent read or write request operate in units of the burst length

Clock

Act readCommand PChg ActNOP NOP NOP NOP NOP NOP NOP NOP

D1 D2 D3ActivateDRAM

row

Read fromColumn no.

(3 cycle latency)Read burst (4 words)


Summary DRAMs• A whole row of the memory array must be read• After reading the data must be re-written• Writing requires the data to be read first (whole row has to be

stored if only a few bits are changed)

• Cycle time a lot slower than static RAM• Address lines are multiplexed - saves package pin count• Fastest DRAM commonly available has access time of

~60ns but a cycle time of 121ns• DRAMs consume more current• SDRAMS replace the asynchronous control mechanisms

Memory Devices

Cycles RequiredMemory Type

Word 1 Word 2 Word 3 Word 4

DRAM 5 5 5 5

Page-Mode DRAM 5 3 3 3

EDO DRAM 5 2 2 2

SDRAM 5 1 1 1

SRAM 2 1 1 1


44


Memory Interfacing

Interfacing

Most processors rely on external memory

The unit of access is a word carried along the Data Bus

Ignoring caching and virtual memory, all memory belongs to a single address space.

Addresses are passed on the Address Bus

Hardware devices may respond to particular addresses - Memory Mapped devices

External memory is a collection of memory chips.

All memory devices are joined to the same data bus

Main purpose of the addressing logic is to ensure only one memory device is activated during each cycle


Memory Interfacing

Interfacing

The Data Bus has n lines - n = 8,16,32 or 64

The Address Bus has m lines - m = 16,20,24, 32 or 64 providing 2m words of memory

The Address Bus is used at the beginning of a cycle and the Data Bus at the end

It is therefore possible to multiplex (in time) the two buses

Can create all sorts of timing complications - benefits are a reduced processor pin count, makes it relatively common

Processor must tell memory subsystem what to do and when to do it

Can do this either synchronously or asynchronously


Memory Interfacing

Interfacing

synchronously

• processor defines the duration of a memory cycle

• provides control lines for begin and end of cycle

• most conventional

• the durations and relationships might be determined at boot time (available in 1980’s in the INMOS transputer)

asynchronously -

• processor starts cycle, memory signals end of cycle

• Error recovery is needed - if non-existent memory is accessed (Bus Error)


Memory Interfacing

Interfacing

synchronous memory scheme control signals

• Memory system active – goes active when the processor is accessing external

memory. – Used to enable the address decoding logic

• provides one active chip select to a group of chips

• Read Memory– says the processor is not driving the data bus

– selected memory can return data to the data bus

– usually connected to the output enable (OE) of memory


Memory InterfacingInterfacingsynchronous memory scheme control signals (cont’d)• Memory Write

– indicates data bus contains data which selected memory device should store

– different processors use leading or trailing edges of signal to latch data into memory

– Processors with data bus wider than 8 bits have separate memory write byte signal for each byte of data

– Memory write lines connected to write lines of memories• Address Latch Enable (in multiplexed address machines)

– tells the addressing logic when to take a copy of the address from multiplexed bus so processor can use it for data later

• Memory Wait– causes processor to extend memory cycle– allows fast and slow memories to be used together without

loss of speed


Memory Interfacing

Address BlocksHow do we place blocks of memory within the address space

of our processor?Two methods of addressing memory:• Byte addressing

– each byte has its own address– good for 8-bit processors and graphics systems– if memory is 16 or 32 bits wide?

• Word addressing– only address lines which number individual words– select multi-byte word– extra byte address bits retained in processor to

manipulate individual byte – or use write byte control signals


Memory Interfacing


of our processor?Often want different blocks of memory:• Particular addresses might be special:

– memory mapped I/O ports– location executed first after a reset– fast on-chip memory– diagnostic or test locations

• Also want – SRAM and/or DRAM in one contiguous block– memory mapped graphics screen memory– ROM for booting and low level system operation– extra locations for peripheral controller registers


Memory Interfacing


of our processor?• Each memory block might be built from individual

memory chips– address and control lines wired in parallel– data lines brought out separately to provide n bit word

• Fit all the blocks together in overall address map– easier to place similar sized blocks next to each other

so that they can be combined to produce 2k+1 word area– jumbling blocks of various sizes complicates address

decoding– if contiguous blocks are not needed, place them at

major power of 2 boundaries - eg put base of SRAM at 0, ROM half way up, lowest memory mapped peripheral at 7/8ths


Memory Interfacing

Address Decodingaddress decoding logic determines which memory device to

enable depending upon address• if each memory area stores contiguous words of 2k block

– all memory devices in that area will have k address lines

– connected (normally) to the k least-significant lines– remaining m-k examined to see if they provide most-

significant (remaining) part of address of each area3 schemes possible

– Full decoding - unique decoding• All m-k bits are compared with exact values to make up

full address of that block• only one block can become active


Memory Interfacing

Address Decoding3 schemes possible (cont’d)

– Partial decoding• only decode some of m-k lines so that a number of

blocks of addresses will cause a particular chip select to become active

• eg ignoring one line will mean the same memory device will be accessible at places in memory map

• makes decoding simpler– Non-unique decoding

• connect different one of m-k lines directly to active low chip select of each memory block

• can activate memory block by referencing that line• No extra logic needed• BUT can access 2 blocks at once this way…...


55


Memory Interfacing

Address Decoding - Example

A processor has a 32-bit data bus. It also provides a separate 30-bit word addressed address bus, which is labelled A2 to A31 since it refers to memory initially using byte addressing, where it uses A0 and A1 as byte addressing bits. It is desired to connect 2 banks of SRAM (each built up from 128K*8 devices) and one bank of DRAM, built from 1M*4 devices, to this processor. The SRAM banks should start at the bottom of the address map, and the DRAM bank should be contiguous with the SRAM. Specify the address map and design the decoding logic.


Memory InterfacingAddress Decoding - ExampleEach bank of SRAMs will require 4 devices to make up the 32 bit data bus.

Each Bank of DRAMs will require 8 devices.

0013FFFF

00040000

DRAMbank 0

DRAMbank 0

DRAMbank 0

DRAMbank 0

DRAMbank 0

DRAMbank 0

DRAMbank 0

DRAMbank 0

1Mwords(20 bits)

0003FFFF

00020000

SRAMBank 1

SRAMBank 1

SRAMBank 1

SRAMBank 1

128kwords(17 bits)

0001FFFF

00000000

SRAMBank 2

SRAMBank 2

SRAMBank 2

SRAMBank 2

128kwords(17bits)

------------------------- 32 bits -------------------------------------------------------------


Address Decoding - Example

Memory Interfacing

CPU

SRAM128k*8

SRAM128k*8

DRAM1M*4

17 address linesto all devices

in parallel

17 address linesto all devices

in parallel

20 address lines

to all devicesin parallel

8 data linesto each device

4 data linesto each device

CS1 CS2

CS3

CS1 connects to chip select on SRAM bank 0CS2 connects to chip select on SRAM bank 1CS3 connects to chip select on DRAM bank

CS1 = A19*A20*A21*A22CS2 =A19*A20*A21*A22CS3 = A20+A21+A22

}}omitting all address lines A23 and above to simplify}


66


Connecting Multiplexed Address and Data Buses

There are many multiplexing schemes but let’s choose 3 processor types and 2 memory types and look at the possible interconnections:

• Processor types all 8-bit data and 16 bit address:

– No multiplexing - (eg Zilog Z80)

– multiplexes least significant address bits with data bus (intel 8085)

– multiplexes the most significant and least significant halves of address bus

• Memory types:

– SRAM (8k *8) - no address multiplexing

– DRAM (16k*4) - with multiplexed address inputs

Memory Interfacing


CPU vs Static Memory Configuration

Memory Interfacing

Addressdecode

CPUNon multiplexed

address bus

8k*8 SRAM

A0…15

D0…7

A0…12

D0…7

CS



Memory Interfacing

Addressdecode

CPU with LSaddresses multiplexed

with data bus

8k*8 SRAM

A8…15

AD0…7

CS

A0…12

D0…7

latc

h

MS

LS



Memory Interfacing

Addressdecode

CPUtime-multiplexed

address bus

8k*8 SRAM

MA0…7

D0…7

CS

A0…12

D0…7

latc

h


CPU vs Dynamic Memory Configuration

Memory Interfacing

Addressdecode

CPUnon - multiplexed

address bus

2 x 16k*4 DRAM

A0…15

D0…7

CAS

MA0…6

D4…7

MP

XD0…3

MA0…6

RAS



Memory Interfacing

Addressdecode

CPU with LSaddresses multiplexed

with data bus

2 x 16k*4 DRAM

A8…15

AD0…7

CAS

MA0…6

D4…7

MP

XD0…3

MA0…6

RAS

latc

h



Memory Interfacing

Addressdecode

MA0…7

D0…7

2 x 16k*4 DRAM

CAS

MA0…6

D4…7

D0…3

MA0…6

RAS

CPUtime-multiplexed

address bus


DisplaysVideo Display Characteristics• Consider a video display capable of producing 640*240 pixel

monochrome, non-interlaced images at a frame rate of 50Hz:

h

DisplayedImage

vadd 20%for lineflyback

add 20%for frameflyback

dot rate = (640*1.2)*(240*1.2)*50 Hz

= 11MHz

= 90 ns/pixel

For 1024*800 non-interlaced display:

dot rate = (1024*1.2)*(800*1.2)* 50 Hz

= 65MHz

= 15 ns/pixel

add colour with 64 levels for rgb -

- 18 bits per pixel

Bandwidth now 1180MHz...


Video Display Characteristics• Problems with high bit rates:

– Memory mapping of the screen display within the processor map couples CPU and display tightly - design together

– In order that screen display may be refreshed at the rates required by video, display must have higher priority then processor for DMA of memory bus - uses much of bandwidth

– In order to update the image the CPU may require very fast access to screen memory too

– Megabytes of memory needed for large screen displays are still relatively expensive - compared with CPU etc.

Displays


Bit-Mapped Displays• Even 640*240 pixel display cannot be easily maintained

using DMA access to CPU’s RAM - except with multiple word access

• Increase memory bandwidth for video display with special video DRAM

– allows whole row of DRAM (256 or 1024 bits) in one DMA access

• Many video DRAMs may be mapped to provide a single bit of a multi-bit pixel in parallel - colour displays.

Displays


Displays

Character Based Displays• limited to displaying one of a small number of images in fixed positions

– typically 24lines of 80 characters

– normally 8-bit ASCII

• Character value is used to determine the image from a look-up table– table often in ROM (RAM version allows font changes)

• For a character of 9 dots wide by 14 high– 14 rows are generated for each row of characters

– In order to display a complete frame, pixels are drawn a suitable do rate:

dot rate = (80*9*1.2)*(24*14*1.2)*50 Hz

= 17.28 MHz

= 58ns/pixel


Displays

Character Based Displays• A row of 80 characters must be read for every displayed line

– giving a line rate of 20.16kHz (similar to EGA standard)

– overall memory access rate needed ~1.6Mbytes/second (625ns/byte)

– barely supportable using DMA on small computers

– even at 4bytes at a time (32 bit machines) still major use of data bus

• To avoid reading each line of 80 characters on other 13 rows characters can be stored in a circular shift register on first access and used instead of memory access.– only need 80*24*50 accesses/sec - in bursts

– 167s per byte - easily supported

– the whole 80 bytes can be read during flyback before start of new character row at full memory speed in one DMA burst - 80 * about 200ns at a rate of 24*50 times a second - less than 2% of bus bandwidth.


Displays

Character Based Displays• Assuming that rows of 80 characters in the CPU’s memory map are

stored at 128-byte boundaries (simplifies addressing) the CPU memory addresses are:

n-12 bits 5 bits 7 bits

address of screen memory row column

address decode 0…23 0…79

• Address of character positions on the screen:

5 bits 4 bits 7 bits 4 bits

columnrow

0…23 0…790…13 0…8

dot numberacross char

line numberin row

carrycarrycarry

Memory Look-upTable

Memory address of currentbit in shift register


Displays

Character Based Displays• An appropriate block diagram of the display would be:

ScreenMemory

CharacterGenerator

ROM

ScreenAddress

ASCII bytes

8 9

12 4Line no.in char

Shiftregister

9 to 1 bit

dot clock

video data out

(16*256)*9 bits

FIFO80*8bits

(r,c)


DisplaysCharacter Based Displays• The problem with DMA fetching individual characters from display

memory is its interference with processor.

• Alternative is to use Dual Port Memories

Dual Port SRAMs• provide 2 (or more) separate data and address pathways to each memory cell

• 100% of memory bandwidth can be used by display without effecting CPU

• Can be expensive - ~£25 for 4kbytes - makes Mb displays impractical. For character based display would be OK

Memory Arrayaddressdecode 1

I/O1

WriteCE

OEDo..Dn

Ao…Anrow, col

addressdecode 2

I/O2

WriteCE

OEDo..Dn

Ao…Anrow, col


77


Bit-Mapped Graphics & Memory Interleaving

Bit-Mapped Displays• Instead of using an intermediate character generator can store all pixel

information in screen memory at pixel rates above.

• Even 640*240 pixel display cannot be maintained using DMA access to CPU’s RAM - except with multiple word access

• Increase memory bandwidth for video display with special video DRAM

– allows whole row of DRAM (256 or 1024 bits) in one DMA access

• Many video DRAMs may be mapped to provide a single bit of a multi-bit pixel in parallel - colour displays.

• Use of video shift register limits clocking frequency to 25MHz - 40ns/pixel


Graphics Card consists of:GPU – Graphics Processing Unit

microprocessor optimized for 3D graphics renderingclock rate 250-850MHz with pipelining – converts 3D images of vertices and lines into 2D pixel image

Video BIOS – program to operate card and interface timings etc.Video Memory – can use computer RAM, but more often has its

own VideoRAM (128Mb- 2Gb) – often multiport VRAM, now DDR (double data rate – uses rising and falling edge of clock)

RAMDAC – Random Access Digital to Analog Converter to CRT


Using Video DRAMs• To generate analogue signals for a colour display

– 3 fast DAC devices are needed– each fed from 6 or 8 bits of data– one each for red, green and blue video inputs

• To save storing so much data per pixel (24 bits) a Colour Look Up Table (CLUT) device can be used.– uses a small RAM as a look-up table– E.g. a 256 entry table accessed by 8-bit values stored for each pixel - the

table contains 18 or 24 bits used to drive DACs– Hence “256 colours may be displayed from a palette of 262144”

Red output

Blue output

Green output

DACsData for updating RAM

Pixel Data218*28

AddrRAM

Din

Dout

6

6

6

CLUT

Bit Mapped Graphics & Memory Interleaving


screen block select row address column address

Using Video DRAMs• Addressing Considerations

– if the number of bits in the shift registers is not the same as the number of displayed pixels, it is easier to ignore the extra ones - wasting memory may make addressing simpler

– processor’s screen memory bigger than displayable memory, gives a scrollable virtual window.

remaining bits log2v bits log2h bits

(not all combinations used)

0..(v-1) 0..(h-1)

– Even though most 32 bit processors can access individual bytes (used as pixels) this is not as efficient as accessing memory in word (32bits) units



Addressing Considerations (cont’d)– Sometimes it might be better NOT to arrange the displayed pixels in

ascending memory address order:

0 1 2 3

0 1

2 3

32 Each word defines one bit of 32 horizontally neighbouring pixels. 8 words (in 8 separate colour planes) need to be changed to completely change any pixel. Useful for adding or moving blocks of solid colour - CAD

Each word defines 2 pixels horizontally and vertically with all colour data. Useful for text or graphics applications where small rectangular blocks are modified - might access fewer words for changes

Each word defines 4 horizontally neighbouring pixels. Each set fully specifies its colour - most simple and common representation



Addressing Considerations (cont’d)• The video memories must now be arranged so that the bits

within the CPU’s 32-bit words can all be read or written to their relevant locations in video memory in parallel.

– this is done by making sure that the pixels stored in each neighbouring 32-bit word are stored in different memory chips - interleaving



ExampleDesign a 1024*512 pixel colour display capable of passing 8

bits per pixel to a CLUT. Use a video frame rate of 60Hz and use video DRAMs with a shift register maximum clocking frequency of 25MHz. Produce a solution that supports a processor with an 8-bit data bus.



Example• 1024 pixels across the screen can be satisfied using 1 1024-bit shift

register (or 4 multiplexed 256-bit ones)

• The frame rate is 60Hz

• The number of lines displayed is 512

• The line rate becomes 60*512 = 30.72kHz - or 32.55s/line

• 1024 pixels gives a dot rate of 30.72*1024 = 31.46MHz

• Dot time is thus 32ns - too fast for one shift register! So we will have to interleave 2 or more.

• Multiplexing the minimum 3 shift registers will make addressing complicated, easier to use 4 VRAMs - each with 256 rows of 256 columns, addressed row/column intersection containing 4 bits interfaced by 4 pins to the processor and to 4 separate shift registers



Example• Hence for 8 bit CPU:

screen block select 512 rows 1024 columns

n-20 bits 9 bits 10 bitsCPU memoryaddress(BYTE address)

0..512 0..1023

to top/bottommultiplexers

to RAS address i/p implicit addressof bits in cascaded

shift registers

1 bit 8 bits 8 bits0..512 0..1023

1 bitVideo address(pixel counters)

to pixelmultiplexer

(odd/even pixels)

1 bit

Which VRAM?A+B, C+DE+F, G+H



Example

top 256 lines on screen (8 bits of eachodd pixel)

256*256*4

A

256*256*4

B

256*256*4

C

256*256*4

D

256*256*4

E

256*256*4

F

256*256*4

G

256*256*4

H

top/bottmmpx

top/bottmmpx

odd/evenpixelmpx

RGB

CLUT

4

4

4

4

4

4

4

4

8

8

8

8

8

8

select

8 bits ofall pixels

(interleaved)

bottom 256 lines on screen (8 bits of eacheven pixel)

top 256 lines on screen (8 bits of eacheven pixel)

bottom 256 lines on screen (8 bits of eachodd pixel)

select

select

RHS of screen LHS of screen

odd pixels

even pixels


88


Mass Memory ConceptsDisk technology• unchanged for 50 years• similar for CD, DVD• 1-12 platters• 3600-10000rpm• double sided• circular tracks• subdivided into sectors• recording density >3Gb/cm2

• innermost tracks not used – can not be used efficiently• inner tracks factor of 2 shorter than outer tracks• hence more sectors in outer tracks• cylinder – tracks with same diameter on all recording surfaces


Mass Memory ConceptsAccess Time• Seek time

– align head withcylinder containingtrack with sector inside

• Rotational Latency– time for disk to rotate tobeginning of sector

• Data Transfer time– time for sector to pass under head

Disk Capacity = surfaces x tracks/surface x sectors/track x bytes/sector


Key Attributes of Example Discs

Manufacturer Seagate Hitachi IBM

Identity of disc Series Barracuda DK23DA Microdrive

Model Number ST1181677LW ATA-5 40 DSCM-11000

Typical Application Desktop Laptop Pocket device

Storage attributes

Formatted Capacity GB 180 40 1

Recording surfaces 24 4 2

Cylinders 24,247 33,067 7167

Sector size B 512 512 512

Avg tracks/sector 604 591 140

Max recording Density Gb/cm2 2.4 5.1 2.4

Access attributes

Min seek time ms 1 3 1

Max seek time ms 17 25 19

External data rate MB/s 160 100 13

Physical attributes

Diameter, inches 3.5 2.5 1

Platters 12 2 1

Rotation speed rpm 7200 4200 3600

Weight kg 1.04 0.10 0.04

Operating power W 14.1 2.3 0.8

Idle power W 10.3 0.7 0.5


Key Attributes of Example Discs

Samsung launch 1Tb Hard drive:3 x 3.5” platters334Gb per platter7200RPM32Mb Cache3Gb/s SATA interface(SATA – serial Advanced Technology Attachment)

Highest density so far....


Mass Memory Concepts

Disk OrganizationData bits are small regions of magnetic coating magnetized in different

directions to give 0 or 1

Special encoding techniques maximize the storage density

eg rather than let data bit values dictate direction of magnetization can magnetize based on change of bit value – nonreturn-to-zero (NRZ) – allows doubling of recording capacity


Mass Memory ConceptsDisk Organization• Sector proceeded by sector number and followed by cyclic redundancy check

allows some errors and anomalies to be corrected

• Various gaps within and separating sectors allow processing to finish

• Unit of transfer is a sector – typically 512 to 2K bytes

• Sector address consists of 3 components:

– Disk address = Cylinder#, Track#, Sector#

17-31 bits 10-16bits 1-5bits 6-10bits

– Cylinder# - actuator arm

– Track# - selects read/write head or surface

– Sector# - compared with sector number recorded as it passes

• Sectors are independent and can be arranged in any logical order

• Each sector needs some time to be processed – some sectors may pass before disk is ready to read again, so logical sectors not stored sequentially as physical sectors

track i 0 16 32 48 1 17 33 49 2 18 34 50 3 19 35 51 4 20 36 52 …..

track i+1 30 46 62 15 31 47 0 16 32 48 1 17 33 49 2 18 34 50 3 19….

track i+2 60 13 29 45 61 14 30 46 62 15 31 47 0 16 32 48 1 17 33 49…..


Mass Memory ConceptsDisk Performance

Disk Access Latency = Seek Time + Rotational Latency

• Seek Time – how far head travels from current cylinder

– mechanical motion – accelerates and brakes

• Rotational Latency – depends upon position

– Average rotational latency = time for half a rotation

– at 10,000 rpm = 3ms


Mass Memory ConceptsRAID - Redundant Array of Inexpensive (Independent) Disks.• High capacity faster response without specialty hardware


Mass Memory ConceptsRAID0 – multiple disks appear as a single disk each accessing a

part of a single item across many disks


Mass Memory ConceptsRAID1 – robustness added by mirror contents on duplicate

disks – 100% redundancy


Mass Memory ConceptsRAID2 – robustness using error correcting codes – reducing

redundancy – Hamming codes – ~50% redundancy


Mass Memory ConceptsRAID3 – robustness using separate parity and spare disks –

reducing redundancy to 25%


Mass Memory ConceptsRAID4 – Parity/Checksum applied to sectors instead of bytes –

requires large use of parity disk


Mass Memory ConceptsRAID5 – Parity/Checksum distributed across disks – but 2 disk

failures can cause data loss


Mass Memory ConceptsRAID6 – Parity/Checksum distributed across disks and a second

checksum scheme (P+Q) distributed across different disks


99


Virtual MemoryIn order to take advantage of the various performance and prices of

different types of memory devices it is normal for a memory hierarchy to be used:

CPU register fastest data storage medium

cache for increased speed of access to DRAM

main RAM normally DRAM for cost reasons; SRAM possible

disc magnetic, random access

magnetic tape serial access for archiving; cheap

• How and where do we find memory that is not RAM?

• How does a job maintain a consistent user image when there are many others swapping resources between memory devices?

• How can all users pretend they have access to similar memory addresses?


Virtual Memory

Paging

In a paged virtual memory system the virtual address is treated as groups of bits which correspond to the Page number and offset or displacement within the page

– often denoted as (P,D) pair.

• Page number can be looked up in a page table and concatenated with the offset to give the real address.

• There is normally a separate page table for each virtual machine which point to pages in the same memory.

• There are two methods used for page table lookup

– direct mapping

– associative mapping


Virtual MemoryDirect Mapping• uses a page table with the same

number of entries as there are pages of virtual memory.

• thus possible to look up the entry corresponding to the virtual page number to find

– the real address of the page (if the page is currently resident in real memory)

– or the address of that page on the backing store if not

• This may not be economic for large mainframes with many users

• A large page table is expensive to keep in RAM and may be paged...


Virtual MemoryContent Addressable Memories

• when an ordinary memory is given an address it returns the data word stored at that location.

• A content addressable memory is supplied data rather than an address.

• It looks through all its storage cells to find a location which matches the pattern and returns which cell contained the data - may be more than one


Virtual MemoryContent Addressable Memories

• It is possible to perform a translation operation using a content addressable memory

• An output value is stored together with each cell used for matching

• When a match is made the signal from the match is used to enable the register containing the output value

• Care needs to be taken so that only one output becomes active at any time


Virtual MemoryAssociative Mapping

• Associative mapping uses a content addressable memory to find if the page number exist in the page table

• If it does the rest of the entry contains the real memory address of the start of the page

• If not then page is currently in backing store and needs to be found from a directly mapped page table on disc

• The associative memory only needs to contain the same number of entries as the number of pages of real memory - much smaller than the directly mapped table


Virtual MemoryAssociative Mapping

• A combination of direct and associative mapping is often used.


Virtual Memory

Paging

• Paging is viable because programs tend to consist of loops and functions which are called repeatedly from the same area of memory. Data tends to be stored in sequential areas of memory and are likely to be used frequently once brought into main memory.

• Some memory access will be unexpected, unrepeated and so wasteful of page resources.

• It is easy to produce a program which mis-use virtual memory, provoking frantic paging as they access memory over a wide area.

• When RAM is full, paging can not just read virtual pages from backing store to RAM, it must first discard old ones to the backing store.


1010


Virtual Memory

Paging

• There are a number of algorithms that can be used to decide which ones to move:

– Random replacement - easy to implement, but takes no account of usage

– FIFO replacement - simple cyclic queue, similar to above

– First-In-Not-Used-First-Out - FIFO queue enhanced with extra bits which are set when page is accessed and reset when entry is tested cyclically.

– Least Recently Used - uses set of counters so that access can be logged

– Working Set - all pages used in last x accesses are flagged as working set. All other pages are discarded to leave memory partially empty, ready for further paging


Virtual Memory

Paging - general points

• Every process requires its own page table - so that it can make independent translation of location of actual page

• Memory fragmentation under paging can be serious.

– as pages are set size, usage will not be for complete page and last one of a set will not normally be full

– especially if page size is large to optimise disc usage (reduce the number of head movements)

• Extra bits can be stored in page table with the real address - dirty bit - to determine if page has been written to since it was copied and hence if it needs to be copied back


Virtual Memory

Segmentation

• A virtual address in a segmented system is made from 2 parts

– segment number

– displacement within (S,D) pairs

• unlike paging, segments are not fixed length, maybe variable

• Segments store complete entities - pages allow objects to be split

• Each task has its own segment table

• segment table contains base address and length of segment so that other segments aren’t corrupted


Virtual Memory

Segmentation

• Segmentation doesn’t give rise to fragmentation in the same way, pages are of variable size so no waste of a segment.

• BUT as they are variable size not very easy to plan to fit them into memory

• Keep a sorted table of vacant blocks of memory and combine neighbouring blocks when possible

• Can keep information on the “type” a segment is - read-only executable etc. as they correspond to complete entities.

?


Virtual Memory

Segmentation & Paging

• A combination of segmentation and Paging uses a triplet of virtual address fields - the segment number, the page number within the segment and the displacement within the page (S,P,D)

• More efficient than pure paging - use of space more flexible

• More efficient than pure segmentation - allows part of segment to be swapped


Virtual Memory

Segmentation & Paging

• It is easy to mis-use virtual memory by simple difference in the way that some routines are coded: The 2 examples below perform exactly the same task, but the left-hand one generates 1,000 page faults on a machine with 1K word pages, while the one on the right generates 1,000,000. Most languages (except Fortran) store arrays in memory with the rows laid out sequentially, the right hand subscript varying most rapidly…..

void order

{

int array[1000][1000], ii, jj;

for (ii=0; ii<1000; ii++) {

for (jj=0;jj<1000; jj++) {

array[ii][jj];

}

}

}

void order

{

int array[1000][1000], ii, jj;

for (ii=0; ii<1000; ii++) {

for (jj=0;jj<1000; jj++) {

array[jj][ii];

}

}

}


Memory Caches

• Most general purpose processor systems use DRAM for their bulk RAM requirements because it is cheap and more dense than SRAM

• The penalty for this is that it is slower - SRAM has a 3-4 times shorter cycle time

• To help some SRAM can be added:– On-chip directly to the CPU for use as desired - use depends on

the compiler, not always easy to use efficiently but fast access– Cacheing - between DRAM and CPU. Built using small fast

SRAM, copies of certain parts of the main memory are held here. The method used to decide where to allocate cache determines the performance.

– Combination of the two - on chip cache.


Memory CachesDirectly mapped cache - simplest form of memory cache.• In which the real memory address is treated in three parts:

block select tag (t bits) cache index (c bits)• For a cache of 2c words, the cache index section of the real memory

address indicates which cache entry is able to store data from that address• When cached the tag (msb of address) is stored in cache with data to

indicate which page it came from• Cache will store 2c words from 2t pages.• In operation tag is compared in every memory cycle

– if tag matches a cache hit is achieved and cache data is passed– otherwise a cache miss occurs and the DRAM supplies word and data

with tag are stored in the cache

t bits c bitsTag Index

Tags Data MainMemory

Cache Memory compareUse Cache orMain Memory


Memory CachesSet Associative Caches.• A 2-way cache contains 2 cache blocks, each capable of storing one word

and the appropriate tag.• For any memory access the two stored tags are checked• Require Associative memory with 2 entries for each of the 2c cache lines• Similarly a 4-way cache stores 4 cache entries for each cache index

t bits c bitsTag Index

Tags Data MainMemory

compare

Use Appropriate Cache orMain Memory

DataTags

Cache Memory


Memory CachesFully Associative Caches• A 2-way cache has two places which it must read and compare to

look for a tag• This is extended to the size of the cache memory

– so that any main memory word can be cached at any location in cache

• cache has no index (c=0) and contains longer tags and data– notice as c (address length) decreases, t (tag length) must

increase to match • all tags are compared on each memory access• to be fast all tags must be compared in parallel

block select tag (t bits)

no cache index (c=0)

INMOS T9000 had such a cache on chip


Memory Caches

Degree of Set Associativity

• for any chosen size of cache, there is a choice between more associativity or a larger index field width

• optimum can depend on workload and instruction decoding - accessible by simulation

In practice:

An 8kbyte (2k entries) cache, addressed directly, will produce a hit rate of about 73%, a 32kbyte cache achieves 86% and a 128kbyte 2-way cache 89%

(all these figures depend on characteristics of the instruction set and code executed, data used, etc. - these are for the Intel 80386)

• considering the greater complexity of the 2-way cache there doesn’t seem to be a great advantage in applying it


Memory Caches

Cache Line Size

• Possible to have cache data entries wider than a single word -

– i.e. a line size > 1

• Then a real memory access causes 2, 4 etc. words to be read

– reading performed over n-word data bus

– or from page mode DRAM, capable of transferring multiple words from same row in DRAM, by supplying extra column addresses

– extra words are stored in the cache in an extended data area

– as most code (and data access) occurs sequentially, it is likely that next word will come in useful…

– real memory address specifies which word in the line it wants

block select tag (t bits) cache index (c bits) line address (l bits)


1111


Memory CachesWriting Cached Memory

So far only really concerned with reading cache. But problem also exists to keep cache and main memory consistent:

Unbuffered Write Through

• write data to relevant cache entry, update tag, also write data to location in main memory - speed determined by main memory

Buffered Write Through

• Data (and address) is written to A FIFO buffer between CPU and main memory, CPU continues with next access, FIFO buffer writes to DRAM

• CPU can continue to write at cache speeds, until FIFO is full, then slows down to DRAM speed as FIFO empties

• If CPU wants to read from DRAM (instead of cache) need to empty FIFO to ensure we have the correct data - can put long delay in.

• This delay can be shortened if FIFO has only one entry - simple latch buffer


22bits

Memory Caches

Micro-processor

MainDRAMmemory

Data Bus (32 bits)FIFO32bits

DataCache

Memory

Address BusFIFO22bits

Tagstorage

andcomparison

ControlLogic

D0-31 D0-31

A0-21A0-31

DA

WR

WR A

D Qcontrol

control control

DRAM select Tag Index 2bits(byte address)13 bits9 bits8 bits

CPUtimingsignals

9 bittag

13 bitindex

13 bitindex

Match

4Mword Memory using 8kwordDirect-Mapped cache with

Write-Through writes

FIFO

s op

tion

alfo

r bu

ffer

edw

rite

-thr

ough

32 32


Memory Caches

Writing Cached Memory (cont’d)

Deferred Write (Copy Back)

• data is written out to cache only, allowing the cached entry to be different from main memory. If the cache system wants to over-write a cache index with a different tag it looks to see if the current entry has been changed since it was copied in. If so it writes the new value to main memory before reading the new data to the location in cache.

• More logic is required for this operation, but the performance gain can be considerable as it allows the CPU to work at cache speed if it stays within the same block of memory. Other techniques will slow down to DRAM speed eventually.

• Adding a buffer to this allows CPU to write to cache before data is actually copied back to DRAM


Memory Caches

22bits

Micro-processor

MainDRAMmemory

Data Bus (32 bits)

DataCache

Memory

Address Bus

Tagstorage

andcomparison

ControlLogic

D0-31 D0-31

A0-21A0-31

DA

WR

WR A

D Q

control

DRAM select Tag Index 2bits(byte address)

13 bits9 bits8 bits

CPUtimingsignals

9 bittag

13 bitindex

13 bitindex

Match 4Mword Memory using 8kwordDirect-Mapped cache with

Copy-Back writes

32 32

Dirty bit

Latch32bits

control

LatchD Q

LatchD QQ

control


Memory Caches

Cache Replacement Policies for non direct-mapped caches

• when CPU accesses a location which is not already in cache need to decide which existing entry to send back to main memory

• needs to be a quick decision

• Possible schemes are:– Random replacement - a very simple scheme where a frequently

changing binary counter is used to supply a cache set number for rejection.

– First-In-First-Out - a counter is incremented every time a new entry is brought into the cache, which is used to point to the next slot to be filled

– Least Recently Used - good strategy as keeps often used values in cache, but difficult to implement with a few gates in short times


Memory Caches

Cache Consistency

A problem occurs when DMA is used by other devices or processors.

• Simple solution is to attach cache to memory and make all devices operate through it.

• Not best idea as DMA transfer will cause all cache entries to be overwritten, even though it is unlikely to be needed again soon

• If the cache is placed on the CPU side of the DMA traffic then cache might not mirror DRAM contents

Bus Watching - monitor access to the DRAM and invalidate the relevant cache tag entry if that DRAM has been updated can then keep cache towards the CPU


1212


Instruction Sets

IntroductionInstruction streams control all activity in the processor. All

characteristics of the machine depend on design of instruction set– ease of programming– code space efficiency– performance

Look at a few different instruction sets:– Zilog Z80– DEC Vax-11– Intel family– INMOS Transputer– Fairchild Clipper– Berkeley RISC-I


Instruction Sets

General Requirements of an Instruction Set

Number of conflicting requirements of an instruction set:

• Space Efficiency - control information should be compact

– the major part of all data moved between memory and CPU

– obtained by careful design of instruction set

• variable length coding can be used so that frequently used instructions are encoded into fewer bits

• Code Efficiency - can only translate a task efficiently if it is easy to pick needed instructions from set.

– various attempts at optimising instruction sets resulted in :• CISC - rich set of long instructions - results in small number

of translated instructions

• RISC - very short instructions, combined at compile time to produce same result


Instruction Sets

General Requirements of an Instruction Set (cont’d)

• Ease of Compilation - in some environments compilation is a more frequent activity than on machines where demanding executables predominate. Both want execution efficiency however.

– more time consuming to produce efficient code for CISC - more difficult to map program to wide range of complex instructions

– RISC simplifies compilation

– Ease of compilation doesn’t guarantee better code…..

– Orthogonality of instruction set also effects code generation.

• regular structure

• no special cases

• thus all actions (add, multiply etc.) able to work with each addressing mode (immediate, absolute, indirect, register).

• If not compiler may have to treat different items differently - constants, arrays and variables


Instruction Sets


• Ease of Programming

– still times when humans work directly at machine code level;

• compiler code generators

• performance optimisation

– in these cases there are advantages to regular, fixed length instructions with few side effects and maximum orthoganality

• Backward Compatibility

– many manufacturers produce upgrade versions which allow code written for earlier CPU to run without change.

– Good for public relations - if not compatible the could rewrite for competitors CPU instead!

– But can make Instruction set a mess - deficiencies added to rather than replaced - 8086 - 80286 - 80386 - 80486 - pentium


Instruction Sets


• Addressing Modes & Number of Addresses per Instruction

– Huge range of addressing modes can be provided - specifying operands from 1 bit to several 32bit words.

– These modes may themselves need to include absolute addresses, index registers, etc. of various lengths.

– Instruction sets can be designed which primarily use 0, 1, 2 or 3 operand addresses just to compound the problem.


Instruction Sets

Important Instruction Set Features:

• Operand Storage in the CPU

– where are operands kept other than in memory?

• Number of operands named per instruction

– How many operands are named explicitly per instruction?

• Operand Location

– can any ALU operand be located in memory or must some or all of the operands be held in the CPU?

• Operations

– What types of operations are provided in the instruction set?

• Type and size of operands

– What is the size and type of each operand and how is it specified?


Instruction SetsThree Classes of Machine:• Stack based Machines

Advantages Simple model of expression evaluationShort instructions can give dense code

Disadvantages Stack can not be randomly accessed make efficient code generation difficult

Stack can be hardware bottleneck

• Accumulator based MachinesAdvantages Minimises internal state of machine

Short instructionsDisadvantages Since accumulator provides temporary storage

memory traffic is high

• Register based MachinesAdvantages Most general modelDisadvantages All operands must be named, leading to long

instructions

zero

add

ress

mac

hine

one

addr

ess

mac

hine

mul

ti a

ddre

ss m

achi

ne


Instruction SetsRegister Machines• Register to Register

Advantages Simple, fixed length instruction encodingSimple model for code generation Most compactInstructions access operands in similar time

Disadvantages Higher instruction count than in architectures with memory references in instructions

Some short instruction codings may waste instruction space.

• Register to MemoryAdvantages Data can be accessed without loading first

Instruction format is easy to encode and dense

Disadvantages Operands are not symmetric, since one operand (in the register) is destroyed

The no. of registers is fixed by instruction coding

Operand fetch speed depends on location (register or memory)


Instruction SetsRegister Machines (cont’d)

• Memory to MemoryAdvantages Simple, (fixed length?) instruction encoding

Does not waste registers for temporary storageDisadvantages Large variation in instruction size - especially as

number of operands is increasedLarge variation in operand fetch speedMemory accesses create memory bottleneck


1313


Instruction SetsAddressing ModesRegister Add R4, R3 R4=R4+R3 When a value is in a register

Immediate Add R4, #3 R4=R4+3 For constants

Indirect Add R4, (R1) R4=R4+M[R1] Access via a pointer

Displacement Add R4, 100(R1) R4=R4+M[100+R1] Access local variables

Indexed Add R3, (R1+R2) R3=R3+M[R1+R2] Array access (base + index)

Direct Add R1, (1001) R1 = R1+M[1001] Access static data

Memory Add R1, @(R3) R1=R1+M[M[R3]] Double indirect - pointers IndirectAuto Add R1, (R2)+ R1=R1+M[R2] step through arrays - d is Postincrement then R2=R2+d word lengthAuto Add R1,-(R2) R2=R2-1 Postdecrement then R1=R1+M[R2] can also be used for stacksScaled Add R1,100(R2)[R3] R1=R1+M[100+R2+(R3*d)]


Instruction SetsInstruction Formats

Number of Address (operands)

4 operation 1st operand 2nd operand Result next address

3 operation 1st operand 2nd operand Result

2 operation 2nd operand

1 operation register 2nd operand

0 operation

1st operand& result


Instruction SetsExample Programs and simulations(used in simulations by Hennessey & Patterson)gcc the gcc compiler (written in C) compiling a large number of C

source files

TeX the TeX text formatter (written in C), formatting a set of computer manuals

SPICE The spice electronic circuit simulator (written in FORTRAN) simulating a digital shift register


Instruction SetsSimulations on Instruction Sets from Hennessey & PattersonThe following tables are extracted from 4 graphs in Hennessey &

Patterson’s “Computer Architecture: A Quantitative Approach”

Use of Memory Addressing Modes

Addressing Mode TeX Spice gccMemory Indirect 1 6 1 listsScaled 0 16 6 ArraysIndirect 24 3 11 pointersImmediate 43 17 39 consts.Displacement 32 55 40 local var


Instruction SetsSimulations on Instruction Sets (cont’d)

Number of bits needed for a Displacement Operand Value

Percentage of displacement operands using this # of bits0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

TeX 17 1 2 8 5 17 16 9 0 0 0 0 0 5 2 22

Spice 4 1 13 9 1 3 3 6 6 5 14 16 5 11 0 12

gcc 27 0 0 5 5 15 14 6 5 1 2 1 0 4 1 12

How local are the local variables?

< 8bits: 71% for TeX; 37% for Spice; 79% for gcc


Instruction SetsSimulations on Instruction Sets (cont’d)

Percentage of Operations using Immediate Operands

The Distributions of Immediate Operand Sizes

Operation TeX Spice gcc

Loads 38 26 23

Compares 83 92 84

ALU Operations 52 49 69

Number of bits needed for an Immediate ValueProgram 0 4 8 12 16 20 24 28 32

TeX 3 44 3 2 16 23 2 1 0Spice 0 12 36 16 14 10 12 0 0gcc 1 50 21 3 2 19 0 0 1


1414


Instruction SetsThe Zilog Z80

• 8 bit microprocessor derived from the Intel 8080

• has a small register set ( 8 bit accumulator + 6 other registers)

• Instructions are either register based or register and one memory address - single address machine

• Enhanced 8080 with relative jumps and bit manipulation

• 8080 instruction set (8bit opcodes) -

– unused gaps filled in with extra instructions.

– even more needed so some codes cause next byte to be interpreted as another set of opcodes….

• Typical of early register-based microprocessor

• Let down by lack of orthogonality - inconsistencies in instructions eg:– can load a register from address in single register

– but accumulator can only be loaded by address in register pair


Instruction SetsThe Zilog Z80 (cont’d)• Separate PC, SP and 2 index registers• Addressing modes:

– Immediate (1 or 2 byte operands)– Relative (one-byte displacement)– Absolute (2-byte address)– Indexed (M[index reg + 8 bit disp])– Register (specified in opcode itself)– Implied (e.g. references accumulator)– Indirect (via HL, DE or BC register pairs)

• Instruction Types– Load & Exchange - 64 opcodes used just for register- register copying– Block Copy– Arithmetic, rotate & shift - mainly 8 bit; some simple 16-bit operations– Jump, call & return - uses condition code from previous instruction– Input & Output - single byte; block I/O

8 improvements over 8080

1) Enhanced Instruction set – index registers & instructions

2) Two sets of registers for fast context switching

3) Block Move

4) Bit manipulation

5) Built in DRAM refresh address counter

6) Single 5V power supply

7) Fewer extra support chips needed

8) Very good price…


Instruction SetsIntel 8086 Family• 8086 announced in 1978 - not used in PC until 1987 (slower 8088 from 1981)

– 16 bit processor, data paths– 20 bit base addressing mode

• 80186 upgrade: small extensions• 80286 - used in PC/AT in 1984 (6 times faster than 8088 - 20MHz)

– Memory mapping & protection added• Support for VM through segmentation• 4 levels of protection – to keep applications away from OS

– 24-bit addressing (16Mb) - segment table has 24 bit base field & 16 bit size field• 80386 - 1986 - 40MHz

– 32 bit registers and addressing (4Gb)– Incorporates “virtual” 8086 mode rather than direct hardware support– Paging (4kbytes) and segmentation (up to 4Gb) – allows UNIX implementation– general purpose register usage– Incorporates 6 parallel stages:

• Bus Interface Unit – I/O and memory• Code Prefetch Unit• Instruction Decode Unit• Execution Unit• Segment Unit – logical address to linear address translation• Paging Unit – linear address to physical address translation

– Includes cache for up to 32 most recently used pages

Concurrent Fetch (Prefetch) and Execute

protected mode only switchable by processor reset until 386!


Instruction SetsIntel 8086 Family• i486 - 1988 - 100MHz

– more performance• added caching (8kb) to memory system • integrated floating point processor on board• Expanded decode and execute to 5 pipelined stages

• Pentium- 1994 - 150-750MHz (10,000 times speed of 8088)– added second pipeline stage to give superscalar performance– Now code (8k) and data (8k) cache– Added branch prediction, with on-chip branch table for lookps– Pages now 4Mb as well as 4kb– Internal paths 128bits and 256bits, external still 32bits– Dual processor support added

• Pentium Pro– Instruction decode now 3 parallel units– Breaks up code into “micro-ops”– Micro-ops can be executed in any order using 5 parallel execution units, 2

integer, 2 floating point and 1 memory

not 80486 – court ruling “can’t trademark a number


Instruction SetsIntel 8086 Registers (initially 16 bit)DataAX used for general arithmetic

AH and AL used in byte arithmeticBX general-purpose register

used as address base registerCX general-purpose register

used specifically in string, shift & loop instructionsDX general-purpose register

used in multiply, divide and I/O instructionsAddressSP Stack PointerBP base register - for base-addressing modeSI index, string source base registerDI index, string destination base registerRegisters can be used in 32 bit mode when in 80386 mode


Instruction SetsIntel 8086 Registers (initially 16 bit)Segment Base Registers - shift left 4 bits and add to address specified in

instruction...CS start address of code accessesSS start address of Stack SegmentES extra segment (for string destinations)DS data segment - used for all other accessesControl RegistersIP Instruction Pointer (LS 16 bits of PC)Flags 6 condition code bits plus 3 processor status control bitsAddressing ModesA wide range of addressing modes are supported. Many modes can only be

accessed via specific registers eg:Register Indirect BX, SI, DIBase + displacement BP, BX, SI, DIIndexed address is sum of 2 registers - BX+SI, BX+DI, BP+SI, BP+DI

causes overlap!!!

changed in 80286


1515


Instruction SetsThe DEC Vax-11

Vax-11 family was compatible with the PDP-11 range - had 2 separate processor modes - “Native” (VAX) and “Compatibility” Modes

• VAX had 16 32 bit general purpose registers including PC and SP and a frame pointer.

• All data and address paths were 32bits wide - 4Gb address space.

• Full range of data types directly supported by hardware - 8, 16, 32 and 64 bit integers, 32 and 64 bit floating point and 32 digit BCD numbers, character strings etc.

• A very full selection of addressing modes was available

• Used instructions made up from 8-bit bytes which specified:

– the operation

– the data type

– the number of operands

EE3.cma - Computer Architecture04/18/23 EE3.cma - Computer Architecture 1478 bits

Instruction SetsThe DEC Vax-11• Special opcodes FD and FF introduce even more opcodes in a second byte.• Only the number of addresses is encoded into the opcode itself - the

addresses of operands are encoded in one or more succeeding bytesSo the operation:

ADDL3 #1, R0, @#12345678(R2)or“Add 1 to the longword in R0 and store the result in a memory location

addressed at an offset of the number of longwords stored in R2 from the absolute address 12345678 (hex)”

is stored as:

12

78

02

15

3456

10549

ADDL3 opcode193Literal (immediate) constantRegister mode - register 0

Index prefix register 2Abs address follows for indexing

Absolute address #12345678- the VAX was little endian

9 by

tes


Instruction SetsThe INMOS Transputer• The transputer is a

microprocessor designed to operate with other transputers in parallel embedded systems.

• The T800 was exceptionally powerful when introduced in 1986

• The T9000 - more powerful pipelined version in 1994

• Government sell INMOS to EMI• EMI decide to concentrate on music• SGS Thompson buy what’s left• Japanese use transputer technology in

printers/scanners• Then sold to ST microelectronics• Now abandoned


Instruction Sets

The INMOS Transputer

• Designed for synchronised communications applications

• Suitable for coupling into a multiprocessing configuration allowing a single program to be spread over all machines to perform task co-operatively.

• Has 4kbytes internal RAM - not cache, but a section of main memory map for programmer/compiler to utilise.

• Compact instruction set

– most popular instructions in shortest opcodes - to minimise bandwidth

– operate in conjunction with a 3 word execution stack - a zero addressing strategy


Instruction SetsThe INMOS TransputerThe processor evaluates the following high-level expression:

x = a+b+c;where x, a, b, and c represent integer variables.

No need to specify which processor registers receive the variables. Processor just told to load - pushed on stack - and add them. When an operation is performed, two values at the top are popped, then

combined, and the result of the operation is pushed back: ;stack contents (=Undefined) ;[ ] load a ;[a ] load b ;[b a ] load c ;[c b a] add ;[c+b a ] add ;[c+b+a ] store x ;[c+b+a ]• removes need to add extra bits to the instruction to specify which register

is accessed, instructions can be packed in smaller words - 80% of instructions are only1 byte long - results in tighter fit in memory, and less time spent fetching the instructions


Instruction SetsThe INMOS Transputer• Has 6 registers

– 3 make the register stack– a program counter (called the instruction pointer by Inmos)– a stack pointer (called a workspace pointer by Inmos)– and an operand register

• The stack is interfaced by the first of the 3 registers (A, B, C)– “push”ing a value into A will cause A’s value to be pushed to B and

B’s value to C– “pop”ping a value from A will cause B’s value to be popped to A and

C’s value to B• The operand pointer is the focal point for instruction processing.

– the 4 upper bits of a transputer instruction contain the operation– 16 possible operations– 4 lower bits contain the operand - this can be enlarged to 32 bits by

using a “prefix” instructions


Instruction SetsThe INMOS Transputer• The 16 instructions include jump, call, memory load/store and add. Three

of the 16 elementary instructions are used to enlarge the two 4-bit fields (opcode or operand) in conjunction with the OR as follows:– the “prefix” instruction adds its operand data into the OR (4bits)– shifts the OR 4 bits to the left– allowing numbers (upto 32 bits) to be built up in the OR– a negative prefix instruction adds its operand into the OR and then

inverts all the bits in the OR before shifting 4 bits to the left - allows 2’s complement negative values to be built up - eg

Mnemonic Code Memory ldc #3 #4 #43 ldc #35 #2345 is coded as pfix #3 #2 #23 ldc #5 #4 #45 ldc #987 #292847 is coded as pfix #9 #2 #29 pfix #8 #2 #28 ldc #7 #4 #47

OperandRegister


Instruction SetsThe INMOS Transputer Mnemonic Code Memory

ldc -31 (ldc #FFFFFFE1) #6141 is coded as nfix #1 #6 #61 ldc #1 #4 #41

This last example shows the advantage of loading the 2’s complement negative prefix. Otherwise we would have to load all of the Fs making 5 additional operations….

• An additional “operate” instruction allows the OR to be treated as an extended opcode - up to 32 bits. Such instructions can not have an operand as OR is used for instruction so are all zero address instructions.

• We have 16 1-address instructions and potentially lots of zero length instructions.

Mnemonic Code Memory add #5 #F5 is coded as opr #5 #F #F5 ladd #16 #21F6 is coded as pfix #1 #2 #21 opr #6 #F #F6


Instruction SetsThe INMOS Transputer

• No dedicated data registers. The transputer does not have dedicated registers, but a stack of registers, which allows for an implicit selection of the registers. The net result is a smaller instruction format.

• Reduced Instruction Set design. The transputer adopts the RISC philosophy and supports a small set of instructions executed in a few cycles each.

• Multitasking supported in microcode. The actions necessary for the transputer to swap from one task to another are executed at the hardware level, freeing the system programmer of this task, and resulting in fast swap operations.


Instruction SetsThe Fairchild (now Intergraph) Clipper• Had sixteen 32-bit general purpose registers for the user and another 16

for operating system functions.

– Separated interrupt activity and eliminated time taken to save register information during an ISR

• Tightly coupled to a Floating Point Unit

• Had 101 RISC like instructions

– 16 bits long

– made up from an 8-bit opcode and two 4-bit register fields

– some instructions can carry 4 bits of immediate data

– the 16 bit instructions could be executed in extremely fast cycles

– also had 67 macro instructions - made up from multiples of simpler instructions using a microprogramming technique - these incorporated many more complex addressing modes as well as operations which took several clock cycles


Intergraph were a leading workstation producer for CAD in transport, building and local government products built using Intel chips.

1987 – Intergraph buys Advanced Processor Division of Fairchild from National Semiconductor

1989-92 – Patents for Clipper transferred to Intergraph1996 – Intergraph find that Intel are infringing their patents on Cache addressing,

memory and consistency between cache and memory, write through & copy back modes for virtual addressing and bus snooping etc..- Intergraph ask Intel to pay for patent rights- Intel refuse- Intel then cut off Intergraph from advanced information about Intel chips - without that info Integraph could not design new products well- Intergraph go from #1 to #5

1997 – Intergraph sue Intel – lots of legal stuff for next 3 years – court rules Intel not licensed to use clipper technology in pentium

2002 – Intel pays Intergraph $300M for license plus $150M damages for infringement of PIC technology – core of Itanium chip for high end servers

Parallel Instruction Computing

A tale of Intel


Federal Trade Commission site Intel in 2 other similar cases:

1997 – Digital sue Intel saying it copied DEC technology to make Pentium Pro.

In retaliation Intel cut off DEC from Intel pre-release material.

Shortly after this DEC get bought out by Compaq.

1994 – Compaq sue Packard Bell for violating patents for Comaq chip set.

Packard Bell say chip set made by Intel

Intel cut off Compaq from advanced information…..

A tale of Intel


Instruction SetsThe Fairchild (now Intergraph) Clipper

An example of a Harvard Architecture - having a separate internal instruction bus and data bus (and associated caches)

IntegerCPU

FPU

Cache/Memory

ManagementUnit

Cache/Memory

ManagementUnit

InternalInstruction

Bus

InternalDataBus

Off-CarrierMemory

Bus

The clipper is made upfrom 3 chips mounted on a ceramic carrier.The Harvard Architectureenables the caches to beoptimised to the differentcharacteristics of theinstruction and datastreams.

Microchips PIC chip also uses a Harvard Architecture


1616


Instruction SetsThe Berkeley RISC-I Research ProcessorA research project at UC Berkeley 1980-83 set out to build

• a “pure” RISC structure

• highly suited to executing compiled high level language programs

– procedural block, local & global variables

The team examined the frequency of execution of different types of instructions in various C and Pascal programs

The RISC-I has had a strong influence on the design of SUN Sparc architecture - (the Stanford MIPS (microprocessor without Interlocked Pipelined

Stages) architecture influenced the IBM R2000)

The RISC-I was a register based machine. The registers, data and addresses were all 32 bits wide.

Had a total of 138 registers.

All instructions, except memory LOADs and STOREs, operated on 1,2 or 3 registers.


Instruction SetsThe Berkeley RISC-I Research ProcessorWhen running program had available a total of 32 general-purpose registers• 10 (R0-R9) are global• the remaining 22 were split into 3 groups:

– low, local and high - 6,10 and 6 registers respectively• When a program calls a procedure

– the first 6 parameters are stored to the programs low registers – a new register window is formed– these 6 low registers relabelled as the high 6 in a new block of 22– this is the register space for the new procedure while it runs.– the running procedure can keep 10 of its local variables in registers– it can call further procedures using its own low registers– it can nest calls to a depth of 8 calls – (thus using all 138 registers)– on return from procedures the return results are in the high registers

and appear in the calling procedures low registers.


Instruction SetsThe Berkeley RISC-I Research ProcessorProcess A calls process B which calls process C:

high

lowlocal

high

lowlocal

high

lowlocal

A

B

C

137

90

RegisterBank

Global Global Global


Instruction SetsThe Berkeley RISC-I Research Processor

RISC-I Short Immediate Instruction Format

RISC-I Long Immediate Instruction FormatDEST is the register number for all operations except conditional branches,

when it specifies the conditionS1 is the number of the first source register and S2 the second if bit 13 is high

- a 2’s complement immediate value otherwiseSCC is a set condition code bit which causes the status word register to be

activated

Op-Code SCC DEST IMM7 1 5 19

Op-Code SCC DEST S1 S27 15 5 131


Instruction SetsThe Berkeley RISC-I Research ProcessorThe Op-Code (7 bits) can be one of 4 types of instruction:• Arithmetic

– where RDEST = RS1 OP S2 and OP is a math, logical or shift operation• Memory Access

– where LOADs take the form RDEST = MEM[RS1+S2]– and STOREs take the form MEM[RS1+S2] = RDEST

• Note that RDEST is really the source register in this case• Control Transfer

– where various branches may be made relative to the current PC (PC+IMM) or relative to RS1 using the short form (RS1+S2)

• Miscellaneous– all the rest. Includes “Load immediate high” - uses the long immediate

format to load 19 bits into the MS part of a register - can be followed by a short format load immediate to the other 13 bits - 32 in all


Instruction SetsRISC PrinciplesNot just a machine with a small set of instructions. Must also have been

optimised and minimised to improve processor performance.Many processors n the 60s and 70s were developed with a microcode engine

at the heart of the processor - easier to design (CAD and formal proof did not exist) and easy to add extra, or change instructions

Most CISC programs spend most of their time in small number of instructionsIf the time taken to decode all instructions can be reduced by having fewer of

them then more time can be spent on making the less frequent instructionsVarious other features become necessary to make this work:• One clock cycle per instruction

CISC machines typically take a variable number of cycles– reading in variable numbers of instruction bytes– executing microcodeTime wasted by waiting for these to complete is gained if all operate in

the same periodFor this to happen a number of other features are required.


Instruction SetsRISC Principles• Hard-wired Controller, Fixed Format Instructions

– Single cycle operation only possible if instructions can be decoded fast and executed straight away.

– Fast (old-fashioned?) hard-wired instruction sequences are needed - microcode can be too slow

– As designing these controllers is hard even more important to have few

– can be simplified by making all instructions share a common format

• number of bytes, positions of op-code etc.• smaller the better - provided that each instruction

contains needed information– Typical for only 10% of the logic of a RISC chip to be used for

controller function, compared with 50-60% of a CISC chip like the 68020


Instruction SetsRISC Principles• Larger Register Set

– It is necessary to minimise data movement to and from the processor– The larger the number of registers the easier this is to do.– Enables rapid supply of data to the ALU etc. as needed– Many RISC machines have upward of 32 registers and over 100 is not

uncommon.– There are problems with saving state of this many registers– Some machines have “windows” of sets of registers so that a complete

set can be switched by a single reference change• Memory Access Instructions

– One type of instruction can not be speeded up as much as others– Use indexed addressing (via a processor register) to avoid having to

supply (long) absolute addresses in the instruction– Harvard architecture attempts to keep most program instructions and

data apart by having 2 data and address buses


Instruction SetsRISC Principles• Minimal pipelining, wide data bus

– CISC machines use pipelining to improve the delivery of instructions to the execution unit

– it is possible to read ahead in the instruction stream and so decode one instruction whilst executing the previous one whilst retrieving another

– Complications in jump or branch instructions can make pipelining unattractive as they invalidate the backed up instructions and new instructions have to ripple their way through.

– RISC designers often prefer large memory cache so that data can be read, decoded and executed in a single cycle independent of main memory

– Regardless of pipelining, fetching program instructions fast is vital to RISC and a wide data bus is essential to ensure this - same for CISC


Instruction SetsRISC Principles• Compiler Effort

– A CISC machine has to spend a lot of effort matching high-level language fragments to the many different machine instructions - even more so when the addressing modes are not orthogonal.

– RISC compilers have a much easier job in that respect - fewer choices– They do, however, build up longer sequences of their small

instructions to achieve the same effect.– The main complication of compiling for RISC is that of optimising

register usage.– Data must be maintained on-chip when possible - difficult to evaluate

an importance to a variable.• a variable accessed in a loop can be used many times and one

outside may be used only once - but both only appear in the code once...


Instruction Sets

Convergence of RISC and CISCMany of the principles developed for RISC machine optimisation have been

fed back into CISC machines (Intergraph and Intel…). This is tending to bring the two styles of machine back together.

• Large caches on the memory interface - reduce the effects of memory usage

• CISC machines are getting an increasing number of registers• More orthogonal instruction sets are making compiler implementation

easier• Many of the techniques described above may be applied to the

microprogram controller inside a conventional CISC machine.• This suggests that the microprogram will take on a more RISC like form

with fixed formats and fields, applying orthogonally over the registers etc.


1717


Pipelined Parallelism in Instruction Processing

General PrinciplesPipelined processing involves splitting a task into several sequential parts and

processing each in parallel with separate execution units.• for one off tasks little advantage, but• for repetitive tasks, can make substantial gainsPipelining can be applied to many fields of computing, such as:• large scale multi-processor distributed processing• arithmetic processing using vector hardware to pipe individual vector

elements through a single high-speed arithmetic unit• multi-stage arithmetic pipelines• layered protocol processing• as well as instruction execution within a processorOver all task must be able to be broken into smaller sub-tasks which can be

chained together - all subtasks taking the same time to executeChoosing the best sub-division of tasks is called load balancing



General Principles

single instruction still takes as long, each instruction still has to be performed in the same order. Speed up occurs when all stages are kept in operation at same time. Start up and ending become less efficient.

stage 2 stage 3stage 1

stage 1 stage 2 stage 3





non-pipelined processing

pipelined processing



General PrinciplesTwo clocking schemes which can be incorporated in pipelining -Synchronous

Operates using a global clock - indicates when each stage of the pipeline should pass its result to the next stage.

Clock must run at rate of slowest possible element in pipeline when given with most time consuming data.

To de-couple each stage they are separated by staging latches


latc

h

latc

hTask Results

Clock



General PrinciplesAsynchronous

in this case the stages of the pipeline run independently of each other.Two stages synchronise when a result has to pass from one to the other.A little more complicated to design than synchronous, but benefits that

stages can run in time needed rather than use maximum time.Use of a FIFO buffer instead of latch between stages can allow queuing

of results for each stage


latc

h

latc

hTask Results

acknowledgeready



Pipelining for Instruction ProcessingProcessing a stream of instructions can be performed in a pipelineIndividual instructions can be executed in a number of distinct phases:Fetch Read instruction from memory

Decode instruction Inspect instruction - how many operands, how and where will it be executed

Address generate Calculate addresses of registers and memory locations to be accessed

Load operand Read operands stored in memory - might read register operands or set up pathways between registers and functional units

Execute Drive the ALU, shifter, FPU and other components

Store operand Store result of previous stage

Update PC PC must be updated for next fetch operationNo processor would implement all of these. Most common would be Fetch

and Execute



Overlapping Fetch & Execute PhasesFetch - involves memory activity (slow) can be overlapped with Decode and

Execute.In RISC only 2 instructions access memory - LOAD and STORE - the

remainder operate on registers so for most instructions only Fetch needs memory bus.

On starting the processor the Fetch unit gets an instruction from memoryAt the end of the cycle the instruction just read is passed to the Execute unitWhile the Execute unit is performing the operation Fetch is getting next

instruction (provided Execute doesn’t need to use the memory as well)This and other contention can be resolved by:• Extending the clashing cycle to give time for both memory accesses to take

place - hesitation - requires synchronous clock to be delayed• Providing multi-port access to main memory (or cache) so that access can

happen in parallel. Memory interleaving may help.• Widening data bus so that 2 instructions are fetched with each Fetch• Use a Harvard memory architecture - separate instruction and data bus



Overlapping Fetch & Execute Phases

Fetch #1

time

Fetch #2 Fetch #3

Execute #1 Execute #2 Execute #3

time

Fetch #1 Fetch #2 Fetch #3

Execute #1 Execute #2 Execute #3

Decode #1 Decode #2 Decode #3

Overlapping Fetch, Decode & Execute Phases



Overlapping Fetch, Decode & Execute Phases

There are benefits to extending the pipeline to more than 2 stages - even though more hardware is needed

A 3-stage pipeline splits the instruction processing into Fetch, Decode and Execute.

The Fetch stage operates as before.

The Decode stage decodes the instruction and calculates any memory addresses used in the Execute

The Execute stage controls the ALU and writes result back to a register - and can perform LOAD and STORE accesses.

The Decode stage is guaranteed not to need a memory access. Thus memory contention is no worse than in the 2 stage version.

Longer Pipelines of 5, 7 or more stages are possible and depend on the complexity of hardware and instruction set.


1818



The Effect of Branch Instructions

One of the biggest problems with pipelining is the effect of a branch instruction.

A branch is Fetched as usual and the target address Decoded. The Execute stage then has the task of deciding whether or not to branch and so changing the PC.

By this time the PC has already been used at least once by the Fetch (and with a separate Decode maybe twice).

The effect of changing the PC is that all data in the pipeline following the branch must be flushed.

Branches are common in some types of program (up to 10% of instructions). So benefits of pipelining can be lost for 10% of instructions and incur reloading overhead.

A number of schemes exist to avoid this flushing:


Pipelined Parallelism in Instruction Processing• Delayed Branching – Sun SPARC

instead of branching as soon as a branch instruction has been decided, the branch is modified to “Execute n more instructions before jumping to the instruction specified” - used with n chosen to be 1 smaller than the number of stages in pipeline. So that in a 2 stage pipeline, instead of the loop:

a; b; c; a; b; c; ….. (where c is the branch instruction back to a)

in that order, the code could be stored as:a; c; b; a; c; b; ……

in this case a is executed, then the decision to jump back to a, but before the jump happens b is executed.

the delayed jump at c enables b - which has already been fetched when evaluating c to be used rather than thrown away.

must be careful when operating instructions out of sequence and the machine code becomes difficult to understand.

a good compiler can hide all of this and in about 70% of cases can be implemented easily.



• Delayed BranchingConsider the following code fragments - running on a 3 stage pipelineloop: RA = RB ADD RC

RD = RB SUB RCRE = RB MUL RCRF = RB DIV RCBNZ RA, loop

Cycle Fetch Decode Execute1 ADD - -2 SUB ADD -3 MULT SUB ADD4 DIV MULT SUB5 BNZ DIV MULT6 next BNZ DIV7 next 2 next BNZ (Updates PC)8(=1) ADD - -

Pipeline has to be flushed to remove the two incorrectly fetched instructions and code repeats every 7 cycles.


Pipelined Parallelism in Instruction Processing• Delayed Branching

We can invoke the delayed branching behaviour of DBNZ and re-order 2 instructions (if possible) from earlier in the loop:

loop: RA = RB ADD RCRD = RB SUB RCDBNZ RA loopRE = RB MULT RCRF = RB DIV RC

Cycle Fetch Decode Execute1 ADD - -2 SUB ADD -3 DBNZ SUB ADD4 MULT DBNZ SUB5 DIV MULT DBNZ (Updates PC)6(=1) ADD DIV MULT7 SUB ADD DIV8(=3) DBNZ SUB ADD

Loop now executes every 5 processor cycles - no instructions are fetched and unused.


Pipelined Parallelism in Instruction Processing• Instruction Buffers – IBM PowerPC

When a branch is found in early stage of pipeline, the Fetch unit can be made to start fetching both future instructions into separate buffers and start decoding both, before branch is executed. A number of difficulties with this:

– it imposes an extra load on instruction memory

– requires extra hardware - duplication of decode and fetch

– becomes difficult to exploit fully if several branches follow closely - each fork will require a separate pair of instruction buffers

– early duplicated stages cannot fetch different values to the same register, so register fetches may have to be delayed - pipeline stalling(?)

– duplicate pipeline stages must not write (memory or registers) unless mechanism for reversing changes is included (if branch not taken)


Pipelined Parallelism in Instruction Processing• Branch Prediction – Intel Pentium

When a branch is executed destination address chosen can be kept in cache. When Fetch stage detects a branch, it can prime itself with a next-program-counter value looked up in the cached table of previous destinations for a branch at this instruction.

If the branch is made (at execution stage) in the same direction as before, then pipeline already contains the correct prefetched instructions and does not need to be flushed.

More complex schemes could even use a most-frequently-taken strategy to guess where the next branch from any particular instruction is likely to go and reduce the pipeline flush still further.

ExecuteDecodeFetchMemoryaddress

Instruction

PC

load target address

target address found

search

Instructionaddress

Targetaddress

Look-up Table



• Dependence of Instructions on others which have not completedInstructions can not be reliably fetched if all previous branch instructions

are incomplete - PC updated too late for next fetch– Similar problem occurs with memory and registers.– memory case can be solved by ensuring that all memory accesses are

atomically performed in a single Execute stage - get data only when needed.

– but what if the memory just written contains a new instruction which has already been prefetched? (self modifying code)

In a long pipeline, several stages may read from a particular register and several may write to the same register.

– Hazards occur when the order of access to operands is changed by the pipeline

– various methods may be used to prevent data from different stages getting confused in the pipeline.

Consider 2 sequential instructions i, j, and a 3 stage pipeline. Possible hazards are:



• Read-after-write HazardsWhen j tries to read a source before i writes it, j incorrectly gets the old value

– a direct consequence of pipelining conventional instructions– occurs when a register is read very shortly after it has been updated– value in register is correct

ExampleR1 = R2 ADD R3R4 = R1 MULT R5

Cycle Fetch Decode Execute Comments 1 ADD - - 2 MULT ADD fetches R2,R3 -

3 next1 MULT fetches R1,R5 ADD stores R1 register fetch probably wrong value

4 next2 next1 MULT stores R4 wrong value calculated



• Write-after-write HazardsWhen j tries to write an operand before i writes it, the value left by i rather than

the value written by j is left at the destination– Occurs if the pipeline permits write from more than one stage.– value in register is incorrect

ExampleR3 = R1 ADD R2R5 = R4 MULT -(R3)

Cycle Fetch Decode Execute Comments 1 ADD - - 2 MULT ADD fetches R1,R2 -

3 next1 MULT fetches (R3-1),R4,saves R3-1 in R3 ADD stores R3 which version of R3?

4 next2 next1 MULT stores R5



• Write-after-read HazardsWhen j tries to write to a register before it is read by i, i incorrectly gets the

new value– can only happen if the pipeline provides for early (decode-stage)

writing of registers and late reading - auto-increment addressing– the value in the register is correct

ExampleA realistic example is difficult in this case for several reasons.

• Firstly memory accessing introduces dependencies for the data in the read case, or stalls due to bus activity in the write case

• A long pipeline with early writing and late reading of registers is rather untypical……..

• Read-after-read HazardsThese are not a hazard - multiple reads always return the same value…….


1919



• Detecting Hazardsseveral techniques - normally resulting in some stage of the pipeline being

stopped for a cycle - can be used to overcome these hazards.They all depend on detecting register usage dependencies between

instructions in the pipeline.An automated method of managing register accesses is neededMost common detection scheme is scoreboardingScoreboarding – keeping a 1-bit tag with each register.– clear tags when machine is booted– set by Fetch or Decode stage when instruction is going to change a

register– when the change is complete the tag bit is cleared– if instruction is decoded which wants a tagged register, then

instructions is not allowed to access it until tag is cleared.



• Avoiding Hazards - ForwardingHazards will always be a possibility, particularly in long pipelines.Some can be avoided by providing an alternative pathway for data from a

previous cycle but not written back in time:

Registers

Mpx Mpx

regreg

valuevalue

ALU

bypasspaths

Normal register write path


Pipelined Parallelism in Instruction Processing• Avoiding Hazards - Forwarding - Example

R1 = R2 ADD R3R4 = R1 SUB R5R6 = R1 AND R7R8 = R1 OR R9R10 = R1 XOR R11

Cycle Fetch Decode/regs ALU Memory Writeback 1 ADD - - - - 2 SUB ADD read R2,R3 - - - 3 AND SUB read R5, R1(not ready) ADD compute R1 - - 4 - SUB read R1(not ready) - ADD pass R1 - 5 - SUB read R1(not ready) - - ADD store R1

6 - SUB read R1 - - - 7 OR AND read R1, R7 SUB compute R4 - - 8 XOR OR read R1, R9 AND compute R6 SUB pass R4 - 9 next1 XOR read R1, R11 OR compute R8 AND pass R6 SUB store R410 next2 next1 XOR compute R10 OR pass R8 AND store R611 next3 next2 next1 XOR pass R10 OR store R812 next4 next3 next2 next1 XOR store

R1013 next5 next4 next3 next2 next1

on a 5 stage pipeline - no forwarding pathways

FetchDecode

Reg readALU

ExecuteMemoryAccess

RegisterWrite




Cycle Fetch Decode/regs ALU Memory Writeback 1 ADD - - - - 2 SUB ADD read R2,R3 - - - 3 AND SUB read R5, R1(not ready) ADD compute R1 - - 4 - SUB read R1(not ready) - ADD pass R1 - 5 - SUB read R1 - - ADD store R1

6 OR AND read R1, R7 SUB compute R4 - - 7 XOR OR read R1, R9 AND compute R6 SUB pass R4 - 8 next1 XOR read R1, R11 OR compute R8 AND pass R6 SUB store R4 9 next2 next1 XOR compute R10 OR pass R8 AND store R610 next3 next2 next1 XOR pass R10 OR store R811 next4 next3 next2 next1 XOR store R10

12 next5 next4 next3 next2 next1

on a 5 stage pipeline - no forwarding pathwaysBUT: registers read in second half of cycle and

written in first half of cycle

FetchDecode

Reg readALU

ExecuteMemoryAccess

RegisterWrite




Cycle Fetch Decode/regs ALU MemoryWriteback 1 ADD - - - - 2 SUB ADD read R2,R3 - - - 3 AND SUB read R5, R1(from ALU) ADD compute R1 - - 4 OR AND read R1(from ALU),R7 SUB compute R4 ADD pass R1 - 5 XOR OR read R1(from ALU),R9 AND compute R6 SUB pass R4 ADD store R1 6 next1 XOR read R1, R11 OR compute R8 AND pass R6 SUB store R4 7 next2 next1 XOR compute R10 OR pass R8 AND store R6 8 next3 next2 next1 XOR pass R10 OR store R8 9 next4 next3 next2 next1 XOR store R10

10 next5 next4 next3 next2 next1

In this case the forwarding prevents any pipeline stalls.

on a 5 stage pipeline - with full forwarding

FetchDecode

Reg readALU

ExecuteMemoryAccess

RegisterWrite


Pipelined Parallelism in Instruction Processing• Characteristics of Memory Store Operations

Example - use the 5 stage pipeline as before in store cycle: R1 = R2 ADD R325 (R1) = R1 (store in main memory)

Cycle Fetch Decode/regs ALU Memory Writeback 1 ADD - - - - 2 STORE ADD read R2,R3 - - - 3 next1 STORE read R1(not ready) ADD compute R1 - - 4 next2 next1 STORE compute R1+25 ADD pass R1 - 5 stall next2 next1 STORE R1(R1) ADD store

R1 6 next3 - next2 next1 STORE null

R1 from ALU

Since STORE is an output operation, it does not create register based hazards.It might create memory-based hazards, which may be avoided by instruction re-ordering or store-fetch avoidance techniques - see next section

Wait for memory indirection


Pipelined Parallelism in Instruction Processing• Forwarding during Memory Load Operations

Example - use the 5 stage pipeline as before in Load cycle:R1 = 32 (R6)R4 = R1 ADD R7

R5 = R1 SUB R8 R6 = R1 AND R7

Cycle Fetch Decode/regs ALU Memory Writeback 1 LOAD - - - - 2 ADD LOAD read R6 - - - 3 SUB ADD read R7,R1(not ready) LOAD R6+32 - - 4 stall ADD R1(not ready) - LOAD (R6+32) - 5 AND SUB read R8,R1(from Mem) ADD R4,R1(Mem) - LOAD store

R1 6 next1 AND read R7,R1 SUB R5 ADD pass R4 - 7 next2 next1 AND R6 SUB pass R5 ADD store

R4 8 next3 next2 next1 AND pass R6 SUB store R5 9 next4 next3 next2 next1 AND store

R6

In this case the result of the LOAD must be forwarded to the earlier ALU stage, and the even earlier DECODE stage.


Pipelined Parallelism in Instruction Processing• Forwarding (Optimisation) Applied to Memory Operations

– Store Fetch Forwarding - where words stored and then loaded by another instruction further back in the pipeline can be piped directly without the need to be passed into and out of that register or memory location: e.g

MOV [200],AX ;copy AX to memoryADD BX,[200] ;add memory to BX

transforms to:MOV [200],AXADD BX, AX

– Fetch Fetch Forwarding - where words loaded twice in successive stages may be loaded together - or once from memory to register

MOV AX, [200] ;copy memory to AXMOV BX,[200] ;copy memory to BX

transforms to:MOV AX, [200]MOV BX, AX

– Store Store OverwritingMOV [200],AX ;copy AX to memoryMOV [200],BX ;copy BX to memory

transforms to:MOV [200], BX


Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Instruction Re-orderingBecause hazards and data dependencies cause pipeline stalls, removing them can

improve performance. Re-ordering instructions is often simplest technique.Consider a program to calculate on a 3-stage pipeline:

loop: RT = RA EXP RNRT = RT MULT RNRS = RS ADD RTRN = RN SUB 1BNZ RN, loop

Cycle Fetch Decode Execute 1 EXP - - 2 MULT EXP read RA,RN - 3 ADD MULT read RN,RT(not ready) EXP store RT

4 - MULT read RN,RT - 5 SUB ADD read RS,RT(not ready) MULT store RT

6 - ADD read RS,RT - 7 BNZ SUB read RN,1 ADD store RS 8 - BNZ read RN (not ready) SUB store RN

9 next1 BNZ read RN -10 next2 next1 BNZ store PC

11 EXP flushed flushed

100

1n

nna

Needs 10 cycles..


Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Instruction Re-ordering

Re-order the sum and decrement instructions:

loop: RT = RA EXP RNRT = RT MULT RNRN = RN SUB 1 these 2 swappedRS = RS ADD RT these 2 swappedBNZ RN, loop

Cycle Fetch Decode Execute 1 EXP - - 2 MULT EXP read RA,RN - 3 SUB MULT read RN,RT(not ready) EXP store RT

4 - MULT read RN,RT - 5 ADD SUB read RN,1 MULT store RT 6 BNZ ADD read RS,RT SUB store RN 8 next1 BNZ read RN ADD store RS 9 next2 next1 BNZ store PC

10 EXP flushed flushed

Can only make it better with forwarding - to remove final RT dependency

Needs 8 cycles..


2020



• Code Changes affecting the pipeline - Loop UnrollingThe unrolling of loops is a conventional technique for increasing

performance. It works especially well in pipelined systems:– start with a tight program loop– re-organise the loop construct so that the loop is traversed half (or a

third, quarter etc.) as many times– re-write the code body so that it performs two (3, 4) times as much

work in each loop– Optimise the new code bodyIn the case of pipeline execution, the code body gains from:– more likely benefit from delayed branching– less need to increment the loop variable– instruction re-ordering avoids pipeline stalls– parallelism is exposed - useful for vector and VLIW architectures


Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Loop Unrolling

Example - Calculate using a Harvard architecture and forwarding R2 = 0

R1 = 100loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop

Cycle Fetch Decode ALU MemoryWriteback 1 LOAD - - - - 2 SUB LOAD read R1 - - - 3 ADD SUB read R1,1 LOAD R1+0 - - 4 BNZ ADD R2,R3(not ready) SUB R1-1 LOAD (R1+0) - 5 next1 BNZ read R1 ADD R2+R3,R3(Mem) SUB pass R1 LOAD store

R3 6 next2 next1 BNZ R1(from ALU) ADD pass R2 SUB store R1 7 next3 next2 next1 BNZ pass R1 ADD store R2 8 next4 next3 next2 next1 BNZ store PC

9 LOAD - - - -

Code is difficult to write in optimal form - too short to implement delayed branching - forwarding prevents stalling and performing decrement early hides some of the memory latency

100

1

)(n

narray

800 cycles to complete all loops



Example - Calculate using a Harvard architecture and forwarding R2 = 0

R1 = 100loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop

100

1

)(n

narray

Unrolling the loop body:

loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3

R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3



BNZ R1, loop

Re-label registers and re-order

loop: R3 = LOAD array(R1) R4 = LOAD array-1(R1) R5 = LOAD array-2(R1) R6 = LOAD array-3(R1) R1 = R1 SUB 4 DBNZ R1, loop R2 = R2 ADD R3 R2 = R2 ADD R4 R2 = R2 ADD R5 R2 = R2 ADD R6



Example - Calculate using a Harvard architecture and forwarding

Branch has been replaced with a delayedbranch - takes effect after 4 moreinstructions (5 stage pipeline)

Cycle Fetch Decode ALU Memory Writeback 1 LOAD1 - - - - 2 LOAD2 LOAD1 read R1 - - - 3 LOAD3 LOAD2 read R1,1 LOAD1 array+R1 - - 4 LOAD4 LOAD3 read,R1,2 LOAD2 array+1+R1 LOAD1 R3 - 5 SUB LOAD4 read R1,3 LOAD3 array+2+R1 LOAD2 R4 LOAD1 store R3 6 DBNZ SUB read R1,4 LOAD4 array+3+R1 LOAD3 R5 LOAD2 store R4 7 ADD1 DBNZ read R1 SUB R1 LOAD4 R6 LOAD3 store R5 8 ADD2 ADD1 read R2,R3 DBNZ R1(from ALU) SUB pass R1 LOAD4 store R6

9 ADD3 ADD2 read R2,R4 ADD1 R2 DBNZ pass R1 SUB store R1 10 ADD4 ADD3 read R2,R5 ADD2 R2(from ALU) ADD1 pass R2 DBNZ store PC11 LOAD1 ADD4 read R2,R6 ADD3 R2(from ALU) ADD2 pass R2 ADD1 store R212 LOAD2 LOAD1 read R1 ADD4 R2(from ALU) ADD3 pass R2 ADD2 store R213 LOAD3 LOAD2 read R1,1 LOAD1 array+R1 ADD4 pass R2 ADD3 store R214 LOAD4 LOAD3 read,R1,2 LOAD2 array+1+R1 LOAD1 R3 ADD4 store R215 SUB LOAD4 read R1,3 LOAD3 array+2+R1 LOAD2 R4 LOAD1 store R3

100

1

)(n

narray

250 cycles to complete all loops



• Code Changes affecting the pipeline - Loop Unrolling

Example - Calculate using a Harvard architecture and forwarding

The original loop took 8 cycles per iteration. The unrolled version allows a delayed branch to be implemented and performs 4 iterations in 10 cycles.

Gives an improvement of a factor of 3.2

Benefits of Loop Unrolling

– Fewer instructions (multiple decrements can be performed in one operation)

– longer loop allows delayed branch to fit

– better use of pipeline - more independent operations

– disadvantage - more registers required to obtain these results

100

1

)(n

narray



• Parallelism at the Instruction LevelConventional instruction sets rely on encoding of register numbers, instruction type

and addressing modes to reduce volume of instruction streamCISC processors optimise a lower level encoding in a longer instruction word -

requires them to consume more instruction bits per cycle, forcing advancements like Harvard memory architectures.

CISC architectures are still sequential processing machines - pipelining and superscalar instruction grouping introduce a limited amount of parallelism

Parallelism can also be introduced explicitly with parallel operations in each instruction word.

VLIW (Very Long Instruction Word) machines have instruction formats which contain different fields, each referring to a separate functional unit in the processor, this requires multi-ported access to registers etc.

Choice of parallel activities in a VLIW machine is made by the compiler, which must determine when hazards exist and how to resolve them...


Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors• Uniprocessor limits on performanceThe speed of a pipelined processor (instructions per second) is limited by:• clock frequency (AMD 2.66 GHz) is unlikely to increase - much more• depth of pipeline. As depth increases, work in each stage per cycle initially

decreases. But effects of register hazards, branching etc. limit further sub-division and load balancing between stages gets increasingly difficult

So, why only initiate one instruction in each cycle?Superpipelined processors double the clock frequency by pushing alternate

instructions from a conventional instruction stream to 2 parallel pipelines. Compiler must separate instructions to run independently in the 2 streams and when not possible must add NULL operations. Could use more than 2 pipelines. Scheme is not very flexible and is superseded by:

Superscalar processors use conventional instruction stream, read at several instructions per cycle. Decoded instructions issued to a number of pipelines - 2 or 3 pipelines can be kept busy this way

Very Long Instruction Word (VLIW) processors use modified instruction set - each containing sub-instructions, each sent to separate functional units


Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors

• Superscalar Architectures– fetch and decode more instructions than needed to feed a single pipeline

– launch instructions down a number of parallel pipelines in each cycle

– compilers often re-order instructions to place suitable instructions in parallel - the details of the strategy used will have a huge effect on the degree of parallelism achieved

– some superscalars can perform re-ordering at run time - to take advantage of free resources

– relatively easy to expand - add another pipelined functional unit. Will run previously compiled code, but will benefit from new compiler

– provide exceptional peak performance, but extra data requirements put heavy demands on memory system and sustained performance might not be much more than 2 instructions per cycle.



• Very Long Instruction Word architectures– VLIW machines provide a number of parallel functional units

• typically 2 integer ALUs, 2 floating point units, 2 memory access units and a branch control engine

– the units are controlled from bits in a very long instruction word - this can be 150 bits or more in width

– needs fetching across a wide instruction bus - and hence wide memories and cache.

– Many functional units require 2 register read ports and a register write port

– Application code must have plenty of instruction level parallelism and few control hazards - obtainable by loop unrolling

– Compiler responsible for identifying activities to be combined into a single VLIW.


2121



• Hazards and Instruction Issue Matters with Multiple PipelinesThere are 3 main types of hazard:

– read after write - j tries to read an operand before i writes it, j gets the old value

– write after write - j writes a result before i, the value left by i rather than j is left at the destination

– write after read - j writes a result before it is read by i, i incorrectly gets new value

In single pipeline machine with in-order execution read after write is the only one that can not be avoided and is easily solved using forwarding.

Using extra superscalar pipelines (or altering the order of instruction completion or issue) brings all three types of hazard further into play:



• Read After Write HazardsIt is difficult to organise forwarding from one pipeline to another.

Better is to allow each pipeline to write its result values directly to any execution unit that needs them

• Write After Read HazardsConsider

F0 = F1 DIV F2F3 = F0 ADD F4F4 = F2 SUB F6

Assume that DIV takes several cycles to execute in one floating point pipeline.Its dependency with ADD (F0) stops ADD from being executed until DIV

finishes.BUT SUB is independent of F0 and F3 and could be executed in parallel with

DIV and could finish 1st. If it wrote to F4 before the ADD read it then ADD would have the wrong value



• Write After Write HazardsConsider

F0 = F1 DIV F2F3 = F0 MUL F4F0 = F2 SUB F6F3 = F0 ADD F4

On a superscalar the DIV and SUB have independent operands (F2 is read twice but not changed)

If there are 2 floating point pipelines, each could be performed at the same time.DIV would be expected to take longerSo SUB might try and write to F0 before DIV - hence ADD might get wrong

value from F0 (MUL would be made to wait for DIV to finish, however)

We can use Scoreboarding to resolve these issues.



• Limits to Superscalar and VLIW ExpansionOnly 5 operations per cycle are typical, why not 50?– Limited Parallelism available.

• VLIW machines depend on a stream of ready-parallelised instructions.– Many parallel VLIW instructions can only be found by unrolling loops– if a VLIW field can not be filled in an instruction, then the functional unit will

remain idle during that cycle• superscalar machine depends on stream of sequential instructions

– loop unrolling is also beneficial for superscalars– Limited Hardware resources

• cost of registers read/write ports scale linearly with number, but complexity of access increases as the square of the number

• extra register access complexity may lead to longer cycle times• more memory ports needed to keep processor supplied with data

– Code Size too high• wasted fields in VLIW instructions lead to poor code density, need for

increased memory access and overall less benefit from wide instruction bus



• Amdahl’s LawGene Amdahl suggested the following law for vector processors - equally appropriate

for VLIW and superscalar machines and all multiprocessor machines.Any parallel code has sequential elements - at startup and shutdown, at the beginning

and end of each loop etc.To find the benefit from parallelism need to consider how much is done sequentially

and in parallel.Speedup factor can be taken as: Execution time using one processor

S(n) = Execution time using a multiprocessor with n processors

If the fraction of code which can not be parallelised is f and the time taken for the computation on one processor is t then the time taken to perform the computation with n processors will be:

ft + (1 - f) t / nThe speed up is therefore:

S(n) = t / ( ft + (1 - f) t / n) = n/(1 + (n - 1) f)(ignoring any overhead due to parallelism or communication between processors)



• Amdahl’s LawS( ) 1/f

• even for an infinite number of processors maximum speed up is given by above

• Small reduction in sequential overhead can make huge difference in throughput

Amdahl's Law - Speedup v No. Processors

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20

Number of Processors

Sp

ee

du

p S

(n)

f = 0%

f = 5%

f = 10%

f = 20%



• Gustafson’s Law

– A result of observation and experience.

– If you increase the size of the problem then the size (not the fraction) of the sequential part remains the same.

– eg if we have a problem that uses a number of grid points to solve a partial differential equation

• for 1000 grid points 10% of the code is sequential.

• might expect that for 10,000 grid points only 1% of the code will be sequential.

• If we expand the problem to 100,000 grid points the only 0.1% of the problem remains sequential.

– So after Amdahl’s law things start to look better again!


2222


Running Programs in ParallelRunning Programs in ParallelOptions for running programs in parallel include:• Timesharing on a Uniprocessor - this is mainly muti-tasking to share a

processor rather than combining resources for a single application. Timesharing is characterised by:– Shared memory and semaphores– high context-switch overheads– limited parallelism

• Multiprocessors with shared memory - clustered computing combines several processors communicating via shared memory and semaphores. – Shared memory limits performance (even with caches) due to the delays

when the operating system or user processes wait for other processes to finish with shared memory and let them have their turn.

– Four - eight processors, actively communicating on a shared bus is about the limit before access delays become unacceptable

• Multiprocessors with separate communication switching devices - INMOS transputer and Beowulf clusters.– each element contains a packet routing controller as well as a processor

(transputer contained both on single chip) – messages can be sent between any process on any processor in hardware


Running Programs in ParallelRunning Programs in ParallelOptions for running programs in parallel include: (cont’d)• Vector Processors (native and attached)

– may just be specialised pipeline engines pumping operands through heavily-pipelined, chained, floating point units.

– Or they might have enough parallel floating point units to allow vector operands to be manipulated element-wise in parallel.

– can be integrated into otherwise fast scalar processors– or might be co-processors which attach to general purpose processors

• Active Memory (Distributed Array Processor)– rather than take data to the processors it is possible to take the processors to

the data, by implementing a large number of very simple processors in association with columns of bits in memory

– thus groups of processors can be programmed to work together, manipulating all the bits of stored words.

– All processors are fed the same instruction in a cycle by a master controller.• Dataflow Architectures - an overall task is defined in terms of all operations

which need to be performed and all operands and intermediate results needed to perform them. Some operations can be started immediately with initial data whilst others must wait for the results of the first ones and so on to the result.


Running Programs in Parallel• Semaphores:

– Lock shared resources– Problems of deadlock and starvation

• Shared memory– Fastest way to move information between two

processors is not to!– Rather than:

• sender → receiver we have sender receiver

– Use semaphore to prevent receiver reading until sender has finished

– Segment created outside normal process space – system call maps it into space of requesting process

Segment 2

Segment 3Segment 1Proc 1

Proc 2 Proc 3


Running Programs in ParallelFlynn’s Classification of Computer Architectures

SISD Single Instruction, Single Data machines are conventional uni-processors. They read their instructions sequentially and operate on their data operands individually. Each instruction only accesses a few operand words

SIMD Single Instruction, Multiple Data machines are typified by vector processors. Instructions are still read sequentially but this time they each perform work on operands which describe multi-word objects such as arrays and vectors. These instructions might perform vector element summation, complete matrix multiplication or the solution of a set of simultaneous equations.

MIMD Multiple Instruction, Multiple Data machines are capable of fetching many instructions at once, each of which performs operations on its own operands. The architecture here is of a mutiprocessor - each processor (probably a SISD or SIMD processor) performs its own computations but shares the results with the others. The multiprocessor sub-divides a large task into smaller sections which are suitable for parallel solution and permits these tasks to share earlier results

(MISD) Multiple Instruction Single Data machines are not really implementable. One might imagine an image processing engine capable of taking an image and performing several concurrent operations upon it...


Major Classifications of ParallelismIntroductionAlmost all parallel applications can be classified into one or more of the following:• Algorithmic Parallelism- the algorithm is split into sections (eg pipelining)• Geometric Parallelism - static data space is split into sections (eg process an

image on an array of processors)• Processor Farming - the input data is passed to many processors (eg ray

tracing co-ordinates to several processors one ray at a time)

Load BalancingThere are 3 forms of load balancing• Static Load Balancing - the choice of which processor to use for each part of

the task is made at compile time• Semi-dynamic - the choice is made at run-time, but once started, each task

must run to completion on the chosen processor - more efficient• Fully-dynamic load balancing - tasks can be interrupted and moved between

processors at will. This enables processors with different capabilities to be used to best advantage. Context switching and communication costs may outweigh the gains


2323


Major Classifications of ParallelismAlgorithm Parallelism• Tasks can be split so that a stream of data can be processed in successive stages

on a series of processors• As the first stage finishes its processing the result is passed to the second stage

and the first stage accepts more input data and processes it and so on.• When the pipeline is full one result is produced at every cycle• At the end of continuous operation the early stages go idle as the last results are

flushed through.• Load balancing is static - the speed of the pipeline is determined by the speed

of the slowest stage.

data resultslinear pipeline

or chain

data results pipeline withparallel section

data resultsIrregular network

general casedata


Major Classifications of ParallelismGeometric Parallelism• Some regular-patterned tasks can be processed by spreading their data across

several processors and performing the same task on each section in parallel• Many examples involve image processing - pixels mapped to an array of

transputers for example• Many such tasks involve communication of boundary data from one portion to

another - finite element calculations• Load balancing is static - initial partitioning of data determines the time to

process each area.• Rectangular blocks may not be the best choice - stripes, concentric squares…• Initial loading of the data may prove to be a serious overhead

data array

connectedtransputers


Major Classifications of Parallelism

Geometric v Algorithmic F(xi) = cos(sin(exp(xi*xi))) for x1, x2, … x6 using 4 processorsAlgorithmic: x3, x2, x1 y*y ey sin(y) cos(y)

y y y result

1 time unit 1 time unit 1 time unit 1 time unit

F1 is produced in 4 time unitsF2 is produced at time 5i.e. time = 4+(6-1) = 9 units speedup = 24/9 = 2.6

Geometric:

i.e. time = 8 units speedup = 24/8 = 3

cos(sin(ex*x)) cos(sin(ex*x)) cos(sin(ex*x)) cos(sin(ex*x))

x1 x2 x3 x4

x5 x6

F1 F2 F5

F6

F6

F4

4 time units

4 time units

4 time units

4 time units

4 time units

4 time units


Major Classifications of ParallelismProcessor Farming• Involves sharing work out from a central controller process to several worker

processes.• The “workers” just accept packets of command data and return results.• The “controller” splits up the tasks, sending work packets to free processors

(ones that have returned a result) and collating the results• Global data is sent to all workers at the outset.• Processor farming is only appropriate if:

– the task can be split into many independent sections– the amount of communication (commands + results) is small

• To minimise latency, it might be better to keep 2 (or 3) packets in circulation for each worker - buffers are needed

• Load balancing is semi-dynamic - the command packets are sent to processors which have just (or are about to) run out of work. Thus all processors are kept busy except for the closedown phase, when some finish before others.


Major Classifications of ParallelismProcessor Farming (cont’d)

controller

Return routers

Buffers

Workers

Buffers

Outgoing Routers

Each section on separate transputer/processor

displayresults

receivedresults

sendcommands

generatework

packets

initial proc nos

Command packets(CPU; work)

Result packets(result; CPU)

free CPU # A Processor Farm Controller


Vector ProcessorsIntroduction• Vector processors extend the scalar model by incorporating vector registers

within the CPU• These registers can be operated on by special vector instructions - each performs

calculations element wise on the vector• Vector parallelism could enable a machine to be constructed with a row of FPUs

all driven in parallel. In practice a heavily pipelined single FPU is usually used. Both are classified as SIMD

• A vector instruction is similar to an unrolled loop, but:– each computation is guaranteed independent of all others - allows a deep

pipeline (allowing the cycle time to be kept short) and removes the need to check for data hazards (within the vector)

– Instruction bandwidth is considerably reduced– There are no control hazards (eg pipeline flushes on branches) since the

looping has been removed– Memory access pattern is well-known - thus latency of memory access can

be countered by interleaved memory blocks and serial memory techniques– Overlap of ALU & FPU operations, memory accesses and address

calculations are possible.


Vector ProcessorsTypes of Vector Processors• Vector-Register machines - vector registers, held in the CPU, are loaded and

stored using pipelined vector versions of the typical memory access instructions• Memory-to-Memory Vector machines - operate on memory only. Pipelines of

memory accesses and FPU instructions operate together without pre-loading the data into vector registers. (This style has been overtaken by Vector-Register machines.)

Vector-Register MachinesMain Sections of a Vector-Register Machine are:• The Vector Functional units - machine can have several pipelined such units,

usually dedicated to just one purpose so that they can be optimised.• Vector Load/Store activities are usually carried out by a dedicated pipelined

memory access unit. This unit must deliver one word per cycle (at least) in order that the FPUs are not held up. If this is the case, vector fetches may be carried out whilst part of the vector is being fed to the FPU

• Scalar Registers and Processing Engine - conventional machine• Instruction Scheduler


Vector Processors

The Effects of Start-up Time and Initiation Rate• like all pipelined systems, the time taken for a vector operation is determined

from the start-up time, the initiation (result delivery) rate and the number of calculations performed

• The initiation rate is usually 1 - new vector elements are supplied in every cycle

• The start-up cost is the time for one element to pass along the vector pipeline - the depth in stages. This time is increased by the time taken to fetch data operands from memory if they are not already in the vector registers - can dominate

• The number of clock cycles per vector element is then:cycles per result = (start-up time + n*initiation rate)/n

• The start-up time is divided amongst all of the elements and dominates for short vectors.

• The start-up time is more significant (as a fraction of the time per result) when the initiation rate drops to 1


Vector Processors

Load/Store Behaviour• the pipelined load/store unit must be able to sustain a memory access rate at

least as good as the initiation rate of the FPUs to avoid data starvation.• This is especially important when chaining the two units• Memory has a start-up overhead - access time latency - similar to the pipeline

start-up cost• Once data starts to flow, how can a rate of one word/cycle be maintained?

– interleaving is usually used

Memory InterleavingWe need to attach multiple memory banks to the processor and operate them all in

parallel so that the overall access rate is sufficient. Two schemes are common:• Synchronised Banks• Independent Banks


Vector Processors

Memory Interleaving (cont’d)Synchronised Banks• A single memory address is passed to all memory banks, and they all access a

related word in parallel.• Once stable, all these words are latched and are then read out sequentially

across the data bus - achieving the desired rate.• Once the latching is complete the memories can be supplied with another

address and may start to access it.


Vector Processors

Memory Interleaving (cont’d)Independent Banks• If each bank of memory can be supplied with a separate address, we

obtain more flexibility - BUT must generate and supply much more information.

• The data latches (as in synchronised case) may not be necessary, since all data should be available at the memory interface when required.

In both cases, we require more memory banks than the number of clock cycles taken to get information from a bank of memory

The number of banks chosen is usually a power of 2 - to simplify addressing (but this can also be a problem - see vector strides)


2424


Vector Processors

Variable Vector LengthIn practice the length of vectors will not be 64, 256 - whatever the memory size isA hardware vector length register in the processor is set before each vector

operation - used also in load/store unit.Programming Variable Length Vector OperationsSince the processor’s vector length is fixed, operations on long user vectors must

be covered by several vector instructions. This is called strip miningFrequently, the user vector will not be a precise multiple of the machine vector

length and so one vector operation will have to compute results for a short vector - this incurs greater set-up overheads

Consider the following:for (j=0; j<n; j++) x[j] = x[j] + (a * b[j]);

For a vector processor with vectors of length MAX and a vector-length register called LEN, we need to process a number of MAX-sized chunks of x[j]and then one section which covers the remainder:


Vector Processors

Variable Vector Length (cont’d)start = 0;LEN = MAX;for (k=0; k<n/MAX; k++ ) {

for (j=start; j<start+MAX; j++) {x[j] = x[j] + (a*b[j]);

}start = start + MAX;

}LEN = n-start;for (j=start; j<n; j++) x[j]= x[j] + (a*b[j]);

The j-loop in each case is implemented as three vector instructions - a Load, a multiply and an add.

The time to execute the whole program is simply:Int(n/MAX)*(sum of start-up overheads) + (n*3*1) cycles

This equation exhibits a saw-tooth shape as n increases - the efficiency drops each time a large vector fills up and an extra 1 element vector must be used, carrying and extra start-up overhead…

Unrolling the outer loop will be effective too...


Vector Processors

Vector StrideMulti-dimensional arrays are stored in memory as single-dimensional vectors. In all

common languages (except Fortran) row 1 is stored next to row 0, plane 0 is stored next to plane 1 etc….

Thus,accessing an individual row of a matrix involves reading contiguous memory locations, these reads are easily spread across several interleaved memory banks:

Accessing a column of a matrix - the nth element in every row, say - involves picking individual words from memory. These words are separated from each other by x words, where x is the number of elements in each row of the matrix. x is the stride of the matrix in thisdimension. Each dimension has its own stride.

Once loaded, vector operations on columns can be carried out with no further reference to their original memory layout.


Vector ProcessorsVector Stride (cont’d)Consider multiplying 2 rectangular matrices together:What is the memory reference pattern of a column-wise vector load?• We step through the memory in units of our strideWhat about in a memory system with j interleaved banks?• If j is co-prime with the stride x then we visit each bank just once before re-visiting

any one again (assuming that we use the LS words address bits as bank selectors)• If j has any common factors with x (especially if j is a factor of x) then the banks are

visited in a pattern which favours some banks and totally omits others. Since the number of active banks is reduced, the latency of memory accesses is not hidden and the one-cycle-per-access goal is lost. This is an example of aliasing.

Does it matter whether the interleaving uses synchronised or independent banks?• Yes. In the synchronised case, the actual memory accesses must be timed correctly

since all the MS addresses are the same, and if the stride is wider than the interleaving factor, only some of the word accesses will be used anyway.

• In the independent case, the separate accesses automatically happen at the right time and to the right addresses. The load/store unit must generate the stream of addresses in advance of the data being required, and must send each to the correct bank

A critically-banked system - interleaved banks are all used fully in a vector accessOverbanking - supplying more banks than needed, reduces danger of aliasing


Vector ProcessorsForwarding and ChainingIf a vector processor is required to perform multiple operations on the same vector,

then it is pointless to save the first result before reading it back to another (or the same) functional unit

Chaining - the vector equivalent of forwarding - allows the pipelined result output of one functional unit to be joined to the input of another

The performance of two chained operations is far greater than that of just one, since the first operation does not have to finish before the next starts. Consider

V1 = V2 MULT V3V4 = V1 ADD V5

The non-chained solution requires a briefstall (4 cycles) since V1 must be fully writtenback to the registers before it can be re-used.

In the chained case, the dependence between writes to elements of V1 and their re-readingin the ADD are compensated by the forwardingeffect of the chaining - no storage is required prior to use.


Multi-Core Processors

• A multi-core microprocessor is one which combines two or more independent processors into a single package, often a single IC. A dual-core device contains only two independent microprocessors.

• In general, multi-core microprocessors allow a computing device to exhibit some form of parallelism without including multiple microprocessors in separate physical packages often known as chip level multiprocessing or CMP.


Multi-Core ProcessorsCommercial examplesCommercial examples• IBM’s POWER4, first Dual-Core module processor released in 2000. • IBM's POWER5 dual-core chip now in production - in use in the Apple

PowerMac G5. • Sun Microsystems UltraSPARC IV, UltraSPARC IV+, UltraSPARC T1• AMD - dual-core Opteron processors on 22 April 2005,

– dual-core Athlon 64 X2 family, on 31 May 2005. – And the FX-60, FX-62 and FX-64 for high performance desktops, – and one for laptops.

• Intel's dual-core Xeon processors, – also developing dual-core versions of its Itanium high-end CPU– produced Pentium D, the dual core version of Pentium 4. – A newer chip, the Core Duo, is available in the Apple Computer's iMac

• Motorola/Freescale has dual-core ICs based on the PowerPC e500 core, and e600 and e700 cores in development.

• Microsoft's Xbox 360 game console uses a triple core PowerPC microprocessor.

• The Cell processor, in PlayStation 3 is a 9 core design.


Multi-Core ProcessorsWhy?Why?• CMOS manufacturing technology continues

to improve: – BUT reducing the size of single gates,

can’t continue to increase clock speed– 5km of internal interconnects in modern

processor…. Speed of light is too slow!• Also significant heat dissipation and data

synchronization problems at high rates. • Some gain from

– Instruction Level Parallelism (ILP) - superscalar pipelining – can be used for many applications

– Many applications better suited to Thread level Parallelism (TLP)- multiple independent CPUs

• A combination of increased available space due to refined manufacturing processes and the demand for increased TLP has led to multi-core CPUs.


Multi-Core Processors• AdvantagesAdvantages

• Proximity of multiple CPU cores on the same die have the advantage that the cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip - combining equivalent CPUs on a single die significantly improves the cache performance of multiple CPUs.

• Assuming that the die can fit into the package, physically, the multi-core CPU designs require much less Printed Circuit Board (PCB) space than multi-chip designs.

• A dual-core processor uses slightly less power than two coupled single-core processors - fewer off chip signals, shared circuitry, like the L2 cache and the interface to the main Bus.

• In terms of competing technologies for the available silicon die area, multi-core design can make use of proven CPU core library designs and produce a product with lower risk of design error than devising a new wider core design.

• Also, adding more cache suffers from diminishing returns, so better to use space in other ways


Multi-Core Processors• DisadvantagesDisadvantages• In addition to operating system (OS) support, adjustments to existing software

can be required to maximize utilization of the computing resources provided by multi-core processors.

• The ability of multi-core processors to increase application performance depends on the use of multiple threads within applications. – eg, most current (2006) video games will run faster on a 3 GHz single-core

processor than on a 2GHz dual-core, despite the dual-core theoretically having more processing power, because they are incapable of efficiently using more than one core at a time.

• Integration of a multi-core chip drives production yields down and they are more difficult to manage thermally than lower-density single-chip designs.

• Raw processing power is not the only constraint on system performance. Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage. Even in theory, a dual-core system cannot achieve more than a 70% performance improvement over a single core, and in practice, will most likely achieve less


2525


The INMOS TransputerThe TransputerNecessary features for a message-passing microprocessor are:• A low context-switch time• A hardware process scheduler• Support for communicating process model• Normal microprocessor facilities.Special Features of Transputers:• high performance microprocessor• conceived as building blocks (like transistors or logic gates)• designed for intercommunication• CMOS devices - low power, high noise immunity• integrated with small supporting chip count• provided with a hardware task scheduler - supports multi-tasking with low

overhead• capable of sub-microsecond interrupt responses - good for control

applications



Transputer PerformanceThe fastest first-generation transputer (IMS T805-30) is capable of:• up to 15 MIPS sustained• up to 3 MFLOPs sustained• up to 40 Mbytes/sec at the main memory interface• up to 120 Mbytes/sec to the 4K byte on-chip memory• up to 2.3 Mbytes/sec on each of 4 bi-directional Links30MHz clock speed

The fatstest projected second generation transputer (IMS T9000-50):• is 5 times faster in calculation • and 6 times faster in communication50MHz clock speed - equivalent performance to the 100MHz intel 486



Low Chip CountTo run using internal RAM a T805 transputer only requires:• a 5MHz clock • a 5V power supply at about 150mA• a power-on-reset or external reset input• an incoming Link to supply boot code and sink results

Expansion possibilities• 32K*32 SRAM (4 devices) require 3 support chips• 8 support devices will support 8Mbytes of DRAM with optimal timing• Extra PALs will directly implement 8-bit memory mapped I/O ports or

timing logic for conventional peripheral devices (Ethernet, SCSI, etc)• Link adapters can be used for limited expansion to avoid memory mapping• TRAMs (transputers plus peripherals) can be used as very high-level

building blocks



Transputer ProcessesSoftware running on a transputer is made up from one or more sequential

processes, which are run in parallel and communicate with each other periodically

Software running on many interconnected transputers is simply a group of parallel processes - just the same as if all code were running on a single processor

Processes can be reasoned about individually; rules exist which allow the overall effect of parallel processes to be reasoned about too.

The benefits of breaking a task into separate processes include:• Taking advantage of parallel hardware• Taking advantage of parallelism on a single processor• Breaking the task into separately-programmed sections• Easy implementation of buffers and data management code which runs

asynchronously with the main processing sections



Transputer RegistersThe transputer implements a stack of 3 hardware registers and is able to execute

0-address instructions.It also has a few one-address instructions which are used for memory access.All instructions and data operands are built up in 4-bit sections using an

Operand register and two special instruction Prefix and Negative Prefix.Extra registers are used to store the head and tail pointers to two linked lists of

process workspace headers - these make up the high andlow priority run-time processqueues. The hardware scheduler takes a new process from one of these queues whenever it suspends the current process (due to time-slicing or communication)



Action on Context SwitchingEach process runs until it communicates,is time-sliced or is pre-empted by a higherpriority process. Time-slices occur at thenext descheduling point - approx 2ms.Pre-emption can occur at any time.

At a context switch the following happens:• The PC of the stopping process is saved in its workspace at word WSP-1• The process pointed to by the processor’s BPtr1 is changed to point to the

stopping processes’ WSP• On a pre-emptive context switch (only) the registers in the ALU and FPU

may need saving• The process pointed to by FPtr1 is unlinked from the process queue, has its

stored PC value loaded into the processor and starts executingA context switch takes about 1s. This translates to an interrupt rate of about

1,000,000 per second.

local variables for PROC

pointer to workspacechain

program counter


The INMOS TransputerJoining Transputers TogetherThree routing configurations are possible:• static - nearest neighbour communications• any-to-any routing across static configurations• dynamic configuration with specialised routing devicesStatic ConfigurationsCan be connected together in fixed configurations and are characterised by:• Number of nodes• Valency - number of interconnecting arcs (per processor)• Diameter - maximum number of arcs traversed from point to point• Latency - time for a message to pass across the network• point-to-point bandwidth - message flow rate along a routeStructures


The T3D NetworkAfter simulation the T3D network was chosen to be a 3D torus (as is the T3E)Note:

config. max latency average latency8-node ring 4 hops 2 hops2D, 4*2 torus 3 hops 1.5 hops3D, 2*2*2 torus 2 hops 1 hop

The Cray T3D

2D torus 4*4

cube = 4*2 2D torus

hyper-cube


Beowulf Clusters

IntroductionMass market competition has driven down prices of subsystems:

processors, motherboards, disks, network cards etc.Development of publicly available software :

Linux, GNU compilers, PVM and MPI librariesPVM - Parallel Virtual Machine (allows many inter-linked machines to be combined as one parallel machine)MPI - Message Passing Interface (similar to PVM)

High Performance Computing groups have many years of experience working with parallel algorithms.

History of MIMD computing shows many academic groups and commercial vendors building machines based on “state-of-the-art” processors, BUT always needed special “glue” chips or one-of-a-kind interconnection schemes.

Leads to interesting research and new ideas, but often results in one off machines with a short life cycle.

Leads to vendor specific code (to use vendor specific connections)Beowulf uses standard bits and Linux operating system (with MPI - or PVM)


Beowulf Clusters

IntroductionFirst Beowulf was built in 1994 with 16 DX4 processors anda 10Mbit/s Ethernet.Processors were too fast for a single EthernetEthernet switches were still much too expensive to usemore than one.So they re-wrote the linux ethernet drivers and built a channel bonded Ethernet

– network traffic was striped across 2 or more ethernetsAs 100Mb/s ethernet and switches have become cheap less need for channel

bonding. This can support 16, 200MHz P6 processors…..The best configuration continues to change. But this does not affect the user.With the robustness of MPI, PVM, Linux (Extreme) and GNU compilers

programmers have the confidence that what they are writing today will still work on future Beowulf clusters.

In 1997 CalTech’s 140 node cluster ran a problem sustaining a 10.9 Gflop/s rate


Beowulf Clusters

The FutureBeowulf clusters are not quite Massively Parallel Processorslike the Cray T3D as MPPs are typically bigger and have alower network latency and a lot of work must be done by theprogrammer to balance the system.But the cost effectiveness is such that many people aredeveloping do-it-yourself approaches to HPC and building their ownclusters. A large number of computer companies are taking these machines very

seriously and offering full clusters.2002 – 2096 processor linux cluster comes in as 5th fastest computer in the

world…2005 – 4800 2.2GHz powerPC cluster is #5 – 42.14TFlops 40960 1.4GHz itanium is #2 – 114.67 TFlops 65536 0.7GHz powerPC is #1 – 183.5TFlops 5000 Opteron (AMD - Cray) is #10 – 20 TFlops


Fastest Super computers – June 2006Rank Site Computer Processors Year Rmax Rpeak

1 LLNL US Blue Gene – IBM 131072 2005 280600 367000

2 IBM US Blue Gene –IBM 40960 2005 91290 114688

3 LLNL US ASCI Purple IBM 12208 2006 75760 92781

4 NASA US Columbia – SGI 10160 2004 51870 60960

5 CEA, France Tera 10, Bull SA 8704 2006 42900 55705.6

6 Sandia US Thunderbird – Dell 9024 2006 38270 64972.8

7 GSIC, Japan TSUBAME - NEC/Sun 10368 2006 38180 49868.8

8 Julich, Germany Blue Gene – IBM 16384 2006 37330 45875

9 Sandia, US Red Storm - Cray Inc. 10880 2005 36190 43520

10 Earth Simulator, Japan Earth-Simulator, NEC 5120 2002 35860 40960

11 Barcelona Super Computer Centre, Spain MareNostrum – IBM 4800 2005 27910 42144

12 ASTRON/University Groningen, Netherlands Stella (Blue Gene) – IBM 12288 2005 27450 34406.4

13 Oak Ridge, US Jaguar - Cray Inc. 5200 2005 20527 24960

14 LLNL, US Thunder - Digital Corporation 4096 2004 19940 22938

15 Computational Biology Research Center, Japan Blue Protein (Blue Gene) –IBM 8192 2005 18200 22937.6

16 Ecole Polytechnique, Switzerland Blue Gene - IBM 8192 2005 18200 22937.6

17 High Energy Accelerator Research Organization, Japan KEK/BG Sakura (Blue Gene) – IBM 8192 2006 18200 22937.6

18 High Energy Accelerator Research Organization, Japan KEK/BG Momo (Blue Gene) – IBM 8192 2006 18200 22937.6

19 IBM Rochester, On Demand Deep Computing Center, US Blue Gene - IBM 8192 2006 18200 22937.6

20 ERDC MSRC, United States Cray XT3 - Cray Inc. 4096 2005 16975 21299


2626


Shared Memory Systems

IntroductionThe earliest form of co-operating processors used shared memory as the

communication mediumShared memory involves:• connecting the buses of several processors together so that either:

– all memory accesses for all processors share the bus; or– just inter-processor communication accesses share the common memoryClearly the latter involves less contention

Shared memory systems typically operate under control of a single operating system either:

• with one master processor and several slaves; or• with all processors running separate copies of the OS, maintaining a common

set of VM and process tables.


Shared Memory SystemsThe Shared-Memory Programming ModelIdeally a programmer wants each process to have access to a contiguous area of

memory - how is unimportantSomewhere in the memory map will be sections of memory which are also

accessible by other processes.

How do we implement this? We certainly need caches (for speed) and VM, secondary storage etc. (for flexibility)

ProcessorMemory

shared

Processor

Processor

Processor

Localcache

Localcache

Localcache

MainMemory

SecondaryMemory

SharedVirtualAddressSpace

Notice that cacheconsistency issuesare introduced assoon as multiplecaches areprovided.


Shared Memory SystemsCommon Bus StructuresA timeshared common bus arrangement can provide the interconnection required:

P P P

A common bus provides:• contention resolution between the processors• limited bandwidth, shared by all processors• single-point failure modes• cheap(ish) hardware - although speed requirements and complex wiring add to

expense• easy, but non-scalable, expansion


Shared Memory SystemsCommon Bus Structures (cont’d)Adding caches, extra buses (making a crossbar arrangement) and mutiport memorycan help

P

P

P

cache

cache

cache

cache

cachecache



Kendall Square Research KSR1One of the most recent shared memory architectures is the Kendall Square

Research KSR1, which implements the virtual memory model across multiple memories, using a layered cacheing scheme.

The KSR1 processors are proprietary:• 64-bit superscalar, issues 1 integer and 2 chained FP instructions per 50ns

cycle, giving a peak integer and FP performance of 20MIPS / 40 MFLOPs• Each Processor has 256Kbytes of local instruction cache and 256Kbytes of

local data cache• There is a 40bit global addressing scheme1088 (32*34) processors can be attached in the current KSR1 architectureMain memory comprises 32Mbytes DRAM per Processor Environment, connected

in a hierarchical cached scheme.If a page is not held in one of the 32Mbyte caches it is stored on secondary

memory (disc - as with any other system)


Shared Memory SystemsKSR1 Processor InterconnectionThe KSR1 architecture connects the caches on each processor with a special memory

controller called the Allcache Engine. Several such memory controllers can be connected

level 0router

directory

Cellinter-

connect

32 MBmaincache

P

256kBcache

Cellinter-

connect

32 MBmaincache

P

256kBcache

Cellinter-

connect

32 MBmaincache

P

256kBcache

Cellinter-

connect

32 MBmaincache

P

256kBcache

…...



KSR1 Processor Interconnection (cont’d)The Allcache Engine at the lowest level (level-0) provides:• connections to all the 32Mbyte caches on the processor cells• Up to 32 processors may be present in each ringThe level-0 Allcache Engine Features:• a 16-bit wide slotted ring, which synchronously passes packets between the

interconnect cells (ie every path can carry a packet simultaneously)• Each ring carries 8 million packets per second• Each packet contains a 16-byte header and 128 bytes of data• This gives the total throughput of 1Gbyte per second• Each router directory contains an entry for each sub-page held in the main

cache memory (below)• Requests for a sub-page are made by the cell interconnect unit, passed around

the ring and satisfied by data if it is found in the other level-0 caches.



KSR1 Processor Interconnection (cont’d)KSR1 Higher Level RoutersIn order to connect more than 32 processors, a second layer of routing is needed.This contains up to 34 Allcache router directory cells, plus the main level-1

directory which permits connection to level 2.

level-2

level-1

level-0

32 processors

1088 processors

unaffordable; minimal bandwidth per processor



KSR1 Processor Interconnection (cont’d)The Level-1 Allcache RingThe routing directories in level 1 Allcache engine contain copies of the entries in

the lower level tables, so that requests may be sent downwards for sub-page information as well as upwards - the Level-1 table is therefore very large

The higher level packet pipelines carry 4 Gbytes per second of inter-cell traffic

level 1router

directory

ARD 0copy

ARD 0copy

ARD 0copy

…...

ARD 0copy

ARD 0directory

ARD 0directory

ARD 0directory

ARD 0directory



KSR1 PerformanceAs with all multi-processor machines, maximum performance is obtained when

there is no communicationThe layered KSR1 architecture does not scale linearly in bandwidth or latency as

processors are added:Relative Bandwidths

unit bandwidth shared fraction(MByte/s) amongst (MByte/s)

256 k subcache160 1 PE 16032MB cache 90 1 PE 90level-0 ring 1000 32 PEs 31level-1 ring 4000 1088 PEs 3.7

Relative LatenciesLocation Latency (cycles)subcache 2cache 18ring 0 150ring 1 500page fault (disc) 400,000

Copied (read-only) sub-pages reside in more thanone cache and thus providea low-latency access toconstant information



KSR1 Performance - how did it succeed?Like most other parallel architectures, it relies on locality

Locality justifies the workings of:• Virtual memory systems (working sets)• Caches (hit rates)• Interprocess connection networks

Kendall Square Research claim that the locality present in massively-parallel programs can be exploited by their architecture.

1991 - 2nd commercial machine is installed in Manchester Regional Computer Centre

1994 - upgraded to 64bit version1998 - Kendall Square Research went out of business, patents transferred to SUN

microsystems


The Cray T3D

IntroductionThe Cray T3D is the successor to several generations of vector conventional

processors. T3D has been replaced by newer T3E but much the same as T3DT3E (with 512 processors) capable of 0.4 TFlops

SV1ex (unveiled 7/11/00 capable of 1.8 TFLOPs with 1000 processors - normally delivered as 8-32 processor machines

T3Dwatercooled T3ESV1


The Cray T3D

IntroductionLike every other manufacturer, Cray would like to deliver:• 1000+ processors with GFLOPs performance• 10s of Gbytes/s per processor of communication bandwidth• 100ns interprocessor latence……they can’t afford to - just yet…….

They have tried to achieve these goals by:• MIMD - multiple co-operating processors will beat small numbers of

intercommunicating ones (even vector supercomputers)• Distributed memory• Communication at the memory-access level, keeping latency short and packet

size small• A scalable communications network• Commercial processors (DEC Alpha)


The T3D NetworkAfter simulation the T3D network was chosen to be a 3D torus (as is the T3E)Note:

config. max latency average latency8-node ring 4 hops 2 hops2D, 4*2 torus 3 hops 1.5 hops3D, 2*2*2 torus 2 hops 1 hop

The Cray T3D

2D torus 4*4

cube = 4*2 2D torus

hyper-cube


T3D Macro-architectureThe T3D designers have decided that the programmer’s view of the architecture

should include:• globally-addressed physically-distributed memory characteristics• visible topological relationships between PEs• synchronisation features visible from a high levelTheir goal is led by the need to provide a slowly-changing view (to the

programmer) from one hardware generation to the next.

T3D Micro-architectureRather than choosing to develop their own processor, Cray selected the DEC

Alpha processor:• 0.75 m CMOS RISC processor core• 64 bit bus• 150MHz, 150 MFLOPS, 300MIPS (3 instructions/cycle)• 32 integer and 32 FP registers• 8Kbytes instruction and 8Kbytes data caches• 43 bit virtual address space

The Cray T3D


Latency HidingThe DEC Alpha has a FETCH instruction which allows memory to be loaded into

the cache before it is required in an algorithm.This runs asynchronously with the processor16 FETCHes may be in progress at once - they are FIFO queuedWhen data is received, it is slotted into the FIFO, ready for access by the processorThe processor stalls if data is not available at the head of the FIFO when needed

Stores do not have a a latency - they can proceed independently of the processor (data dependencies permitting)

SynchronisationBarrier Synchronisation• no process may advance beyond the barrier until all processes have arrived• used as a break between 2 blocks of code with data dependencies• supported in hardware - 16 special registers - bits set to 1 on barrier creation;

set to 0 by arriving process; hardware interrupt on completionMessaging (a form of synchronisation)• T3D exchanges 32-byte messages + 32-byte control header• Messages are queued at target PE, returned to sender PE’s queue if full

The Cray T3D


IntroductionThe Connection MAchine family of suprecomputers has developed since first

descriptions were published in 1981. Today the CM5 is one of the fastest available supercomputers

In 1981 the philosophy of the CM founders was for a machine capable of sequential program execution, but where each instruction was spread to use lots of processors.

The CM-1 had 65,536 processors organised in a layer between two communicating planes:

The Connection Machine Family

Host

broadcast controlnetwork

hyper-cubedata network

...PM

PM

PM

PM

Plane of 65536 cells

P = single-bit processorM = 4kbit memory

Total Memory = 32Mbytes


Introduction (cont’d)Each single-bit processor can:• perform single-bit calculations• transfer data to its neighbours or via the data network• be enabled or disabled (for each operation) by the control network and its own

stored data

The major lessons learnt from this machine were:• A new programming model was needed - that of virtual processors. One

“processor” could be used per data element and a number of data elements combined onto actual processors. The massive concurrency makes programming and compiler design clearer

• 32Mbytes was not enough memory (even in 1985!)• It was too expensive for AI - but physicists wanted the raw processing power



The Connection Machine 2This was an enlarged CM-1 with several enhancements. It had:• 256kbit DRAM per CPU• Clusters of 32 bit serial processors augmented by floating point chip (2048 in total)• Parallel I/O added to the processors - 40 discs (RAID - Redundant Array of

Inexpensive Disks) Graphics frame bufferIn addition, multiple hosts could be added to support multiple users; the plane of small

processors could be partitioned.Architectural Lessons:• Programmers used a high-level language (Fortran 90) rather than a lower-level

parallel language. F90 contains array operators, which provide the parallelism directly. The term data parallel was coined for this style of computation

• Array operators compiled into instructions sent to separate vector or bit processors• This SIMD programming model gives synchronisation between data elements in

each instruction but MIMD processor engine doesn’t need such constraints• Differences between shared (single address space) and distributed memory blur.• Data network now carries messages which correspond to memory accesses• The compiler places memory and computations optimally, but statically• multiple hosts are awkward compared with a single timesharing host



The Connection Machine 5This architecture is more orthogonal than the earlier ones. It just uses larger multi-bit

processors, but similar communication architecture to the CM-1 and CM-2Design Goals were:• > 1 TFLOPs• Several Tbytes of memory• > 1 Tbit/s of I/O bandwidth


Host

broadcast controlnetwork

hyper-cubedata network

...WW W H HI/O Hosts (H) and worker (W) processors

identical (hosts have more memory)


The CM-5 ProcessorTo save on development effort, CM used a common SPARC RISC processor for all the

hosts and workers. RISC CPUs are optimised for workstations, so they added extra hardware and fast memory paths

Each Node has:• 32Mbytes memory• A Network interface• Vector processors capable of up to 128 MFLOPS• Vector-to-Memory bandwidth of 0.5Gbytes/sCaching doesn’t really work here.


vector processor

vector processor

vector processor

vector processor

32 Mbytesmemory

SPARC

cache

I/Omain bus

64 bit0.5Gbyte/svector ports


Fastest Super computers – June 2006Rank Site Computer Processors Year Rmax Rpeak

1 LLNL US Blue Gene – IBM 131072 2005 280600 367000

2 IBM US Blue Gene –IBM 40960 2005 91290 114688

3 LLNL US ASCI Purple IBM 12208 2006 75760 92781

4 NASA US Columbia – SGI 10160 2004 51870 60960

5 CEA, France Tera 10, Bull SA 8704 2006 42900 55705.6

6 Sandia US Thunderbird – Dell 9024 2006 38270 64972.8

7 GSIC, Japan TSUBAME - NEC/Sun 10368 2006 38180 49868.8

8 Julich, Germany Blue Gene – IBM 16384 2006 37330 45875

9 Sandia, US Red Storm - Cray Inc. 10880 2005 36190 43520

10 Earth Simulator, Japan Earth-Simulator, NEC 5120 2002 35860 40960

11 Barcelona Super Computer Centre, Spain MareNostrum – IBM 4800 2005 27910 42144

12 ASTRON/University Groningen, Netherlands Stella (Blue Gene) – IBM 12288 2005 27450 34406.4

13 Oak Ridge, US Jaguar - Cray Inc. 5200 2005 20527 24960

14 LLNL, US Thunder - Digital Corporation 4096 2004 19940 22938

15 Computational Biology Research Center, Japan Blue Protein (Blue Gene) –IBM 8192 2005 18200 22937.6

16 Ecole Polytechnique, Switzerland Blue Gene - IBM 8192 2005 18200 22937.6

17 High Energy Accelerator Research Organization, Japan KEK/BG Sakura (Blue Gene) – IBM 8192 2006 18200 22937.6

18 High Energy Accelerator Research Organization, Japan KEK/BG Momo (Blue Gene) – IBM 8192 2006 18200 22937.6

19 IBM Rochester, On Demand Deep Computing Center, US Blue Gene - IBM 8192 2006 18200 22937.6

20 ERDC MSRC, United States Cray XT3 - Cray Inc. 4096 2005 16975 21299


History of Supercomputers1966/7: Michael Flynn’s Taxonomy & Amdahl’s Law1976: Cray Research delivers 1st Cray-1 to LANL1982: Fujitsu ships 1st VP200 vector machine ~500MFlops1985: CM-1 demonstrated to DARPA1988: Intel delivers iPSC/2 hypercubes1990: Intel produces iPSC/860 hypercubes1991: CM5 announced1992: KSR1 delivered1992: Maspar delivers its SIMD machine – MP21993: Cray delivers Cray T3D1993: IBM delivers SP11994: SGI Power Challenge1997: SGI/Cray Origin 2000 delivered to LANL - 0.7TFlops1998: Cray T3E delivered to US military – 0.9Tflops1996: Hitachi Parallel System1997: Intel Paragon (ASCI Red) 2.3 Tflops to Sandia Nat Lab2000: IBM (ASCI White) 7.2 Tflops to Lawrence Livermore NL2002: HP (ASCI Q) 7.8 Tflops to Los Alamos Nat Lab 2002: NEC Earth Simulator Japan 36TFlops2002: 5th fastest machine in world is a linux cluster (2304 processor)


History of Supercomputers


2727


The fundamentals of Computing have remained unchanged for 70 years

• During all of the rapid development of computers during that time little has changed since Turing and Von Neumann

Quantum Computers are Potentially different.

• They employ Quantum Mechanical principles that expand the range of operations possible on a classical computer.

• Three main differences between classical and Quantum computers are:

• Fundamental unit of information is a qubit

• Range of logical operations

• Process of determining the state of the computer


Qubits

Classical computers are built from bits

two states: 0 or 1

Quantum computers are built from qubits

Physical system which possess states analogous to 0 or 1, but which can also be in states between 0 and 1

The intermediate states are known as superposition states

A qubit – in a sense – can store much more information than a bit


Range of logical operations

Classical computers operate according to binary logic

Quantum logic gates take one or more qubits as input and produce one or more qubits as output.

Qubits have states corresponding to 0 and 1, so quantum logic gates can emulate classical logic gates.

With superposition states between 0 and 1 there is a great expansion in the range of quantum logic gates.

• e.g. quantum logic gates that take 0 and 1 as input and produce as output different superposition states between 0 and 1 – no classical analogue

This expanded range of quantum gates can be exploited to achieve greater information processing power in quantum computers


Determining the State of the Computer

In Classical computers we read out the state of all the bits in the computer at any time

In a Quantum computer it is in principle impossible to determine the exact state of the computer.

i.e. we can’t determine exactly which superposition state is being stored in the qubits making up the computer

We can only obtain partial information about the state of the computer

Designing algorithms is a delicate balance between exploiting the expanded range of states and logical operations and the restricted readout of information.


Beam-splitter

Detector A

Detector B

Single particle

Equal probability of photon reaching A or B

What actually happens?

Does the photon travel each path at Random?

Detector A

Detector B

Beam-splitter

Beam-splittermirror

mirror

What actually happens here?

If path lengths are the same photons always hit A.

A single photon travels both routes simultaneously


Photons travel both paths simultaneously.

If we block either of the paths then A or B become equally probable

This is quantum interference and applies not just to photons but all particles and physical systems

Quantum computation is all about making this effect work for us.

In this case the photon is a in a coherent superposition of being on both paths at the same time.

Any qubit can be prepared in a superposition of two logical states – a qubit can store both 0 and 1 simultaneously, and in arbitrary proportions.

Any quantum system with at least two discrete states can be used as a qubit – e.g. energy levels in an atom, photons, trapped ions, spins of atomic nuclei…..


Once the qubit is measured, however, only one of the two values it stores can be detected at random – just like the photon is detected on only one of the two paths.

Not very useful – but….

Consider a traditional 3-bit register it can represent 8 different numbers 000 - 111

A quantum register of 3 qubits can represent 8 numbers at the same time in quantum superposition. The bigger the register the more numbers we can represent at the same time.

A 250 qubit register could hold more numbers than there are atoms in the known universe – all on 250 atoms…..

But we only see one of these if we measure the registers contents.

We can now do some real quantum computation…..


Mathematical Operations can be performed at the same time on all the numbers held in the register.

If the qubits are atoms then tuned laser pulses can affect their electronic states so that initial superpositions of numbers evolve into different superpositions.

Basically a massively parallel computation

Can perform a calculation on 2L numbers in a single step, which would take 2L steps or processors in a conventional architecture

Only good for certain types of computation….

NOT information storage – it can hold many states at once but can only see one of them

Quantum interference allows us to obtain a single result that depends logically on all 2L of the intermediate results


Grover’s Algorithm

Searches an unsorted list of N items in only N steps.

Conventionally this scales as N/2 – by brute force searching. The quantum computer can search them all at the same time.

BUT if the QC is merely programmed to print out the result at that point it will not be any faster than a conventional system.

Only one of the N paths would check the entry we are looking for, so the probability that measuring the computer’s state would give us the correct answer would require the same number of hits.

BUT if we leave the information in the computer, unmeasured, a further quantum operation can cause the information to affect other paths. If we repeat the operation N times a measurement will return information about which entry contains the desired number with a probability of 0.5. Repeating just a few more times will find the entry with a probability extremely close to 1.

Can be turned into a very useful searching, minimization or evaluation of the mean tool.


Cryptoanalysis

Biggest use of quantum computing is in cracking encrypted data.

Cracking DES (Data encryption standard) requires a search among 256 keys.

Conventionally even at 1M/s this takes more than 1000 years.

A QC using Grover’s algorithm could do it in less than 4 minutes.

Factorisation is the key to RSA encryption system.

Conventionally the time taken to factorise a number increases exponentially with the number of digits.

Largest number ever factorised contained 129 digits.

No way to factorise 1000 digits – conventionally…..

QC can do this in a fraction of a second

Already a big worry for data security, it is only a matter of a few years before this will be available.


Decoherence: the obstacle to quantum computation

For a qubit to work successfully it must remain in an entangled quantum superposition of states.

As soon as we measure the state it collapses to a single value.

This happens even if we make the measurement by accident

source

source

In a conventional double split experiment, the wave amplitudes corresponding to an electron (or photon) travelling along the two possible paths interfere. If another particle with spin is placed close to the left slit an electron passing will flip the spin. This “accidentally” records the which path the electron took and causes the loss of the interference pattern


Decoherence: the obstacle to quantum computation

In reality it is very difficult to prevent qubits from interacting with the rest of the world.

The best solution (so far) to this is to build quantum computers with fault tolerant designs using error correction procedures.

The result of this is that we need more qubits, between 2 and 5 times the number in an “ideal world”

ee3.cma - computer architecture5/20/2015ee3.cma - computer architecture1 ee3004 (ee3.cma) - computer...

Documents