tk6123: computer organisation & architecture lecture 8: cpu and memory (3) 1 prepared by:...

TK6123: COMPUTER ORGANISATION & ARCHITECTURE

Lecture 8: CPU and Memory (3)

1

Prepared By: Associate Prof. Dr Masri Ayob

Contents

This lecture will discuss:• Cache.• Error Correcting Codes.

2

The Memory Hierarchy

Trade-off: cost, capacity and access time.• Faster access time, greater cost per bit.• Greater capacity, smaller cost per bit.• Greater capacity, slower access time.

3

Transfer Rate - rate at which data can be moved.

Access time - the time it takes to perform a read or write operation.

Memory Cycle time –Time that is required for the memory to “recover” before next access, i.e. access + recovery.

Memory Hierarchies

A five-level memory hierarchy.

4

Hierarchy List

Registers

L1 Cache

L2 Cache

Main memory

Disk cache

Disk

Optical

Tape

5

Internal Internal memorymemory

external external memorymemory

decreasing decreasing cost/bit, cost/bit, increasing increasing capacity, capacity, and slower and slower access timeaccess time

Hierarchy List

It would be nice to use only the fastest memory, but because that is the most expensive memory, • we trade off access time for cost by using more of

the slower memory. • The design challenge is to organise the data and

programs in memory so that the accessed memory words are usually in the faster memory.

6

Hierarchy List

In general, it is likely that most future accesses to main memory by the processor will be to locations recently accessed. • So the cache automatically retains a copy of some

of the recently used words from the DRAM. • If the cache is designed properly, then most of the

time the processor will request memory words that are already in the cache.

7

Hierarchy List

No one technology is optimal in satisfying the memory requirements for a computer system. • As a consequence, the typical computer system is

equipped with a hierarchy of memory subsystems;• some internal to the system (directly accessible

by the processor) and • some external (accessible by the processor via

an I/O module).

8

Cache

Small amount of fast memory

Sits between normal main memory and CPU

May be located on CPU chip or module

9

or cache line.

Cache

The cache contains a copy of portions of main memory. • When the processor attempts to read a word of

memory, a check is made to determine if the word is in the cache.

• If so (hit), the word is delivered to the processor. • If not (miss), a block of main memory, consisting

of some fixed number of words, is read into the cache and then the word is delivered to the processor.

10

Cache

• Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that same memory location or to other words in the block.

11

The ratio of hits to the total number of requests is known as the hit ratio.

Cache/Main Memory Structure

12

Cache operation – overview

CPU requests contents of memory location

Check cache for this data

If present, get from cache (fast)

If not present, read required block from main memory to cache

Then deliver from cache to CPU

Cache includes tags to identify which block of main memory is in each cache slot

13

Cache Operation

14

Cache Design

Size

Mapping Function

Replacement Algorithm

Write Policy

Block Size

Number of Caches – L1, L2, L3 etc.

15

Size does matter

Cost• More cache is expensive

Speed• More cache is faster (up to a point)• Checking cache for data takes time

16

The larger the cache, the larger the number of gates involved in addressing the cache. The result is that large caches tend to be slightly slower than small ones.

The larger the cache, the larger the number of gates involved in addressing the cache. The result is that large caches tend to be slightly slower than small ones.

We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.

We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.

Comparison of Cache Sizes

17

Processor TypeYear of

IntroductionL1 chache L2 cache L3 cache

IBM 360/85 Mainframe 1968 16 to 32 KB — —

PDP-11/70 Minicomputer 1975 1 KB — —

VAX 11/780 Minicomputer 1978 16 KB — —

IBM 3033 Mainframe 1978 64 KB — —

IBM 3090 Mainframe 1985 128 to 256 KB — —

Intel 80486 PC 1989 8 KB — —

Pentium PC 1993 8 KB/8 KB 256 to 512 KB —

PowerPC 601 PC 1993 32 KB — —

PowerPC 620 PC 1996 32 KB/32 KB — —

PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB

IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB

IBM S/390 G6 Mainframe 1999 256 KB 8 MB —

Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —

IBM SPHigh-end server/ supercomputer

2000 64 KB/32 KB 8 MB —

CRAY MTAb Supercomputer 2000 8 KB 2 MB —

Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB

SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —

Itanium 2 PC/server 2002 32 KB 256 KB 6 MB

IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB

CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB —

Cache: Mapping Function

Cache lines < main memory blocks:• An algorithm is needed for mapping main memory

blocks into cache lines. • Three techniques:

• Direct• Associative• set associative

18

Direct Mapping

Each block of main memory maps to only one cache line• i.e. if a block is in cache, it must be in one specific

place.

pros & cons• Simple• Inexpensive• Fixed location for given block

• If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

19

Associative Mapping

A main memory block can load into any line of cache

Memory address is interpreted as tag and word

Tag uniquely identifies block of memory

Every line’s tag is examined for a match

Disadvantage:• Cache searching gets expensive• The complex circuitry is required to examine the

tags of all cache lines in parallel.

20

Set Associative Mapping

A compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages.

Cache is divided into a number of sets.

Each set contains a number of lines.

A given block maps to any line in a given set• e.g. Block B can be in any line of set i.

21

Set Associative Mapping

With fully associative mapping, the tag in a memory address is quite large and must be compared to the tag of every line in the cache.

With k-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single set.

22

Replacement Algorithms

When cache memory is full, some block in cache memory must be selected for replacement.

Direct mapping :• No choice• Each block only maps to one line• Replace that line

23

Replacement Algorithms (2)Associative & Set Associative

Hardware implemented algorithm (speed)• Least Recently used (LRU)

• An LRU algorithm, keeps track of the usage of each block and replaces the block that was last used the longest time ago.

• First in first out (FIFO)• replace block that has been in cache longest

• Least frequently used (LFU)• replace block which has had fewest hits

• Random

24

Write Policy

Issues:• Must not overwrite a cache block unless main

memory is up to date• Multiple CPUs may have individual caches• I/O may address main memory directly

25

Write through

All writes go to main memory as well as cache

Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date

Disadvantage:• Lots of traffic• Slows down writes• Create a bottleneck.

26

Cache: Line Size

As the block size increases from very small to larger sizes, the hit ratio will at first increase because of the principle of locality.

Two issues:• Larger blocks reduce the number of blocks that fit

into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched.

• As a block becomes larger, each additional word is farther from the requested word, therefore less likely to be needed in the near future.

27

Number of Caches

Multilevel Caches:• On-chip cache:

• A cache on the same chip as the processor.• Reduces the processor’s external bus activity

and therefore speeds up execution times and increases overall system performance.

28

Number of Caches

Multilevel Caches:• external cache: Is it still desirable?

• Yes - most contemporary designs include both on-chip and external caches.

• E.g. two-level cache, with the internal cache (L1) and the external cache (L2). Why?

• If there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus – poor performance.

29

Number of Caches

More recently, it has become common to split the cache into two:• one dedicated to instructions and one dedicated to

data.• There are two potential advantages of a unified

cache:• For a given cache size, a unified cache has a

higher hit rate than split caches because it balances the load between instruction and data fetches automatically.

• Only one cache needs to be designed and implemented.

30

Number of Caches

More recently, it has become common to split the cache into two:• The trend is toward split caches, such as the

Pentium and PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. Advantage:

• It eliminates contention for the cache between the instruction fetch/decode unit and the execution unit.

31

Intel Cache Evolution

32

Problem Solution Processor on which feature first appears

External memory slower than the system bus. Add external cache using faster memory technology.

386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor.

486

Internal cache is rather small, due to limited space on chip.

Add external L2 cache using faster technology than main memory

486

Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate data and instruction caches.

Pentium

Increased processor speed results in external bus becoming a bottleneck for L2 cache access.

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

Move L2 cache on to the processor chip.

Pentium II

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small

Add external L3 cache. Pentium III

Move L3 cache on-chip. Pentium 4

Locality

Why the principle of locality make sense?• In most cases, the next instruction to be fetched

immediately follows the last instruction fetched (except for branch and call instructions).

• A program remains confined to a rather narrow window of procedure-invocation depth. Thus, over a short period of time references to instructions tend to be localised to a few procedures.

33

Locality

Why the principle of locality make sense?• Most iterative constructs consist of a relatively

small number of instructions repeated many times. • In many programs, much of the computation

involves processing data structures, such as arrays or sequences of records. In many cases, successive references to these data structures will be to closely located data items.

34

Internal Memory (revision)

35

Memory Packaging and Types

A group of chips, typically 8 or 16, is mounted A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit.on a tiny PCB and sold as a unit. SIMM - SIMM - single inline memory module, has a row of single inline memory module, has a row of

connectors on one side.connectors on one side. DIMM – Dual inline memory module, has a row of DIMM – Dual inline memory module, has a row of

connectors on both side.connectors on both side.

36

A SIMM holding 256 MB. Two of the chips control the SIMM.

Error Correction

Hard Failure• Permanent defect• Caused by harsh environmental abuse,

manufacturing defects, and wear.

Soft Error• Random, non-destructive• No permanent damage to memory• Caused by power supply problems.

Detected using Hamming error correcting code.

37

Error Correction

When reading out the stored word, a new set of K code bits is generated from M data bits and compared with fetch code bits. Results:• No errors – the fetch data bits are sent out.• An error is detected, and it is possible to correct

the error. • Data bits + error correction bits corrector

sent out the corrected set of M bits. • An error is detected, but it is not possible to

correct the error. This condition is reported.

38

Error Correcting Code Function

39

A function to produce code

Stored codeword: M+K bits

Error Correcting Codes: Venn diagram

(a) Encoding of 1100

(b) Even parity added

(c) Error in AC

40

Error Correction: Hamming Distance

The number of bit positions in which two codewords differ is called the Hamming distance.

If two codewords are a Hamming distance d apart, it will require d single-bit errors to convert one into the other.• E.g. the codewords 11110001 and 00110000 are a

Hamming distance 3 apart because it takes 3 single-bit errors to convert one into the other.

41

Error Correction: Hamming Distance

To detect d single-bit errors, you need a distance d + 1 code.

To correct d single-bit errors, you need a distance 2d + 1 code.

42

To determine how many bits differ, just compute the bitwise Boolean EXCLUSIVE OR of the two codewords, and count the number of 1 bits in the result.

To determine how many bits differ, just compute the bitwise Boolean EXCLUSIVE OR of the two codewords, and count the number of 1 bits in the result.

Example: Hamming algorithm

All bits whose bit number (start with bit 1) is a power of 2 are parity bits; the rest are used for data. • E.g. with a 16-bit word, 5 parity bits are added. Bits 1, 2, 4, 8,

and 16 are parity bits, and all the rest are data bits. • Bit b is checked by those bits b1, b2, …bj such that b1+b2+…

+bj=b.• For example, bit 5 is checked by bits 1 and 4 because 1+4=5.

43


Construction of the Hamming code for the memory word 1111000010101110 by adding 5 check bits to the 16 data bits.

44

We will (arbitrarily) use even parity in this example.


Consider what would happen if bit 5 were inverted by an electrical surge on the power line. The new codeword would be 001001100000101101110 instead of 001011100000101101110.

The 5 parity bits will be checked, with the following results:

45

Since parity bits 1 and 4 are incorrect but 2, 8, and 16 are correct bit 5 (1 + 4) has been inverted. Since parity bits 1 and 4 are incorrect but 2, 8, and 16 are correct bit 5 (1 + 4) has been inverted.

Thank youQ & A

46

tk6123: computer organisation & architecture lecture 8: cpu and memory (3) 1 prepared by:...

Documents

word of memory

slower memory

faster memory

memory requirements

memory hierarchies

fastest memory

expensive memory

block of main memory