memory hierarchy and cache design the following sources are used for preparing these slides: lecture...

Memory Hierarchy andCache Design

The following sources are used for preparing these slides:

• Lecture 14 from the course Computer architecture ECE 201 by Professor Mike Schulte.

• Lecture 4 from William Stallings, Computer Organization and Architecture, Prentice Hall; 6th edition, July 15, 2002.

• Lecture 6 from the course Systems Architectures II by Professors Jeremy R. Johnson and Anatole D. Ruslanov

• Some of figures are from Computer Organization and Design: The Hardware/Software Approach, Third Edition, by David Patterson and John Hennessy, are copyrighted material (COPYRIGHT 2004 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).

http://www.cse.lehigh.edu/~mschulte/ece201-02/

Memory Hierarchy

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level

Processor

Data are transferred

Memory technology Typical access time $ per GB in 2004SRAM 0.5–5 ns $4000–$10,000

DRAM 50–70 ns $100–$200

Magnetic disk 5,000,000–20,000,000 ns $0.50–$2

SRAM v DRAM

• Both volatile– Power needed to preserve data

• Dynamic cell – Simpler to build, smaller

– More dense

– Less expensive

– Needs refresh

– Larger memory units

• Static– Faster

– Cache

General Principles of Memory• Locality

– Temporal Locality: referenced memory is likely to be referenced again soon (e.g. code within a loop)

– Spatial Locality: memory close to referenced memory is likely to be referenced soon (e.g., data in a sequentially access array)

• Definitions– Upper: memory closer to processor

– Block: minimum unit that is present or not present

– Block address: location of block in memory

– Hit: Data is found in the desired location

– Hit time: time to access upper level

– Miss rate: percentage of time item not found in upper level

• Locality + smaller HW is faster = memory hierarchy– Levels: each smaller, faster, more expensive/byte than level below

– Inclusive: data found in upper level also found in the lower level

Cache

• Small amount of fast memory

• Sits between normal main memory and CPU

• May be located on CPU chip or module

Cache operation - overview

• CPU requests contents of memory location

• Check cache for this data

• If present, get from cache (fast)

• If not present, read required block from main memory to cache

• Then deliver from cache to CPU

• Cache includes tags to identify which block of main memory is in each cache slot

Cache/memory structure

Four Questions for Memory Hierarchy Designers

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level? (Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed?

• Direct Mapped: Each block has only one place that it can appear in the cache.

• Fully associative: Each block can be placed anywhere in the cache.

• Set associative: Each block can be placed in a restricted set of places in the cache.

– If there are n blocks in a set, the cache placement is called n-way set associative

• What is the associativity of a direct mapped cache?

Associativity Examples

Cache size is 8 blocks Where does word 12 from memory go?

Fully associative:Block 12 can go anywhere

Direct mapped: Block no. = (Block address) mod (No. of blocks in cache)Block 12 can go only into block 4(12 mod 8 = 4) => Access block using lower 3 bits

2-way set associative:Set no. = (Block address) mod (No. of sets in cache)Block 12 can go anywhere in set 0(12 mod 4 = 0)=> Access set using lower 2 bits

• Mapping: memory mapped to one location in cache: (Block address) mod (Number of blocks in cache)

• Number of blocks is typically a power of two, i.e.,cache location obtained from low-order bits of address.

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

010

011

100

101

110

111

Locating data in the Cache

• Index is 10 bits, while tag is 20 bits

– We need to address 1024 (210) words

– We could have any of 220 words per cache location

• Valid bit indicates whether an entry contains a valid address or not

• Tag bits is usually indicated by address size – (log2(memory size) + 2)

– E.g. 32 – (10 + 2) = 20

Address (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

Q2: How Is a Block Found?• The address can be divided into two main parts

– Block offset: selects the data from the block

offset size = log2(block size)

– Block address: tag + index

» index: selects set in cache

index size = log2(#blocks/associativity)

» tag: compared to tag in cache to determine hit

tag size = addreess size - index size - offset size

• Each block has a valid bit that tells if the block is valid - the block is in the cache if the tags match and the valid bit is set.

Tag Index

Q4: What Happens on a Write?

• Write through: The information is written to both the block in the cache and to the block in the lower-level memory.

• Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

– is block clean or dirty? (add a dirty bit to each block)

• Pros and Cons of each:– Write through

» Read misses cannot result in writes to memory,

» Easier to implement

» Always combine with write buffers to avoid memory latency

– Write back

» Less memory traffic

» Perform writes at the speed of the cache

Reducing Cache Misses with a more Flexible Replacement Strategy

• In a direct mapped cache a block can go in exactly one place in cache

• In a fully associative cache a block can go anywhere in cache

• A compromise is to use a set associative cache where a block can go into a fixed number of locations in cache, determined by:

(Block number) mod (Number of sets in cache)

1

2Tag

Data

Block # 0 1 2 3 4 5 6 7

Search

Direct mapped

1

2Tag

Data

Set # 0 1 2 3

Search

Set associative

1

2Tag

Data

Search

Fully associative

Example

• Three small 4 word caches:

Direct mapped, two-way set associative, fully associative

• How many misses in the sequence of block addresses: 0, 8, 0, 6, 8?

• How does this change with 8 words, 16 words?

Locating a Block in Cache

• Check the tag of every cache block in the appropriate set

• Address consists of 3 parts

• Replacement strategy:

E.G. Least Recently Used (LRU)

tag index block offset

Program Assoc. I miss rate D miss rate Combined rategcc 1 2.0% 1.7% 1.9%

2 1.6% 1.4% 1.5%4 1.6% 1.4% 1.5%

Address

22 8

V TagIndex

0

1

2

253

254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

Size of Tags vs. Associativity

• Increasing associativity requires more comparators, as well as more tag bits per cache block.

• Assume a cache with 4K 4-word blocks and 32 bit addresses

• Find the total number of sets and the total number of tag bits for a

– direct mapped cache

– two-way set associative cache

– four-way set associative cache

– fully associative cache


• Total cache size 4K x 4 words/block x 4 bytes/word = 64Kb

• Direct mapped cache:– 16 bytes/block 28 bits for tag and index

– # sets = # blocks

– Log(4K) = 12 bits for index 16 bits for tag

– Total # of tag bits = 16 bits x 4K locations = 64 Kbits

• Two-way set-associative cache:– 32 bytes / set

– 16 bytes/block 28 bits for tag and index

– # sets = # blocks / 2 2K sets


– Total # of tag bits = 17 bits x 2 location / set x 2K sets = 68 Kbits


• Four-way set-associative cache:– 64 bytes / set

– 16 bytes/block 28 bits for tag and index

– # sets = # blocks / 4 1K sets


– Total # of tag bits = 18 bits x 4 location / set x 1K sets = 72 Kbits

• Fully associative cache:– 1 set of 4 K blocks 28 bits for tag and index

– Index = 0 bits tag will have 28 bits

– Total # of tag bits = 28 bits x 4K location / set x 1 set = 112 Kbits

Measuring Cache Performance

• CPU time = (CPU execution clock cycles +

Memory stall clock cycles) Clock-cycle time

• Memory stall clock cycles =

Read-stall cycles + Write-stall cycles

• Read-stall cycles = Reads/program Read miss rate Read miss penalty

• Write-stall cycles = (Writes/program Write miss rate Write miss penalty) + Write buffer stalls

(assumes write-through cache)

• Write buffer stalls should be negligible and write and read miss penalties equal (cost to fetch block from memory)

• Memory stall clock cycles = Mem access/program miss rate miss penalty

Example I

• Assume I-miss rate of 2% and D-miss rate of 4% (gcc)

• Assume CPI = 2 (without stalls) and miss penalty of 40 cycles

• Assume 36% loads/stores

• What is the CPI with memory stalls?

• How much faster would a machine with perfect cache run?

• What happens if the processor is made faster, but the memory system stays the same (e.g. reduce CPI to 1)?

Calculation I

• Instruction miss cycles = I x 100% x 2% x 40 = .80 x I

• Data miss cycles = I x 36% x 4% x 40 = .58 x I

• Total miss cycles = .80 x I + .58 x I = 1.38 x I

• CPI = 2 + 1.38 = 3.38

• PerfPerf / PerfStall = 3.38/2 = 1.69

• For a processor with base CPI = 1:

• CPI = 1 + 1.38 = 2.38 PerfPerf / PerfStall = 2.38

• Time spent on stalls for slower processor 1.38/3.38 = 41%

• Time spent on stalls for faster processor 1.38/2.38 = 58%

memory hierarchy and cache design the following sources are used for preparing these slides: lecture...

Documents

cache block

access block

location of block

referenced memory

processor block

required block

cpu cache

memory hit