cpe 626 cpu resources: introduction to cache memories

CPE 626 CPU Resources:Introduction to Cache Memories

Aleksandar MilenkovicE-mail: [email protected]

Web: http://www.ece.uah.edu/~milenka

mailto:[email protected]







http://www.ece.uah.edu/~milenka








2

Outline

Processor-Memory Speed Gap Memory Hierarchy Spatial and Temporal Locality What is Cache? Four questions about caches Example: Alpha AXP 21064 Data Cache Fully Associative Cache N-Way Set Associative Cache Block Replacement Policies Cache Performance

3

Review: Processor Memory Performance Gap

1

10

100

1000

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

Processor:2x/1.5 year

Per

form

ance

Time

Memory:2x/10 years

Processor-Memory Performance Gap grows 50% / year

4

The Memory Hierarchy (MH)

Processor

Control

Datapath

Speed:Capacity:Cost/bit:

Fastest Slowest

Smallest Biggest

Highest Lowest

Upper LowerLevels in Memory Hierarchy

User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology

5

Why hierarchy works?

Principal of locality

Temporal locality: recently accessed items are likely to be accessed in the near future Keep them close to the processor

Spatial locality: items whose addresses are near one another tend to be referenced close together in time Move blocks consisted of contiguous words to the upper level

Rule of thumb:Programs spend 90% of their execution time in only 10% of code

Address space

Probability of reference

6

What is a Cache?

Small, fast storage used to improve average access time to slow memory

Exploits spatial and temporal locality In computer architecture,

almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on Page Map Table Branch-prediction a cache on prediction information?

7

Four questions about cache

Where can a block be placed in the upper level? Block placement direct-mapped, fully associative, set-associative

How is a block found if it is in the upper level? Block identification

Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used)

What happens on a write? Write strategy Write-through vs. write-back Write allocate vs. No-write allocate

8

Direct-Mapped Cache (1/3)

In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location

in the cache for the data if it exists in the cache Block is the unit of transfer between cache and

memory

9


MemoryMemory Address

0123456789ABCDEF

Cache Location 0 can be occupied by data from:

Memory location 0, 4, 8, ... In general:

any memory location that is multiple of 4

Cache (4 byte)Cache Index

0123

10


Since multiple memory addresses map to same cache index, how do we tell which one is in there?

What if we have a block size > 1 byte? Result: divide memory address into three fields:

tttttttttttttttttt iiiiiiiiii oooo

TAG: to check if have the correct block

INDEX: to select block

OFFSET: to select byte within the block

Block Address

11

Direct-Mapped Cache Terminology

INDEX: specifies the cache index (which “row” of the cache we should look in)

OFFSET: once we have found correct block, specifies which byte within the block we want

TAG: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

BLOCK ADDRESS: TAG + INDEX

12

Direct-Mapped Cache Example

Conditions 32-bit architecture (word=32bits), address unit is byte 8KB direct-mapped cache with 4 words blocks

Determine the size of the Tag, Index, and Offset fields OFFSET (specifies correct byte within block):

cache block contains 4 words = 16 (24) bytes 4 bits INDEX (specifies correct row in the cache):

cache size is 8KB=213 bytes, cache block is 24 bytes#Rows in cache (1 block = 1 row): 213/24 = 29 9 bits

TAG: Memory address length - offset - index =32 - 4 - 9 = 19 tag is leftmost 19 bits

13

What to do on a write hit?

Write through (or store through) Update the word in cache block and

corresponding word in memory Write back (or copy back)

update the word in cache block only allow the memory word to be “stale” the modified cache block is written to main memory

only when it is replaced => add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced; if the block is clean, it is not written on a miss, since the memory has identical information to the cache=> OS flushes cache before I/O !!!

14

What to do on a write-miss?

Write allocate (or fetch on write)The block is loaded on a write-miss, followed by the write-hit actions

No-write allocate (or write around)The block is modified in the memory and not loaded into the cache

Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate

15

8KB direct mapped cache with 32B blocks; CPU presents a 34 bits address to the cache => offset is 5 bits, index is 8 bits, tag 21 bits

Write through, four block write buffer No-write allocate on a write miss

Alpha AXP 21064 Data Cache

16

Alpha AXP 21064 Data Cache

Valid<1> Tag<21> Data<256>

... ...

Tag Index<21> <8> <5>

Offset

=? 4:1 Mux

CPUAddressData in

Data out

Write buffer

Lower level memory

<34>

17

Alpha AXP 21064 Data Cache: Read Hit

1. CPU presents 34 bits address to the cache 2. The address is divided <tag, index, offset> 3. Index selects tag to be tested

to see if the desired block is in the cache 4. After reading the tag from the cache,

it is compared to the tag portion of the address data can be read and sent to CPU in parallel

with the tag being read and checked 5. If the tag does match and valid bit is set,

cache signals the CPU to load data

18

Alpha AXP 21064 Data Cache: Write Hit First four steps are the same as for Read 5. Corresponding word in the cache is written and

also sent to write buffer 6a. If write buffer is empty, the data and the full address are

written in the buffer, and the write is finished from the CPU perspective

6b. If the buffer contains any other modified blocks, the addresses are checked to see if the address of this new data matches; if so, new data are combined with that entry (write-merging)

6c. If the buffer is full and there is no address match, cache and CPU must wait

WA v v v v

100 1 xxxx 0 0 0

104 1 xxxx 0 0 0

108 1 xxxx 0 0 0

10C 1 xxxx 0 0 0

WA v v v v

100 1 xxxx 1 xxxx 1 xxxx 1 xxxx

0 0 0 0

0 0 0 0

0 0 0 0

19

Alpha AXP 21064 Data Cache: Read Miss

First four steps are the same as for Read 5. The cache sends a stall signal to CPU

telling it to wait, and 32B are read from the next level of the hierarchy in the DEC 3000 model 800 the path to the lower

level is 16 bytes wide; it takes 5 clock cycles per transfer, or 10 clock cycles for all 32 bytes

6. Replace block (update data, tag, and valid bit)

20

Alpha AXP 21064 Data Cache: Write Miss

First four steps are the same as for Read 5. Since tag does not match (or valid bit is not set)

the CPU writes around the cache to lower level memory and does not affect cache (no-write allocate)

21

Fully Associative Cache

Memory address fields: Tag: same as before Offset: same as before Index: non-existent

What does this mean? any block can be placed anywhere in the cache must compare with all tags in entire cache

to see if data is there

22


8KB with 4W blocks (512 cache blocks)

Byte Offset

:

Cache Data

0331

:

Cache Tag (28 bits long)

Valid

:

Cache Tag

=

==

=

=:

Compare tags in parallel

23


Benefit of fully associative caches Data can go anywhere in the cache -

no Conflict misses Drawbacks

expensive in hardware since every single entryneeds comparator (if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible)

may slow processor clock rate

24

N-Way set-associative cache (1/3)

Memory address fields: Tag: same as before Offset: same as before Index: points us to the correct “row”

(called a set in this case) So what is the difference?

each set contains multiple blocks once we have found correct set,

must compare with all tags in that set to find our data

25


Summary Cache is direct-mapped with respect to sets Each set is fully associative Basically N direct-mapped caches working in parallel:

each has its own valid bit and data Given memory address:

Find correct set using Index value Compare Tag with all Tag values in the determined

set If a match occurs, it is a hit, otherwise a miss Finally, use the offset field as usual to find

the desired data within the desired block

26


What is so great about this? even a 2-way set associative cache

avoids a lot of conflict misses hardware cost is not so bad (N comparators)

Disadvantages Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set

selection;In a direct mapped cache, Cache Block is available BEFORE Hit/Miss

In fact, for a cache with M blocks it is Direct Mapped if it is 1-way set associative it is Fully Associative if it is M-way set associative so these two are just special cases of

the more general set associative

27

2-Way set associative, 8KB cache with 4W blocks, W=64b

Tag Index<22> <7> <5>

Offset CPUAddressData in

Data out

Write buffer

Lower level memory

Valid<1> Tag<22> Data<256>

... ...

=? 4:1 Mux

<34>

... ...

=? 4:1 Mux

2:1

MUX

28

Block Replacement Policy (1/2)

Direct Mapped Cache: index completely specifies position which a block can go in on a miss

N-Way Set Associative (N > 1): index specifies a set, but block can occupy any position within the set on a miss

Fully Associative: block can be written into any position

Question: if we have the choice, where should we write an incoming block?

29

Block Replacement Policy (2/2)

If there are any locations with valid bit off (empty), then usually write the new block into the first one

If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss

Replacement policies LRU (Least Recently Used) Random FIFO (First In First Out)

30

Block Replacement Policy: LRU (1/2)

Reduce the chance of throwing out data that will be needed soon

Accesses to blocks are recorded Idea: throw out block

which has been accessed least recently

31

Block Replacement Policy: LRU (2/2)

Pro: exploit temporal locality => recent past use implies likely future use: in fact, this is a very effective policy

Con: implementation simple for 2-way set associative -

one LRU bit to keep track complex for 4-way or greater (requires complicated

hardware and much time to keep track of this) => frequently it is only approximated

32

LRU Example Assume 2-way set associative cache with 4 blocks total

capacity; We perform the following block accesses: 0, 4, 1, 2, 3, 1, 5

How many misses will there for the LRU? How many hits block replacement policy?

Event Set 0 Set 1

B0 B1 B0 B1

Start - <lru> - - <lru> -

0 (miss) 0 - <lru> - <lru> -

4 (miss) 0 <lru> 4 - <lru> -

1 (miss) 0 <lru> 4 1 - <lru>

2 (miss) 2 4 <lru> 1 - <lru>

3 (miss) 2 4 <lru> 1 <lru> 3

1 (hit) 2 4 <lru> 1 3 <lru>

5 (miss) 2 4 <lru> 1 <lru> 5

33

Instruction and Data caches

When a load or store instruction is executed, the pipelined processor will simultaneously request both a data word and an instruction word=> structural hazard for load and stores

Divide and conquer Instruction cache (dedicated to instructions) Data cache (dedicated to data)

Opportunity for optimizing each cache separately:different capacities, block sizes, and associativities may led to better performance

34

Example: Harvard Architecture Unified vs. Separate I&D (Harvard)

Table on page 384: 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%

Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions

(1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

ProcI-Cache-L1

Proc

UnifiedCache-L1

UnifiedCache-L2

D-Cache-L1

Proc

UnifiedCache-L2

35

Things to Remember Where can a block be placed in the upper level?

Block placement direct-mapped, fully associative, set-associative

How is a block found if it is in the upper level? Block identification: Tag, Index, Offset

Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used)

What happens on a write? Write strategy Write-through (with write buffer) vs. write-back Write allocate vs. No-write allocate

AMAT = Average Memory Access Time

PenaltyMissRateMisstimeHitAMAT

cpe 626 cpu resources: introduction to cache memories

Documents

cache location

memorydirectmapped cache

allocatedirectmapped

variablesfirstlevel

level cachesecondlevel

memory location

memory hierarchyuser

possible block