cpe 626 cpu resources: introduction to cache memories

35
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic E-mail: milenka @ ece . uah . edu Web: http://www. ece . uah . edu /~ milenka

Upload: caesar

Post on 05-Feb-2016

59 views

Category:

Documents


0 download

DESCRIPTION

CPE 626 CPU Resources: Introduction to Cache Memories. Aleksandar Milenkovic E-mail: [email protected] Web: http://www.ece.uah.edu/~milenka. Outline. Processor-Memory Speed Gap Memory Hierarchy Spatial and Temporal Locality What is Cache? Four questions about caches - PowerPoint PPT Presentation

TRANSCRIPT

Page 2: CPE 626 CPU Resources: Introduction to Cache Memories

2

Outline

Processor-Memory Speed Gap Memory Hierarchy Spatial and Temporal Locality What is Cache? Four questions about caches Example: Alpha AXP 21064 Data Cache Fully Associative Cache N-Way Set Associative Cache Block Replacement Policies Cache Performance

Page 3: CPE 626 CPU Resources: Introduction to Cache Memories

3

Review: Processor Memory Performance Gap

1

10

100

1000

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

Processor:2x/1.5 year

Per

form

ance

Time

Memory:2x/10 years

Processor-Memory Performance Gap grows 50% / year

Page 4: CPE 626 CPU Resources: Introduction to Cache Memories

4

The Memory Hierarchy (MH)

Processor

Control

Datapath

Speed:Capacity:Cost/bit:

Fastest Slowest

Smallest Biggest

Highest Lowest

Upper LowerLevels in Memory Hierarchy

User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology

Page 5: CPE 626 CPU Resources: Introduction to Cache Memories

5

Why hierarchy works?

Principal of locality

Temporal locality: recently accessed items are likely to be accessed in the near future Keep them close to the processor

Spatial locality: items whose addresses are near one another tend to be referenced close together in time Move blocks consisted of contiguous words to the upper level

Rule of thumb:Programs spend 90% of their execution time in only 10% of code

Address space

Probability of reference

Page 6: CPE 626 CPU Resources: Introduction to Cache Memories

6

What is a Cache?

Small, fast storage used to improve average access time to slow memory

Exploits spatial and temporal locality In computer architecture,

almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on Page Map Table Branch-prediction a cache on prediction information?

Page 7: CPE 626 CPU Resources: Introduction to Cache Memories

7

Four questions about cache

Where can a block be placed in the upper level? Block placement direct-mapped, fully associative, set-associative

How is a block found if it is in the upper level? Block identification

Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used)

What happens on a write? Write strategy Write-through vs. write-back Write allocate vs. No-write allocate

Page 8: CPE 626 CPU Resources: Introduction to Cache Memories

8

Direct-Mapped Cache (1/3)

In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location

in the cache for the data if it exists in the cache Block is the unit of transfer between cache and

memory

Page 9: CPE 626 CPU Resources: Introduction to Cache Memories

9

Direct-Mapped Cache (2/3)

MemoryMemory Address

0123456789ABCDEF

Cache Location 0 can be occupied by data from:

Memory location 0, 4, 8, ... In general:

any memory location that is multiple of 4

Cache (4 byte)Cache Index

0123

Page 10: CPE 626 CPU Resources: Introduction to Cache Memories

10

Direct-Mapped Cache (3/3)

Since multiple memory addresses map to same cache index, how do we tell which one is in there?

What if we have a block size > 1 byte? Result: divide memory address into three fields:

tttttttttttttttttt iiiiiiiiii oooo

TAG: to check if have the correct block

INDEX: to select block

OFFSET: to select byte within the block

Block Address

Page 11: CPE 626 CPU Resources: Introduction to Cache Memories

11

Direct-Mapped Cache Terminology

INDEX: specifies the cache index (which “row” of the cache we should look in)

OFFSET: once we have found correct block, specifies which byte within the block we want

TAG: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

BLOCK ADDRESS: TAG + INDEX

Page 12: CPE 626 CPU Resources: Introduction to Cache Memories

12

Direct-Mapped Cache Example

Conditions 32-bit architecture (word=32bits), address unit is byte 8KB direct-mapped cache with 4 words blocks

Determine the size of the Tag, Index, and Offset fields OFFSET (specifies correct byte within block):

cache block contains 4 words = 16 (24) bytes 4 bits INDEX (specifies correct row in the cache):

cache size is 8KB=213 bytes, cache block is 24 bytes#Rows in cache (1 block = 1 row): 213/24 = 29 9 bits

TAG: Memory address length - offset - index =32 - 4 - 9 = 19 tag is leftmost 19 bits

Page 13: CPE 626 CPU Resources: Introduction to Cache Memories

13

What to do on a write hit?

Write through (or store through) Update the word in cache block and

corresponding word in memory Write back (or copy back)

update the word in cache block only allow the memory word to be “stale” the modified cache block is written to main memory

only when it is replaced => add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced; if the block is clean, it is not written on a miss, since the memory has identical information to the cache=> OS flushes cache before I/O !!!

Page 14: CPE 626 CPU Resources: Introduction to Cache Memories

14

What to do on a write-miss?

Write allocate (or fetch on write)The block is loaded on a write-miss, followed by the write-hit actions

No-write allocate (or write around)The block is modified in the memory and not loaded into the cache

Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate

Page 15: CPE 626 CPU Resources: Introduction to Cache Memories

15

8KB direct mapped cache with 32B blocks; CPU presents a 34 bits address to the cache => offset is 5 bits, index is 8 bits, tag 21 bits

Write through, four block write buffer No-write allocate on a write miss

Alpha AXP 21064 Data Cache

Page 16: CPE 626 CPU Resources: Introduction to Cache Memories

16

Alpha AXP 21064 Data Cache

Valid<1> Tag<21> Data<256>

... ...

Tag Index<21> <8> <5>

Offset

=? 4:1 Mux

CPUAddressData in

Data out

Write buffer

Lower level memory

<34>

Page 17: CPE 626 CPU Resources: Introduction to Cache Memories

17

Alpha AXP 21064 Data Cache: Read Hit

1. CPU presents 34 bits address to the cache 2. The address is divided <tag, index, offset> 3. Index selects tag to be tested

to see if the desired block is in the cache 4. After reading the tag from the cache,

it is compared to the tag portion of the address data can be read and sent to CPU in parallel

with the tag being read and checked 5. If the tag does match and valid bit is set,

cache signals the CPU to load data

Page 18: CPE 626 CPU Resources: Introduction to Cache Memories

18

Alpha AXP 21064 Data Cache: Write Hit First four steps are the same as for Read 5. Corresponding word in the cache is written and

also sent to write buffer 6a. If write buffer is empty, the data and the full address are

written in the buffer, and the write is finished from the CPU perspective

6b. If the buffer contains any other modified blocks, the addresses are checked to see if the address of this new data matches; if so, new data are combined with that entry (write-merging)

6c. If the buffer is full and there is no address match, cache and CPU must wait

WA v v v v

100 1 xxxx 0 0 0

104 1 xxxx 0 0 0

108 1 xxxx 0 0 0

10C 1 xxxx 0 0 0

WA v v v v

100 1 xxxx 1 xxxx 1 xxxx 1 xxxx

0 0 0 0

0 0 0 0

0 0 0 0

Page 19: CPE 626 CPU Resources: Introduction to Cache Memories

19

Alpha AXP 21064 Data Cache: Read Miss

First four steps are the same as for Read 5. The cache sends a stall signal to CPU

telling it to wait, and 32B are read from the next level of the hierarchy in the DEC 3000 model 800 the path to the lower

level is 16 bytes wide; it takes 5 clock cycles per transfer, or 10 clock cycles for all 32 bytes

6. Replace block (update data, tag, and valid bit)

Page 20: CPE 626 CPU Resources: Introduction to Cache Memories

20

Alpha AXP 21064 Data Cache: Write Miss

First four steps are the same as for Read 5. Since tag does not match (or valid bit is not set)

the CPU writes around the cache to lower level memory and does not affect cache (no-write allocate)

Page 21: CPE 626 CPU Resources: Introduction to Cache Memories

21

Fully Associative Cache

Memory address fields: Tag: same as before Offset: same as before Index: non-existent

What does this mean? any block can be placed anywhere in the cache must compare with all tags in entire cache

to see if data is there

Page 22: CPE 626 CPU Resources: Introduction to Cache Memories

22

Fully Associative Cache

8KB with 4W blocks (512 cache blocks)

Byte Offset

:

Cache Data

0331

:

Cache Tag (28 bits long)

Valid

:

Cache Tag

=

==

=

=:

Compare tags in parallel

Page 23: CPE 626 CPU Resources: Introduction to Cache Memories

23

Fully Associative Cache

Benefit of fully associative caches Data can go anywhere in the cache -

no Conflict misses Drawbacks

expensive in hardware since every single entryneeds comparator (if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible)

may slow processor clock rate

Page 24: CPE 626 CPU Resources: Introduction to Cache Memories

24

N-Way set-associative cache (1/3)

Memory address fields: Tag: same as before Offset: same as before Index: points us to the correct “row”

(called a set in this case) So what is the difference?

each set contains multiple blocks once we have found correct set,

must compare with all tags in that set to find our data

Page 25: CPE 626 CPU Resources: Introduction to Cache Memories

25

N-Way set-associative cache (2/3)

Summary Cache is direct-mapped with respect to sets Each set is fully associative Basically N direct-mapped caches working in parallel:

each has its own valid bit and data Given memory address:

Find correct set using Index value Compare Tag with all Tag values in the determined

set If a match occurs, it is a hit, otherwise a miss Finally, use the offset field as usual to find

the desired data within the desired block

Page 26: CPE 626 CPU Resources: Introduction to Cache Memories

26

N-Way set-associative cache (3/3)

What is so great about this? even a 2-way set associative cache

avoids a lot of conflict misses hardware cost is not so bad (N comparators)

Disadvantages Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set

selection;In a direct mapped cache, Cache Block is available BEFORE Hit/Miss

In fact, for a cache with M blocks it is Direct Mapped if it is 1-way set associative it is Fully Associative if it is M-way set associative so these two are just special cases of

the more general set associative

Page 27: CPE 626 CPU Resources: Introduction to Cache Memories

27

2-Way set associative, 8KB cache with 4W blocks, W=64b

Tag Index<22> <7> <5>

Offset CPUAddressData in

Data out

Write buffer

Lower level memory

Valid<1> Tag<22> Data<256>

... ...

=? 4:1 Mux

<34>

... ...

=? 4:1 Mux

2:1

MUX

Page 28: CPE 626 CPU Resources: Introduction to Cache Memories

28

Block Replacement Policy (1/2)

Direct Mapped Cache: index completely specifies position which a block can go in on a miss

N-Way Set Associative (N > 1): index specifies a set, but block can occupy any position within the set on a miss

Fully Associative: block can be written into any position

Question: if we have the choice, where should we write an incoming block?

Page 29: CPE 626 CPU Resources: Introduction to Cache Memories

29

Block Replacement Policy (2/2)

If there are any locations with valid bit off (empty), then usually write the new block into the first one

If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss

Replacement policies LRU (Least Recently Used) Random FIFO (First In First Out)

Page 30: CPE 626 CPU Resources: Introduction to Cache Memories

30

Block Replacement Policy: LRU (1/2)

Reduce the chance of throwing out data that will be needed soon

Accesses to blocks are recorded Idea: throw out block

which has been accessed least recently

Page 31: CPE 626 CPU Resources: Introduction to Cache Memories

31

Block Replacement Policy: LRU (2/2)

Pro: exploit temporal locality => recent past use implies likely future use: in fact, this is a very effective policy

Con: implementation simple for 2-way set associative -

one LRU bit to keep track complex for 4-way or greater (requires complicated

hardware and much time to keep track of this) => frequently it is only approximated

Page 32: CPE 626 CPU Resources: Introduction to Cache Memories

32

LRU Example Assume 2-way set associative cache with 4 blocks total

capacity; We perform the following block accesses: 0, 4, 1, 2, 3, 1, 5

How many misses will there for the LRU? How many hits block replacement policy?

Event Set 0 Set 1

B0 B1 B0 B1

Start - <lru> - - <lru> -

0 (miss) 0 - <lru> - <lru> -

4 (miss) 0 <lru> 4 - <lru> -

1 (miss) 0 <lru> 4 1 - <lru>

2 (miss) 2 4 <lru> 1 - <lru>

3 (miss) 2 4 <lru> 1 <lru> 3

1 (hit) 2 4 <lru> 1 3 <lru>

5 (miss) 2 4 <lru> 1 <lru> 5

Page 33: CPE 626 CPU Resources: Introduction to Cache Memories

33

Instruction and Data caches

When a load or store instruction is executed, the pipelined processor will simultaneously request both a data word and an instruction word=> structural hazard for load and stores

Divide and conquer Instruction cache (dedicated to instructions) Data cache (dedicated to data)

Opportunity for optimizing each cache separately:different capacities, block sizes, and associativities may led to better performance

Page 34: CPE 626 CPU Resources: Introduction to Cache Memories

34

Example: Harvard Architecture Unified vs. Separate I&D (Harvard)

Table on page 384: 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%

Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions

(1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

ProcI-Cache-L1

Proc

UnifiedCache-L1

UnifiedCache-L2

D-Cache-L1

Proc

UnifiedCache-L2

Page 35: CPE 626 CPU Resources: Introduction to Cache Memories

35

Things to Remember Where can a block be placed in the upper level?

Block placement direct-mapped, fully associative, set-associative

How is a block found if it is in the upper level? Block identification: Tag, Index, Offset

Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used)

What happens on a write? Write strategy Write-through (with write buffer) vs. write-back Write allocate vs. No-write allocate

AMAT = Average Memory Access Time

PenaltyMissRateMisstimeHitAMAT