cpe 626 cpu resources: introduction to cache memories
DESCRIPTION
CPE 626 CPU Resources: Introduction to Cache Memories. Aleksandar Milenkovic E-mail: [email protected] Web: http://www.ece.uah.edu/~milenka. Outline. Processor-Memory Speed Gap Memory Hierarchy Spatial and Temporal Locality What is Cache? Four questions about caches - PowerPoint PPT PresentationTRANSCRIPT
CPE 626 CPU Resources:Introduction to Cache Memories
Aleksandar MilenkovicE-mail: [email protected]
Web: http://www.ece.uah.edu/~milenka
2
Outline
Processor-Memory Speed Gap Memory Hierarchy Spatial and Temporal Locality What is Cache? Four questions about caches Example: Alpha AXP 21064 Data Cache Fully Associative Cache N-Way Set Associative Cache Block Replacement Policies Cache Performance
3
Review: Processor Memory Performance Gap
1
10
100
1000
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
Processor:2x/1.5 year
Per
form
ance
Time
Memory:2x/10 years
Processor-Memory Performance Gap grows 50% / year
4
The Memory Hierarchy (MH)
Processor
Control
Datapath
Speed:Capacity:Cost/bit:
Fastest Slowest
Smallest Biggest
Highest Lowest
Upper LowerLevels in Memory Hierarchy
User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology
5
Why hierarchy works?
Principal of locality
Temporal locality: recently accessed items are likely to be accessed in the near future Keep them close to the processor
Spatial locality: items whose addresses are near one another tend to be referenced close together in time Move blocks consisted of contiguous words to the upper level
Rule of thumb:Programs spend 90% of their execution time in only 10% of code
Address space
Probability of reference
6
What is a Cache?
Small, fast storage used to improve average access time to slow memory
Exploits spatial and temporal locality In computer architecture,
almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on Page Map Table Branch-prediction a cache on prediction information?
7
Four questions about cache
Where can a block be placed in the upper level? Block placement direct-mapped, fully associative, set-associative
How is a block found if it is in the upper level? Block identification
Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used)
What happens on a write? Write strategy Write-through vs. write-back Write allocate vs. No-write allocate
8
Direct-Mapped Cache (1/3)
In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location
in the cache for the data if it exists in the cache Block is the unit of transfer between cache and
memory
9
Direct-Mapped Cache (2/3)
MemoryMemory Address
0123456789ABCDEF
Cache Location 0 can be occupied by data from:
Memory location 0, 4, 8, ... In general:
any memory location that is multiple of 4
Cache (4 byte)Cache Index
0123
10
Direct-Mapped Cache (3/3)
Since multiple memory addresses map to same cache index, how do we tell which one is in there?
What if we have a block size > 1 byte? Result: divide memory address into three fields:
tttttttttttttttttt iiiiiiiiii oooo
TAG: to check if have the correct block
INDEX: to select block
OFFSET: to select byte within the block
Block Address
11
Direct-Mapped Cache Terminology
INDEX: specifies the cache index (which “row” of the cache we should look in)
OFFSET: once we have found correct block, specifies which byte within the block we want
TAG: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location
BLOCK ADDRESS: TAG + INDEX
12
Direct-Mapped Cache Example
Conditions 32-bit architecture (word=32bits), address unit is byte 8KB direct-mapped cache with 4 words blocks
Determine the size of the Tag, Index, and Offset fields OFFSET (specifies correct byte within block):
cache block contains 4 words = 16 (24) bytes 4 bits INDEX (specifies correct row in the cache):
cache size is 8KB=213 bytes, cache block is 24 bytes#Rows in cache (1 block = 1 row): 213/24 = 29 9 bits
TAG: Memory address length - offset - index =32 - 4 - 9 = 19 tag is leftmost 19 bits
13
What to do on a write hit?
Write through (or store through) Update the word in cache block and
corresponding word in memory Write back (or copy back)
update the word in cache block only allow the memory word to be “stale” the modified cache block is written to main memory
only when it is replaced => add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced; if the block is clean, it is not written on a miss, since the memory has identical information to the cache=> OS flushes cache before I/O !!!
14
What to do on a write-miss?
Write allocate (or fetch on write)The block is loaded on a write-miss, followed by the write-hit actions
No-write allocate (or write around)The block is modified in the memory and not loaded into the cache
Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate
15
8KB direct mapped cache with 32B blocks; CPU presents a 34 bits address to the cache => offset is 5 bits, index is 8 bits, tag 21 bits
Write through, four block write buffer No-write allocate on a write miss
Alpha AXP 21064 Data Cache
16
Alpha AXP 21064 Data Cache
Valid<1> Tag<21> Data<256>
... ...
Tag Index<21> <8> <5>
Offset
=? 4:1 Mux
CPUAddressData in
Data out
Write buffer
Lower level memory
<34>
17
Alpha AXP 21064 Data Cache: Read Hit
1. CPU presents 34 bits address to the cache 2. The address is divided <tag, index, offset> 3. Index selects tag to be tested
to see if the desired block is in the cache 4. After reading the tag from the cache,
it is compared to the tag portion of the address data can be read and sent to CPU in parallel
with the tag being read and checked 5. If the tag does match and valid bit is set,
cache signals the CPU to load data
18
Alpha AXP 21064 Data Cache: Write Hit First four steps are the same as for Read 5. Corresponding word in the cache is written and
also sent to write buffer 6a. If write buffer is empty, the data and the full address are
written in the buffer, and the write is finished from the CPU perspective
6b. If the buffer contains any other modified blocks, the addresses are checked to see if the address of this new data matches; if so, new data are combined with that entry (write-merging)
6c. If the buffer is full and there is no address match, cache and CPU must wait
WA v v v v
100 1 xxxx 0 0 0
104 1 xxxx 0 0 0
108 1 xxxx 0 0 0
10C 1 xxxx 0 0 0
WA v v v v
100 1 xxxx 1 xxxx 1 xxxx 1 xxxx
0 0 0 0
0 0 0 0
0 0 0 0
19
Alpha AXP 21064 Data Cache: Read Miss
First four steps are the same as for Read 5. The cache sends a stall signal to CPU
telling it to wait, and 32B are read from the next level of the hierarchy in the DEC 3000 model 800 the path to the lower
level is 16 bytes wide; it takes 5 clock cycles per transfer, or 10 clock cycles for all 32 bytes
6. Replace block (update data, tag, and valid bit)
20
Alpha AXP 21064 Data Cache: Write Miss
First four steps are the same as for Read 5. Since tag does not match (or valid bit is not set)
the CPU writes around the cache to lower level memory and does not affect cache (no-write allocate)
21
Fully Associative Cache
Memory address fields: Tag: same as before Offset: same as before Index: non-existent
What does this mean? any block can be placed anywhere in the cache must compare with all tags in entire cache
to see if data is there
22
Fully Associative Cache
8KB with 4W blocks (512 cache blocks)
Byte Offset
:
Cache Data
0331
:
Cache Tag (28 bits long)
Valid
:
Cache Tag
=
==
=
=:
Compare tags in parallel
23
Fully Associative Cache
Benefit of fully associative caches Data can go anywhere in the cache -
no Conflict misses Drawbacks
expensive in hardware since every single entryneeds comparator (if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible)
may slow processor clock rate
24
N-Way set-associative cache (1/3)
Memory address fields: Tag: same as before Offset: same as before Index: points us to the correct “row”
(called a set in this case) So what is the difference?
each set contains multiple blocks once we have found correct set,
must compare with all tags in that set to find our data
25
N-Way set-associative cache (2/3)
Summary Cache is direct-mapped with respect to sets Each set is fully associative Basically N direct-mapped caches working in parallel:
each has its own valid bit and data Given memory address:
Find correct set using Index value Compare Tag with all Tag values in the determined
set If a match occurs, it is a hit, otherwise a miss Finally, use the offset field as usual to find
the desired data within the desired block
26
N-Way set-associative cache (3/3)
What is so great about this? even a 2-way set associative cache
avoids a lot of conflict misses hardware cost is not so bad (N comparators)
Disadvantages Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set
selection;In a direct mapped cache, Cache Block is available BEFORE Hit/Miss
In fact, for a cache with M blocks it is Direct Mapped if it is 1-way set associative it is Fully Associative if it is M-way set associative so these two are just special cases of
the more general set associative
27
2-Way set associative, 8KB cache with 4W blocks, W=64b
Tag Index<22> <7> <5>
Offset CPUAddressData in
Data out
Write buffer
Lower level memory
Valid<1> Tag<22> Data<256>
... ...
=? 4:1 Mux
<34>
... ...
=? 4:1 Mux
2:1
MUX
28
Block Replacement Policy (1/2)
Direct Mapped Cache: index completely specifies position which a block can go in on a miss
N-Way Set Associative (N > 1): index specifies a set, but block can occupy any position within the set on a miss
Fully Associative: block can be written into any position
Question: if we have the choice, where should we write an incoming block?
29
Block Replacement Policy (2/2)
If there are any locations with valid bit off (empty), then usually write the new block into the first one
If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss
Replacement policies LRU (Least Recently Used) Random FIFO (First In First Out)
30
Block Replacement Policy: LRU (1/2)
Reduce the chance of throwing out data that will be needed soon
Accesses to blocks are recorded Idea: throw out block
which has been accessed least recently
31
Block Replacement Policy: LRU (2/2)
Pro: exploit temporal locality => recent past use implies likely future use: in fact, this is a very effective policy
Con: implementation simple for 2-way set associative -
one LRU bit to keep track complex for 4-way or greater (requires complicated
hardware and much time to keep track of this) => frequently it is only approximated
32
LRU Example Assume 2-way set associative cache with 4 blocks total
capacity; We perform the following block accesses: 0, 4, 1, 2, 3, 1, 5
How many misses will there for the LRU? How many hits block replacement policy?
Event Set 0 Set 1
B0 B1 B0 B1
Start - <lru> - - <lru> -
0 (miss) 0 - <lru> - <lru> -
4 (miss) 0 <lru> 4 - <lru> -
1 (miss) 0 <lru> 4 1 - <lru>
2 (miss) 2 4 <lru> 1 - <lru>
3 (miss) 2 4 <lru> 1 <lru> 3
1 (hit) 2 4 <lru> 1 3 <lru>
5 (miss) 2 4 <lru> 1 <lru> 5
33
Instruction and Data caches
When a load or store instruction is executed, the pipelined processor will simultaneously request both a data word and an instruction word=> structural hazard for load and stores
Divide and conquer Instruction cache (dedicated to instructions) Data cache (dedicated to data)
Opportunity for optimizing each cache separately:different capacities, block sizes, and associativities may led to better performance
34
Example: Harvard Architecture Unified vs. Separate I&D (Harvard)
Table on page 384: 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%
Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions
(1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
ProcI-Cache-L1
Proc
UnifiedCache-L1
UnifiedCache-L2
D-Cache-L1
Proc
UnifiedCache-L2
35
Things to Remember Where can a block be placed in the upper level?
Block placement direct-mapped, fully associative, set-associative
How is a block found if it is in the upper level? Block identification: Tag, Index, Offset
Which block should be replaced on a miss? Block replacement Random, LRU (Least Recently Used)
What happens on a write? Write strategy Write-through (with write buffer) vs. write-back Write allocate vs. No-write allocate
AMAT = Average Memory Access Time
PenaltyMissRateMisstimeHitAMAT