memory hierarchy and cache design the following sources are used for preparing these slides: lecture...
Post on 23-Dec-2015
221 Views
Preview:
TRANSCRIPT
Memory Hierarchy andCache Design
The following sources are used for preparing these slides:
• Lecture 14 from the course Computer architecture ECE 201 by Professor Mike Schulte.
• Lecture 4 from William Stallings, Computer Organization and Architecture, Prentice Hall; 6th edition, July 15, 2002.
• Lecture 6 from the course Systems Architectures II by Professors Jeremy R. Johnson and Anatole D. Ruslanov
• Some of figures are from Computer Organization and Design: The Hardware/Software Approach, Third Edition, by David Patterson and John Hennessy, are copyrighted material (COPYRIGHT 2004 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).
Memory Hierarchy
CPU
Level n
Level 2
Level 1
Levels in thememory hierarchy
Increasing distance from the CPU in
access time
Size of the memory at each level
Processor
Data are transferred
Memory technology Typical access time $ per GB in 2004SRAM 0.5–5 ns $4000–$10,000
DRAM 50–70 ns $100–$200
Magnetic disk 5,000,000–20,000,000 ns $0.50–$2
SRAM v DRAM
• Both volatile– Power needed to preserve data
• Dynamic cell – Simpler to build, smaller
– More dense
– Less expensive
– Needs refresh
– Larger memory units
• Static– Faster
– Cache
General Principles of Memory• Locality
– Temporal Locality: referenced memory is likely to be referenced again soon (e.g. code within a loop)
– Spatial Locality: memory close to referenced memory is likely to be referenced soon (e.g., data in a sequentially access array)
• Definitions– Upper: memory closer to processor
– Block: minimum unit that is present or not present
– Block address: location of block in memory
– Hit: Data is found in the desired location
– Hit time: time to access upper level
– Miss rate: percentage of time item not found in upper level
• Locality + smaller HW is faster = memory hierarchy– Levels: each smaller, faster, more expensive/byte than level below
– Inclusive: data found in upper level also found in the lower level
Cache
• Small amount of fast memory
• Sits between normal main memory and CPU
• May be located on CPU chip or module
Cache operation - overview
• CPU requests contents of memory location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block from main memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of main memory is in each cache slot
Cache/memory structure
Four Questions for Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
Q1: Where can a block be placed?
• Direct Mapped: Each block has only one place that it can appear in the cache.
• Fully associative: Each block can be placed anywhere in the cache.
• Set associative: Each block can be placed in a restricted set of places in the cache.
– If there are n blocks in a set, the cache placement is called n-way set associative
• What is the associativity of a direct mapped cache?
Associativity Examples
Cache size is 8 blocks Where does word 12 from memory go?
Fully associative:Block 12 can go anywhere
Direct mapped: Block no. = (Block address) mod (No. of blocks in cache)Block 12 can go only into block 4(12 mod 8 = 4) => Access block using lower 3 bits
2-way set associative:Set no. = (Block address) mod (No. of sets in cache)Block 12 can go anywhere in set 0(12 mod 4 = 0)=> Access set using lower 2 bits
• Mapping: memory mapped to one location in cache: (Block address) mod (Number of blocks in cache)
• Number of blocks is typically a power of two, i.e.,cache location obtained from low-order bits of address.
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
000
Cache
Memory
001
010
011
100
101
110
111
Locating data in the Cache
• Index is 10 bits, while tag is 20 bits
– We need to address 1024 (210) words
– We could have any of 220 words per cache location
• Valid bit indicates whether an entry contains a valid address or not
• Tag bits is usually indicated by address size – (log2(memory size) + 2)
– E.g. 32 – (10 + 2) = 20
Address (showing bit positions)
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
Q2: How Is a Block Found?• The address can be divided into two main parts
– Block offset: selects the data from the block
offset size = log2(block size)
– Block address: tag + index
» index: selects set in cache
index size = log2(#blocks/associativity)
» tag: compared to tag in cache to determine hit
tag size = addreess size - index size - offset size
• Each block has a valid bit that tells if the block is valid - the block is in the cache if the tags match and the valid bit is set.
Tag Index
Q4: What Happens on a Write?
• Write through: The information is written to both the block in the cache and to the block in the lower-level memory.
• Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
– is block clean or dirty? (add a dirty bit to each block)
• Pros and Cons of each:– Write through
» Read misses cannot result in writes to memory,
» Easier to implement
» Always combine with write buffers to avoid memory latency
– Write back
» Less memory traffic
» Perform writes at the speed of the cache
Reducing Cache Misses with a more Flexible Replacement Strategy
• In a direct mapped cache a block can go in exactly one place in cache
• In a fully associative cache a block can go anywhere in cache
• A compromise is to use a set associative cache where a block can go into a fixed number of locations in cache, determined by:
(Block number) mod (Number of sets in cache)
1
2Tag
Data
Block # 0 1 2 3 4 5 6 7
Search
Direct mapped
1
2Tag
Data
Set # 0 1 2 3
Search
Set associative
1
2Tag
Data
Search
Fully associative
Example
• Three small 4 word caches:
Direct mapped, two-way set associative, fully associative
• How many misses in the sequence of block addresses: 0, 8, 0, 6, 8?
• How does this change with 8 words, 16 words?
Locating a Block in Cache
• Check the tag of every cache block in the appropriate set
• Address consists of 3 parts
• Replacement strategy:
E.G. Least Recently Used (LRU)
tag index block offset
Program Assoc. I miss rate D miss rate Combined rategcc 1 2.0% 1.7% 1.9%
2 1.6% 1.4% 1.5%4 1.6% 1.4% 1.5%
Address
22 8
V TagIndex
0
1
2
253
254255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
Size of Tags vs. Associativity
• Increasing associativity requires more comparators, as well as more tag bits per cache block.
• Assume a cache with 4K 4-word blocks and 32 bit addresses
• Find the total number of sets and the total number of tag bits for a
– direct mapped cache
– two-way set associative cache
– four-way set associative cache
– fully associative cache
Size of Tags vs. Associativity
• Total cache size 4K x 4 words/block x 4 bytes/word = 64Kb
• Direct mapped cache:– 16 bytes/block 28 bits for tag and index
– # sets = # blocks
– Log(4K) = 12 bits for index 16 bits for tag
– Total # of tag bits = 16 bits x 4K locations = 64 Kbits
• Two-way set-associative cache:– 32 bytes / set
– 16 bytes/block 28 bits for tag and index
– # sets = # blocks / 2 2K sets
– Log(2K) = 11 bits for index 17 bits for tag
– Total # of tag bits = 17 bits x 2 location / set x 2K sets = 68 Kbits
Size of Tags vs. Associativity
• Four-way set-associative cache:– 64 bytes / set
– 16 bytes/block 28 bits for tag and index
– # sets = # blocks / 4 1K sets
– Log(1K) = 10 bits for index 18 bits for tag
– Total # of tag bits = 18 bits x 4 location / set x 1K sets = 72 Kbits
• Fully associative cache:– 1 set of 4 K blocks 28 bits for tag and index
– Index = 0 bits tag will have 28 bits
– Total # of tag bits = 28 bits x 4K location / set x 1 set = 112 Kbits
Measuring Cache Performance
• CPU time = (CPU execution clock cycles +
Memory stall clock cycles) Clock-cycle time
• Memory stall clock cycles =
Read-stall cycles + Write-stall cycles
• Read-stall cycles = Reads/program Read miss rate Read miss penalty
• Write-stall cycles = (Writes/program Write miss rate Write miss penalty) + Write buffer stalls
(assumes write-through cache)
• Write buffer stalls should be negligible and write and read miss penalties equal (cost to fetch block from memory)
• Memory stall clock cycles = Mem access/program miss rate miss penalty
Example I
• Assume I-miss rate of 2% and D-miss rate of 4% (gcc)
• Assume CPI = 2 (without stalls) and miss penalty of 40 cycles
• Assume 36% loads/stores
• What is the CPI with memory stalls?
• How much faster would a machine with perfect cache run?
• What happens if the processor is made faster, but the memory system stays the same (e.g. reduce CPI to 1)?
Calculation I
• Instruction miss cycles = I x 100% x 2% x 40 = .80 x I
• Data miss cycles = I x 36% x 4% x 40 = .58 x I
• Total miss cycles = .80 x I + .58 x I = 1.38 x I
• CPI = 2 + 1.38 = 3.38
• PerfPerf / PerfStall = 3.38/2 = 1.69
• For a processor with base CPI = 1:
• CPI = 1 + 1.38 = 2.38 PerfPerf / PerfStall = 2.38
• Time spent on stalls for slower processor 1.38/3.38 = 41%
• Time spent on stalls for faster processor 1.38/2.38 = 58%
top related