memory performance gap - iowa state...
TRANSCRIPT
1
1
Memory Hierarchy: Cache Design Principles and Performance
CprE 581 Fall 2016Slides adapted and revised from UC Berkeley CS252 S01 and f06
Adapted from UCB CS252 S01, Copyright 2001 UCB
Copyright © 2012, Elsevier Inc. All rights reserved.
IntroductionProgrammers want unlimited amounts of memory with low latencyFast memory technology is more expensive per bit than slower memorySolution: organize memory system into a hierarchy Entire addressable memory space available in largest,
slowest memory Incrementally smaller and faster memories, each containing
a subset of the memory below it, proceed in steps up toward the processor
Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented
to the processor
Introduction
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy
Introduction
Since 1980, CPU has outpaced DRAM ...
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
DRAM
CPU
Performance(1/latency)
Year
Gap grew 50% per year
Q. How do architects address this gap?
A. Put smaller, faster “cache” memories between CPU and DRAM.
Create a “memory hierarchy”.
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Performance Gap
Introduction
1977: DRAM faster than microprocessorsApple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 nsDRAM: 400 ns
2
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy DesignMemory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores:
Intel Core i7 can generate two references per core per clock
Four cores and 3.2 GHz clock 25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s!
DRAM bandwidth is only 6% of this (25 GB/s) Requires:
Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip
Introduction
8
Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Today’s Focus
9
The Principle of LocalityThe Principle of Locality: Program access a relatively small portion of the
address space at any instant of time.Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is
referenced, it will tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design.
Copyright © 2012, Elsevier Inc. All rights reserved.
Performance and PowerHigh-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget
Introduction
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Ad
dre
ss (
on
e d
ot
per
acc
ess)
SpatialLocality
TemporalLocality
Bad locality behavior
12
Memory Hierarchy: TerminologyHit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper
level Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/missMiss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processorHit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
3
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy BasicsWhen a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy,
requiring a higher latency reference Lower level may be another cache or the main
memory Also fetch the other words contained within the
block Takes advantage of spatial locality
Place block into cache in any location within its set, determined by address block address MOD number of sets
IntroductionCache Algorithm (Read)
HIT - Found in Cache
Return copyof data fromcache
MISS - Not in cache
Read block of data from Main Memory
Wait …
Return data to processorand update cache
Look at Processor Address, search cache tags to find match. Then either
Hit Rate = fraction of accesses found in cacheMiss Rate = 1 – Hit rateHit Time = RAM access time +
time to determine HIT/MISSMiss Time = time to replace block in cache +
time to deliver block to processor
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basicsn set cardinality => n-way set associative Direct-mapped cache => one block per set Fully associative => one set with cardinality n
Writing to cache: two strategies Write-through
Immediately update lower levels of hierarchy Write-back
Only update lower levels of hierarchy when an updated block is replaced
Both strategies use write buffer to make writes asynchronous
Introduction
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy BasicsMiss rate Fraction of cache access that result in a miss
Causes of misses Compulsory
First reference to a block Capacity
Blocks discarded and later retrieved Conflict
Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache
Introduction
Inside a Cache
CACHEProcessor MainMemory
Address Address
DataData
AddressTag
Data Block
DataByte
DataByte
DataByte
Line100304
6848
copy of mainmemorylocation 100
copy of mainmemorylocation 101
416
18
4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement)Q2: How is a block found if it is in the upper level?(Block identification)Q3: Which block should be replaced on a miss? (Block replacement)Q4: What happens on a write? (Write strategy)
4
19
Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set
associative S.A. Mapping = Block Number Modulo Number Sets
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
Fully associative Direct Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
20
Q2: How is a block found if it is in the upper level?
Tag on each block No need to check index or block offsetIncreasing associativity shrinks index, expands tag
BlockOffset
Block Address
IndexTag
Memory address
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.5 The organization of the data cache in the Opteron microprocessor. The 64 KB cache is two-way set associative with 64-byte blocks. The 9-bit index selects among 512 sets. The four steps of a read hit, shown as circled numbers in order of occurrence, labelthis organization. Three bits of the block offset join the index to supply the RAM address to select the proper 8 bytes. Thus, the cacheholds two groups of 4096 64-bit words, with each group containing half of the 512 sets. Although not exercised in this example, the linefrom lower-level memory to the cache is used on a miss to load the cache. The size of address leaving the processor is 40 bits because itis a physical address and not a virtual address. Figure B.24 on page B-47 explains how the Opteron maps from virtual to physical for acache access.
Q3: Which block should be replaced on a miss?
A randomly chosen block?Easy to implement, how
well does it work?
The Least Recently Used (LRU) block? Appealing,but hard to implement for high associativity
Miss Rate for 2-way Set Associative Cache
Also, try other LRU approx.
Size Random LRU16 KB 5.7% 5.2%64 KB 2.0% 1.9%256 KB 1.17% 1.15%
Q4: What happens on a write?Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level memory
Write data only to the cache
Update lower level when a block falls out of the cache
Debug Easy Hard
Do read misses produce writes? No Yes
Do repeated writes make it to lower level?
Yes No
Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).
24
Example: 1 KB Direct Mapped CacheAssume a cache of 2N bytes, 2K blocks, block size of 2M
bytes; N = M+K (#block times block size) (32 - N)-bit cache tag, K-bit cache index, and M-bit block-offset
The cache stores tag, data, and valid bit for each block Cache index is used to select a block in SRAM (Recall BHT, BTB) Block tag is compared with the input tag A word in the data block may be selected as the output
Index
0123
:
Cache DataByte 0
0431
:
Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
31
Byte 1Byte 31
:
Byte 32Byte 33Byte 63
:
Byte 992Byte 1023
:
Cache Tag
Block offsetEx: 0x00
9Block address
5
25
Example: Set Associative CacheExample: Two-way set associative cache Cache index selects a set of two blocks The two tags in the set are compared to the input in
parallel Data is selected based on the tag comparison
Set associative or direct mapped? Discuss laterCache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
26
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C’s for thedata in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in eachcategory. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.)
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy BasicsSix basic cache optimizations: Larger block size
Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty
Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption
Higher associativity Reduces conflict misses Increases hit time, increases power consumption
Higher number of cache levels Reduces overall memory access time
Giving priority to read misses over writes Reduces miss penalty
Avoiding address translation in cache indexing Reduces hit time
Introduction
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size istoo large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot theselines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on aDECstation 5000 [Gee et al. 1993].
30
6
31
32
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches smaller than the sum of the two 64 KB first-level caches make little sense, as reflected in the high miss rates. After 256 KB the single cache is within 10% of the global missrates. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-levelcache using a 32 KB first-level cache. The L2 caches (unified) were two-way set associative with replacement. Each had split L1instruction and data caches that were 64 KB two-way set associative with LRU replacement. The block size for both L1 and L2caches was 64 bytes. Data were collected as in Figure B.4
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.15 Relative execution time by second-level cache size. The two bars are for different clock cycles for an L2 cache hit.The reference execution time of 1.00 is for an 8192 KB second-level cache with a 1-clock-cycle latency on a second-level hit.These data were collected the same way as in Figure B.14, using a simulator to imitate the Alpha 21264.
35
Prioritize Reads over writes – write buffers
With a write-buffer, write through might be acceptable
Write-buffer should act as a bypass for successiveLoads.
Optimization- prioritize reads
Read miss can wait until write-buffers empty.Check the write-buffer contents.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.16 Miss rate versus virtually addressed cache size of a program measured three ways: without process switches(uniprocess), with process switches using a process-identifier tag (PID), and with process switches but without PIDs (purge).PIDs increase the uniprocess absolute miss rate by 0.3% to 0.6% and save 0.6% to 4.3% over purging. Agarwal [1987] collectedthese statistics for the Ultrix operating system running on a VAX, assuming direct-mapped caches with a block size of 16 bytes.Note that the miss rate goes up from 128K to 256K. Such nonintuitive behavior can occur in caches because changing size changesthe mapping of memory blocks onto cache blocks, which can change the conflict miss rate.
Avoid Address Translation
7
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. Thepage size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2cache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physicaladdress is 40 bits.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.19 The logical program in its contiguous virtual address space is shown on the left. It consists of four pages, A,B, C, and D. The actual location of three of the blocks is in physical main memory and the other is located on the disk.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.23 The mapping of a virtual address to a physical address via a page table.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.24 Operation of the Opteron data TLB during address translation. The four steps of a TLB hit are shown as circlednumbers. This TLB has 40 entries. Section B.5 describes the various protection and access fields of an Opteron page table entry.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.25 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. The pagesize is 8 KB. The TLB is direct mapped with 256 entries. The L1 cache is a direct-mapped 8 KB, and the L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address is 41 bits. The primary differencebetween this simple figure and a real cache is replication of pieces of this figure.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.26 The IA-32 segment descriptors are distinguished by bits in the attributes field. Base, limit, present, readable, andwritable are all self-explanatory. D gives the default addressing size of the instructions: 16 bits or 32 bits. G gives the granularity ofthe segment limit: 0 means in bytes and 1 means in 4 KB pages. G is set to 1 when paging is turned on to set the size of the pagetables. DPL means descriptor privilege level—this is checked against the code privilege level to see if the access will be allowed.Conforming says the code takes on the privilege level of the code being called rather than the privilege level of the caller; it is usedfor library routines. The expand-down field flips the check to let the base field be the high-water mark and the limit field be the low-water mark. As you might expect, this is used for stack segments that grow down. Word count controls the number of words copiedfrom the current stack to the new stack on a call gate. The other two fields of the call gate descriptor, destination selector anddestination offset, select the descriptor of the destination of the call and the offset into it, respectively. There are many more thanthese three segment descriptors in the IA-32 protection model.
8
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure B.27 The mapping of an Opteron virtual address. The Opteron virtual memory implementation with four page tablelevels supports an effective physical address size of 40 bits. Each page table has 512 entries, so each level field is 9 bits wide. TheAMD64 architecture document allows the virtual address size to grow from the current 48 bits to 64 bits, and the physical addresssize to grow from the current 40 bits to 52 bits.
Copyright © 2012, Elsevier Inc. All rights reserved.
Ten Advanced OptimizationsSmall and simple first level caches Critical timing path:
addressing tag memory, then comparing tags, then selecting correct set
Direct-mapped caches can overlap tag compare and transmission of data
Lower associativity reduces power because fewer cache lines are accessed
Advanced O
ptimiza
tions
Copyright © 2012, Elsevier Inc. All rights reserved.
L1 Size and Associativity
Access time vs. size and associativity
Advanced O
ptimizations
64B block40nm
Copyright © 2012, Elsevier Inc. All rights reserved.
L1 Size and Associativity
Energy per read vs. size and associativity
Advanced O
ptimizations
Copyright © 2012, Elsevier Inc. All rights reserved.
Way PredictionTo improve hit time, predict the way to pre-set mux Mis-prediction gives longer hit time Prediction accuracy
> 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache
First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8
Extend to predict block as well “Way selection” Increases mis-prediction penalty
Advanced O
ptimiza
tions
Copyright © 2012, Elsevier Inc. All rights reserved.
Pipelining CachePipeline cache access to improve bandwidth Examples:
Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles
Increases branch mis-prediction penaltyMakes it easier to increase associativity
Advanced O
ptimizations
9
Copyright © 2012, Elsevier Inc. All rights reserved.
Nonblocking CachesAllow hits before previous misses complete “Hit under miss” “Hit under multiple
miss”L2 must support thisIn general, processors can hide L1 miss penalty but not L2 miss penalty
Ad
vanced Optim
izations
Copyright © 2011, Elsevier Inc. All rights Reserved.
Figure 2.5 The effectiveness of a nonblocking cache is evaluated by allowing 1, 2, or 64 hits under a cache miss with 9 SPECINT (on the left) and 9 SPECFP (on the right) benchmarks. The data memory system modeled after the Intel i7 consists of a 32KB L1 cache with a four cycle access latency. The L2 cache (shared with instructions) is 256 KB with a 10 clock cycle access latency. The L3 is 2 MB and a 36-cycle access latency. All the caches are eight-way set associative and have a 64-byte block size. Allowing one hit under miss reduces the miss penalty by 9% for the integer benchmarks and 12.5% for the floating point. Allowing a second hit improves these results to 10% and 16%, and allowing 64 results in little additional improvement.
Copyright © 2012, Elsevier Inc. All rights reserved.
Multibanked CachesOrganize cache as independent banks to support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for
L2
Interleave banks according to block address
Advanced O
ptimizations
Copyright © 2012, Elsevier Inc. All rights reserved.
Critical Word First, Early Restart
Critical word first Request missed word from memory first Send it to the processor as soon as it arrives
Early restart Request words in normal order Send missed word to the processor as soon as it
arrives
Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched
Ad
vanced Optim
izations
Copyright © 2012, Elsevier Inc. All rights reserved.
Merging Write BufferWhen storing to a block that is already pending in the write buffer, update write bufferReduces stalls due to full write bufferDo not apply to I/O addresses
Advanced O
ptimiza
tions
No write buffering
Write buffering
Loop Interchange/* Before */for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)x[i][j] = 2 * x[i][j];
54
10
Loop Interchange/* After */for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)x[i][j] = 2 * x[i][j];
55
Blocking/* Before */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1){r = 0;for (k = 0; k < N; k = k + 1)r = r + y[i][k]*z[k][j];x[i][j] = r;};
56
Copyright © 2011, Elsevier Inc. All rights Reserved.
A snapshot of the three arrays x, y, and z when N = 6 and i = 1. The age of accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses, and dark means newer accesses. Compared to Figure 2.9, elements of y and z are read repeatedly to calculate new elements of x. The variables i, j, and k are shown along the rows or columns used to access the arrays.
BLOCKING
Blockingfor (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)for (j = jj; j < min(jj+B,N); j = j+1){r = 0;for (k = kk; k < min(kk+B,N); k = k + 1)r = r + y[i][k]*z[k][j];x[i][j] = x[i][j] + r;};
58
Copyright © 2011, Elsevier Inc. All rights Reserved.
The age of accesses to the arrays x, y, and z when B = 3. Note that, in contrast to Figure 2.8, a smaller numberof elements is accessed.
60
Real Example: Alpha 21264 Caches
I-cache D-cache
64KB 2-way associative instruction cache64KB 2-way associative data cache
11
61
Alpha 21264 Data CacheD-cache: 64K 2-way
associativeUse 48-bit virtual address to index cache, use tag from physical address48-bit Virtual=>44-bit address512 sets (9-bit blk index), 1024 blocksCache block size 64 bytes (6-bit offset)tTag has 44-(9+6)=29 bitsWriteback and write allocated
(We will study virtual-physical address translation)
62
Three first-order factors: Hit Time Miss penalty Miss rate
AMAT: Average memory access timeA simple metric to balance the three factors
Example: hit time = 1 cycle, miss time = 100 cycle, miss rate = 4%, AMAT = ________
Cache performance
penalty Missrate Miss hit time AMAT
63
Cache performanceCache impact on CPI
Note cycles spent on cache hit is usually counted into execution cycles
Better AMAT may not always mean better performance
CycleTimenInstructio
Cycles StallMemory CPIIC timeCPU
timeCyclecycles) stallMemory cyclesexecution (CPU timeCPU
execution
64
Example: Evaluating Split Inst/Data CacheUnified vs Split Inst/data cache (Harvard Architecture)
Example: Assume 36% data ops 74% accesses from instructions (1.0/1.36) 16KB I&D: Inst miss rate=0.4%, data miss rate=11.4%, overall
3.24% 32KB unified: Aggregate miss rate=3.18%
Which design is better? hit time=1, miss time=100 Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=74%x(1+0.4%x100)+26%x(1+11.4%x100) = 4.24AMATUnified=74%x(1+3.18%x100)+26%x(1+1+3.18%x100)= 4.44
ProcI-Cache-1Proc
UnifiedCache-1
UnifiedCache-2
D-Cache-1Proc
UnifiedCache-2
65
Disadvantage of Set Associative CacheCompare n-way set associative with direct mapped cache: Has n comparators vs. 1 comparator Has Extra MUX delay for the data Data comes after hit/miss decision and set selection
In a direct mapped cache, cache block is available before hit/miss decision Use the data assuming the access is a hit, recover if
found otherwiseCache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
66
Cache and Out-of-order ProcessorsAs for performance:
Very difficult to define miss penalty Partial overlapping between execution and
memory access Overlapping between memory accesses
themselvesCache hit time may increase without increasing cycle time – multi-cycle latencyPractical solution: Run simulation
AMAT is no longer a good indication
12
67
Cache and Out-of-order ProcessorsAs for design:
Non-blocking cache to avoid full fall Wake up the load inst upon the return of
data May tolerate L1 misses but not misses to
DRAM Allow overlapping of memory accessesCache hit-or-miss complicates instruction scheduling Mini flash to avoid high penalty