cache (memory) performance...
TRANSCRIPT
![Page 1: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/1.jpg)
1
1
Cache (Memory) PerformanceOptimization
Page 1
![Page 2: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/2.jpg)
2
2
Improving Cache Performance
Average memory access time = Hit time + Miss rate x Miss penalty
To improve performance: • reduce the miss rate (e.g., larger cache) • reduce the miss penalty (e.g., L2 cache) • reduce the hit time
The simplest design strategy is to design the largest primary cache without slowing down the clock or adding pipeline stages
Design the largest primary cache without slowing down the clock Or adding pipeline stages.
Page 2
![Page 3: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/3.jpg)
3
3
RAM-tag 2-way Set-Associative Cache
Tag Data Word
=
Tag Index
t k Tag Data Word
Data Word
=
t
MU
X
Set t
Page 3
![Page 4: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/4.jpg)
4
4
CAM-tag 2-way Set-Associative Cache
Tag Word =
Tag Index
t k
Data Word
Dem
ux
=
=
=
TagWord =
Dem
ux =
=
=
Page 4
![Page 5: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/5.jpg)
5
5
Causes for Cache Misses
• Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache
• Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect placement & replacement policy
• Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity
Page 5
![Page 6: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/6.jpg)
6
6
Block-level Optimizations
• Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss
penalty could be large. • Sub-block placement
– A valid bit added to units smaller than the full block, called sub- locks
– Only read a sub- lock on a miss – If a tag matches, is the word in the cache?
100 300 204
1 1 1 0 0 0
bb
1 1 1
1 0 1
Main reason for subblock placement is to reduce tag overhead.
Page 6
![Page 7: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/7.jpg)
7
7
Write Alternatives
- Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit
- Design data RAM that can perform read and write in one cycle, restore old value after tag miss
- Hold write data for store in single buffer ahead ofcache, write cache data during next store’s tagcheck - Need to bypass from write buffer if read matches write buffer
tag
Page 7
![Page 8: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/8.jpg)
8
8
Pipelining Cache Writes (Alpha 21064)
Tag DataV
=
Block Offset
Tag Index
t k b
t
HIT Data Word or Byte
2k
lines
WE Bypass
Writes usually take longer than reads because the tags have to
Be checked before writing the data.First, tags and data are split, so they can be addressed independently.
Page 8
![Page 9: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/9.jpg)
9
9
Prefetching
• Speculate on future instruction and data accesses and fetch them into cache(s)
– Instruction accesses easier to predict than data accesses
• Varieties of prefetching – Hardware prefetching – Software prefetching – Mixed schemes
• What types of misses does prefetching affect?
Reduces compulsory misses, can increase conflict and capacity misses.
Page 9
![Page 10: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/10.jpg)
10
10
Issues in Prefetching
• Usefulness – should produce hits • Timeliness – not late and not too early • Cache and bandwidth pollution
L1 Data
L1 Instruction
Unified L2 Cache
RF
CPU
Prefetched data
Page 10
![Page 11: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/11.jpg)
11
11
Hardware Prefetching of Instructions
• Instruction prefetch in Alpha AXP 21064 – Fetch two blocks on a miss; the requested
block and the next consecutive block – Requested block placed in cache, and next
block in instruction stream buffer
L1 Instruction
Unified L2 Cache
RF
CPU
Stream Buffer (4 blocks)
Prefetched instruction blockReq
block
Req block
Need to check the stream buffer if the requested block is in there. Never more than one 32-byte block in the stream buffer.
Page 11
![Page 12: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/12.jpg)
12
12
Hardware Data Prefetching
• One Block Lookahead (OBL) scheme – Initiate prefetch for block b + 1 when block b is
accessed – Why is this different from doubling block size?
• Prefetch-on-miss: – Prefetch b + 1 upon miss on b
• Tagged prefetch: – Tag bit for each memory block – Tag bit = 0 signifies that a block is demand-
fetched or if a prefetched block is referenced for the first time
– Prefetch for b + 1 initiated only if tag bit = 0 on b
HP PA 7200 uses OBL prefetching
Tag prefetching is twice as effective as prefetch-on-miss in reducing miss rates.
Page 12
![Page 13: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/13.jpg)
13
13
Comparison of Variant OBL Schemes
Demand-fetched Prefetched
Demand-fetched Prefetched
Demand-fetched Prefetched Demand-fetched Prefetched
0 Demand-fetched 1 Prefetched 0 0 0
0 Demand-fetched 0 Prefetched 1 Prefetched 0 0
0 Demand-fetched 0 Prefetched 0 Prefetched 1 Prefetched 0
Prefetch-on-miss accessing contiguous blocks
Tagged prefetch accessing contiguous blocks
Page 13
![Page 14: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/14.jpg)
14
14
Software Prefetching
for(i=0; i < N; i++) {fetch( &a[i + 1] );fetch( &b[i + 1] );SUM = SUM + a[i] * b[i];
}
• What property do we require of the cache for prefetching to work ?
Cache should be non-blocking or lockup-free.By that we mean that the processor can proceed while the prefetched
Data is being fetched; and the caches continue to supply instructions
And data while waiting for the prefetched data to return.
Page 14
![Page 15: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/15.jpg)
15
15
Software Prefetching Issues
• Timing is the biggest issue, not predictability – If you prefetch very close to when the data is
required, you might be too late – Prefetch too early, cause pollution – Estimate how long it will take for the data to
come into L1, so we can set P appropriately – Why is this hard to do?
for(i=0; i < N; i++) {fetch( &a[i + P] );fetch( &b[i + P] );SUM = SUM + a[i] * b[i];
} Don’t ignore cost of prefetch instructions
Page 15
![Page 16: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/16.jpg)
16
16
Compiler Optimizations
• Restructuring code affects the data block access sequence
– Group data accesses together to improvespatial locality
– Re- order data accesses to improve temporallocality
• Prevent data from entering the cache – Useful for variables that are only accessed
once • Kill data that will never be used
– Streaming data exploits spatial locality but not temporal locality
Miss-rate reduction without any hardware changes. Hardware designer’s favorite solution.
Page 16
![Page 17: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/17.jpg)
17
17
Loop Interchange
for(j=0; j < N; j++) {for(i=0; i < M; i++) {x[i][j] = 2 * x[i][j];}}
for(i=0; i < M; i++) {for(j=0; j < N; j++) {x[i][j] = 2 * x[i][j];}}
What type of locality does this improve?
Page 17
![Page 18: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/18.jpg)
18
18
Loop Fusion
for(i=0; i < N; i++)for(j=0; j < M; j++)a[i][j] = b[i][j] * c[i][j]; for(i=0; i < N; i++)for(j=0; j < M; j++)
d[i][j] = a[i][j] * c[i][j];
for(i=0; i < M; i++)for(j=0; j < N; j++) {a[i][j] = b[i][j] * c[i][j];d[i][j] = a[i][j] * c[i][j];
}
What type of locality does this improve?
Page 18
![Page 19: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/19.jpg)
19
19
for(i=0; i < N; i++)for(j=0; j < N; j++) {
r = 0;for(k=0; k < N; k++)
r = r + y[i][k] * z[k][j];x[i][j] = r;
}
Blocking
x j j
i k
Not touched Old access New access
zy k
i
Page 19
![Page 20: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/20.jpg)
20
20
for(jj=0; jj < N; jj=jj+B)for(kk=0; kk < N; kk=kk+B)
for(i=0; i < N; i++)for(j=jj; j < min(jj+B,N); j++) {
r = 0;for(k=kk; k < min(kk+B,N); k++)
r = r + y[i][k] * z[k][j];x[i][j] = x[i][j] + r;
}
Blocking
x j j
i k
What type of locality does this improve?
zy k
i
Y benefits from spatial locality z benefits from temporal locality
Page 20
![Page 21: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/21.jpg)
21
21
CPU-Cache Interaction (5-stage pipeline)
PC addr inst
Primary Instruction Cache
0x4 Add
IR
D
nop
hit?
PCen
Decode, Register
Fetch R
addr
wdata
rdata PrimaryData Cache
we A
B YALU
MD1 MD2
Cache Refill Data from Lower Levels of Memory Hierarchy
hit?
Stall entire CPU on data cache miss
To Memory Control
M E
Instruction miss ?
Page 21
![Page 22: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/22.jpg)
22
22
Memory (DRAM) Performance
• Upon a cache miss – 4 clocks to send the address – 24 clocks for the access time per word – 4 clocks to send a word of data
• Latency worsens with increasing block size
1 Gb DRAM
50-100 ns access time Needs refreshing
Need 128 or 116 clocks, 128 for a dumb memory.
Page 22
![Page 23: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/23.jpg)
23
23
Memory organizations
Memory
Bus
Cache
CPU
1 word wide
Memory B
us Cache
CPU
Wide memory
Bus
Cache
CPU
bank 0 bank 1
Interleaved memory
Alpha AXP 21064 256 bits wide memory and cache.
Page 23
![Page 24: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/24.jpg)
24
24
Interleaved Memory
• Banks are often 1 word wide • Send an address to all the banks • How long to get 4 words back?
Bus
Cache
CPU
B 0 B 1 B 2 B 3
Bank i contains all words whose address
modulo N is i
4 + 24 + 4* 4 clocks = 44 clocks from interleaved memory.
Page 24
![Page 25: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/25.jpg)
25
25
Independent Memory
• Send an address to all the banks • How long to get 4 words back?
Bank i contains all words whose address
modulo N is i
Bus
Non Blocking
Cache
CPU
B 0 B 1 B 2 B 3 B
us
Bus
Bus
4 + 24 + 4 = 32 clocks from main memory for 4 words.
Page 25
![Page 26: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/26.jpg)
26
26
Memory Bank Conflicts
Consider a 128-bank memory in the NEC SX/3 where each bank can service independent requests
int x[256][512];for(j=0; j < 512; j++)
for(i=0; i < 256; i++)x[i][j] = 2 * x[i][j];
Consider column k Address of elements in column is i*512 + k
Where do these addresses go?
x[i][k], 0 <= i < 256
Page 26
![Page 27: Cache (Memory) Performance Optimizationdspace.mit.edu/.../6B1084F7-D4A7-449F-8E2A-E1B47ACA72E9/0/lecture08.pdf · Memory (DRAM) Performance • Upon a cache miss – 4 clocks to send](https://reader034.vdocuments.site/reader034/viewer/2022042023/5e7a849d54c9bc51ab43c643/html5/thumbnails/27.jpg)
27
27
Bank Assignment
B0 B1 B2 B3 0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
Address within Bank
Bank number = Address MOD NumBanks Address within bank = Address .
NumBanks
Should randomize address to bank mapping.
Could use odd/prime number of banks. Problem?
Bank number for i*512 + k should depend on i Can pick more significant bits in address
Page 27