1 ceng 450 computer systems and architecture lecture 15 amirali baniasadi [email protected]
TRANSCRIPT
2
Announcements
Last Quiz scheduled for March 31st.
Cache Write Policy: Write Through versus Write Back
Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache
Cache write: How do we keep data in the cache and memory consistent?
Two options: Write Back: write to cache only. Write the cache block to
memory when that cache block is being replaced on a cache miss.
Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex
Write Through: write to cache and memory at the same time. Isn’t memory too slow for this?
Write Buffer for Write Through
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycleMemory system designer’s nightmare:
Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation
ProcessorCache
Write Buffer
DRAM
Write Buffer Saturation
Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time
too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time
Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache:
ProcessorCache
Write Buffer
DRAM
ProcessorCache
Write Buffer
DRAML2Cache
6
Improving Cache Performance
Average Memory Access Time= Hit Time+ Miss Rate * Miss Penalty
1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.
7
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16
32
64
12
8
25
6
1K
4K
16K
64K
256K
1. Reduce Misses via Larger Block Size
8
2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2
Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%, internal + 2%
9
3. Reducing Misses via a “Victim Cache”
How to combine fast hit time of direct mapped yet still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
Used in Alpha, HP machines
To Next Lower Level InHierarchy
DATATAGS
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
10
4. Reducing Misses via “Pseudo-Associativity”
How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way associative cache?
Divide cache: on a miss, check other half of cache to see if the data is there, if so have a pseudo-hit (slow hit)
Drawback: Difficult to build a CPU pipeline if hit may take either 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC
Hit Time
Pseudo Hit Time Miss Penalty
Time
11
5. Reducing Misses by Compiler Optimizations
Instructions Not discussed here.
Data Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in
memory Loop Fusion: Combine 2 independent loops that have same looping and
some variables overlap
12
Merging Arrays Example
/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];
/* After: 1 array of structures */struct merge {
int val;int key;
};struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality
13
Loop Interchange Example
/* Before */for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
···
··· ···
i
j
··· ··· ···
memory addresses
14
Loop Fusion Example
/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality
15
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
Summary of Compiler Optimizations (by hand)
16
Summary: Miss Rate Reduction
3 Cs: Compulsory, Capacity, Conflict1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4. Reducing Misses via Pseudo-Associativity5. Reducing Misses by Compiler Optimizations
CPUtimeIC CPIExecution
Memory accesses
InstructionMiss rateMiss penalty
Clock cycle time
17
Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.
18
1. Reduce Miss Penalty: Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
Generally useful only when cache line > bus width Spatial locality a problem; tend to want next sequential word, so not
clear if benefit by early restart
block
19
2. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses
Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss
“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests
“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses Pentium Pro allows 4 outstanding memory misses
20
3: Use a multi-level cache
L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Definitions: Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters
21
Reducing Misses: Which apply to L2 Cache?
Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity3. Reducing Conflict Misses via Victim Cache4. Reducing Conflict Misses via Pseudo-Associativity5. Reducing Capacity/Conf. Misses by Compiler Optimizations
22
Relative CPU Time
Block Size
11.11.21.31.41.51.61.71.81.9
2
16 32 64 128 256 512
1.361.28 1.27
1.34
1.54
1.95
L2 cache block size & A.M.A.T.
32KB L1, 8 byte path to memory
23
Reducing Miss Penalty Summary
Three techniques Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache
Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels
in between
CPUtimeIC CPIExecution
Memory accesses
InstructionMiss rateMiss penalty
Clock cycle time
24
Summary: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs. write-back
The optimal choice is a compromise depends on access characteristics
workload use (I-cache, D-cache, TLB)
depends on technology / cost Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
25
IBM POWER4 Memory Hierarchy
L1(Instr.)64 KB
Direct Mapped
L1(Data)32 KB
2-way, FIFO
L2(Instr. + Data)1440 KB, 3-way, pseudo-LRU(shared by two processors)
L3(Instr. + Data)128 MB8-way
(shared by two processors)
4 cycles to load to a floatingpoint register
128-byte blocksdivided into 32-byte sectors
write allocate14 cycles to load to a floating
point register128-byte blocks
340 cycles512-byte blocks
divided into 128-byte sectors
26
Intel Itanium Processor
L1(Instr.)16 KB4-way
L1(Data)16 KB, 4-waydual-ported
write through
L2 (Instr. + Data)96 KB, 6-way
4 MB (on package, off chip)
32-byte blocks2 cycles
64-byte blockswrite allocate
12 cycles
64-byte blocks128 bits bus at 800 MHz
(12.8 GB/s)20 cycles
27
3rd Generation Itanium
1.5 GHz 410 million transistors 6MB 24-way set associative L3 cache 6-level copper interconnect, 0.13 micron 130W (i.e. lasts 17s on an AA NiCd)
28
Miss-oriented Approach to Memory Access:
CPIExecution includes ALU and Memory instructions
CycleTimeyMissPenaltMissRateInst
MemAccessExecution
CPIICCPUtime
CycleTimeyMissPenaltInst
MemMissesExecution
CPIICCPUtime
Cache performance
Separating out Memory component entirely AMAT = Average Memory Access Time CPIALUOps does not include memory instructions
CycleTimeAMATInst
MemAccessCPI
Inst
AluOpsICCPUtime
AluOps
yMissPenaltMissRateHitTimeAMAT DataDataData
InstInstInst
yMissPenaltMissRateHitTime
yMissPenaltMissRateHitTime
29
Impact on Performance
Suppose a processor executes at Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 100 cycle miss penaltySuppose that 1% of instructions get same miss penalty
ninstructioper stalls average CPI ideal CPI
miss
cycle
Inst_Mop
miss
instr.
Inst_Mop
miss
cycle
Data_Mops
miss
instr.
Data_Mops
instr.
cycles CPI
1.5instr.
cycle)0.10.31.1(
10001.0110010.030011CPI
..
78% of the time the proc is stalled waiting for memory!
30
Example: Harvard Architecture Unified vs. Separate I&D (Harvard)
16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%
Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
ProcI-Cache-1
Proc
UnifiedCache-1
UnifiedCache-2
D-Cache-1
Proc
UnifiedCache-2
Summary:
The Principle of Locality: Program access a relatively small portion of the address space
at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space
Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start
misses. Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size
Write Policy: Write Through: need a write buffer. Nightmare: WB saturation Write Back: control can be complex
Cache Performance