1 ceng 450 computer systems and architecture lecture 15 amirali baniasadi [email protected]

1

CENG 450Computer Systems and Architecture

Lecture 15

Amirali Baniasadi

[email protected]

2

Announcements

Last Quiz scheduled for March 31st.

Cache Write Policy: Write Through versus Write Back

Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache

Cache write: How do we keep data in the cache and memory consistent?

Two options: Write Back: write to cache only. Write the cache block to

memory when that cache block is being replaced on a cache miss.

Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex

Write Through: write to cache and memory at the same time. Isn’t memory too slow for this?

Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write

cycleMemory system designer’s nightmare:

Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation

ProcessorCache

Write Buffer

DRAM

Write Buffer Saturation

Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time

too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time

Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache:

ProcessorCache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache

6

Improving Cache Performance

Average Memory Access Time= Hit Time+ Miss Rate * Miss Penalty

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

7

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size

8

2. Reduce Misses via Higher Associativity

2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2

Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%, internal + 2%

9

3. Reducing Misses via a “Victim Cache”

How to combine fast hit time of direct mapped yet still avoid conflict misses?

Add buffer to place data discarded from cache

Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

Used in Alpha, HP machines

To Next Lower Level InHierarchy

DATATAGS

One Cache line of DataTag and Comparator




10

4. Reducing Misses via “Pseudo-Associativity”

How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way associative cache?

Divide cache: on a miss, check other half of cache to see if the data is there, if so have a pseudo-hit (slow hit)

Drawback: Difficult to build a CPU pipeline if hit may take either 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time

Pseudo Hit Time Miss Penalty

Time

11

5. Reducing Misses by Compiler Optimizations

Instructions Not discussed here.

Data Merging Arrays: improve spatial locality by single array of compound

elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in

memory Loop Fusion: Combine 2 independent loops that have same looping and

some variables overlap

12

Merging Arrays Example

/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];

/* After: 1 array of structures */struct merge {

int val;int key;

};struct merge merged_array[SIZE];

Reducing conflicts between val & key; improve spatial locality

13

Loop Interchange Example

/* Before */for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

···

··· ···

i

j

··· ··· ···

memory addresses

14

Loop Fusion Example

/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];

/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

15

Performance Improvement

1 1.5 2 2.5 3

compress

cholesky(nasa7)

spice

mxm (nasa7)

btrix (nasa7)

tomcatv

gmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking

Summary of Compiler Optimizations (by hand)

16

Summary: Miss Rate Reduction

3 Cs: Compulsory, Capacity, Conflict1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4. Reducing Misses via Pseudo-Associativity5. Reducing Misses by Compiler Optimizations

CPUtimeIC CPIExecution

Memory accesses

InstructionMiss rateMiss penalty

Clock cycle time

17

Improving Cache Performance

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

18

1. Reduce Miss Penalty: Early Restart and Critical Word First

Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block

arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory

and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

Generally useful only when cache line > bus width Spatial locality a problem; tend to want next sequential word, so not

clear if benefit by early restart

block

19

2. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses

Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss

“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests

“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as

there can be multiple outstanding memory accesses Pentium Pro allows 4 outstanding memory misses

20

3: Use a multi-level cache

L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

Definitions: Local miss rate— misses in this cache divided by the total number of

memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of

memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)

Global Miss Rate is what matters

21

Reducing Misses: Which apply to L2 Cache?

Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity3. Reducing Conflict Misses via Victim Cache4. Reducing Conflict Misses via Pseudo-Associativity5. Reducing Capacity/Conf. Misses by Compiler Optimizations

22

Relative CPU Time

Block Size

11.11.21.31.41.51.61.71.81.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

L2 cache block size & A.M.A.T.

32KB L1, 8 byte path to memory

23

Reducing Miss Penalty Summary

Three techniques Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache

Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels

in between

CPUtimeIC CPIExecution

Memory accesses

InstructionMiss rateMiss penalty

Clock cycle time

24

Summary: The Cache Design Space

Several interacting dimensions cache size block size associativity replacement policy write-through vs. write-back

The optimal choice is a compromise depends on access characteristics

workload use (I-cache, D-cache, TLB)

depends on technology / cost Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

25

IBM POWER4 Memory Hierarchy

L1(Instr.)64 KB

Direct Mapped

L1(Data)32 KB

2-way, FIFO

L2(Instr. + Data)1440 KB, 3-way, pseudo-LRU(shared by two processors)

L3(Instr. + Data)128 MB8-way

(shared by two processors)

4 cycles to load to a floatingpoint register

128-byte blocksdivided into 32-byte sectors

write allocate14 cycles to load to a floating

point register128-byte blocks

340 cycles512-byte blocks

divided into 128-byte sectors

26

Intel Itanium Processor

L1(Instr.)16 KB4-way

L1(Data)16 KB, 4-waydual-ported

write through

L2 (Instr. + Data)96 KB, 6-way

4 MB (on package, off chip)

32-byte blocks2 cycles

64-byte blockswrite allocate

12 cycles

64-byte blocks128 bits bus at 800 MHz

(12.8 GB/s)20 cycles

27

3rd Generation Itanium

1.5 GHz 410 million transistors 6MB 24-way set associative L3 cache 6-level copper interconnect, 0.13 micron 130W (i.e. lasts 17s on an AA NiCd)

28

Miss-oriented Approach to Memory Access:

CPIExecution includes ALU and Memory instructions

CycleTimeyMissPenaltMissRateInst

MemAccessExecution

CPIICCPUtime

CycleTimeyMissPenaltInst

MemMissesExecution

CPIICCPUtime

Cache performance

Separating out Memory component entirely AMAT = Average Memory Access Time CPIALUOps does not include memory instructions

CycleTimeAMATInst

MemAccessCPI

Inst

AluOpsICCPUtime

AluOps

yMissPenaltMissRateHitTimeAMAT DataDataData

InstInstInst

yMissPenaltMissRateHitTime

yMissPenaltMissRateHitTime

29

Impact on Performance

Suppose a processor executes at Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations get 100 cycle miss penaltySuppose that 1% of instructions get same miss penalty

ninstructioper stalls average CPI ideal CPI

miss

cycle

Inst_Mop

miss

instr.

Inst_Mop

miss

cycle

Data_Mops

miss

instr.

Data_Mops

instr.

cycles CPI

1.5instr.

cycle)0.10.31.1(

10001.0110010.030011CPI

..

78% of the time the proc is stalled waiting for memory!

30

Example: Harvard Architecture Unified vs. Separate I&D (Harvard)

16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%

Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

ProcI-Cache-1

Proc

UnifiedCache-1

UnifiedCache-2

D-Cache-1

Proc

UnifiedCache-2

Summary:

The Principle of Locality: Program access a relatively small portion of the address space

at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space

Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start

misses. Conflict Misses: increase cache size and/or associativity.

Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size

Write Policy: Write Through: need a write buffer. Nightmare: WB saturation Write Back: control can be complex

Cache Performance

1 ceng 450 computer systems and architecture lecture 15 amirali baniasadi [email protected]

Documents

cache miss rate l2 yglobal

miss zhit

cache controller

cache hits

yinstruction cache

effective miss penalty

cache block xgreatly

data cache zcache