1 ceng 450 computer systems and architecture lecture 15 amirali baniasadi [email protected]

31
1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi [email protected]

Upload: angela-jefferson

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

1

CENG 450Computer Systems and Architecture

Lecture 15

Amirali Baniasadi

[email protected]

Page 2: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

2

Announcements

Last Quiz scheduled for March 31st.

Page 3: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

Cache Write Policy: Write Through versus Write Back

Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache

Cache write: How do we keep data in the cache and memory consistent?

Two options: Write Back: write to cache only. Write the cache block to

memory when that cache block is being replaced on a cache miss.

Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex

Write Through: write to cache and memory at the same time. Isn’t memory too slow for this?

Page 4: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write

cycleMemory system designer’s nightmare:

Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation

ProcessorCache

Write Buffer

DRAM

Page 5: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

Write Buffer Saturation

Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time

too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time

Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache:

ProcessorCache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache

Page 6: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

6

Improving Cache Performance

Average Memory Access Time= Hit Time+ Miss Rate * Miss Penalty

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

Page 7: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

7

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size

Page 8: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

8

2. Reduce Misses via Higher Associativity

2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2

Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%, internal + 2%

Page 9: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

9

3. Reducing Misses via a “Victim Cache”

How to combine fast hit time of direct mapped yet still avoid conflict misses?

Add buffer to place data discarded from cache

Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

Used in Alpha, HP machines

To Next Lower Level InHierarchy

DATATAGS

One Cache line of DataTag and Comparator

One Cache line of DataTag and Comparator

One Cache line of DataTag and Comparator

One Cache line of DataTag and Comparator

Page 10: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

10

4. Reducing Misses via “Pseudo-Associativity”

How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way associative cache?

Divide cache: on a miss, check other half of cache to see if the data is there, if so have a pseudo-hit (slow hit)

Drawback: Difficult to build a CPU pipeline if hit may take either 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time

Pseudo Hit Time Miss Penalty

Time

Page 11: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

11

5. Reducing Misses by Compiler Optimizations

Instructions Not discussed here.

Data Merging Arrays: improve spatial locality by single array of compound

elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in

memory Loop Fusion: Combine 2 independent loops that have same looping and

some variables overlap

Page 12: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

12

Merging Arrays Example

/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];

/* After: 1 array of structures */struct merge {

int val;int key;

};struct merge merged_array[SIZE];

Reducing conflicts between val & key; improve spatial locality

Page 13: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

13

Loop Interchange Example

/* Before */for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

···

··· ···

i

j

··· ··· ···

memory addresses

Page 14: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

14

Loop Fusion Example

/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];

/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

Page 15: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

15

Performance Improvement

1 1.5 2 2.5 3

compress

cholesky(nasa7)

spice

mxm (nasa7)

btrix (nasa7)

tomcatv

gmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking

Summary of Compiler Optimizations (by hand)

Page 16: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

16

Summary: Miss Rate Reduction

3 Cs: Compulsory, Capacity, Conflict1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4. Reducing Misses via Pseudo-Associativity5. Reducing Misses by Compiler Optimizations

CPUtimeIC CPIExecution

Memory accesses

InstructionMiss rateMiss penalty

Clock cycle time

Page 17: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

17

Improving Cache Performance

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

Page 18: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

18

1. Reduce Miss Penalty: Early Restart and Critical Word First

Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block

arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory

and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

Generally useful only when cache line > bus width Spatial locality a problem; tend to want next sequential word, so not

clear if benefit by early restart

block

Page 19: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

19

2. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses

Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss

“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests

“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as

there can be multiple outstanding memory accesses Pentium Pro allows 4 outstanding memory misses

Page 20: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

20

3: Use a multi-level cache

L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

Definitions: Local miss rate— misses in this cache divided by the total number of

memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of

memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)

Global Miss Rate is what matters

Page 21: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

21

Reducing Misses: Which apply to L2 Cache?

Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity3. Reducing Conflict Misses via Victim Cache4. Reducing Conflict Misses via Pseudo-Associativity5. Reducing Capacity/Conf. Misses by Compiler Optimizations

Page 22: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

22

Relative CPU Time

Block Size

11.11.21.31.41.51.61.71.81.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

L2 cache block size & A.M.A.T.

32KB L1, 8 byte path to memory

Page 23: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

23

Reducing Miss Penalty Summary

Three techniques Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache

Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels

in between

CPUtimeIC CPIExecution

Memory accesses

InstructionMiss rateMiss penalty

Clock cycle time

Page 24: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

24

Summary: The Cache Design Space

Several interacting dimensions cache size block size associativity replacement policy write-through vs. write-back

The optimal choice is a compromise depends on access characteristics

workload use (I-cache, D-cache, TLB)

depends on technology / cost Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

Page 25: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

25

IBM POWER4 Memory Hierarchy

L1(Instr.)64 KB

Direct Mapped

L1(Data)32 KB

2-way, FIFO

L2(Instr. + Data)1440 KB, 3-way, pseudo-LRU(shared by two processors)

L3(Instr. + Data)128 MB8-way

(shared by two processors)

4 cycles to load to a floatingpoint register

128-byte blocksdivided into 32-byte sectors

write allocate14 cycles to load to a floating

point register128-byte blocks

340 cycles512-byte blocks

divided into 128-byte sectors

Page 26: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

26

Intel Itanium Processor

L1(Instr.)16 KB4-way

L1(Data)16 KB, 4-waydual-ported

write through

L2 (Instr. + Data)96 KB, 6-way

4 MB (on package, off chip)

32-byte blocks2 cycles

64-byte blockswrite allocate

12 cycles

64-byte blocks128 bits bus at 800 MHz

(12.8 GB/s)20 cycles

Page 27: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

27

3rd Generation Itanium

1.5 GHz 410 million transistors 6MB 24-way set associative L3 cache 6-level copper interconnect, 0.13 micron 130W (i.e. lasts 17s on an AA NiCd)

Page 28: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

28

Miss-oriented Approach to Memory Access:

CPIExecution includes ALU and Memory instructions

CycleTimeyMissPenaltMissRateInst

MemAccessExecution

CPIICCPUtime

CycleTimeyMissPenaltInst

MemMissesExecution

CPIICCPUtime

Cache performance

Separating out Memory component entirely AMAT = Average Memory Access Time CPIALUOps does not include memory instructions

CycleTimeAMATInst

MemAccessCPI

Inst

AluOpsICCPUtime

AluOps

yMissPenaltMissRateHitTimeAMAT DataDataData

InstInstInst

yMissPenaltMissRateHitTime

yMissPenaltMissRateHitTime

Page 29: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

29

Impact on Performance

Suppose a processor executes at Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations get 100 cycle miss penaltySuppose that 1% of instructions get same miss penalty

ninstructioper stalls average CPI ideal CPI

miss

cycle

Inst_Mop

miss

instr.

Inst_Mop

miss

cycle

Data_Mops

miss

instr.

Data_Mops

instr.

cycles CPI

1.5instr.

cycle)0.10.31.1(

10001.0110010.030011CPI

..

78% of the time the proc is stalled waiting for memory!

Page 30: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

30

Example: Harvard Architecture Unified vs. Separate I&D (Harvard)

16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%

Which is better (ignore L2 cache)? Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

ProcI-Cache-1

Proc

UnifiedCache-1

UnifiedCache-2

D-Cache-1

Proc

UnifiedCache-2

Page 31: 1 CENG 450 Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca

Summary:

The Principle of Locality: Program access a relatively small portion of the address space

at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space

Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start

misses. Conflict Misses: increase cache size and/or associativity.

Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size

Write Policy: Write Through: need a write buffer. Nightmare: WB saturation Write Back: control can be complex

Cache Performance