memory performance gap - iowa state...

12
1 1 Memory Hierarchy: Cache Design Principles and Performance CprE 581 Fall 2016 Slides adapted and revised from UC Berkeley CS252 S01 and f06 Adapted from UCB CS252 S01, Copyright 2001 UCB Copyright © 2012, Elsevier Inc. All rights reserved. Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented to the processor Introduction Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Introduction Since 1980, CPU has outpaced DRAM ... CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs DRAM CPU Performance (1/latency) Year Gap grew 50% per year Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”. Copyright © 2012, Elsevier Inc. All rights reserved. Memory Performance Gap Introduction 1977: DRAM faster than microprocessors Apple ][ (1977) Steve Wozniak Steve Jobs CPU: 1000 ns DRAM: 400 ns

Upload: others

Post on 15-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

1

1

Memory Hierarchy: Cache Design Principles and Performance

CprE 581 Fall 2016Slides adapted and revised from UC Berkeley CS252 S01 and f06

Adapted from UCB CS252 S01, Copyright 2001 UCB

Copyright © 2012, Elsevier Inc. All rights reserved.

IntroductionProgrammers want unlimited amounts of memory with low latencyFast memory technology is more expensive per bit than slower memorySolution: organize memory system into a hierarchy Entire addressable memory space available in largest,

slowest memory Incrementally smaller and faster memories, each containing

a subset of the memory below it, proceed in steps up toward the processor

Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented

to the processor

Introduction

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy

Introduction

Since 1980, CPU has outpaced DRAM ...

CPU60% per yr2X in 1.5 yrs

DRAM9% per yr2X in 10 yrs

DRAM

CPU

Performance(1/latency)

Year

Gap grew 50% per year

Q. How do architects address this gap?

A. Put smaller, faster “cache” memories between CPU and DRAM.

Create a “memory hierarchy”.

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Performance Gap

Introduction

1977: DRAM faster than microprocessorsApple ][ (1977)

Steve WozniakSteve

Jobs

CPU: 1000 nsDRAM: 400 ns

Page 2: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

2

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy DesignMemory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores:

Intel Core i7 can generate two references per core per clock

Four cores and 3.2 GHz clock 25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s!

DRAM bandwidth is only 6% of this (25 GB/s) Requires:

Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip

Introduction

8

Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Today’s Focus

9

The Principle of LocalityThe Principle of Locality: Program access a relatively small portion of the

address space at any instant of time.Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is

referenced, it will tend to be referenced again soon (e.g., loops, reuse)

Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Last 15 years, HW relied on locality for speed

It is a property of programs which is exploited in machine design.

Copyright © 2012, Elsevier Inc. All rights reserved.

Performance and PowerHigh-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget

Introduction

Programs with locality cache well ...

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

ory

Ad

dre

ss (

on

e d

ot

per

acc

ess)

SpatialLocality

TemporalLocality

Bad locality behavior

12

Memory Hierarchy: TerminologyHit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper

level Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/missMiss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processorHit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Page 3: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

3

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy BasicsWhen a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy,

requiring a higher latency reference Lower level may be another cache or the main

memory Also fetch the other words contained within the

block Takes advantage of spatial locality

Place block into cache in any location within its set, determined by address block address MOD number of sets

IntroductionCache Algorithm (Read)

HIT - Found in Cache

Return copyof data fromcache

MISS - Not in cache

Read block of data from Main Memory

Wait …

Return data to processorand update cache

Look at Processor Address, search cache tags to find match. Then either

Hit Rate = fraction of accesses found in cacheMiss Rate = 1 – Hit rateHit Time = RAM access time +

time to determine HIT/MISSMiss Time = time to replace block in cache +

time to deliver block to processor

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basicsn set cardinality => n-way set associative Direct-mapped cache => one block per set Fully associative => one set with cardinality n

Writing to cache: two strategies Write-through

Immediately update lower levels of hierarchy Write-back

Only update lower levels of hierarchy when an updated block is replaced

Both strategies use write buffer to make writes asynchronous

Introduction

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy BasicsMiss rate Fraction of cache access that result in a miss

Causes of misses Compulsory

First reference to a block Capacity

Blocks discarded and later retrieved Conflict

Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache

Introduction

Inside a Cache

CACHEProcessor MainMemory

Address Address

DataData

AddressTag

Data Block

DataByte

DataByte

DataByte

Line100304

6848

copy of mainmemorylocation 100

copy of mainmemorylocation 101

416

18

4 Questions for Memory Hierarchy

Q1: Where can a block be placed in the upper level? (Block placement)Q2: How is a block found if it is in the upper level?(Block identification)Q3: Which block should be replaced on a miss? (Block replacement)Q4: What happens on a write? (Write strategy)

Page 4: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

4

19

Q1: Where can a block be placed in the upper level?

Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set

associative S.A. Mapping = Block Number Modulo Number Sets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Fully associative Direct Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

20

Q2: How is a block found if it is in the upper level?

Tag on each block No need to check index or block offsetIncreasing associativity shrinks index, expands tag

BlockOffset

Block Address

IndexTag

Memory address

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.5 The organization of the data cache in the Opteron microprocessor. The 64 KB cache is two-way set associative with 64-byte blocks. The 9-bit index selects among 512 sets. The four steps of a read hit, shown as circled numbers in order of occurrence, labelthis organization. Three bits of the block offset join the index to supply the RAM address to select the proper 8 bytes. Thus, the cacheholds two groups of 4096 64-bit words, with each group containing half of the 512 sets. Although not exercised in this example, the linefrom lower-level memory to the cache is used on a miss to load the cache. The size of address leaving the processor is 40 bits because itis a physical address and not a virtual address. Figure B.24 on page B-47 explains how the Opteron maps from virtual to physical for acache access.

Q3: Which block should be replaced on a miss?

A randomly chosen block?Easy to implement, how

well does it work?

The Least Recently Used (LRU) block? Appealing,but hard to implement for high associativity

Miss Rate for 2-way Set Associative Cache

Also, try other LRU approx.

Size Random LRU16 KB 5.7% 5.2%64 KB 2.0% 1.9%256 KB 1.17% 1.15%

Q4: What happens on a write?Write-Through Write-Back

Policy

Data written to cache block

also written to lower-level memory

Write data only to the cache

Update lower level when a block falls out of the cache

Debug Easy Hard

Do read misses produce writes? No Yes

Do repeated writes make it to lower level?

Yes No

Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).

24

Example: 1 KB Direct Mapped CacheAssume a cache of 2N bytes, 2K blocks, block size of 2M

bytes; N = M+K (#block times block size) (32 - N)-bit cache tag, K-bit cache index, and M-bit block-offset

The cache stores tag, data, and valid bit for each block Cache index is used to select a block in SRAM (Recall BHT, BTB) Block tag is compared with the input tag A word in the data block may be selected as the output

Index

0123

:

Cache DataByte 0

0431

:

Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

31

Byte 1Byte 31

:

Byte 32Byte 33Byte 63

:

Byte 992Byte 1023

:

Cache Tag

Block offsetEx: 0x00

9Block address

Page 5: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

5

25

Example: Set Associative CacheExample: Two-way set associative cache Cache index selects a set of two blocks The two tags in the set are compared to the input in

parallel Data is selected based on the tag comparison

Set associative or direct mapped? Discuss laterCache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

26

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C’s for thedata in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in eachcategory. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.)

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy BasicsSix basic cache optimizations: Larger block size

Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty

Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption

Higher associativity Reduces conflict misses Increases hit time, increases power consumption

Higher number of cache levels Reduces overall memory access time

Giving priority to read misses over writes Reduces miss penalty

Avoiding address translation in cache indexing Reduces hit time

Introduction

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size istoo large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot theselines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on aDECstation 5000 [Gee et al. 1993].

30

Page 6: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

6

31

32

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches smaller than the sum of the two 64 KB first-level caches make little sense, as reflected in the high miss rates. After 256 KB the single cache is within 10% of the global missrates. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-levelcache using a 32 KB first-level cache. The L2 caches (unified) were two-way set associative with replacement. Each had split L1instruction and data caches that were 64 KB two-way set associative with LRU replacement. The block size for both L1 and L2caches was 64 bytes. Data were collected as in Figure B.4

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.15 Relative execution time by second-level cache size. The two bars are for different clock cycles for an L2 cache hit.The reference execution time of 1.00 is for an 8192 KB second-level cache with a 1-clock-cycle latency on a second-level hit.These data were collected the same way as in Figure B.14, using a simulator to imitate the Alpha 21264.

35

Prioritize Reads over writes – write buffers

With a write-buffer, write through might be acceptable

Write-buffer should act as a bypass for successiveLoads.

Optimization- prioritize reads

Read miss can wait until write-buffers empty.Check the write-buffer contents.

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.16 Miss rate versus virtually addressed cache size of a program measured three ways: without process switches(uniprocess), with process switches using a process-identifier tag (PID), and with process switches but without PIDs (purge).PIDs increase the uniprocess absolute miss rate by 0.3% to 0.6% and save 0.6% to 4.3% over purging. Agarwal [1987] collectedthese statistics for the Ultrix operating system running on a VAX, assuming direct-mapped caches with a block size of 16 bytes.Note that the miss rate goes up from 128K to 256K. Such nonintuitive behavior can occur in caches because changing size changesthe mapping of memory blocks onto cache blocks, which can change the conflict miss rate.

Avoid Address Translation

Page 7: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

7

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. Thepage size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2cache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physicaladdress is 40 bits.

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.19 The logical program in its contiguous virtual address space is shown on the left. It consists of four pages, A,B, C, and D. The actual location of three of the blocks is in physical main memory and the other is located on the disk.

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.23 The mapping of a virtual address to a physical address via a page table.

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.24 Operation of the Opteron data TLB during address translation. The four steps of a TLB hit are shown as circlednumbers. This TLB has 40 entries. Section B.5 describes the various protection and access fields of an Opteron page table entry.

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.25 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. The pagesize is 8 KB. The TLB is direct mapped with 256 entries. The L1 cache is a direct-mapped 8 KB, and the L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address is 41 bits. The primary differencebetween this simple figure and a real cache is replication of pieces of this figure.

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.26 The IA-32 segment descriptors are distinguished by bits in the attributes field. Base, limit, present, readable, andwritable are all self-explanatory. D gives the default addressing size of the instructions: 16 bits or 32 bits. G gives the granularity ofthe segment limit: 0 means in bytes and 1 means in 4 KB pages. G is set to 1 when paging is turned on to set the size of the pagetables. DPL means descriptor privilege level—this is checked against the code privilege level to see if the access will be allowed.Conforming says the code takes on the privilege level of the code being called rather than the privilege level of the caller; it is usedfor library routines. The expand-down field flips the check to let the base field be the high-water mark and the limit field be the low-water mark. As you might expect, this is used for stack segments that grow down. Word count controls the number of words copiedfrom the current stack to the new stack on a call gate. The other two fields of the call gate descriptor, destination selector anddestination offset, select the descriptor of the destination of the call and the offset into it, respectively. There are many more thanthese three segment descriptors in the IA-32 protection model.

Page 8: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

8

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure B.27 The mapping of an Opteron virtual address. The Opteron virtual memory implementation with four page tablelevels supports an effective physical address size of 40 bits. Each page table has 512 entries, so each level field is 9 bits wide. TheAMD64 architecture document allows the virtual address size to grow from the current 48 bits to 64 bits, and the physical addresssize to grow from the current 40 bits to 52 bits.

Copyright © 2012, Elsevier Inc. All rights reserved.

Ten Advanced OptimizationsSmall and simple first level caches Critical timing path:

addressing tag memory, then comparing tags, then selecting correct set

Direct-mapped caches can overlap tag compare and transmission of data

Lower associativity reduces power because fewer cache lines are accessed

Advanced O

ptimiza

tions

Copyright © 2012, Elsevier Inc. All rights reserved.

L1 Size and Associativity

Access time vs. size and associativity

Advanced O

ptimizations

64B block40nm

Copyright © 2012, Elsevier Inc. All rights reserved.

L1 Size and Associativity

Energy per read vs. size and associativity

Advanced O

ptimizations

Copyright © 2012, Elsevier Inc. All rights reserved.

Way PredictionTo improve hit time, predict the way to pre-set mux Mis-prediction gives longer hit time Prediction accuracy

> 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache

First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8

Extend to predict block as well “Way selection” Increases mis-prediction penalty

Advanced O

ptimiza

tions

Copyright © 2012, Elsevier Inc. All rights reserved.

Pipelining CachePipeline cache access to improve bandwidth Examples:

Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles

Increases branch mis-prediction penaltyMakes it easier to increase associativity

Advanced O

ptimizations

Page 9: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

9

Copyright © 2012, Elsevier Inc. All rights reserved.

Nonblocking CachesAllow hits before previous misses complete “Hit under miss” “Hit under multiple

miss”L2 must support thisIn general, processors can hide L1 miss penalty but not L2 miss penalty

Ad

vanced Optim

izations

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure 2.5 The effectiveness of a nonblocking cache is evaluated by allowing 1, 2, or 64 hits under a cache miss with 9 SPECINT (on the left) and 9 SPECFP (on the right) benchmarks. The data memory system modeled after the Intel i7 consists of a 32KB L1 cache with a four cycle access latency. The L2 cache (shared with instructions) is 256 KB with a 10 clock cycle access latency. The L3 is 2 MB and a 36-cycle access latency. All the caches are eight-way set associative and have a 64-byte block size. Allowing one hit under miss reduces the miss penalty by 9% for the integer benchmarks and 12.5% for the floating point. Allowing a second hit improves these results to 10% and 16%, and allowing 64 results in little additional improvement.

Copyright © 2012, Elsevier Inc. All rights reserved.

Multibanked CachesOrganize cache as independent banks to support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for

L2

Interleave banks according to block address

Advanced O

ptimizations

Copyright © 2012, Elsevier Inc. All rights reserved.

Critical Word First, Early Restart

Critical word first Request missed word from memory first Send it to the processor as soon as it arrives

Early restart Request words in normal order Send missed word to the processor as soon as it

arrives

Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

Ad

vanced Optim

izations

Copyright © 2012, Elsevier Inc. All rights reserved.

Merging Write BufferWhen storing to a block that is already pending in the write buffer, update write bufferReduces stalls due to full write bufferDo not apply to I/O addresses

Advanced O

ptimiza

tions

No write buffering

Write buffering

Loop Interchange/* Before */for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)x[i][j] = 2 * x[i][j];

54

Page 10: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

10

Loop Interchange/* After */for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)x[i][j] = 2 * x[i][j];

55

Blocking/* Before */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1){r = 0;for (k = 0; k < N; k = k + 1)r = r + y[i][k]*z[k][j];x[i][j] = r;};

56

Copyright © 2011, Elsevier Inc. All rights Reserved.

A snapshot of the three arrays x, y, and z when N = 6 and i = 1. The age of accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses, and dark means newer accesses. Compared to Figure 2.9, elements of y and z are read repeatedly to calculate new elements of x. The variables i, j, and k are shown along the rows or columns used to access the arrays.

BLOCKING

Blockingfor (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)for (j = jj; j < min(jj+B,N); j = j+1){r = 0;for (k = kk; k < min(kk+B,N); k = k + 1)r = r + y[i][k]*z[k][j];x[i][j] = x[i][j] + r;};

58

Copyright © 2011, Elsevier Inc. All rights Reserved.

The age of accesses to the arrays x, y, and z when B = 3. Note that, in contrast to Figure 2.8, a smaller numberof elements is accessed.

60

Real Example: Alpha 21264 Caches

I-cache D-cache

64KB 2-way associative instruction cache64KB 2-way associative data cache

Page 11: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

11

61

Alpha 21264 Data CacheD-cache: 64K 2-way

associativeUse 48-bit virtual address to index cache, use tag from physical address48-bit Virtual=>44-bit address512 sets (9-bit blk index), 1024 blocksCache block size 64 bytes (6-bit offset)tTag has 44-(9+6)=29 bitsWriteback and write allocated

(We will study virtual-physical address translation)

62

Three first-order factors: Hit Time Miss penalty Miss rate

AMAT: Average memory access timeA simple metric to balance the three factors

Example: hit time = 1 cycle, miss time = 100 cycle, miss rate = 4%, AMAT = ________

Cache performance

penalty Missrate Miss hit time AMAT

63

Cache performanceCache impact on CPI

Note cycles spent on cache hit is usually counted into execution cycles

Better AMAT may not always mean better performance

CycleTimenInstructio

Cycles StallMemory CPIIC timeCPU

timeCyclecycles) stallMemory cyclesexecution (CPU timeCPU

execution

64

Example: Evaluating Split Inst/Data CacheUnified vs Split Inst/data cache (Harvard Architecture)

Example: Assume 36% data ops 74% accesses from instructions (1.0/1.36) 16KB I&D: Inst miss rate=0.4%, data miss rate=11.4%, overall

3.24% 32KB unified: Aggregate miss rate=3.18%

Which design is better? hit time=1, miss time=100 Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard=74%x(1+0.4%x100)+26%x(1+11.4%x100) = 4.24AMATUnified=74%x(1+3.18%x100)+26%x(1+1+3.18%x100)= 4.44

ProcI-Cache-1Proc

UnifiedCache-1

UnifiedCache-2

D-Cache-1Proc

UnifiedCache-2

65

Disadvantage of Set Associative CacheCompare n-way set associative with direct mapped cache: Has n comparators vs. 1 comparator Has Extra MUX delay for the data Data comes after hit/miss decision and set selection

In a direct mapped cache, cache block is available before hit/miss decision Use the data assuming the access is a hit, recover if

found otherwiseCache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

66

Cache and Out-of-order ProcessorsAs for performance:

Very difficult to define miss penalty Partial overlapping between execution and

memory access Overlapping between memory accesses

themselvesCache hit time may increase without increasing cycle time – multi-cycle latencyPractical solution: Run simulation

AMAT is no longer a good indication

Page 12: Memory Performance Gap - Iowa State Universityclass.ece.iastate.edu/tyagi/cpre581/lectures/Lecture16.pdf · Memory Hierarchy Design Memory hierarchy design becomes more crucial with

12

67

Cache and Out-of-order ProcessorsAs for design:

Non-blocking cache to avoid full fall Wake up the load inst upon the return of

data May tolerate L1 misses but not misses to

DRAM Allow overlapping of memory accessesCache hit-or-miss complicates instruction scheduling Mini flash to avoid high penalty