memory hierarchies

Memory hierarchies

An introduction to the use of cache memories as a technique for exploiting

the ‘locality-of-reference’ principle

Recall system diagram

Mainmemory

I/O I/O I/O I/O I/O

system bus

CPU

SRAM versus DRAM

• Random Access Memory uses two distinct technologies: ‘static’ and ‘dynamic’

• Static RAM is much faster

• Dynamic RAM is much cheaper

• Static RAM offers ‘persistent’ storage

• Dynamic RAM provides ‘volatile’ storage

• Our workstations have 1GB of DRAM, but only 512KB of SRAM (ratio: 2-million to 1)

Data access time

• The time-period between the supplying of the data access signal and the output (or acceptance) of the data by the addressed subsystem

• FETCH-OPERATION: cpu places address for a memory-location on the system bus, and memory-system responds by placing memory-location’s data on the system bus

Comparative access timings

• Processor registers: 5 nanoseconds

• SRAM memory: 15 nanoseconds

• DRAM memory: 60 nanoseconds

• Magnetic disk: 10 milliseconds

• Optical disk: (even slower)

System design decisions

• How much memory?

• How fast?

• How expensive?

• Tradeoffs and compromises

• Modern systems employ a ‘hybrid’ design in which small, fast, expensive SRAM is supplemented by larger, slower, cheaper DRAM

Memory hierarchy

CPU

CACHE MEMORY

MAIN MEMORY

‘word’ transfers

‘burst’ transfers

Instructions versus Data

• Modern system designs frequently use a pair of separate cache memories, one for storing processor instructions and another for storing the program’s data

• Why is this a good idea?

• Let’s examine the way a cache works

Fetching program-instructions

• CPU fetches its next program-instruction by placing the value from register EIP on the system bus and issuing a ‘read’ signal

• If that instruction has not previously been fetched, it will not be available in the ‘fast’ cache memory, so it must be gotten from the ‘slower’ main memory; however, it will be ‘saved’ in the cache in case it’s needed again very soon (e.g., in a program loop)

Most programs use ‘loops’

// EXAMPLE:

int sum = 0, i = 0;

do {

sum += array[ i ];

++ i;

}

while ( i < 64 );

Assembly language loop

movl $0, %eax # initialize accumulator movl $0, %esi # initialize array-indexagain: addl array(%esi, 4), %eax # add next int

incl %esi # increment array-indexcmp $64, %esi # check for exit-conditionjae finis # index is beyond boundsjmp again # else do another addition

finis:

Benefits of cache-mechanism

• During the first pass through the loop, all of the loop-instructions must be fetched from the (slow) DRAM memory

• But loops are typically short – maybe only a few dozen bytes – and multiple bytes can be fetched together in a single ‘burst’

• Thus subsequent passes through the loop can fetch all of these same instructions from the (fast) SRAM memory

‘Locality-of-Reference’

• The ‘locality-of-reference’ principle is says that, with most computer programs, when the CPU fetches an instruction to execute, it is very likely that the next instruction it fetches will be located at a nearly address

Accessing data-operands

• Accessing a program’s data differs from accessing its instructions in that the data consists of ‘variables’ as well ‘constants’

• Instructions are constant, but data may be modified (i.e., may ‘write’ as well as ‘read’)

• This introduces the problem known as ‘cache coherency’ – if a new value gets assigned to a variable, its cached value will be modified, but what will happen to its original memory-value?

Cache ‘write’ policies

• ‘Write-Through’ policy: a modified variable in the cache is immediately copied back to the original memory-location, to insure that the memory and the cache are kept ‘consistent’ (synchronized)

• ‘Write-Back’ policy: a modified variable in the cache is not immediately copied back to its original memory-location – some values nearby might also get changed soon, in which case all adjacent changes could be done in one ‘burst’

Locality-of-reference for data

// EXAMPLE (from array-processing):

int price[64], int quantity[64], revenue[64];

for (int i = 0; i < 64; i++)

revenue[ i ] = price[ i ] * quantity[ i ];

Cache ‘hit’ or cache ‘miss’

• When the CPU accesses an address, it first looks in the (fast) cache to see if the address’s value is perhaps already there

• In case the CPU finds that address-value is in the cache, this is called a ‘cache hit’

• Otherwise, if the CPU finds that the value it wants to access is not presently in the cache, this is called a ‘cache miss’

Cache-Line Replacement

• Small-size cache-memory quickly fills up

• So some ‘stale’ items need to be removed in order to make room for the ‘fresh’ items

• Various policies exist for ‘replacement’

• But ‘write-back’ of any data in the cache which is ‘inconsistent’ with data in memory can safely be deferred until it’s time for the cached data to be ‘replaced’

Performance impact

Access frequency (proportion of cache-hits)

Average access time

x

y

y = 75 – 60x

15ns

60ns

75ns

0 1

Examples:If x = .90, then y = 21nsIf x = .95, then y = 18ns

Control Register 0

• Privileged software (e.g., a kernel module) can disable Pentium’s cache-mechanism (by using some inline assembly language)

asm(“ movl %cr0, %eax “); asm(“ orl $0x60000000, %eax “); asm(“ movl %eax, %cr0 “);

PG CD NW AM WP NE ET TS EM MP PE

31 30 29 18 16 5 4 3 2 1 0

In-class demonstration

• Install ‘nocache.c’ module to turn off cache

• Execute ‘cr0.cpp’ to verify cache disabled

• Run ‘sieve.cpp’ using the ‘time’ command to measure performance w/o caching

• Remove ‘nocache’ to re-enable the cache

• Run ‘sieve.cpp’ again to measure speedup

• Run ‘fsieve.cpp’ for comparing the speeds of memory-access versus disk-access

In-class exercises

• Try modifying the ‘sieve.cpp’ source-code so that each array-element is stored at an address which is in a different cache-line (cache-line size on Pentium is 32 bytes)

• Try modifying the ‘fsieve.cpp’ source-code to omit the file’s ‘O_SYNC’ flag; also try shrinking the ‘m’ parameter so that more than one element is stored in a disk-sector (size of each disk-sector is 512 bytes)

memory hierarchies

Documents