memory hierarchy design ch. 5 - york university · memory hierarchy we need huge ... compiler...

CSE 4201

Memory Hierarchy DesignCh. 5

(Hennessy and Patterson)

Memory Hierarchy

We need huge amount of cheap and fast memory

Memory is either fast or cheap; never both. Do as politicians do: fake it

Give a little bit of fast memory and tons of cheap memory.

As technology progresses Cheap becomes cheaper rapidly Fast becomes faster rapidly Cheap does not become fast, or fast cheap

Since the 80's...

Processors became 10,000 times faster Memory became 10 times faster Back then cache was for high performance

systems Now we need multiple levels of cache

Addressing Scheme

Match?

Address

Mapping

Set

BlockBlock Address

Block Offset

Cache Index

Tag

Set-Associative

(Set Address) = (Block Address) MOD K Where K is the number of sets in kache

(Block Address) = (Address) DIV b Where b is the number of bytes in the block

(Block Offset) = (Address) MOD b Set has n blocks. (n-way associative) Every block has data and address (tag). If K=1, fully associative If n=1, direct mapped

Victim Selection

Which block to expel to make room for new entry Least recently used Random First In First Out

All work more or less the same. LRU is rarely exact, almost always approximate Little effect on big caches About 10% for smaller

What Happens on a Write?

Writes are less common than reads All instruction fetches are reads Stores are 10% of the instructions, loads, 25% We have 10 writes for every 125 reads Better take good care of the reads

Writes are costlier than reads We write 1-8 bytes at a time in a block typically 32-

64 bytes long Have issues with consistency

Write Through, Back

Write through No need to write back on a cache miss No need to have dirty bit

Write back Less bus traffic

Write Through, Back

Write through <-> no-write-allocate Allocate cache block only on reads Multiple writes w/o immediate read do not disturb

cache

Write back <-> write-allocate Makes subsequent reads fast

AMD Opteron Cache

L1 cache (data): 64 KB, 64 byte blocks 1 K blocks 2-way associative 512 sets. LRU Write back, write allocate

64 bogo-bits address 48 virtual, 40 physical

AMD Opteron Cache

Various sizes: Physical address: 40 bits Block address: 34 bits Block offset: 6 bits Cache index: 9 bits Tag: 25 bits Size of set: 2 blocks (2-way set associative) Number of clock cycles: 2 (2 stalls on hazard)

AMD Opteron Cache

Steps of cache hit: The 40 bits are split int tag(25), index(9), offset(6) A set (2 blocks) is retrieved using the index Their tags are compared and their valid bits

checked The correct block is selected The 3 MSBits of the offset are used to select the

word to be read/written. Update the LRU bits

AMD Opteron Cache

Steps of cache miss Same up until we know it is a miss Identify a victim (LRU) If victim dirty write back If read, stall until next level responds, if write

continue (provisionally)

Miss Rate

Not your elementary school teacher The three Cs

Compulsory (the first time) Capacity (reached the maximum number of blocks

in the set) Conflict (when the blocks have to share the same

spot)

We may add one more: coherency

Sneaky Miss Rate

Miss Rate can be misleading Defined as misses per (1000) access(es) Our delay is related to misses per instruction

Misses per instruction is the miss rate times the memory accesses per instruction.

Even this can be misleading We want to reduce the delaydelay

What is the delay

Avg. Mem. Access Time = Hit Time + Miss Rate * Miss Penalty

We do better by decreasing any of the three quantities in the right hand side

Unfortunately, these always involve trade-offs And, they are just an approximation of the

effect on the execution time.

Complications...

What exactly is a miss in speculative execution?

How much does a miss cost under dynamic scheduling?

Under multi-threading? If we allow a miss over miss?

Example

Effective Access time for 16KB+16KB split cache

Miss per 1000 instructions: 3.82 (instr. Cache), 40.9 (data cache)

Mix: 36% of instructions are load/store Hit: 1 cycle, Miss: 100 cycles

Example

Instruction miss rate: 0.00382

Data miss rate: 0.0409/0.36=0.114

Percent of references that are instructions 100/136 = 74%

Avg. Mem. Acc. Time: 74%*(1+0.004*100) + 26%*(0+0.114*100)

Miss Penalty under DynamicExecution

Is it the full latency? Is it the “exposed” latency? What about the latency due to contention by

speculative instructions Any form of latency has the same problem Simple (simplistic) solution

Find which instruction did not commit in time Attribute the stall to it

How to Increase Performance

Larger cache Obviously reduces misses Increases cost, power Increases hit time

Larger block Decreases compulsory (initial) misses Better exploits spatial locality Decreases number of blocks Increases miss penalty, bus traffic

How to Increase Performance

Higher associativity Reduce conflict misses Increase hit time, silicon area, power consumption

Multilevel cache Reduces hit time and miss penalty Increase cost and power

Give priority to read misses Let reads jump the queue

Overlap TLB and cache read...

TLB and Cache

Cache understands physical addresses We have to consult the TLB to convert a virtual

address to physical How about if we overlap the two?

When is such a thing possible?

What is the Trick?

TLB is a small cache that associates a (virtual) page number to (physical) frame number

The page offset and the frame offset are the same and need no translation

If the page offset is enough to index the set in the cache We do not need any bit from the frame number We can retrieve the set while the TLB does the

translation When the TLB is done we compare the tags

This is the Trick

Match?

Mapping

Set

BlockBlock Address

Block Offset

Cache Index

Tag

Virtual

Physical

TLB

Disadvantages

Cache size = Page size * associativity Usually we want a “medium” size page and a

large cache. There are ways to deviate from this rule with

extra hardware.

11 Advanced Optimizations

We organize them in 5 groups Reduce Hit Time Increase bandwidth Reduce Miss Penalty Reduce Miss Rate Prefetching

Small is Beautiful

Small and simple caches are faster Reduce size Reduce associativity Rely on L2 cache L1 cache sizes do not change much with

technology

Way Prediction

Tag comparison costly Store, along with the data, prediction bits for the

next access The index is augmented by the predictor bits The data is sent to the CPU while we check the

tags If the tags do not match, we send an “Oops!” Pentium 4 uses it

Example

Hit rate 85% (typical) Hit: 1 cycle Miss: 3 cycle Without: 2 cycle .85*1+.15*3 = 1.3 < 2

Trace Cache

Seems so devious... It is almost Harry-Potterish The cache contains dynamic trace (sequence

of instructions as they are executed) Branch prediction is folded into the cache Pentium 4 uses it for its micro-operations cache

Cache Pipeline

Most caches have more than 1 cycle Pipelining is tried and true Embed the cache pipeline into the CPU pipeline

Pentium 4 takes 4 cycles (despite way prediction, etc)

Non-Blocking Miss

Allow hit under miss Or multiple miss under miss

FP intensive programs benefit from multiple miss under miss

Dynamic execution benefits from it

Multi-Banked Cache

Multi-bank (aka interleaved) memories were always popular

Suits best for L2 cache Allows each bank to be smaller Allows each to work independently Increase bandwidth

AMD Opteron has 2 banks, Sun T1 has 4

Critical Word First

Critical (the one we asked) first If the block is transmitted in multiple cycles

Early restart Do not wait for the whole block to arrive

Merging Write Buffer

A write miss might be in the (victim) write-back buffer

Similar idea to victim buffer (virtual memory)

Compiler Re-ordering

Try to access arrays the way they are in the cache

The magic behind fast matrix multiplication (blocking) Break the matrix into pieces that are comfortably

fitting in the cache

Prefetching

Hardware If two misses in the same page, prefetch Most prefetch instructions from the instruction

cache Opteron and P-4 prefetch data too.

Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Memory Technologies

SRAM Static RAM Big transistors optimized for speed

DRAM Cheap capacitors Optimized for density Reads destructively Needs refreshing

DRAMs Rule the Desktop

Memory size and CPU speed grow at the same speed It always took about a second to scan the whole

memory.

Through most of their history increased 4-fold every three years

Now increase 4-fold every four years. Speed increases about 5% per year.

DRAM Organization

Memory array

Dat

a-I/

OAddress Buffer

Row Decoder

ColumnDecoder

Sense Amps

RAS and CAS

Row Address Strobe Column Address Strobe First goes RAS Whole row is copied out CAS selects the bit or bits

Improving RAM

Fast Page Mode Increment CAS several times with the same RAS

Make use of the modularity available Memory is organized in blocks 1-4 Mbits each for

manufacturing reasons. Naturally interleaved

SDRAM, DDR

Synchronous DRAM Shares the clock with the CPU No synchronization overhead in communication.

DDR Double Data Rate Front end of memory is fast Heavily interleaved back-ends

Virtual Memory

Expand RAM to disk (not that useful today) Allow multiple processes to share the physical

memory Allow arbitrary mapping

File I/O, shared memory, dynamic libraries, etc

Critical to security

Security

Virtual memory handled through the kernel Page tables can be manipulated only in monitor

mode A process does not have the means to access

the space of another process

However...

A kernel is a huge program Huge programs have bugs Most bugs cause the system to crush A few of them are exploits.

A better way...

Use virtual machines Much smaller Fewer bugs One extra level of protection

Vms have other advantages as well Share a computer Cloud computing Can migrate a live program to different H/W

VMM

Virtual Machine Monitors (hypervisors) Allow a guest OS to run efficiently as a process on

a host OS User level code runs natively System calls are trapped and emulated VMM mediates between the guest OS and the H/W

on the host Network connection, USB device management, etc Filesystem and state.

ISA Support

An ISA supporting virtualization is called virtualizable

Virtualization is a new idea (geologically speaking)

Attempts by guest to execute privileged instructions result in traps The problem is that not all relevant instructions

result in traps And handling virtual memory is tricky

Virtual Memory for Virtual Machines

Normally we distinguish between Virtual memory Physical memory

Now we have an intermediate level Real memory

Guest OS maps virtual to real VMM maps real to physical

Shadow Page Table

Two step process Too slow Interferes with h/w assisted virtual memory

Shadow tables do it in one shot But this means guest OS cannot manage the page

tables of its own processes TLBs must have PID tags and/or be flushed on

context switch

IBM ('70s) had one more level of indirection

Virtualized I/O

There are far too many devices and drivers to handle

I/O happens with the mediation of H/W, so it would be too slow to handle with emulation

Solution: generic devices for each type of I/O Network: time shared or NATed.

Example: Xen

Instead of trying to emulate everything just to trick the guest

Allow small modifications to the guest to keep things simple.

It is called paravirtualization and Xen is the most popular example (VMWare is another)

The Tricks of Xen

Augment kernel E.g. 1% of Linux is modified

Uses the protection levels of x86 Xen at level 0 (highest), guest OS at 1, apps at 3

Wraps I/O devices in special virtual machines (driver domains) and talks to them with page remapping

VMM and ISA

Designers of ISAs were cheapos! To save a couple of bits had the same

instruction behave differently in monitor mode and user mode POPF (pop flags) ignores privileged flags in user

mode

70's technology IBM-370 is still the golden standard.

Cache and I/O

Should we do I/O with the cache? Get the data immediately with perfect consistency Slows down processor, infects cache

I/O directly with memory Most popular Works well with write through (no stale data) Or can mark pages as non-cacheable Or flush cache Or send cache invalidations

Fallacy: Predicting cache performance

Miss rates vary by a factor of 10,000 or more Tremendous difference between instruction and

data miss rates

RAMBUS promises

RAMBUS had a bandwidth 8 times higher than competition

Performance was only 0-15% faster overall Cost was 2-3 times higher (20% larger die) The reason was that most of the traffic is at the

L2 cache.

memory hierarchy design ch. 5 - york university · memory hierarchy we need huge ... compiler...

Documents