memory hierarchy design ch. 5 - york university · memory hierarchy we need huge ... compiler...

59
CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Upload: others

Post on 03-Sep-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

CSE 4201

Memory Hierarchy DesignCh. 5

(Hennessy and Patterson)

Page 2: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Memory Hierarchy

We need huge amount of cheap and fast memory

Memory is either fast or cheap; never both. Do as politicians do: fake it

Give a little bit of fast memory and tons of cheap memory.

As technology progresses Cheap becomes cheaper rapidly Fast becomes faster rapidly Cheap does not become fast, or fast cheap

Page 3: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Since the 80's...

Processors became 10,000 times faster Memory became 10 times faster Back then cache was for high performance

systems Now we need multiple levels of cache

Page 4: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Addressing Scheme

Match?

Address

Mapping

Set

BlockBlock Address

Block Offset

Cache Index

Tag

Page 5: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Set-Associative

(Set Address) = (Block Address) MOD K Where K is the number of sets in kache

(Block Address) = (Address) DIV b Where b is the number of bytes in the block

(Block Offset) = (Address) MOD b Set has n blocks. (n-way associative) Every block has data and address (tag). If K=1, fully associative If n=1, direct mapped

Page 6: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Victim Selection

Which block to expel to make room for new entry Least recently used Random First In First Out

All work more or less the same. LRU is rarely exact, almost always approximate Little effect on big caches About 10% for smaller

Page 7: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

What Happens on a Write?

Writes are less common than reads All instruction fetches are reads Stores are 10% of the instructions, loads, 25% We have 10 writes for every 125 reads Better take good care of the reads

Writes are costlier than reads We write 1-8 bytes at a time in a block typically 32-

64 bytes long Have issues with consistency

Page 8: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Write Through, Back

Write through No need to write back on a cache miss No need to have dirty bit

Write back Less bus traffic

Page 9: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Write Through, Back

Write through <-> no-write-allocate Allocate cache block only on reads Multiple writes w/o immediate read do not disturb

cache

Write back <-> write-allocate Makes subsequent reads fast

Page 10: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

AMD Opteron Cache

L1 cache (data): 64 KB, 64 byte blocks 1 K blocks 2-way associative 512 sets. LRU Write back, write allocate

64 bogo-bits address 48 virtual, 40 physical

Page 11: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

AMD Opteron Cache

Various sizes: Physical address: 40 bits Block address: 34 bits Block offset: 6 bits Cache index: 9 bits Tag: 25 bits Size of set: 2 blocks (2-way set associative) Number of clock cycles: 2 (2 stalls on hazard)

Page 12: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

AMD Opteron Cache

Steps of cache hit: The 40 bits are split int tag(25), index(9), offset(6) A set (2 blocks) is retrieved using the index Their tags are compared and their valid bits

checked The correct block is selected The 3 MSBits of the offset are used to select the

word to be read/written. Update the LRU bits

Page 13: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

AMD Opteron Cache

Steps of cache miss Same up until we know it is a miss Identify a victim (LRU) If victim dirty write back If read, stall until next level responds, if write

continue (provisionally)

Page 14: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Miss Rate

Not your elementary school teacher The three Cs

Compulsory (the first time) Capacity (reached the maximum number of blocks

in the set) Conflict (when the blocks have to share the same

spot)

We may add one more: coherency

Page 15: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Sneaky Miss Rate

Miss Rate can be misleading Defined as misses per (1000) access(es) Our delay is related to misses per instruction

Misses per instruction is the miss rate times the memory accesses per instruction.

Even this can be misleading We want to reduce the delaydelay

Page 16: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

What is the delay

Avg. Mem. Access Time = Hit Time + Miss Rate * Miss Penalty

We do better by decreasing any of the three quantities in the right hand side

Unfortunately, these always involve trade-offs And, they are just an approximation of the

effect on the execution time.

Page 17: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Complications...

What exactly is a miss in speculative execution?

How much does a miss cost under dynamic scheduling?

Under multi-threading? If we allow a miss over miss?

Page 18: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Example

Effective Access time for 16KB+16KB split cache

Miss per 1000 instructions: 3.82 (instr. Cache), 40.9 (data cache)

Mix: 36% of instructions are load/store Hit: 1 cycle, Miss: 100 cycles

Page 19: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Example

Instruction miss rate: 0.00382

Data miss rate: 0.0409/0.36=0.114

Percent of references that are instructions 100/136 = 74%

Avg. Mem. Acc. Time: 74%*(1+0.004*100) + 26%*(0+0.114*100)

Page 20: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Miss Penalty under DynamicExecution

Is it the full latency? Is it the “exposed” latency? What about the latency due to contention by

speculative instructions Any form of latency has the same problem Simple (simplistic) solution

Find which instruction did not commit in time Attribute the stall to it

Page 21: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

How to Increase Performance

Larger cache Obviously reduces misses Increases cost, power Increases hit time

Larger block Decreases compulsory (initial) misses Better exploits spatial locality Decreases number of blocks Increases miss penalty, bus traffic

Page 22: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

How to Increase Performance

Higher associativity Reduce conflict misses Increase hit time, silicon area, power consumption

Multilevel cache Reduces hit time and miss penalty Increase cost and power

Give priority to read misses Let reads jump the queue

Overlap TLB and cache read...

Page 23: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

TLB and Cache

Cache understands physical addresses We have to consult the TLB to convert a virtual

address to physical How about if we overlap the two?

When is such a thing possible?

Page 24: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

What is the Trick?

TLB is a small cache that associates a (virtual) page number to (physical) frame number

The page offset and the frame offset are the same and need no translation

If the page offset is enough to index the set in the cache We do not need any bit from the frame number We can retrieve the set while the TLB does the

translation When the TLB is done we compare the tags

Page 25: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

This is the Trick

Match?

Mapping

Set

BlockBlock Address

Block Offset

Cache Index

Tag

Virtual

Physical

TLB

Page 26: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Disadvantages

Cache size = Page size * associativity Usually we want a “medium” size page and a

large cache. There are ways to deviate from this rule with

extra hardware.

Page 27: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

11 Advanced Optimizations

We organize them in 5 groups Reduce Hit Time Increase bandwidth Reduce Miss Penalty Reduce Miss Rate Prefetching

Page 28: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Small is Beautiful

Small and simple caches are faster Reduce size Reduce associativity Rely on L2 cache L1 cache sizes do not change much with

technology

Page 29: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Way Prediction

Tag comparison costly Store, along with the data, prediction bits for the

next access The index is augmented by the predictor bits The data is sent to the CPU while we check the

tags If the tags do not match, we send an “Oops!” Pentium 4 uses it

Page 30: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Example

Hit rate 85% (typical) Hit: 1 cycle Miss: 3 cycle Without: 2 cycle .85*1+.15*3 = 1.3 < 2

Page 31: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Trace Cache

Seems so devious... It is almost Harry-Potterish The cache contains dynamic trace (sequence

of instructions as they are executed) Branch prediction is folded into the cache Pentium 4 uses it for its micro-operations cache

Page 32: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Cache Pipeline

Most caches have more than 1 cycle Pipelining is tried and true Embed the cache pipeline into the CPU pipeline

Pentium 4 takes 4 cycles (despite way prediction, etc)

Page 33: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Non-Blocking Miss

Allow hit under miss Or multiple miss under miss

FP intensive programs benefit from multiple miss under miss

Dynamic execution benefits from it

Page 34: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Multi-Banked Cache

Multi-bank (aka interleaved) memories were always popular

Suits best for L2 cache Allows each bank to be smaller Allows each to work independently Increase bandwidth

AMD Opteron has 2 banks, Sun T1 has 4

Page 35: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Critical Word First

Critical (the one we asked) first If the block is transmitted in multiple cycles

Early restart Do not wait for the whole block to arrive

Page 36: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Merging Write Buffer

A write miss might be in the (victim) write-back buffer

Similar idea to victim buffer (virtual memory)

Page 37: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Compiler Re-ordering

Try to access arrays the way they are in the cache

The magic behind fast matrix multiplication (blocking) Break the matrix into pieces that are comfortably

fitting in the cache

Page 38: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Prefetching

Hardware If two misses in the same page, prefetch Most prefetch instructions from the instruction

cache Opteron and P-4 prefetch data too.

Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Page 39: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Memory Technologies

SRAM Static RAM Big transistors optimized for speed

DRAM Cheap capacitors Optimized for density Reads destructively Needs refreshing

Page 40: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

DRAMs Rule the Desktop

Memory size and CPU speed grow at the same speed It always took about a second to scan the whole

memory.

Through most of their history increased 4-fold every three years

Now increase 4-fold every four years. Speed increases about 5% per year.

Page 41: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

DRAM Organization

Memory array

Dat

a-I/

OAddress Buffer

Row Decoder

ColumnDecoder

Sense Amps

Page 42: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

RAS and CAS

Row Address Strobe Column Address Strobe First goes RAS Whole row is copied out CAS selects the bit or bits

Page 43: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Improving RAM

Fast Page Mode Increment CAS several times with the same RAS

Make use of the modularity available Memory is organized in blocks 1-4 Mbits each for

manufacturing reasons. Naturally interleaved

Page 44: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

SDRAM, DDR

Synchronous DRAM Shares the clock with the CPU No synchronization overhead in communication.

DDR Double Data Rate Front end of memory is fast Heavily interleaved back-ends

Page 45: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Virtual Memory

Expand RAM to disk (not that useful today) Allow multiple processes to share the physical

memory Allow arbitrary mapping

File I/O, shared memory, dynamic libraries, etc

Critical to security

Page 46: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Security

Virtual memory handled through the kernel Page tables can be manipulated only in monitor

mode A process does not have the means to access

the space of another process

Page 47: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

However...

A kernel is a huge program Huge programs have bugs Most bugs cause the system to crush A few of them are exploits.

Page 48: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

A better way...

Use virtual machines Much smaller Fewer bugs One extra level of protection

Vms have other advantages as well Share a computer Cloud computing Can migrate a live program to different H/W

Page 49: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

VMM

Virtual Machine Monitors (hypervisors) Allow a guest OS to run efficiently as a process on

a host OS User level code runs natively System calls are trapped and emulated VMM mediates between the guest OS and the H/W

on the host Network connection, USB device management, etc Filesystem and state.

Page 50: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

ISA Support

An ISA supporting virtualization is called virtualizable

Virtualization is a new idea (geologically speaking)

Attempts by guest to execute privileged instructions result in traps The problem is that not all relevant instructions

result in traps And handling virtual memory is tricky

Page 51: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Virtual Memory for Virtual Machines

Normally we distinguish between Virtual memory Physical memory

Now we have an intermediate level Real memory

Guest OS maps virtual to real VMM maps real to physical

Page 52: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Shadow Page Table

Two step process Too slow Interferes with h/w assisted virtual memory

Shadow tables do it in one shot But this means guest OS cannot manage the page

tables of its own processes TLBs must have PID tags and/or be flushed on

context switch

IBM ('70s) had one more level of indirection

Page 53: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Virtualized I/O

There are far too many devices and drivers to handle

I/O happens with the mediation of H/W, so it would be too slow to handle with emulation

Solution: generic devices for each type of I/O Network: time shared or NATed.

Page 54: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Example: Xen

Instead of trying to emulate everything just to trick the guest

Allow small modifications to the guest to keep things simple.

It is called paravirtualization and Xen is the most popular example (VMWare is another)

Page 55: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

The Tricks of Xen

Augment kernel E.g. 1% of Linux is modified

Uses the protection levels of x86 Xen at level 0 (highest), guest OS at 1, apps at 3

Wraps I/O devices in special virtual machines (driver domains) and talks to them with page remapping

Page 56: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

VMM and ISA

Designers of ISAs were cheapos! To save a couple of bits had the same

instruction behave differently in monitor mode and user mode POPF (pop flags) ignores privileged flags in user

mode

70's technology IBM-370 is still the golden standard.

Page 57: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Cache and I/O

Should we do I/O with the cache? Get the data immediately with perfect consistency Slows down processor, infects cache

I/O directly with memory Most popular Works well with write through (no stale data) Or can mark pages as non-cacheable Or flush cache Or send cache invalidations

Page 58: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

Fallacy: Predicting cache performance

Miss rates vary by a factor of 10,000 or more Tremendous difference between instruction and

data miss rates

Page 59: Memory Hierarchy Design Ch. 5 - York University · Memory Hierarchy We need huge ... Compiler Insert special “prefetch” instructions Needs non-blocking cache Increase traffic

RAMBUS promises

RAMBUS had a bandwidth 8 times higher than competition

Performance was only 0-15% faster overall Cost was 2-3 times higher (20% larger die) The reason was that most of the traffic is at the

L2 cache.