memory hierarchy - di.univr.it · 13 memory hierarchy: terminology ¡hit: required data appears in...

$: Memory hierarchy - di.univr.it · 13 Memory hierarchy: terminology ¡Hit: required data appears in some block in the upper level ¡Hit Rate: the fraction of memory accesses found$
Memory hierarchy

2

Outline

Performance impactPrinciples of memory hierarchyMemory technology and basics

3

Performance impact

Memory references of a program typically determine the ultimate performance of a programIdeal pipelining assumes memory accesses can be done always in one cycle

In practice, this is not always possibleExample:

Consider the following code sequence:lw r2,0x20lw r3,0x30add r1,r4,r5sw r6,0x40

Assume: a) 1-cycle access to memoryb) 4-cycle access to memory (for loads/stores, 1 for fetch)

4

Performance impact (2)

a) 1-cycle memory access: 9 cycles

b) 4-cycles memory access: 17 cycles

M

WB

WBEXIDSIF

MEXSIDIF

WBMEXIDIF

WBMEXIDIF

WBMMMM

WB

EX

M

S

EX

M

S

S

M

SIDSSSIF

SSSSSIDIF

MMSSSEXIDIF

WBMMMMEXIDIF

lw r2,0x20lw r3,0x30add r1,r4,r5sw r6,0x40

lw r2,0x20lw r3,0x30add r1,r4,r5sw r6,0x40

5

Performance impact (3)

Need to hide (external) memory latencyGive the illusion of a fast, large & cheap memory system

Solution: use a memory hierarchy

6

Principles of memory hierarchy

7

Typical memory hierarchy

larger, slower, cheaper

CPUCPU

regsregs

Cache

MemoryMemory diskdisk

size:speed:$/Mbyte:block size:

200 B~1 ns

4 B

register reference

cachereference

memoryreference

disk memoryreference

32KB -- 4MB5 ns~1$/MB16 B

256 MB60 ns~$0.5/MB4 KB

10GB10 ms$0.01/MB

4 B 32 B 4 KB

cache virtual memory

8

Memories in a processor

[Microprocessor Report 9/12/94]

Caches:L1 dataL1 instructionL2 unifiedTLBBranch history … … … see page 8 of introduction.

9

Memories in a processor (3)

[Burger]

10

Locality of reference

Why memory hierarchy works?Principle of Locality

• programs tend to reuse data and instructions near those they have used recently.

Temporal locality: recently referenced items are likely to be referenced in the near future.Spatial locality: items with nearby addresses tend to be referenced close together in time.

11

Sources of locality

DataReference array elements in succession (spatial)

InstructionReference instructions in sequence (spatial)Cycle through loop repeatedly (temporal)

Example:sum = 0;for (i = 0; i < n; i++)

sum += a[i];*v = sum;

I0: sum <-- 0I1: ap <-- aI2: i <-- 0I3: if (i >= n) goto doneI4: loop: t <-- *apI5: sum <-- sum + tI6: ap <-- ap + 4I7: i <-- i + 1I8: if (i < n) goto loopI9: done: *v <-- sum

Abstract Version of Machine Code

I1

I2

I3

I4

I5

I6

0x100

0x104

0x108

0x10C

0x110

0x114

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

0x400

0x404

0x408

0x40C

0x410

0x414

• • •

*v0x7A4

• • •

Memory Layout

I00x0FC

12

Exploting locality

Temporal LocalityKeep most recently accessed data items closer to the processor

Spatial LocalityMove data closer to processor in “blocks” (groups of contiguous words)“Block” = minimum unit of data between 2 levelsUpper-level blocks are a subset of lower-level blocks

13

Memory hierarchy: terminology

Hit: required data appears in some block in the upper level

Hit Rate: the fraction of memory accesses found in the upper levelHit Time: Time to access the upper level

= access time + Time to determine hit/miss

Miss: required data needs to be retrieved from a block in the lower level

Miss Rate = 1 - Hit RateMiss Penalty: Time to replace a block in the upper level

= access time + transfer time

To be effective:Hit Time << Miss Penalty

14

Accessing data in a memory hierarchy

Between any two levels, memory divided into blocks

Data moves between levels in block-sized chunksUpper-level blocks are a subset of lower-level blocks

ExampleAccess word w in block a (hit) Access word v in block b (miss)

a

ab

a

ab

w

b

a

b

ab

v

Upper

Lower

Memory Basics & Technology

16

Outline

Basic memory technologies:SRAMsDRAMsOther types of memory

17

Importance of technology

Locality is only one side of the pictureMemory hierarchy principle is limited by cost & technological issues

Faster memories are “expensive”Slower memories memories are “cheap”

Need to understand why

18

Static RAM (SRAM)

Fast~5 ns (depends on size)

Persistent as long as power is suppliedno refresh required

“Expensive”O(1$)/MByte 6 transistors/bit

StableHigh immunity to noise and environmental disturbances

Same process as standard logic (CMOS)Base technology for caches

19

Anatomy of an SRAM bit (cell)

Read:Set bit lines high(precharge)Set word line highSee which bit line goes low

Write:Set bit lines to opposite valuesSet word lineFlip cell to new state

(6 transistors)

b b’

bit line bit line

word line

20

Example 1-level-decode SRAM (16 x 8)

Addressdecoder

Addressdecoder

A0

A1

A2

A3

sense/writeamps

sense/writeamps

b7’b7

d7

sense/writeamps

sense/writeamps

b1’b1

d1

sense/writeamps

sense/writeamps

b0’b0

d0Input/output lines

W0

W1

W15

memorycells

R/W’

21

Dynamic RAM (DRAM)

Slower than SRAM Access time O(50ns) (depends on size)Concept of cycle time > access time

• Cycle time = min. time between consecutive accesses

Non-persistent every row must be accessed every ~1 ms (refreshed)

Cheaper than SRAM O(0.1$)/MByte 1 transistor/bit

Fragileelectrical noise, light, radiation

Different process w.r.t. standard logic Requires extra fabrication stepsWorkhorse memory technology

22

Anatomy of a DRAM Cell

Write:Drive bit lineSelect row

Read:Precharge bit line to “1”Select rowCell and bit line share charges• Very small voltage changes on

the bit lineSense Restore the value

RefreshJust do a dummy read to every cell.

Word Line

BitLine Storage Node

AccessTransistor

Cnode

CBL

23

Addressing arrays with bits

Impossible to decode n address lines to 2n row values

Partition into bi-dimensional addressingMemory array seen as an R x C array of addresses

• R = 2r , C = 2c

For each address:row(address) = address/C = leftmost r bits of addresscol(address) = address%C = righmost c bits of address

r bits c bits

rowaddress = col

0 1 2 30 000 001 010 0111 100 101 110 111

row 1 col 2

24

Example 2-level decode DRAM (64Kx1)

Rowaddress

latch

Rowaddress

latch

Columnaddress

latch

Columnaddress

latch

Row decoderRow

decoder 256x256cell array

256x256cell array

columnsense/write

amps

columnsense/write

amps

columnlatch and decoder


A7-A0

\8

\8

R/W’

Dout DinCAS

RAS

row

col

256 Rows

256 Columns

25

DRAM Operation

Row Address (~50ns)Set Row address on address lines & strobe RASEntire row read & stored in column latchesContents of row of memory cells destroyed

Column Address (~10ns)Set Column address on address lines & strobe CASAccess selected bit

• READ: transfer from selected column latch to Dout• WRITE: Set selected column latch to Din

Rewrite (~30ns)Write back entire row

26

Observations About DRAMs

TimingAccess time = 60ns < cycle time ~= 90nsNeed to rewrite row

Must refresh periodicallyPerform complete memory cycle for each rowApprox. every 1mssqrt(n) cyclesHandled in background by memory controller

Inefficient way to get single bitEffectively read entire row of sqrt(n) bits

27

Enhanced Performance DRAMs

Conventional AccessRow + ColRAS CAS RAS CAS ...

Page ModeRow + Series of columnsRAS CAS CAS CAS ...Gives successive bits

Rowaddress

latch

Rowaddress

latch

Columnaddress

latch

Columnaddress

latch

Row decoder

Row decoder 256x256

cell array

256x256cell array

sense/writeamps

sense/writeamps



A7-A0

\8

\8

R/W’

CAS

RAS

row

col

Entire row buffered here

row access time col access time cycle time page mode cycle time50ns 10ns 90ns 25ns

Typical Performance

28

SRAMs vs. DRAMs

SRAMTechnology quite stablefast, expensive, larger silicon footprint

DRAM Several technological variants

• SDRAM, RAMBUS, …

DRAM optimized for capacity more than for speedslow, cheap, smaller silicon footprint

Use SRAM in fast caches and DRAM in main memory

29

SRAMs vs. DRAMs (2)

metric 1980 1985 1990 1995 2000 2000/1980

$/MB 8,000 880 100 30 0.5 16000access (ns) 375 200 100 70 60 6typical size(MB) 0.064 0.256 4 16 128 2000

DRAM

metric 1980 1985 1990 1995 2000 2000/1980

$/MB 19,200 2,900 320 256 5 4000access (ns) 300 150 35 15 5 60

SRAM

culled from back issues of Byte and PC Magazine

30

Storage price/MByte

[Byte and PC Magazine]

1980 1985 1990 19950.1

1

10

100

1000

10000

100000

$/M

byte

Year

SRAM

DRAM

disk

today

31

Storage access times

[Byte and PC Magazine]

1980 1985 1990 19951

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

acce

ss ti

me

(ns)

Year

disk

DRAM

SRAM

today

32

Memory technology summary

Cost and density improving at enormous rates.Speed lagging processor performanceMemory hierarchies help narrow the gap:

small fast SRAMS (cache) at upper levelslarge slow DRAMS (main memory) at lower levelsLarge & slow disks to back it all up

Locality of reference makes it all workKeep most frequently accessed data in fastest memory

memory hierarchy - di.univr.it · 13 memory hierarchy: terminology ¡hit: required data appears in...

Documents