memory hierarchy - di.univr.it · 13 memory hierarchy: terminology ¡hit: required data appears in...
TRANSCRIPT
Memory hierarchy
2
Outline
Performance impactPrinciples of memory hierarchyMemory technology and basics
3
Performance impact
Memory references of a program typically determine the ultimate performance of a programIdeal pipelining assumes memory accesses can be done always in one cycle
In practice, this is not always possibleExample:
Consider the following code sequence:lw r2,0x20lw r3,0x30add r1,r4,r5sw r6,0x40
Assume: a) 1-cycle access to memoryb) 4-cycle access to memory (for loads/stores, 1 for fetch)
4
Performance impact (2)
a) 1-cycle memory access: 9 cycles
b) 4-cycles memory access: 17 cycles
M
WB
WBEXIDSIF
MEXSIDIF
WBMEXIDIF
WBMEXIDIF
WBMMMM
WB
EX
M
S
EX
M
S
S
M
SIDSSSIF
SSSSSIDIF
MMSSSEXIDIF
WBMMMMEXIDIF
lw r2,0x20lw r3,0x30add r1,r4,r5sw r6,0x40
lw r2,0x20lw r3,0x30add r1,r4,r5sw r6,0x40
5
Performance impact (3)
Need to hide (external) memory latencyGive the illusion of a fast, large & cheap memory system
Solution: use a memory hierarchy
6
Principles of memory hierarchy
7
Typical memory hierarchy
larger, slower, cheaper
CPUCPU
regsregs
Cache
MemoryMemory diskdisk
size:speed:$/Mbyte:block size:
200 B~1 ns
4 B
register reference
cachereference
memoryreference
disk memoryreference
32KB -- 4MB5 ns~1$/MB16 B
256 MB60 ns~$0.5/MB4 KB
10GB10 ms$0.01/MB
4 B 32 B 4 KB
cache virtual memory
8
Memories in a processor
[Microprocessor Report 9/12/94]
Caches:L1 dataL1 instructionL2 unifiedTLBBranch history … … … see page 8 of introduction.
9
Memories in a processor (3)
[Burger]
10
Locality of reference
Why memory hierarchy works?Principle of Locality
• programs tend to reuse data and instructions near those they have used recently.
Temporal locality: recently referenced items are likely to be referenced in the near future.Spatial locality: items with nearby addresses tend to be referenced close together in time.
11
Sources of locality
DataReference array elements in succession (spatial)
InstructionReference instructions in sequence (spatial)Cycle through loop repeatedly (temporal)
Example:sum = 0;for (i = 0; i < n; i++)
sum += a[i];*v = sum;
I0: sum <-- 0I1: ap <-- aI2: i <-- 0I3: if (i >= n) goto doneI4: loop: t <-- *apI5: sum <-- sum + tI6: ap <-- ap + 4I7: i <-- i + 1I8: if (i < n) goto loopI9: done: *v <-- sum
Abstract Version of Machine Code
I1
I2
I3
I4
I5
I6
0x100
0x104
0x108
0x10C
0x110
0x114
a[0]
a[1]
a[2]
a[3]
a[4]
a[5]
0x400
0x404
0x408
0x40C
0x410
0x414
• • •
*v0x7A4
• • •
Memory Layout
I00x0FC
12
Exploting locality
Temporal LocalityKeep most recently accessed data items closer to the processor
Spatial LocalityMove data closer to processor in “blocks” (groups of contiguous words)“Block” = minimum unit of data between 2 levelsUpper-level blocks are a subset of lower-level blocks
13
Memory hierarchy: terminology
Hit: required data appears in some block in the upper level
Hit Rate: the fraction of memory accesses found in the upper levelHit Time: Time to access the upper level
= access time + Time to determine hit/miss
Miss: required data needs to be retrieved from a block in the lower level
Miss Rate = 1 - Hit RateMiss Penalty: Time to replace a block in the upper level
= access time + transfer time
To be effective:Hit Time << Miss Penalty
14
Accessing data in a memory hierarchy
Between any two levels, memory divided into blocks
Data moves between levels in block-sized chunksUpper-level blocks are a subset of lower-level blocks
ExampleAccess word w in block a (hit) Access word v in block b (miss)
a
ab
a
ab
w
b
a
b
ab
v
Upper
Lower
Memory Basics & Technology
16
Outline
Basic memory technologies:SRAMsDRAMsOther types of memory
17
Importance of technology
Locality is only one side of the pictureMemory hierarchy principle is limited by cost & technological issues
Faster memories are “expensive”Slower memories memories are “cheap”
Need to understand why
18
Static RAM (SRAM)
Fast~5 ns (depends on size)
Persistent as long as power is suppliedno refresh required
“Expensive”O(1$)/MByte 6 transistors/bit
StableHigh immunity to noise and environmental disturbances
Same process as standard logic (CMOS)Base technology for caches
19
Anatomy of an SRAM bit (cell)
Read:Set bit lines high(precharge)Set word line highSee which bit line goes low
Write:Set bit lines to opposite valuesSet word lineFlip cell to new state
(6 transistors)
b b’
bit line bit line
word line
20
Example 1-level-decode SRAM (16 x 8)
Addressdecoder
Addressdecoder
A0
A1
A2
A3
sense/writeamps
sense/writeamps
b7’b7
d7
sense/writeamps
sense/writeamps
b1’b1
d1
sense/writeamps
sense/writeamps
b0’b0
d0Input/output lines
W0
W1
W15
memorycells
R/W’
21
Dynamic RAM (DRAM)
Slower than SRAM Access time O(50ns) (depends on size)Concept of cycle time > access time
• Cycle time = min. time between consecutive accesses
Non-persistent every row must be accessed every ~1 ms (refreshed)
Cheaper than SRAM O(0.1$)/MByte 1 transistor/bit
Fragileelectrical noise, light, radiation
Different process w.r.t. standard logic Requires extra fabrication stepsWorkhorse memory technology
22
Anatomy of a DRAM Cell
Write:Drive bit lineSelect row
Read:Precharge bit line to “1”Select rowCell and bit line share charges• Very small voltage changes on
the bit lineSense Restore the value
RefreshJust do a dummy read to every cell.
Word Line
BitLine Storage Node
AccessTransistor
Cnode
CBL
23
Addressing arrays with bits
Impossible to decode n address lines to 2n row values
Partition into bi-dimensional addressingMemory array seen as an R x C array of addresses
• R = 2r , C = 2c
For each address:row(address) = address/C = leftmost r bits of addresscol(address) = address%C = righmost c bits of address
r bits c bits
rowaddress = col
0 1 2 30 000 001 010 0111 100 101 110 111
row 1 col 2
24
Example 2-level decode DRAM (64Kx1)
Rowaddress
latch
Rowaddress
latch
Columnaddress
latch
Columnaddress
latch
Row decoderRow
decoder 256x256cell array
256x256cell array
columnsense/write
amps
columnsense/write
amps
columnlatch and decoder
columnlatch and decoder
A7-A0
\8
\8
R/W’
Dout DinCAS
RAS
row
col
256 Rows
256 Columns
25
DRAM Operation
Row Address (~50ns)Set Row address on address lines & strobe RASEntire row read & stored in column latchesContents of row of memory cells destroyed
Column Address (~10ns)Set Column address on address lines & strobe CASAccess selected bit
• READ: transfer from selected column latch to Dout• WRITE: Set selected column latch to Din
Rewrite (~30ns)Write back entire row
26
Observations About DRAMs
TimingAccess time = 60ns < cycle time ~= 90nsNeed to rewrite row
Must refresh periodicallyPerform complete memory cycle for each rowApprox. every 1mssqrt(n) cyclesHandled in background by memory controller
Inefficient way to get single bitEffectively read entire row of sqrt(n) bits
27
Enhanced Performance DRAMs
Conventional AccessRow + ColRAS CAS RAS CAS ...
Page ModeRow + Series of columnsRAS CAS CAS CAS ...Gives successive bits
Rowaddress
latch
Rowaddress
latch
Columnaddress
latch
Columnaddress
latch
Row decoder
Row decoder 256x256
cell array
256x256cell array
sense/writeamps
sense/writeamps
columnlatch and decoder
columnlatch and decoder
A7-A0
\8
\8
R/W’
CAS
RAS
row
col
Entire row buffered here
row access time col access time cycle time page mode cycle time50ns 10ns 90ns 25ns
Typical Performance
28
SRAMs vs. DRAMs
SRAMTechnology quite stablefast, expensive, larger silicon footprint
DRAM Several technological variants
• SDRAM, RAMBUS, …
DRAM optimized for capacity more than for speedslow, cheap, smaller silicon footprint
Use SRAM in fast caches and DRAM in main memory
29
SRAMs vs. DRAMs (2)
metric 1980 1985 1990 1995 2000 2000/1980
$/MB 8,000 880 100 30 0.5 16000access (ns) 375 200 100 70 60 6typical size(MB) 0.064 0.256 4 16 128 2000
DRAM
metric 1980 1985 1990 1995 2000 2000/1980
$/MB 19,200 2,900 320 256 5 4000access (ns) 300 150 35 15 5 60
SRAM
culled from back issues of Byte and PC Magazine
30
Storage price/MByte
[Byte and PC Magazine]
1980 1985 1990 19950.1
1
10
100
1000
10000
100000
$/M
byte
Year
SRAM
DRAM
disk
today
31
Storage access times
[Byte and PC Magazine]
1980 1985 1990 19951
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
acce
ss ti
me
(ns)
Year
disk
DRAM
SRAM
today
32
Memory technology summary
Cost and density improving at enormous rates.Speed lagging processor performanceMemory hierarchies help narrow the gap:
small fast SRAMS (cache) at upper levelslarge slow DRAMS (main memory) at lower levelsLarge & slow disks to back it all up
Locality of reference makes it all workKeep most frequently accessed data in fastest memory