engs 116 lecture 141 main memory and virtual memory vincent h. berk october 26, 2005 reading for...
Post on 21-Dec-2015
215 views
TRANSCRIPT
ENGS 116 Lecture 14 1
Main Memory and Virtual Memory
Vincent H. Berk
October 26, 2005
Reading for today: Sections 5.1 – 5.4, (Jouppi article)
Reading for Friday: Sections 5.5 – 5.8 Reading for Monday: Sections 5.8 – 5.12 and 5.16
ENGS 116 Lecture 14 2
Main Memory Background
• Performance of Main Memory:
– Latency: Cache miss penalty
• Access Time: time between request and word arrives
• Cycle Time: time between requests
– Bandwidth: I/O & large block miss penalty (L2)
• Main Memory is DRAM: dynamic random access memory
– Dynamic since needs to be refreshed periodically (1% time)
– Addresses divided into 2 halves (memory as a 2-D matrix):
• RAS or Row Access Strobe
• CAS or Column Access Strobe
• Cache uses SRAM: static random access memory
– No refresh; 6 transistors/bit vs. 1 transistor; Size: DRAM/SRAM ≈ 4-8; Cost/Cycle time: SRAM/DRAM ≈ 8-16
3ENGS 116 Lecture 14
4 Key DRAM Timing Parameters
• tRAC: minimum time from RAS line falling to the valid data output.
– Quoted as the speed of a DRAM when buying
– A typical 512Mbit DRAM tRAC = 60-40 ns
• tRC: minimum time from the start of one row access to the start of the next.
– tRC = 80 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
• tCAC: minimum time from CAS line falling to valid data output.
– 5 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
• tPC: minimum time from the start of one column access to the start of the next.
– 15 ns for a 512Mbit DRAM with a tRAC of 60-40 ns
4ENGS 116 Lecture 14
DRAM Performance
• A 40 ns (tRAC) DRAM can
– perform a row access only every 80 ns (tRC)
– perform column access (tCAC) in 5 ns, but time between column accesses is at least 15 ns (tPC).
• In practice, external address delays and turning around buses make it 20 to 25 ns
• These times do not include the time to drive the addresses off the microprocessor or the memory controller overhead!
ENGS 116 Lecture 14 5
DRAM History
• DRAMs: capacity + 60%/yr, cost – 30%/yr
– 2.5X cells/area, 1.5X die size in ≈ 3 years
• ‘98 DRAM fab line costs $2B
• Rely on increasing numbers of computers & memory per computer (60% market)
– SIMM or DIMM is replaceable unit computers use any generation DRAM
• Commodity, second source industry high volume, low profit, conservative
– Little organization innovation in 20 years
• Order of importance: 1) Cost/bit, 2) Capacity
– First RAMBUS: 10X BW, + 30% cost little impact
• Current SDRAM yield very high: > 80%
ENGS 116 Lecture 14 6
Main Memory Performance• Simple:
– CPU, Cache, Bus, Memory same width (32 or 64 bits)
• Wide:
– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UltraSPARC 512)
• Interleaved:
– CPU, Cache, Bus 1 word; Memory N modules (4 modules); example is word interleaved
ENGS 116 Lecture 14 7
Main Memory Performance
• Timing model (word size is 32 bits)– 1 to send address,
– 6 for access time, 1 to send data
– Cache Block is 4 words
• Simple memory 4 (1 + 6 + 1) = 32• Wide memory 1 + 6 + 1 = 8• Interleaved memory 1 + 6 + 4 1 = 11
Address Address Address AddressBank 0 Bank 1 Bank 3Bank 2
048
12
37
1115
26
1014
159
13
ENGS 116 Lecture 14 8
Independent Memory Banks
• Memory banks for independent accesses vs. faster sequential accesses– Multiprocessor
– I/O (DMA)
– CPU with Hit under n Misses, Non-blocking Cache
• Superbank: all memory active on one block transfer (or Bank)
• Bank: portion within a superbank that is word interleaved (or subbank)
Superbank Superbank offset (Bank)
Superbank # Bank # Bank offset
...
ENGS 116 Lecture 14 9
Independent Memory Banks
• How many banks?
number banks ≥ number clocks to access word in bank
– For sequential accesses, otherwise will return to original bank before it has next word ready
– (like in vector case)
• Increasing DRAM fewer chips harder to have banks
ENGS 116 Lecture 14 10
Avoiding Bank Conflicts
• Lots of banksint x[256][512];
for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];• Even with 128 banks, since 512 is multiple of 128, conflict on word
accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)
• HW: prime number of banks– bank number = address mod number of banks
– address within bank = address / number of words in bank
– modulo & divide per memory access with prime no. banks?
– address within bank = address mod number words in bank
– bank number? easy if 2N words per bank
ENGS 116 Lecture 14 11
Fast Memory Systems: DRAM specific
• Multiple CAS accesses: several names (page mode)– Extended Data Out (EDO): 30% faster in page mode
• New DRAMs to address gap; what will they cost, will they survive?– RAMBUS: startup company; reinvent DRAM interface
>> Each chip a module vs. slice of memory
>> Short bus between CPU and chips
>> Does own refresh
>> Variable amount of data returned
>> 1 byte / 2 ns (500 MB/s per chip)
– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz)
– Intel claims RAMBUS Direct is future of PC memory
• Niche memory or main memory?– e.g., Video RAM for frame buffers, DRAM + fast serial output
ENGS 116 Lecture 14 12
Virtual Memory
• Virtual Address (232, 264) to Physical Address mapping (228)
• Virtual memory in terms of cache:– Cache block?
– Cache miss?
• How is virtual memory different from caches?– What controls replacement
– Size (transfer unit, mapping mechanisms)
– Lower-level use
ENGS 116 Lecture 14 13
Figure 5.36 The logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D.
C
A
B
0
4K
8K
12K
16K
20K
24K
28K
Physical address:
A
C
B
D
0
4K
8K
12K
Virtual address:
Physical main memory
Virtual memory
D Disk
ENGS 116 Lecture 14 14
Figure 5.37 Typical ranges of parameters for caches and virtual memory.
Parameter First-level cache Virtual memory
Block (page) size 16 – 128 bytes 4096 – 65,536 bytes
Hit time 1 – 2 clock cycles 40 – 100 clock cycles
Miss penalty (Access time) (Transfer time)
8 – 100 clock cycles (6 – 60 clock cycles) (2 – 40 clock cycles)
700,000 – 6,000,000 clock cycles (500,000 – 4,000,000 clock cycles) (200,000 – 2,000,000 clock cycles)
Miss rate 0.5 – 10% 0.00001 – 0.001%
Data memory size 0.016 – 1 MB 16 – 8192 MB
ENGS 116 Lecture 14 15
Virtual Memory
• 4 Questions for Virtual Memory (VM)?
– Q1: Where can a block be placed in the upper level?
fully associative, set associative, or direct mapped?
– Q2: How is a block found if it is in the upper level?
– Q3: Which block should be replaced on a miss?
random or LRU?
– Q4: What happens on a write?
write back or write through?
• Other issues: size; pages or segments or hybrid
ENGS 116 Lecture 14 16
Figure 5.40 The mapping of a virtual address to a physical address via a page table.
Page offsetVirtual page number
Virtual address
Page table
Physical address
Main memory
ENGS 116 Lecture 14 17
Fast Translation: Translation Buffer (TLB)• Cache of translated addresses
• Data portion usually includes physical page frame number, protection field, valid bit, use bit, and dirty bit
• Alpha 21064 data TLB: 32-entry fully associative
43
21
Page-frame address <30>
Page offset <13>
<30> Tag
<21> Physical page #
(low-order 13 bits of address)
34-bit physical address
(high-order 21 bits of address)
32:1 MUX
V R W<1> <2><2>
<13>
<21>
ENGS 116 Lecture 14 18
Selecting a Page Size• Reasons for larger page size
– Page table size is inversely proportional to the page size;
therefore memory saved
– Fast cache hit time easy when cache ≤ page size (VA caches);
bigger page makes it feasible as cache grows in size
– Transferring larger pages to or from secondary storage,
possibly over a network, is more efficient
– Number of TLB entries is restricted by clock cycle time, so a larger
page size maps more memory, thereby reducing TLB misses
• Reasons for a smaller page size– Fragmentation: don’t waste storage; data must be contiguous within page
– Quicker process start for small processes
• Hybrid solution: multiple page sizes– Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)
ENGS 116 Lecture 14 19
Alpha VM Mapping
• “64-bit” address divided into 3 segments
– seg0 (bit 63 = 0) user code/heap
– seg1 (bit 63 = 1, 62 = 1) user stack
– kseg (bit 63 = 1, 62 = 0)
kernel segment for OS
• Three level page table, each one page
– Alpha only 43 bits of VA
– (future min page size up to 64 KB 55 bits of VA)
• PTE bits; valid, kernel & user, read & write enable (no reference, use, or dirty bit)
– What do you do?
Page table entry
Page table entry
Page table entry
Page Table Base Register
+
+
+
Physical address
page offsetphysical page-frame number
Main memory
Virtual address
page offsetlevel3seg0/seg1 selector
level1 level2
21
10 10 10 13
000 … 0 or 111 … 1
8 bytes32 bit address32 bit fields
L2 page table
L3 page table
L1 page table
. . .
. . .
. . .
. . .. . . . . .
ENGS 116 Lecture 14 20
Protection
• Avoid separate processes to access each others memory– Causes Segmentation Fault: sigSEG– Useful for Multitasking systems– Operating system issue
• At least two levels of protection:– Supervisor (Kernel) mode (privileged)
• Creates page tables, sets process bounds, handles exceptions
– User mode (non-privileged)• Can only make requests to Kernel: called SYSCALLs
• Shared memory• SYSCALL parameter passing
ENGS 116 Lecture 14 21
Protection 2
• Each page needs:– PID bit
– Read/Write/Execute bit
• Each process needs– Stack frame page(s)
– Text or code pages
– Data or heap pages
– State table keeping:• PC and other CPU status registers
• State of all registers
ENGS 116 Lecture 14 22
Alpha 21064• Separate Instruction & Data
TLB & Caches
• TLBs fully associative
• TLB updates in SW(“Private Arch Lib”)
• Caches 8KB direct mapped, write through
• Critical 8 bytes first
• Prefetch instr. stream buffer
• 2 MB L2 cache, direct mapped, WB (off-chip)
• 256 bit path to main memory, 4 64-bit modules
• Victim buffer: to give read priority over write
• 4-entry write buffer between D$ & L2$
StreamBuffer
WriteBuffer
Victim Buffer
Instr Data
ENGS 116 Lecture 14 23
0.00
0.501.00
1.50
2.00
2.503.00
3.50
4.004.50
5.00
Alp
haSort
TPC
-B (
db2)
TPC
-B (
db1)
Esp
ress
o
Li
Eqnto
tt
Sc
Gcc
Com
pre
ss
Mdljsp
2
Ora
Fpppp
Ear
Sw
m256
Doduc
Alv
inn
Tom
catv
Wave5
Mdljp2
Hydro
2d
CPI
L2I$D$I StallOther
Alpha CPI Components
• Instruction stall: branch mispredict (green);
• Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + register conflicts, structural conflicts
ENGS 116 Lecture 14 24
Pitfall: Predicting Cache Performance of One Program from Another (ISA, compiler, ...)
• 4KB data cache: miss rate 8%, 12%, or 28%?
• 1KB instruction cache: miss rate 0%, 3%, or 10%?
• Alpha vs. MIPSfor 8 KB Data $:17% vs. 10%
• Why 2X Alpha v. MIPS?
0%
5%
10%
15%
20%
25%
30%
35%
1 2 4 8
16 32 64
128
Cache Size (KB)
Miss Rate
D: tomcatvD: gccD: espressoI: gccI: espressoI: tomcatv
D$, Tom
D$, gcc
D$, esp
I$, gcc
I$, esp
I$, Tom
ENGS 116 Lecture 14 25
Pitfall: Simulating Too Small an Address Trace
Instructions Executed (billions)
CumulativeAverageMemoryAccess Time
1
1.5
2
2.5
3
3.5
4
4.5
0 1 2 3 4 5 6 7 8 9 10 11 12I$ = 4 KB, B = 16 BD$ = 4 KB, B = 16 BL2 = 512 KB, B = 128 BMP = 12, 200 (miss penalties)
ENGS 116 Lecture 14 26
Additional Pitfalls
• Having too small an address space
• Ignoring the impact of the operating system on the performance of the memory hierarchy
ENGS 116 Lecture 14 27
Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.
TLB First-level cache Second-level cache Virtual memory
Block size 4–8 bytes (1 PTE) 4–32 bytes 32–256 bytes 4096–16,384 bytes
Hit time 1 clock cycle 1–2 clock cycles 6–15 clock cycles 10–100 clock cycles
Miss penalty 10–30 clockcycles
8–66 clock cycles 30–200 clock cycles 700,000–6,000,000clock cycles
Miss rate (local) 0.1–2% 0.5–20% 15–30% 0.00001–0.001%
Size 32–8192 bytes(8-1024 PTEs)
1–128 KB 256 KB – 16 MB 16–8192 MB
Backing store First-level cache Second-level cache Page-mode DRAM Disks
Q1: block placement Fully associativeor set associative
Direct mapped Direct mapped orset associative
Fully associative
Q2: blockidentification
Tag/block Tag/block Tag/block Table
Q3: block replacement Random N.A. (direct mapped) Random LRU
Q4: write strategy Flush on a writeto page table
Write through orwrite back
Write back Write back