se-292 high performance computing memory hierarchy

30
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc

Upload: hazel

Post on 07-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

SE-292 High Performance Computing Memory Hierarchy. Reality Check. Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: What about multiple levels in caches? Question 3: Do modern processors use pipelining of the kind that we studied?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SE-292 High Performance Computing Memory  Hierarchy

SE-292 High Performance ComputingMemory Hierarchy

R. Govindarajan

govind@serc

Page 2: SE-292 High Performance Computing Memory  Hierarchy

2

Reality Check

Question 1: Are real caches built to work on virtual addresses or physical addresses?

Question 2: What about multiple levels in caches?

Question 3: Do modern processors use pipelining of the kind that we studied?

Page 3: SE-292 High Performance Computing Memory  Hierarchy

3

Virtual Memory System

To support memory management when multiple processes are running concurrently. Page based, segment based, ...

Ensures protection across processes. Address Space: range of memory addresses a

process can address (includes program (text), data, heap, and stack) 32-bit address 4 GB with VM, address generated by the processor is

virtual address

Page 4: SE-292 High Performance Computing Memory  Hierarchy

4

Page-Based Virtual Memory

A process’ address space is divided into a no. of pages (of fixed size).

A page is the basic unit of transfer between secondary storage and main memory.

Different processes share the physical memory

Virtual address to be translated to physical address.

Page 5: SE-292 High Performance Computing Memory  Hierarchy

5

Virtual Pages to Physical Frame Mapping

Process 1

Process k

Main Memory

Page 6: SE-292 High Performance Computing Memory  Hierarchy

6

Page Mapping Info (Page Table) A page is mapped to any frame in the main

memory. Where to store/access the mapping?

Page Table Each process will have its own page table!

Address Translation: virtual to physical address translation

Page 7: SE-292 High Performance Computing Memory  Hierarchy

7

Address Translation Virtual address (issued by processor) to Physical

Address (used to access memory)

Virtual Page No. 18-bits

Offset14 bits

Phy. Page # Pr DV

Physical Frame No. 18-bits

Offset14 bits

Virtual Address

Physical AddressPage Table

Page 8: SE-292 High Performance Computing Memory  Hierarchy

8

Memory Hierarchy: Secondary to Main Memory Analogous to Main memory to Cache. When the required virtual page is not in main

memory: Page Hit When the required virtual page is not in main

memory: Page fault Page fault penalty very high (~10’s of mSecs.)

as it involves disk (secondary storage) access.

Page fault ratio shd. be very low (10-4 to 10-5) Page fault handled by OS.

Page 9: SE-292 High Performance Computing Memory  Hierarchy

9

Page Placement

A virtual page is placed anywhere in physical memory (fully associative).

Page Table keeps track of the page mapping. Separate page table for each process. Page table size is quite large!

Assume 32-bit address space and 16KB page size.

# of entries in page table = 232 / 214 = 218 = 256K

Page table size = 256 K * 4B = 1 MB = 64 pages! Page table itself may be paged (multi-level page

tables)!

Page 10: SE-292 High Performance Computing Memory  Hierarchy

10

Page Identification Use virtual page number to index into page table. Accessing page table causes one extra memory

access!

Virtual Page No. 18-bits

Offset14 bits

Phy. Page # Pr DV

Physical Frame No. 18-bits

Offset14 bits

Page 11: SE-292 High Performance Computing Memory  Hierarchy

11

Page Replacement Page replacement can use more

sophisticated policies than in Cache. Least Recently Used Second-chance algorithm Recency vs. frequency

Write Policies Write-back Write-allocate

Page 12: SE-292 High Performance Computing Memory  Hierarchy

12

Translation Look-Aside Buffer Accessing the page tables causes one extra

memory access! To reduce translation time, use translation look-

aside buffer (TLB) which caches recent address translations.

TLB organization similar to cache orgn. (direct mapped, set-, or full-associative).

Size of TLB is small (4 - 512 entries). TLBs is important for fast translation.

Page 13: SE-292 High Performance Computing Memory  Hierarchy

13

Translation using TLB Assume 128 entry 4-way associative TLB

P. P. # Pr DV

Virtual Page No.

Offset

Physical

Frame No. Offset

18 bits 14 bits

=

Tag13bits

Ind.5bits

Phy. Page #TagPrVD

TLB Hit

Page 14: SE-292 High Performance Computing Memory  Hierarchy

14

Q1: Caches and Address Translation

MMU CacheVirtual

AddressPhysical Address

MMUCacheVirtual

AddressVirtual

AddressPhysical Address

Physical Addressed Cache

Virtual Addressed Cache

(if cache miss) (to main memory)

Page 15: SE-292 High Performance Computing Memory  Hierarchy

15

Which is less preferable? Physical addressed cache

Hit time higher (cache access after translation) Virtual addressed cache

Data/instruction of different processes with same virtual address in cache at the same time … Flush cache on context switch, or Include Process id as part of each cache directory entry

Synonyms Virtual addresses that translate to same physical address More than one copy in cache can lead to a data

consistency problem

Page 16: SE-292 High Performance Computing Memory  Hierarchy

16

Another possibility: Overlapped operation

MMU

Cache

Virtual Address Physical Address

Indexing into cache directory using virtual

address

Tag comparison using physical address

Virtual indexed physical tagged cache

Page 17: SE-292 High Performance Computing Memory  Hierarchy

17

Addresses and Caches `Physical Addressed Cache’

Physical Indexed Physical Tagged `Virtual Addressed Cache’

Virtual Indexed Virtual Tagged Overlapped cache indexing and translation

Virtual Indexed Physical Tagged Physical Indexed Virtual Tagged (?)

Page 18: SE-292 High Performance Computing Memory  Hierarchy

18

Cache Tag 16 bits

Physical Page No 18 bits

Physical Indexed Physical Tagged Cache

Virtual Address Virtual Page No

18 bitsPage Offset

14 bits

Coffset

MMU

Physical Address

= Physical Cache 5

16KB page size

64KB direct mapped cache with 32B block size

C-Index 11 bitsPage Offset

14 bits

Page 19: SE-292 High Performance Computing Memory  Hierarchy

19

Virtual Index Virtual Tagged Cache

Virtual Address

Physical Address

=

VPN 18 bits

Coffset

C-Index 11 bits

MMU

5

Page Offset14 bits

PPN 18 bits

Hit/Miss

Page 20: SE-292 High Performance Computing Memory  Hierarchy

20

Virtual Index Physical Tagged Cache

Virtual Address

Physical Address

=

VPN 18 bits

Coffset

C-Index 11 bits

MMU

Cache Tag 16 bits

5

P-Offset14 bits

Page 21: SE-292 High Performance Computing Memory  Hierarchy

21

Multi-Level Caches Small L1 cache -- to give a low hit time, and hence

faster CPU cycle time . Large L2 cache to reduce L1 cache miss penalty. L2 cache is typically set-associative to reduce L2-

cache miss ratio! Typically, L1 cache is direct mapped, separate I and D

cache orgn. L2 is unified and set-associative. L1 and L2 are on-chip; L3 is also getting in on-chip.

Page 22: SE-292 High Performance Computing Memory  Hierarchy

22

Multi-Level Caches

CPU MMU

L1 I-Cache

L1D-Cache

L2 Unified Cache

Memory

Acc. Time 16-30 ns

Acc. Time 100 ns

Acc. Time 2 - 4 ns

Page 23: SE-292 High Performance Computing Memory  Hierarchy

23

Cache Performance

One Level Cache

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Two-Level CachesAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x

(Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

Page 24: SE-292 High Performance Computing Memory  Hierarchy

24

48-bit virtual addr. and 44-bit physical address.

64KB 2-way assoc. L1 I-Cache with 64byte blocks ( 512 sets)

L1 I-Cache is virtually index and tagged (address translation reqd. only on a miss)

8-bit ASID for each process (to avoid cache flush on context switch)

Putting it Together: Alpha 21264

Page 25: SE-292 High Performance Computing Memory  Hierarchy

25

Alpha 21264 8KB page size ( 13 bit

page offset) 128 entry fully

associative TLB 8MB Direct mapped

unified L2 cache, 64B block size

Critical word (16B) first Prefetch next 64B into

instrn. prefetcher

Page 26: SE-292 High Performance Computing Memory  Hierarchy

26

21264 Data Cache L1 data cache uses

virtual addr. index, but physical addr. tag

Addr. translation along with cache access.

64KB 2-way assoc. L1 data cache with write back.

Page 27: SE-292 High Performance Computing Memory  Hierarchy

27

Q2: High Performance Pipelined Processors Pipelining

Overlaps execution of consecutive instructions Performance of processor improves

Current processors use more aggressive techniques for more performance

Some exploit Instruction Level Parallelism - often, many consecutive instructions are independent of each other and can be executed in parallel (at the same time)

Page 28: SE-292 High Performance Computing Memory  Hierarchy

28

Instruction Level Parallelism Processors Challenge: identifying which instructions are

independent Approach 1: build processor hardware to

analyze and keep track of dependences Superscalar processors: Pentium 4, RS6000,…

Approach 2: compiler does analysis and packs suitable instructions together for parallel execution by processor VLIW (very long instruction word) processors:

Intel Itanium

Page 29: SE-292 High Performance Computing Memory  Hierarchy

29

ILP Processors (contd.)

PipelinedIF WBMEMEXID

IF WBMEMEXIDIF WBMEMEXID

SuperscalarIF WBMEMEXIDIF WBMEMEXID

IF WBMEMEXIDIF WBMEMEXID

VLIW/EPICIF WBMEMEXID

EXIF WBMEMEXID

EX

Page 30: SE-292 High Performance Computing Memory  Hierarchy

30

Multicores Multiple cores in a

single die Early efforts utilized

multiple cores for multiple programs Throughput oriented

rather than speedup-oriented!

Can they be used by Parallel Programs?