powerpoint
TRANSCRIPT
CME212 – Introduction to Large-Scale Computing in EngineeringHigh Performance Computing and Programming
CME212 Lecture 15
Caches and Memory
2
Memory
• Writable?– Read-Only (ROM)– Read-Write
• Accessing– Random Access (RAM)– Sequential Access (Tapes)
• Lifetime– Volatile (needs power)– Non-Volatile (can be powered off)
3
Conventional RAM
• Dynamic RAM (DRAM)– Works in refresh cycles– Few transistors means low cost
• Static RAM (SRAM)– More transistors than DRAM– More expensive– No refresh means much faster
4
Flash Memory
• Non-volatile memory– Charged electrons in fields, quantum tunneling
• Cheap NAND Flash has only sequential access
• Finite number of ”flashes”• Problems with writes
– Can only be written in blocks
• Used in cameras, MP3-players
5
The disk surface spins at a fixedrotational rate
spindle
By moving radially, the arm can position the read/write head over any track.
The read/write headis attached to the endof the arm and flies over the disk surface ona thin cushion of air.
spin
dle
spindle
spin
dle
spindle
Disk Operation (single-platter view)
6
Disk Operation (multi-platter view)
arm
read/write heads move in unison
from cylinder to cylinder
spindle
7
CPU-Memory Gap
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
Image from Sun Microsystems
8
The CPU-Memory Gap
• Cheap memory must be built out of few transistors• The most common main memory type is called DRAM
(dynamic RAM) which saves transistors by operating in refresh cycles
• The other type, SRAM (static RAM) uses another, more expensive design without refreshing
• The clock frequency of CPUs increases at a much higher rate than that of DRAM
• Conclusion: CPU must wait for data to pass through the memory system
9
Implications for Pipelines
• Waiting for data stalls the pipeline• Common DRAM latency is about 150 cycles• UNACCEPTABLE!• We will need a lot of registers to keep this
latency hidden• Solution: cache memories• A cache memory is a smaller SRAM
(faster) memory which act as a temporary storage to hide the DRAM latencies
10
Webster Definition of “cache”
1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT
2. 1a: a hiding place esp. for concealing and preserving provisions or implements
3. 1b: a secure place of storage 2: something hidden or stored in a cache
11
Cache Memory
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
Cache
Memory (DRAM)
CPU
Small but fastClose to CPU
Large and slow (cheap)Far away from CPU
12
Basics of Caches
• Caches hold copies of the memory– Need to be synchronized with memory– This is handled transparently to the CPU
• Caches have a limited capacity– Cannot fit the entire memory at one time
• Caches work because of the principle of locality
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
13
General Principles of Computer Programs
• Principle of locality:– Programs tend to reuse data and instructions they have
used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of its program.
• We can predict what instructions and data a program will use based on its history
• Temporal locality, recently accessed items are likely to be accessed in the near future
• Spatial locality, items whose addresses are near one another tend to be referenced close together in time.
14
Cache Knowledge Useful When...
• Designing a new computer• Writing an optimized program
– or compiler– or operating system …
• Implementing software caching– Web caches– Proxies– File systems
15
Cache Concepts
• Requests for data are sent to the memory subsystem– They either hit or miss in a cache– On a miss we need to get a copy from memory
• Caches have finite capacity– Data needs to be replaced– How do we find our victim?
• Caches need to be fast– How do we verify if data is in the cache or not?
16
Details of Caching
• Every piece of data is identified using an address
• We can store the address in a “phone book” to find a piece of data
• When the CPU sends out a request for data, we need a fast mechanism to find out if we have a hit or miss
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
17
Mapping Strategies
• In a direct mapped cache each piece of data has a given location
• In a fully associative cache any piece of data can go anywhere (parallel search)
• In a set associative cache any piece of data can go anywhere within a subset– Data is directly mapped to sets– Each set is associative (must be searched)
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
18
Set Associativity
• The address space is divided into sets modulo the associativity of the cache
• Exact mapping given some bits of address• Example:
– 4-way set associative, each set holds 256 bytes– Address space is 800 bytes (in hex), or 2048 bytes (decimal)– Bits 9 and 10 identify the set
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
set0 000-0FF 400-4FF
set1 100-1FF 500-5FF
set2 200-2FF 600-6FF
set3 300-3FF 700-7FF
Potential conflict(highest bits
specify a tag)
19
Address Book CacheLooking for Tommy’s Telephone Number
ÖÄÅZYXVUT
TOMMY 12345
ÖÄÅZYXV
“Address Tag”
One entry per page =>Direct-mapped caches with 28 entries
“Data”
Indexingfunction
20
Address Book CacheLooking for Tommy’s Number
ÖÄÅZYXVUT
OMMY 12345
TOMMY
EQ?
index
21
Address Book CacheLooking for Tomas’ Number
ÖÄÅZYXVUT
OMMY 12345
TOMAS
EQ?
index
Miss!Lookup Tomas’ number inthe telephone directory
22
Address Book CacheLooking for Tomas’ Number
ZYXVUT
OMMY 12345
TOMAS
index
Replace TOMMY’s datawith TOMAS’ data. There is no other choice(direct mapped)
OMAS 23457
ÖÄÅ
23
Cache Blocks
• To speed up the lookup process data is allocated in cache blocks consisting of several consecutively stored words
• When you access a word you will always allocate several neighboring words in the cache
• Works well due to the principle of locality
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
24i = 0 i = 4 i = 8
Cache Blocks and Miss Ratios• Consider a C array of 1024 doubles
– A pointer to a start address of a contiguous region in memory– Block size is 32 bytes which equals 4 array elements– Loop through the array with an index increment of one (stride-1)
double *arrayEvery 4th element a cache miss.
256 misses in totalMiss ratio of 25%
25
Consequences of Cache Blocks
• Works well because of principle of locality– Codes with high degree of spatial locality
reuse data within blocks
• We should aim for stride-1 access pattern
• Struct’s should be packed and aligned to cache blocks– Compiler can help– Fill out structs using dummy data
26
Who to Replace?Picking a “victim”
• Least-recently used (LRU)– Considered the “best” algorithm (which is not
always true…)– Only practical up to ~4-way associative
• Pseudo-LRU– Based on coarse time stamps.
• Random replacement
27
The Memory Hierarchy
• Extend the caching idea and create a hierarchy of caches
• Arranged into levels• L1 – level 1 cache• L2 – level 2 cache• Caches are often of
increasing size• Hide the latency of cheaper
memory
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
L1
L2
28
Memory/Storage
sram
dramdisk
sram
2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns 1kB 64k 4MB 1GB 1 TB
(1982: 200ns 200ns 200ns 10 000 000ns)
Registers & Caches Main Memory
Disk and Virtual Memory
29
An Example Memory Hierarchy
registers
on-chip L1cache (SRAM)
main memory(DRAM)
local secondary storage(local disks)
Larger, slower,
and cheaper (per byte)storagedevices
remote secondary storage(distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers.
Main memory holds disk blocks retrieved from local disks.
off-chip L2cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache memory.
CPU registers hold words retrieved from L1 cache.
L2 cache holds cache lines retrieved from main memory.
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,faster,and
costlier(per byte)storage devices
30
Caching in a Memory Hierarchy
8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1
Level k:
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.
Data is copied betweenlevels in block-sized transfer units
Level k+1: 4
4
4 10
10
10
31
General Caching Concepts
• Program needs object d, which is stored in some block b.
• Cache hit– Program finds b in the cache at
level k. e.g., block 14.
• Cache miss– b is not at level k, so level k cache
must fetch it from level k+1. e.g., block 12.
– If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?
• Placement policy: where can the new block go?
• Replacement policy: which block should be evicted? E.g., LRU
Request14
Request12
9 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Level k:
Level k+1:
1414
12
14
4*
4*12
12
0 1 2 3
Request12
4*4*12
32
Block Sizes in a Typical Memory Hierarchy
Capacity Block size
# of lines
# of 32-bit integers per block
Register 32-bits 4 bytes 1 1
L1 Cache
64kB 32 bytes
2048 8
L2 Cache
2MB 64 bytes
32768
16
33
Address Translation
• Translation is expensive since we need to keep track of many pages on a multi-tasking multi-user system– Need to search or index the page table that maintains this
information
• Introduce the Translation Lookaside Buffer (TLB) to remember the most recent translations– The TLB is a small on-chip cache– If we have an entry in the TLB the page is probably in
physical memory– Translation is much quicker (faster access time)
34
Page Sizes and TLB Reach
• Typical page sizes are 8kB or 4kB
• TLBs typically holds 256 or 512 entries
• The TLB reach is the amount of data we can fit in the TLB– Multiply page size by number of entries
35
General Caching Concepts
• Types of cache misses:– Cold (compulsory) miss
• Cold misses occur because the cache is empty.
– Capacity miss• Occurs when the set of active cache blocks (working
set) is larger than the cache.
– Conflict miss• Conflict misses occur when the level k cache is large
enough, but multiple data objects all map to the same level k block.
• E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
36
Caches in Hierarchies
• To syncronize data in hierachies caches can either be:
1. Write-through– Reflect change immediately– L1 is often write-through
2. Write-back– Syncronize all data at a given signal– Less traffic
37
Cache Performance Metrics• Miss Rate
– Fraction of memory references not found in cache (misses/references)– Typical numbers:
• 3-10% for L1• can be quite small (e.g., < 1%) for L2, depending on size, etc.
• Hit Time– Time to deliver a line in the cache to the processor (includes time to
determine whether the line is in the cache)– Typical numbers:
• 1 clock cycle for L1• 3-8 clock cycles for L2
• Miss Penalty– Additional time required because of a miss
• Typically 25-100 cycles for main memory
38
Caches and Performance
• Caches are extremely important for performance– Level 1 latency is usually 1 or 2 cycles
• Caches only work well for programs with nice locality properties
• Caching can be used in other areas as well, example: web-caching (proxies)
• Modern CPUs have two or three levels of caches.– Largest caches are tens of megabytes
• Most of the chip area is used for caches
39
Nested Multi-dim Arrays
• Dimensions are stacked consecutively using an index mapping
• Consider a square two-dimensional array of size N
N
N
40
Row or Column-wise Order
• If you allocate a static multi-dimensional array in C the rows of your array will be stored consequtively
• This is called row-wise ordering• Row-wise or row-major ordering means column
index should vary fastest (i,j)• Column-wise or column-major ordering means
that the row index should vary fastest– Used in Fortran
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
41
Row-Major Ordering
• (i,j) loop will give stride-1 access
• (j,i) loop will give stride-N access
Array(i,j) -> (i*N+j)
42
Column-Major Ordering
• (i,j) will give stride-N
• (j,i) will give stride-1
Array(i,j) ->(i+j*N)
43
Dynamically Allocated Arrays
• If you use a nested array you can choose row-major or column-major using your indexing function (i+N*j) or (i*N+j)
• For multi-level arrays there is no guarantee that the rows (the second indirection) will be stored consecutively
• You can still achieve this using some pointer arithmetic (page 92 in Oliviera)
CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming
44
Data caches, example
double x[m][n];register double sum = 0.0;
for( i = 0; i < m; i++ ){ for( j = 0; j < n; j++) { sum = sum + x[i][j]; }}
• Assumptions:1. Only one data cache2. A cache block contains 4 double elements3. The i,j,sum variables stay in registers
45
Storage visualization, (i,j)-loop
0 1 2 4 n
1
2
3
m
for( i = 0; i < m; i++ ) { for( j = 0; j < n; j++) { sum = sum + x[i][j]; }}
MISS
MISS
i
j
MISS
46
Storage visualization, (j,i)-loop
1 2 3 n
1
2
3
m
MISS
MISS
MISS
MISS
i
j
for( j = 0; j < m; j++ ) { for( i = 0; i < n; i++) { sum = sum + x[i][j]; }}
47
Cache Thrashing
• The start addresses of x and y might map to the same set• Accesses to y will conflict with x
– No data will be mapped to the other sets– Only one set will be used (small part of the cache)– Index bits are the same for x and y
• Solution: array padding– Make one array larger– Distance between arrays will not be a power of 2– Same thing can happen in set associative caches
float dotprod(float x[256], float y[256]) { float sum = 0.0; int i;
for( i=0; i < 256; i++ ) sum += x[i] * y[i]; return sum;}
Bevare of array sizes that are powers of two!
48
Array Padding
• Used to reduce thrashing– Especially important for multi-dimensional
arrays
• Allocate more space– Which isn’t used in computations– Will shift subsequent arrays to addresses that
are not powers of two
• Typical padding– Use a prime number like 13, 21, 31– Verify effect experimentally