simulations of memory hierarchy lab 2: cache lab

Simulations of Memory Hierarchy

LAB 2: CACHE LAB

OVERVIEW• Objectives

• Cache Set-Up

• Command line parsing

• Least Recently Used (LRU)

• Matrix Transposition

• Cache-Friendly Code

OBJECTIVE• There are two parts to this lab:

• Part A: Cache Simulator

• Simulate a cache table using the LRU algorithm

• Part B: Optimizing Matrix Transpose

• Write “cache-friendly” code in order to optimize cache hits/misses in the implementation of a matrix transpose function

• When submitting your lab, please submit the handin.tar file as described in the instructions.

MEMORY HIERARCHY• Pick your poison: smaller, faster, and costlier, or larger,

slower, and cheaper

CACHE ADDRESSING• X-bit memory addresses (in Part A, X <= 64 bits)

• Block offset: b bits

• Set index: s bits

• Tag bits: X – b – s

• Cache is a collection of S=2^s cache sets

• Cache set is a collection of E cache lines

• E is the associativity of the cache

• If E=1, the cache is called “direct-mapped”

• Each cache line stores a block of B=2^b bytes of data

ADDRESS ANATOMY

CACHE TABLE BASICS• Conditions:

• Set size (S)

• Block size (B)

• Line size (E)

• Note that the total capacity of this cache would be S*B*E

• Blocks are the fundamental units of the cache

CACHE TABLE CORRESPONDENCE WITH ADDRESS

Example for 32 bit address

CACHE SET LOOK-UP• Determine the set index and the tag bits based on the

memory address

• Locate the corresponding cache set and determine whether or not there exists a valid cache line with a matching tag

• If a cache miss occurs:

• If there is an empty cache line, utilize it

• If the set is full then a cache line must be evicted

TYPES OF CACHE MISSES• Compulsory Miss:

• First access to a block has to be a miss

• Conflict Miss:

• Level k cache is large enough, but multiple data objects all map to the same level k block

• Capacity Miss:

• Occurs when the working set of blocks (blocks of memory being used) is larger than the cache

PART A:CACHE SIMULATION

YOUR OWN CACHE SIMULATOR• NOT a real cache

• Block offsets are NOT used but are important in understanding the concept of a cache

• s, b, and E given at runtime

FUNCTIONS TO USE FOR COMMAND LINE PARSING• int getopt(int argc, char*const* argv, const char*

options)

• See: http://www.gnu.org/software/libc/manual/html_node/Example-of-Getopt.html#Example-of-Getopt

• long long int strtoll(const char* str, char** endptr, int base)

• See: http://www.cplusplus.com/reference/cstdlib/strtoll/

LEAST RECENTLY USED (LRU) ALGORITHM

• A least recently used algorithm should be used to determine which cache lines to evict in what order

• Each cache line will need some sort of “time” field which should be update each time that cache line is referenced

• If a cache miss occurs in a full cache set, the cache line with the least relevant time field should be evicted

PART B:OPTIMIZING MATRIX TRANSPOSE

WHAT IS A MATRIX TRANSPOSITION?

• The transpose of a matrix A is denoted as AT

• The rows of AT are the columns of A, and the columns of AT are the rows of A

• Example:

GENERAL MATRIX TRANSPOSITION

CACHE-FRIENDLY CODE• In order to have fewer cache misses, you must make

good use of:

• Temporal locality: reuse the current cache block if possible (avoid conflict misses [thrashing])

• Spatial locality: reference the data of close storage locations

• Tips:

• Cache blocking

• Optimized access patterns

• Your code should look ugly if done correctly

CACHE BLOCKING• Partition the matrix in question into sub-matrices

• Divide the larger problem into smaller sub-problems

• Main idea:

• Iterate over blocks as you perform the transpose as opposed to the simplistic algorithm which goes index by index, row by row

• Determining the size of these blocks will take some amount of thought and experimentation

QUESTIONS TO PONDER• What would happen if instead of accessing each index in row

order you alternated with jumping from row to row within the same column?

• What would happen if you declared only 4 local variables as opposed to 12 local variables?

• Is it possible to get rid of the local variables all together?

• What happens when accessing elements along the diagonal?

• What happens when the program is run in a different directory?

(XKCD)

simulations of memory hierarchy lab 2: cache lab

Documents

cache lab slide

cache table

cache hitsmisses

cache simulator

cheaper slide

matrix transpose function

memory addresses

lru algorithm