data locality & its optimization techniques

DATA LOCALITY & ITSOPTIMIZATIONTECHNIQUES

Presented by Preethi Rajaram

CSS 548 Introduction to Compilers Professor Carol ZanderFall 2012

Why?• Processor Speed - increasing at a faster rate than the

memory speed

• Computer Architectures -more levels of cache memory

• Cache - takes advantage of data locality

• Good Data Locality - good application performance

• Poor Data Locality - reduces the effectiveness of the cache

Data Locality• It is the property that, references to the same memory location or

adjacent locations are reused within a short period of time

• Temporal locality

• Spatial locality

Fig: Program to find the squares of the differences (a) without loop fusion (b) with loop fusion[Image from: The Dragon book 2nd edition]

Matrix Multiplication - Example

Fig: Basic Matrix Multiplication Algorithm [Image from: The Dragon book 2nd edition]

• Poor data locality• N2 multiply add operations separates the reuse of same data element in

matrix Y• N operations separate the reuse of same cache line in Y

• Solutions• Changing the layout of the data structures• Blocking

Matrix Multiplication – Example Contd…• Changing the data structure layout

• Store Y in column-major order• Improves reuse of cache lines of matrix Y• Limited Applicability

• Blocking• Changes the execution order of instructions• Divide the matrix into submatrices or blocks• Order the operations such that entire block is used over a short period of

time• Choose B such that, one block from each of the matrices fits into cache

Image from: The Dragon book 2nd edition

Data Reuse• Locality Optimization• Identify set of iterations that access the same data or same cache line• Static Access- an instruction in a program e.g x = z[i,j]• Dynamic Access- execution of instruction many times as in a loop nest• Types of Reuse

• Self• Iterations using same data come from same static access

• Group• Iterations using same data come from different static access

• Temporal• If the same exact location is referenced

• Spatial• If the same cache line is referenced

Self Temporal Reuse• Save substantial memory by exploiting self reuse• n(d-k) times reused for data with ‘k’ dimensions in a loop nest of depth

‘d’ e.g. 3-deep nested loop accesses one column of an array, then there is a potential

saving accesses of n2 accesses• Dimensionality of access- Rank of the matrix in access• Iterations referring to the same location – Null Space of a matrix• Rank of a Matrix

• No. of rows or columns that are linearly independent• Null Space of a matrix

• A reference in ‘d’ deep loop nest with ‘r’ rank, accesses O(nr) data elements in O(nd) iterations, so on an average, O(nd-r) iterations must refer to the same array element

Rank = Dimensionality = 22nd row = 1st + 3rd 4th row = 3rd – 2* 1st

Nullity = 3-2 = 1 Loop depth = 3Rank = 2

Self Spatial Reuse• Depends on data layout of the matrix – e.g. Row major

order• In an array of ‘d’ dimension, array elements share a cache

line if they differ only in the last dimensione.g. Two array elements share the same cache line if and only if they share the same row in a 2-D array

• Truncated matrix is obtained by dropping of the last row from the matrix

• If the resulting matrix has a rank ‘r’ that is less than depth ‘d’, we can assure for spatial reuse

Truncated Matrix, r = 1, d = 2r<d, assures spatial reuse

Group Reuse• Group reuse only among accesses in a loop sharing the

same coefficient matrix

Fig: 2-deep loop nest [Image from: The Dragon book 2nd edition]

• z[i,j] and z[i-1,j] access almost the same set of array elements

• Data read by access z[i-1,j] is same as the data written by z[i,j], except for i = 1

Rank = 2, no self temporal reuse

Truncated Matrix, Rank = 1, self spatial reuse

Locality Optimization• Temporal Locality of data

Use the results as soon as they are generated

Fig: Code excerpt for a multigrid algorithm (a) before partition (b) after patition [Image from: The Dragon book 2nd edition]

Locality Optimization Contd…• Array Contraction

Reduce the dimension of the array and reduce the number of memory locations accessed

Fig: Code excerpt for a multigrid algorithm after partition and after array contractionImage from: The Dragon book 2nd edition

Locality Optimization Contd…• Instead of executing each partition one after the other; we interleave a number of the

partitions so that reuse among partitions occur close together• Interleaving Inner Loops in a Parallel Loop

• Interleaving Statements in a Parallel Loop

Fig: The statement interleaving transformation [Image from: The Dragon book 2nd edition]

Fig: Interleaving four instances of the inner loop[Image from: The Dragon book 2nd edition]

References• Wolf, Michael E., and Monica S. Lam. "A data locality optimizing algorithm."

ACM Sigplan Notices 26.6 (1991): 30-44. • McKinley, Kathryn S., Steve Carr, and Chau-Wen Tseng. "Improving data

locality with loop transformations." ACM Transactions on Programming Languages and Systems (TOPLAS) 18.4 (1996): 424-453.

• Bodin, François, et al. "A quantitative algorithm for data locality optimization."

Code Generation: Concepts, Tools, Techniques (1992): 119-145. • Kennedy, Ken, and Kathryn S. McKinley. "Optimizing for parallelism and data

locality." Proceedings of the 6th international conference on Supercomputing. ACM, 1992.

• Compilers ‐ Principles, Techniques, and Tools by A. Aho, M. Lam (2nd edition),

R. Sethi, and J.Ullman, Addison‐Wesley.

Thank You!

Questions??

data locality & its optimization techniques

Documents

data layout

cache data localityit

resulting matrix

d arraytruncated matrix

levels of cache memory

cache image

accesses onr data elements

edition poor data localityn2