data locality
DESCRIPTION
Data Locality. CS 524 – High-Performance Computing. Motivating Example: Vector Multiplication. 1 GHz processor (1 ns clock), 4 instructions per cycle, 2 multiply-add units (4 FPUs) 100 ns memory latency, no cache, 1 word transfer size Peak performance: 4/1 ns = 4 GFLOPS - PowerPoint PPT PresentationTRANSCRIPT
Data Locality
CS 524 – High-Performance Computing
CS 524 (Au 05-06)- Asim Karim @ LUMS 2
Motivating Example: Vector Multiplication
1 GHz processor (1 ns clock), 4 instructions per cycle, 2 multiply-add units (4 FPUs)
100 ns memory latency, no cache, 1 word transfer size Peak performance: 4/1 ns = 4 GFLOPS Performance on matrix-vector multiply
After every 200 ns 2 floating point operations occur Performance = 1 / 100 ns = 10 MFLOPS
CS 524 (Au 05-06)- Asim Karim @ LUMS 3
Motivating Example: Vector Multiplication
Assume there is a 32 K cache with a 1 ns latency. 1 word cache line, optimal replacement policy
Multiply 1 K vectors No. of floating point operations = 2 K (2n: n = size of vector) Latency to fill cache = 2 K * 100 ns = 200 micro sec Time to execute = 2 K / 4 * 1 ns = 0.5 micro sec Performance = 2 K / 200.5 = about 10 MFLOPS
No increase in performance! Why?
CS 524 (Au 05-06)- Asim Karim @ LUMS 4
Motivating Example: Matrix-Matrix Multiply
Assume the same memory system characteristics on the previous two slides
Multiply two 32 x 32 matrices No. of floating point operations = 64 K (2n3: n = dim of
matrix) Latency to fill cache = 2 K * 100 ns = 200 micro sec Time to execute = 64 K / 4 * 1 ns = 16 micro sec Performance = 64 K / 216 = about 300 MFLOPS
Repeated use of data in cache helps improve performance: temporal locality
CS 524 (Au 05-06)- Asim Karim @ LUMS 5
Motivating Example: Vector Multiply
Assume same memory system characteristics on first slide (no cache)
Assume 4 word transfer size (cache line size) Multiply two vectors
Each access brings in four words of a vector In 200 ns, four words of each vector are brought in 8 floating point operations can be performed in 200 ns Performance: 8 / 200 ns = 40 MFLOPS
Memory increased by the effective increase in memory bandwidth Spatial locality
CS 524 (Au 05-06)- Asim Karim @ LUMS 6
Locality of Reference
Application data space is often too large to all fit into cache
Key is locality of reference Types of locality
Register locality Cache-temporal locality Cache-spatial locality
Attempt (in order of preference) to: Keep all data in registers Order algorithms to do all ops on a data element before it is
overwritten in cache Order algorithms such that successive ops are on physically
contiguous data
CS 524 (Au 05-06)- Asim Karim @ LUMS 7
Single Processor Performance Enhancement
Two fundamental issues Adequate op-level parallelism (to keep superscalar, pipelined
processor units busy) Minimize memory access costs (locality in cache)
Three useful techniques to improve locality of reference (loop transformations) Loop permutation Loop blocking (tiling) Loop unrolling
CS 524 (Au 05-06)- Asim Karim @ LUMS 8
Access Stride and Spatial Locality
Access stride: separation between successively accessed memory locations Unit access stride maximizes spatial locality (only one miss
per cache line)
2-D arrays have different linearized representations in C and FORTRAN FORTRAN’s column-major representation favors column-
wise access of data C’s representation favors row-wise access of data
a
d
g
b
e
h
c
f
i
a b c a
d
gRow major order (C )
Column major order (FORTRAN)
CS 524 (Au 05-06)- Asim Karim @ LUMS 9
Matrix-Vector Multiply (MVM)
do i = 1, N
do j = 1, N
y(i) = y(i)+A(i,j)*x(j)
end do
end do
Total no. of references = 4N2
Two questions: Can we predict the miss ratio of different variations of this
program for different cache models? What transformations can we do to improve performance?
A x y
j
i
j
CS 524 (Au 05-06)- Asim Karim @ LUMS 10
Cache Model
Fully associative cache (so no conflict misses) LRU replacement algorithm (ensures temporal locality) Cache size
Large cache model: no capacity misses Small cache model: miss depending on problem size
Cache block size 1 floating-point number b floating-point numbers
CS 524 (Au 05-06)- Asim Karim @ LUMS 11
MVM: Dot-Product Form (Scenario 1)
Loop order: i-j Cache model
Small cache: assume cache can hold fewer than (2N+1) numbers
Cache block size: 1 floating-point number
Cache misses Matrix A: N2 misses (cold misses) Vector x: N cold misses + N(N-1) capacity misses Vector y: N cold misses
Miss ratio = (2N2 + N)/4N2 → 0.5
CS 524 (Au 05-06)- Asim Karim @ LUMS 12
MVM: Dot-Product Form (Scenario 2)
Cache model Large cache: cache can hold more than (2N+1) numbers Cache block size: 1 floating-point number
Cache misses Matrix A: N2 cold misses Vector x: N cold misses Vector y: N cold misses
Miss ratio = (N2 + 2N)/4N2 → 0.25
CS 524 (Au 05-06)- Asim Karim @ LUMS 13
MVM: Dot-Product Form (Scenario 3)
Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses (small cache)
Matrix A: N2/b cold misses Vector x: N/b cold misses + N(N-1)/b capacity misses Vector y: N cold misses Miss ratio = (1/2b +1/4N) → 1/2b
Cache misses (large cache) Matrix A: N2/b cold misses Vector x: N/b cold misses Vector y: N/b cold misses Miss ratio = (1/4 +1/2N)*(1/b) → 1/4b
CS 524 (Au 05-06)- Asim Karim @ LUMS 14
MVM: SAXPY Form
do j = 1, N
do i = 1, N
y(i) = y(i) +A(i,j)*x(j)
end do
end do Total no. of references = 4N2
A y x
j
i
j
CS 524 (Au 05-06)- Asim Karim @ LUMS 15
MVM: SAXPY Form (Scenario 4)
Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses (small cache)
Matrix A: N2 cold misses Vector x: N cold misses Vector y: N/b cold misses + N(N-1)/b capacity misses Miss ratio = 0.25(1 +1/b) + 1/4N → 0.25(1 + 1/b)
Cache misses (large cache) Matrix A: N2 cold misses Vector x: N/b cold misses Vector y: N/b cold misses Miss ratio = 1/4 +1/2Nb → 1/4
CS 524 (Au 05-06)- Asim Karim @ LUMS 16
Loop Blocking (Tiling)
Loop blocking enhances temporal locality by ensuring that data remain in cache until all operations are performed on them
Concept Divide or tile the loop computation into chunks or blocks that
fit into cache Perform all operations on block before moving to the next
block
Procedure Add outer loop(s) to tile inner loop computation Select size of the block so that you have a large cache model
(no capacity misses, temporal locality maximized)
CS 524 (Au 05-06)- Asim Karim @ LUMS 17
MVM: Blocked
do bi = 1, N/B
do bj = 1, N/B
do i = (bi-1)*B+1, bi*B
do j = (bj-1)*B+1, bj*B
y(i) = y(i) + A(i, j)*x(j)
A
x
y
j
i
B
B
CS 524 (Au 05-06)- Asim Karim @ LUMS 18
MVM: Blocked (Scenario 5)
Cache size greater than 2B + 1 floating-point numbers Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses per block
Matrix A: B2/b cold misses Vector x: B/b Vector y: B/b Miss ratio = (1/4 +1/2B)*(1/b) → 1/4b
Lesson: proper loop blocking eliminates capacity misses. Performance essentially becomes independent of cache size (as long as cache size is > 2B +1 fp numbers)
CS 524 (Au 05-06)- Asim Karim @ LUMS 19
Access Stride and Spatial Locality
Access stride: separation between successively accessed memory locations Unit access stride maximizes spatial locality (only one miss
per cache line)
2-D arrays have different linearized representations in C and FORTRAN FORTRAN’s column-major representation favors column-
wise access of data C’s representation favors row-wise access of data
a
d
g
b
e
h
c
f
i
a b c a
d
gRow major order (C )
Column major order (FORTRAN)
CS 524 (Au 05-06)- Asim Karim @ LUMS 20
Access Stride Analysis of MVM
c Dot-product form
do i = 1, N
do j = 1, N
y(i) = y(i) + A(i,j)*x(j)
end do
end do
c SAXPY form
do j = 1, N
do i = 1, N
y(i) = y(i) + A(i,j)*x(j)
end do
end do
Access stride for arrays
Dot-product form
C Fortran
A 1 N
x 1 1
y 0 0
SAXPY form
A N 1
x 0 0
y 1 1
CS 524 (Au 05-06)- Asim Karim @ LUMS 21
Loop Permutation
Changing the order of nested loops can improve spatial locality Choose the loop ordering that minimizes miss ratio, or Choose the loop ordering that minimizes array access strides
Loop permutation issues Validity: Loop reordering should not change the meaning of
the code Performance: Does the reordering improve performance? Array bounds: What should be the new array bounds? Procedure: Is there a mechanical way to perform loop
permutation?These are of primary concern in compiler design. We will
discuss some of these issues when we cover dependences, from an application software developer’s perspective.
CS 524 (Au 05-06)- Asim Karim @ LUMS 22
Matrix-Matrix Multiply (MMM)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++)
C[i][j] = C[i][j] + A[k][i]*B[k][j]; } }
Access strides (C compiler: row-major) + Performance (MFLOPS)
ijk jik ikj kij jki kji
C(i,j) 0 0 1 1 N N
A(k,i) N N 0 0 1 1
B(k,j) N N 1 1 0 0
P4 1.6 48 45 77 77 8 8
P3 800 45 48 55 37 10 10
suraj 3.6 3.6 4.4 4.3 4.0 4.0
CS 524 (Au 05-06)- Asim Karim @ LUMS 23
Loop Unrolling
Reduce number of iterations of loop but add statement(s) to loop body to do work of missing iterations
Increase amount of operation-level parallelism in loop body
Outer-loop unrolling changes the order of access of memory elements and could reduce number of memory accesses and cache misses
do i = 1, 2n
do j = 1, m
loop-body(i,j)
end do
do i = 1, 2n, 2
do j = 1, m
loop-body(i,j)
loop-body(i+1, j)
End do
CS 524 (Au 05-06)- Asim Karim @ LUMS 24
MVM: Dot-Product 4-Outer Unrolled Form
/* Assume n is a multiple of 4 */
for (i = 0; i < n; i+=4) {
for (j = 0; j < n; j++) {
y[i] = y[i] + A[i][j]*x[j];
y[i+1] = y[i+1] + A[i+1][j]*x[j];
y[i+2] = y[i+2] + A[i+2][j]*x[j];
y[i+3] = y[i+3] + A[i+3][j]*x[j];
} }
Performance (MFLOPS)
32 x 32 1024 x 1024
P4 1.6 23.0 8.5
P3 800 73.1 38.0
suraj 6.6 39.2
CS 524 (Au 05-06)- Asim Karim @ LUMS 25
MVM: SAXPY 4-Outer Unrolled Form
/* Assume n is a multiple of 4 */
for (j = 0; j < n; j+=4) {
for (i = 0; i < n; i++) {
y[i] = y[i] + A[i][j]*x[j]
+ A[i][j+1]*x[j+1]
+ A[i][j+2]*x[j+2]
+ A[i][j+3]*x[j+3];
} }
Performance (MFLOPS)
32 x 32 1024 x 1024
P4 1.6 5.0 2.1
P3 800 20.5 9.8
suraj 3.5 8.6
CS 524 (Au 05-06)- Asim Karim @ LUMS 26
Summary
Loop permutation Enhances spatial locality C and FORTRAN compilers have different access patterns
for matrices resulting in different performances Performance increases when inner loop index has the
minimum access stride
Loop blocking Enhances temporal locality Ensure that 2B is less than cache size
Loop unrolling Enhances superscalar, pipelined operations