data locality

26
Data Locality CS 524 – High-Performance Computing

Upload: hadar

Post on 05-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Data Locality. CS 524 – High-Performance Computing. Motivating Example: Vector Multiplication. 1 GHz processor (1 ns clock), 4 instructions per cycle, 2 multiply-add units (4 FPUs) 100 ns memory latency, no cache, 1 word transfer size Peak performance: 4/1 ns = 4 GFLOPS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Locality

Data Locality

CS 524 – High-Performance Computing

Page 2: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 2

Motivating Example: Vector Multiplication

1 GHz processor (1 ns clock), 4 instructions per cycle, 2 multiply-add units (4 FPUs)

100 ns memory latency, no cache, 1 word transfer size Peak performance: 4/1 ns = 4 GFLOPS Performance on matrix-vector multiply

After every 200 ns 2 floating point operations occur Performance = 1 / 100 ns = 10 MFLOPS

Page 3: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 3

Motivating Example: Vector Multiplication

Assume there is a 32 K cache with a 1 ns latency. 1 word cache line, optimal replacement policy

Multiply 1 K vectors No. of floating point operations = 2 K (2n: n = size of vector) Latency to fill cache = 2 K * 100 ns = 200 micro sec Time to execute = 2 K / 4 * 1 ns = 0.5 micro sec Performance = 2 K / 200.5 = about 10 MFLOPS

No increase in performance! Why?

Page 4: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 4

Motivating Example: Matrix-Matrix Multiply

Assume the same memory system characteristics on the previous two slides

Multiply two 32 x 32 matrices No. of floating point operations = 64 K (2n3: n = dim of

matrix) Latency to fill cache = 2 K * 100 ns = 200 micro sec Time to execute = 64 K / 4 * 1 ns = 16 micro sec Performance = 64 K / 216 = about 300 MFLOPS

Repeated use of data in cache helps improve performance: temporal locality

Page 5: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 5

Motivating Example: Vector Multiply

Assume same memory system characteristics on first slide (no cache)

Assume 4 word transfer size (cache line size) Multiply two vectors

Each access brings in four words of a vector In 200 ns, four words of each vector are brought in 8 floating point operations can be performed in 200 ns Performance: 8 / 200 ns = 40 MFLOPS

Memory increased by the effective increase in memory bandwidth Spatial locality

Page 6: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 6

Locality of Reference

Application data space is often too large to all fit into cache

Key is locality of reference Types of locality

Register locality Cache-temporal locality Cache-spatial locality

Attempt (in order of preference) to: Keep all data in registers Order algorithms to do all ops on a data element before it is

overwritten in cache Order algorithms such that successive ops are on physically

contiguous data

Page 7: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 7

Single Processor Performance Enhancement

Two fundamental issues Adequate op-level parallelism (to keep superscalar, pipelined

processor units busy) Minimize memory access costs (locality in cache)

Three useful techniques to improve locality of reference (loop transformations) Loop permutation Loop blocking (tiling) Loop unrolling

Page 8: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 8

Access Stride and Spatial Locality

Access stride: separation between successively accessed memory locations Unit access stride maximizes spatial locality (only one miss

per cache line)

2-D arrays have different linearized representations in C and FORTRAN FORTRAN’s column-major representation favors column-

wise access of data C’s representation favors row-wise access of data

a

d

g

b

e

h

c

f

i

a b c a

d

gRow major order (C )

Column major order (FORTRAN)

Page 9: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 9

Matrix-Vector Multiply (MVM)

do i = 1, N

do j = 1, N

y(i) = y(i)+A(i,j)*x(j)

end do

end do

Total no. of references = 4N2

Two questions: Can we predict the miss ratio of different variations of this

program for different cache models? What transformations can we do to improve performance?

A x y

j

i

j

Page 10: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 10

Cache Model

Fully associative cache (so no conflict misses) LRU replacement algorithm (ensures temporal locality) Cache size

Large cache model: no capacity misses Small cache model: miss depending on problem size

Cache block size 1 floating-point number b floating-point numbers

Page 11: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 11

MVM: Dot-Product Form (Scenario 1)

Loop order: i-j Cache model

Small cache: assume cache can hold fewer than (2N+1) numbers

Cache block size: 1 floating-point number

Cache misses Matrix A: N2 misses (cold misses) Vector x: N cold misses + N(N-1) capacity misses Vector y: N cold misses

Miss ratio = (2N2 + N)/4N2 → 0.5

Page 12: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 12

MVM: Dot-Product Form (Scenario 2)

Cache model Large cache: cache can hold more than (2N+1) numbers Cache block size: 1 floating-point number

Cache misses Matrix A: N2 cold misses Vector x: N cold misses Vector y: N cold misses

Miss ratio = (N2 + 2N)/4N2 → 0.25

Page 13: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 13

MVM: Dot-Product Form (Scenario 3)

Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses (small cache)

Matrix A: N2/b cold misses Vector x: N/b cold misses + N(N-1)/b capacity misses Vector y: N cold misses Miss ratio = (1/2b +1/4N) → 1/2b

Cache misses (large cache) Matrix A: N2/b cold misses Vector x: N/b cold misses Vector y: N/b cold misses Miss ratio = (1/4 +1/2N)*(1/b) → 1/4b

Page 14: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 14

MVM: SAXPY Form

do j = 1, N

do i = 1, N

y(i) = y(i) +A(i,j)*x(j)

end do

end do Total no. of references = 4N2

A y x

j

i

j

Page 15: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 15

MVM: SAXPY Form (Scenario 4)

Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses (small cache)

Matrix A: N2 cold misses Vector x: N cold misses Vector y: N/b cold misses + N(N-1)/b capacity misses Miss ratio = 0.25(1 +1/b) + 1/4N → 0.25(1 + 1/b)

Cache misses (large cache) Matrix A: N2 cold misses Vector x: N/b cold misses Vector y: N/b cold misses Miss ratio = 1/4 +1/2Nb → 1/4

Page 16: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 16

Loop Blocking (Tiling)

Loop blocking enhances temporal locality by ensuring that data remain in cache until all operations are performed on them

Concept Divide or tile the loop computation into chunks or blocks that

fit into cache Perform all operations on block before moving to the next

block

Procedure Add outer loop(s) to tile inner loop computation Select size of the block so that you have a large cache model

(no capacity misses, temporal locality maximized)

Page 17: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 17

MVM: Blocked

do bi = 1, N/B

do bj = 1, N/B

do i = (bi-1)*B+1, bi*B

do j = (bj-1)*B+1, bj*B

y(i) = y(i) + A(i, j)*x(j)

A

x

y

j

i

B

B

Page 18: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 18

MVM: Blocked (Scenario 5)

Cache size greater than 2B + 1 floating-point numbers Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses per block

Matrix A: B2/b cold misses Vector x: B/b Vector y: B/b Miss ratio = (1/4 +1/2B)*(1/b) → 1/4b

Lesson: proper loop blocking eliminates capacity misses. Performance essentially becomes independent of cache size (as long as cache size is > 2B +1 fp numbers)

Page 19: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 19

Access Stride and Spatial Locality

Access stride: separation between successively accessed memory locations Unit access stride maximizes spatial locality (only one miss

per cache line)

2-D arrays have different linearized representations in C and FORTRAN FORTRAN’s column-major representation favors column-

wise access of data C’s representation favors row-wise access of data

a

d

g

b

e

h

c

f

i

a b c a

d

gRow major order (C )

Column major order (FORTRAN)

Page 20: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 20

Access Stride Analysis of MVM

c Dot-product form

do i = 1, N

do j = 1, N

y(i) = y(i) + A(i,j)*x(j)

end do

end do

c SAXPY form

do j = 1, N

do i = 1, N

y(i) = y(i) + A(i,j)*x(j)

end do

end do

Access stride for arrays

Dot-product form

C Fortran

A 1 N

x 1 1

y 0 0

SAXPY form

A N 1

x 0 0

y 1 1

Page 21: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 21

Loop Permutation

Changing the order of nested loops can improve spatial locality Choose the loop ordering that minimizes miss ratio, or Choose the loop ordering that minimizes array access strides

Loop permutation issues Validity: Loop reordering should not change the meaning of

the code Performance: Does the reordering improve performance? Array bounds: What should be the new array bounds? Procedure: Is there a mechanical way to perform loop

permutation?These are of primary concern in compiler design. We will

discuss some of these issues when we cover dependences, from an application software developer’s perspective.

Page 22: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 22

Matrix-Matrix Multiply (MMM)

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

for (k = 0; k < N; k++)

C[i][j] = C[i][j] + A[k][i]*B[k][j]; } }

Access strides (C compiler: row-major) + Performance (MFLOPS)

ijk jik ikj kij jki kji

C(i,j) 0 0 1 1 N N

A(k,i) N N 0 0 1 1

B(k,j) N N 1 1 0 0

P4 1.6 48 45 77 77 8 8

P3 800 45 48 55 37 10 10

suraj 3.6 3.6 4.4 4.3 4.0 4.0

Page 23: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 23

Loop Unrolling

Reduce number of iterations of loop but add statement(s) to loop body to do work of missing iterations

Increase amount of operation-level parallelism in loop body

Outer-loop unrolling changes the order of access of memory elements and could reduce number of memory accesses and cache misses

do i = 1, 2n

do j = 1, m

loop-body(i,j)

end do

do i = 1, 2n, 2

do j = 1, m

loop-body(i,j)

loop-body(i+1, j)

End do

Page 24: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 24

MVM: Dot-Product 4-Outer Unrolled Form

/* Assume n is a multiple of 4 */

for (i = 0; i < n; i+=4) {

for (j = 0; j < n; j++) {

y[i] = y[i] + A[i][j]*x[j];

y[i+1] = y[i+1] + A[i+1][j]*x[j];

y[i+2] = y[i+2] + A[i+2][j]*x[j];

y[i+3] = y[i+3] + A[i+3][j]*x[j];

} }

Performance (MFLOPS)

32 x 32 1024 x 1024

P4 1.6 23.0 8.5

P3 800 73.1 38.0

suraj 6.6 39.2

Page 25: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 25

MVM: SAXPY 4-Outer Unrolled Form

/* Assume n is a multiple of 4 */

for (j = 0; j < n; j+=4) {

for (i = 0; i < n; i++) {

y[i] = y[i] + A[i][j]*x[j]

+ A[i][j+1]*x[j+1]

+ A[i][j+2]*x[j+2]

+ A[i][j+3]*x[j+3];

} }

Performance (MFLOPS)

32 x 32 1024 x 1024

P4 1.6 5.0 2.1

P3 800 20.5 9.8

suraj 3.5 8.6

Page 26: Data Locality

CS 524 (Au 05-06)- Asim Karim @ LUMS 26

Summary

Loop permutation Enhances spatial locality C and FORTRAN compilers have different access patterns

for matrices resulting in different performances Performance increases when inner loop index has the

minimum access stride

Loop blocking Enhances temporal locality Ensure that 2B is less than cache size

Loop unrolling Enhances superscalar, pipelined operations