ece473 computer organization and architecture€¦ · ece473 lec 30.5 note on matrix storage •a...

Lec 30.1ECE473

Memory Hierarchy: Cache Optimization

ECE473 Computer Architecture and Organization

Lecturer: Prof. Yifeng Zhu

Fall, 2015

Portions of these slides are derived from:

Dave Patterson © UCB

Lec 30.2ECE473

*=

C A B

Matrix Multiplication

Lec 30.3ECE473

Matrix Multiplication

C

= *

A B

n...1k,b kj n...1k,a ik

ijc

Lec 30.4ECE473

Matrix-Matrix Multiply

Sequential code:Vector * MAT_MULT(A,B,C){ ~~~

for(i=0; i<n; i++) ~~~~~~

for(j=0; j<n; j++) { ~~~~~~~~~

C[i,j] = 0; ~~~~~~~~~

for(k=0; k<n; k++) ~~~~~~~~~~~~

C[i,j] += A[i,k]X B[k,j]; ~~~~

}

}

= + *

C(i,j) C(i,j) A(i,:)

B(:,j)

Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory

Lec 30.5ECE473

Note on Matrix Storage

• A matrix is a 2-D array of elements, but memory addresses are “1-D”

• Conventions for matrix layout– by column, or “column major” (Fortran default); A(i,j) at A+i+j*n

– by row, or “row major” (C default) A(i,j) at A+i*n+j

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

11

15

19

Column major Row major

cachelinesBlue row of matrix is stored in red cachelines

Figure source: Larry Carter, UCSD

Column major matrix in memory

Lec 30.6ECE473

“Naïve” Matrix Multiply

{implements C = C + A*B}

for i = 1 to n

for j = 1 to n

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

Sequentialaccess throughentire matrix

Stride-Naccess toone row*

Reuse value from a

register

Slide source: Larry Carter, UCSD

Lec 30.7ECE473

“Naïve” Matrix Multiply

{implements C = C + A*B}

for i = 1 to n

{read row i of A into fast memory}

for j = 1 to n

{read C(i,j) into fast memory}

{read column j of B into fast memory}

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

{write C(i,j) back to slow memory}

= + *

C(i,j) A(i,:)

B(:,j)C(i,j)

Lec 30.8ECE473

Cache Friendly Multiply (better code)

Compiler can optimize the code (for row major storage):Vector * MAT_MULT(A,B,C){ ~~~

for(i=0; i<n; i++) ~~~~~~

for(k=0; k<n; k++) ~~~~~~~~~

r = A[i,k]; ~~~~~~~~~

for(j=0; j<n; j++) ~~~~~~~~~~~~

C[i,j] += r*B[k,j]; ~~~~

}

= + *

C(i,j) C(i,j) A(i,:)

B(k, :)

Lec 30.9ECE473

Improving Temporal Locality:Blocked Matrix Multiplication

Lec 30.10ECE473

“Blocked” Matrix Multiplication

i

j j

i

A BC

cache block

Key idea: reuse the other elements in

each cache block as much as possible

= *

Lec 30.11ECE473


i

j

j

i

A BC

cache block

b elements

b e

lem

en

ts

Since one loads column j+1 of B in the cache lines anyway compute c[i][j+1].Reorder the operations:• compute the first b terms of c[i][j], compute the first b terms of c[i][j+1]• compute the next b terms of c[i][j], compute the next b terms of c[i][j+1]

.....

c[i][j] c[i][j+1]= *

Lec 30.12ECE473


i

j j

i

A BC

Compute a whole subrow of C, with the same reordering of the

operations.

But then one has to load all columns of B, which one has to do

again for computing the next row of C.

Idea: reuse the blocks of B that we have just loaded.

cache block

= *

Lec 30.13ECE473


i

j j

i

A BC

cache block

Order of the operation:

Compute the first b terms of all cij values in the C block

Compute the next b terms of all cij values in the C block

. . .

Compute the last b terms of all cij values in the C block

= *

Lec 30.14ECE473


Lec 30.15ECE473


C11

C22 = A21B12 + A22B22 + A23B32 + A24B42

4 matrix multiplications

4 matrix additions

Main Point: each multiplication operates on small “block” matrices,

whose size may be chosen so that they fit in the cache.

C12 C13 C14

C21 C22 C23 C24

C31 C32 C43 C34

C41 C42 C43 C44

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A43 A144

B11 B12 B13 B14

B21 B22 B23 B24

B32 B32 B33 B34

B41 B42 B43 B44

N = 4 * b

= *

Lec 30.16ECE473

• The blocked version of the i-j-k algorithm is written simply as

for (i=0;i<N/B;i++)

for (j=0;j<N/B;j++)

for (k=0;k<N/B;k++)

C[i][j] += A[i][k]*B[k][j]

– B is the block size (which we assume divides N)

– X[i][j] is the block of matrix X on block row i and block column j

– “+=“ means matrix addition

– “*” means matrix multiplication

Blocked Algorithm

ece473 computer organization and architecture€¦ · ece473 lec 30.5 note on matrix storage •a...

Documents