ece473 computer organization and architecture€¦ · ece473 lec 30.5 note on matrix storage •a...

16
Lec 30.1 ECE473 Memory Hierarchy: Cache Optimization ECE473 Computer Architecture and Organization Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson © UCB

Upload: others

Post on 21-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.1ECE473

Memory Hierarchy: Cache Optimization

ECE473 Computer Architecture and Organization

Lecturer: Prof. Yifeng Zhu

Fall, 2015

Portions of these slides are derived from:

Dave Patterson © UCB

Page 2: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.2ECE473

*=

C A B

Matrix Multiplication

Page 3: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.3ECE473

Matrix Multiplication

C

= *

A B

n...1k,b kj n...1k,a ik

ijc

Page 4: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.4ECE473

Matrix-Matrix Multiply

Sequential code:Vector * MAT_MULT(A,B,C){ ~~~

for(i=0; i<n; i++) ~~~~~~

for(j=0; j<n; j++) { ~~~~~~~~~

C[i,j] = 0; ~~~~~~~~~

for(k=0; k<n; k++) ~~~~~~~~~~~~

C[i,j] += A[i,k]X B[k,j]; ~~~~

}

}

= + *

C(i,j) C(i,j) A(i,:)

B(:,j)

Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory

Page 5: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.5ECE473

Note on Matrix Storage

• A matrix is a 2-D array of elements, but memory addresses are “1-D”

• Conventions for matrix layout– by column, or “column major” (Fortran default); A(i,j) at A+i+j*n

– by row, or “row major” (C default) A(i,j) at A+i*n+j

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

11

15

19

Column major Row major

cachelinesBlue row of matrix is stored in red cachelines

Figure source: Larry Carter, UCSD

Column major matrix in memory

Page 6: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.6ECE473

“Naïve” Matrix Multiply

{implements C = C + A*B}

for i = 1 to n

for j = 1 to n

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

Sequentialaccess throughentire matrix

Stride-Naccess toone row*

Reuse value from a

register

Slide source: Larry Carter, UCSD

Page 7: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.7ECE473

“Naïve” Matrix Multiply

{implements C = C + A*B}

for i = 1 to n

{read row i of A into fast memory}

for j = 1 to n

{read C(i,j) into fast memory}

{read column j of B into fast memory}

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

{write C(i,j) back to slow memory}

= + *

C(i,j) A(i,:)

B(:,j)C(i,j)

Page 8: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.8ECE473

Cache Friendly Multiply (better code)

Compiler can optimize the code (for row major storage):Vector * MAT_MULT(A,B,C){ ~~~

for(i=0; i<n; i++) ~~~~~~

for(k=0; k<n; k++) ~~~~~~~~~

r = A[i,k]; ~~~~~~~~~

for(j=0; j<n; j++) ~~~~~~~~~~~~

C[i,j] += r*B[k,j]; ~~~~

}

= + *

C(i,j) C(i,j) A(i,:)

B(k, :)

Page 9: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.9ECE473

Improving Temporal Locality:Blocked Matrix Multiplication

Page 10: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.10ECE473

“Blocked” Matrix Multiplication

i

j j

i

A BC

cache block

Key idea: reuse the other elements in

each cache block as much as possible

= *

Page 11: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.11ECE473

“Blocked” Matrix Multiplication

i

j

j

i

A BC

cache block

b elements

b e

lem

en

ts

Since one loads column j+1 of B in the cache lines anyway compute c[i][j+1].Reorder the operations:• compute the first b terms of c[i][j], compute the first b terms of c[i][j+1]• compute the next b terms of c[i][j], compute the next b terms of c[i][j+1]

.....

c[i][j] c[i][j+1]= *

Page 12: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.12ECE473

“Blocked” Matrix Multiplication

i

j j

i

A BC

Compute a whole subrow of C, with the same reordering of the

operations.

But then one has to load all columns of B, which one has to do

again for computing the next row of C.

Idea: reuse the blocks of B that we have just loaded.

cache block

= *

Page 13: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.13ECE473

“Blocked” Matrix Multiplication

i

j j

i

A BC

cache block

Order of the operation:

Compute the first b terms of all cij values in the C block

Compute the next b terms of all cij values in the C block

. . .

Compute the last b terms of all cij values in the C block

= *

Page 14: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.14ECE473

“Blocked” Matrix Multiplication

Page 15: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.15ECE473

“Blocked” Matrix Multiplication

C11

C22 = A21B12 + A22B22 + A23B32 + A24B42

4 matrix multiplications

4 matrix additions

Main Point: each multiplication operates on small “block” matrices,

whose size may be chosen so that they fit in the cache.

C12 C13 C14

C21 C22 C23 C24

C31 C32 C43 C34

C41 C42 C43 C44

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A43 A144

B11 B12 B13 B14

B21 B22 B23 B24

B32 B32 B33 B34

B41 B42 B43 B44

N = 4 * b

= *

Page 16: ECE473 Computer Organization and Architecture€¦ · ECE473 Lec 30.5 Note on Matrix Storage •A matrix is a 2-D array of elements, but memory addresses are “1-D” •Conventions

Lec 30.16ECE473

• The blocked version of the i-j-k algorithm is written simply as

for (i=0;i<N/B;i++)

for (j=0;j<N/B;j++)

for (k=0;k<N/B;k++)

C[i][j] += A[i][k]*B[k][j]

– B is the block size (which we assume divides N)

– X[i][j] is the block of matrix X on block row i and block column j

– “+=“ means matrix addition

– “*” means matrix multiplication

Blocked Algorithm