“matrix multiply ― in parallel”
DESCRIPTION
“Matrix Multiply ― in parallel”. Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago [email protected]. Background…. Class :“ Introduction to CS for Engineers ” Lang :C/C++ Focus :programming basics, vectors, matrices Timing :present this after introducing 2D arrays…. - PowerPoint PPT PresentationTRANSCRIPT
“Matrix Multiply ― in parallel”
Joe Hummel, PhDU. Of Illinois, Chicago
Loyola University Chicago
Class: “Introduction to CS for Engineers”
Lang: C/C++
Focus: programming basics, vectors, matrices
Timing: present this after introducing 2D arrays…
Background…
Yes, it’s boring, but…◦ everyone understands the problem
◦ good example of triply-nested loops
◦ non-trivial computation
Matrix multiply
for (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)
C[i][j] += (A[i][k] * B[k][j]);
1500x1500 matrix:
2.25M elements » 32 seconds…
Matrix multiply is greatcandidate for multicore
◦ embarrassingly-parallel
◦ easy to parallelize viaoutermost loop
Multicore
#pragma omp parallel forfor (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)
C[i][j] += (A[i][k] * B[k][j]);
Cores
1500x1500 matrix:
Quad-core CPU » 8 seconds…
Parallelism alone is not enough…
Designing for HPC
HPC == Parallelism + Memory Hierarchy ─ Contention
Expose parallelism
Maximize data locality:• network• disk• RAM• cache• core
Minimize interaction:• false sharing• locking• synchronization
What’s the other halfof the chip?
Implications?◦ No one implements MM this way
◦ Rewrite to use loop interchange,and access B row-wise…
Data locality
Cache!
X
#pragma omp parallel for
for (int i = 0; i < N; i++)for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
C[i][j] += (A[i][k] * B[k][j]);
1500x1500 matrix:
Quad-core + cache » 2 seconds…