Download - Sandia Fast Matmul
![Page 1: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/1.jpg)
1
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION
Austin Benson ([email protected]), ICME, Stanford
Grey Ballard, Sandia National Laboratories
Sandia National Laboratories, October 21, 2014
arXiv: 1409.2908
![Page 2: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/2.jpg)
2
Fast matrix multiplication:bridging theory and practice
• There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect to Intel’s Math Kernel Library (MKL)
• We use code generation for extensive prototyping.
32 2.81[Strassen79]
2.37[Williams12]
xxx xx xx xx
![Page 3: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/3.jpg)
3
Strassen’s algorithm
![Page 4: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/4.jpg)
4
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)• 2. Seven linear combinations of sub-blocks of A• 3. Seven linear combinations of sub-blocks of B• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij
![Page 5: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/5.jpg)
5
Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B• 4. R matrix multiplies to form Mr (recursive)
R < MKN faster than classical
• 5. Linear combinations of Mr to form Cij
Main idea: Trade N^3 computations (matmul) for more N^2 computations (matrix addition)
![Page 6: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/6.jpg)
6
“Outer product” fast algorithm
• <4, 2, 4> partitioning• R = 26 multiplies (< 4 * 2 * 4 = 32)
23% speedup per recursive step (if everything else free)• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
![Page 7: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/7.jpg)
7
Discovering fast algorithms is anumerical challenge
• Low-rank tensor decompositions lead to fast algorithms• Tensors are small, but we need exact decompositions
NP-hard• Use alternating least squares with regularization and
rounding tricks [Smirnov13], [Benson&Ballard14]
• We have around 10 fast algorithms for <M, K, N> partitions. Also have permutations, e.g., <K, M, N>.
![Page 8: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/8.jpg)
8
[Smirnov13]
[Strassen69]
![Page 9: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/9.jpg)
9
Code generation lets us prototype algorithms quickly
• We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
• Our approach: test all algorithms on a bunch of different problem sizes and look for patterns
![Page 10: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/10.jpg)
10
Sequential performance =
Effective GFLOPS for M x P x Q multiplies = 1e-9 * 2 * MPQ / time in seconds
True peak
![Page 11: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/11.jpg)
11
Sequential performance
• All algorithms beat MKL on large problems• Strassen’s algorithm is hard to beat
=
![Page 12: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/12.jpg)
12
Sequential performance =
• Almost all algorithms beat MKL• <4, 2, 4> and <3, 2, 3> tend to perform the best
![Page 13: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/13.jpg)
13
Sequential performance =
• Almost all algorithms beat MKL• <4, 3, 3> and <4, 2, 3> tend to perform the best
![Page 14: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/14.jpg)
14
Practical issues, or what is not obvious when
…you just think about the theory
…your performance models are too simple
![Page 15: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/15.jpg)
15
Practical issue #1: when to stop recursion?Look at the gemm curve.
Basic idea: take another recursive step if the sub-problems will still operate at high performance
<M, K, N> = <4, 2, 3>
![Page 16: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/16.jpg)
16
Practical issue #2: matrix additions
A11 A12 A21 A22
S1 S2S7S6S5S4S3
“Pairwise”
2xDAXPY
2xDAXPY
Y := AX + Y
![Page 17: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/17.jpg)
17
A11 A12 A21 A22
S1 S2S7S6S5S4S3
“Write once”
custom“DAXPY”
custom“DAXPY”
Practical issue #2: matrix additions
![Page 18: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/18.jpg)
18
A11 A12 A21 A22
S1 S2S7S6S5S4S3
“Streaming”
Entry-wise updates
Practical issue #2: matrix additions
![Page 19: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/19.jpg)
19
Practical issue #3: redundant linear combinations
• Example in <4, 2, 4> algorithm (R = 26 multiplies):
B24B12 B22 B23
T11 T25
Four additions, six reads, two writes
![Page 20: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/20.jpg)
20
Common subexpression elimination
• Example in <4, 2, 4> algorithm (R = 26 multiplies):
B24B12 B22 B23
T11 T25
Y
Three additions, six reads, three writes Net increase in communication!
![Page 21: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/21.jpg)
21
Common subexpression elimination (CSE)does not really help
![Page 22: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/22.jpg)
22
Practical issue #4: parallelization on shared memory
![Page 23: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/23.jpg)
23
Fast matmul recursion tree
C
M1 M7
+
M2 …
M1 M7
+
M2 …
M1 M7
+
M2 …
![Page 24: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/24.jpg)
24
DFS ParallelizationC
M1 M7
+
M2 …
M1 M7
+
M2 …
All threads
Use parallel MKL
+ Easy to implement+ Load balanced+ Same memory footprint as sequential- Need large base cases for high performance
![Page 25: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/25.jpg)
25
BFS ParallelizationC
M1 M7
+
M2 …
M1 M7
+
M2 …
omp taskwait
omp taskwait
1 thread
+ High performance for smaller base cases- Sometimes harder to load balance: 24 threads, 49 subproblems- More memory
1 thread 1 thread
![Page 26: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/26.jpg)
26
HYBRID parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
omp taskwait
omp taskwait
1 thread 1 thread all threads
+ Better load balancing- Explicit synchronization or else we can over-subscribe threads
![Page 27: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/27.jpg)
27
![Page 28: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/28.jpg)
28
Bandwidth problems• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions• Parallel dgemm on 24 cores: easily get 50-75% of peak• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
M1 M7
+
M2 …
![Page 29: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/29.jpg)
29
Parallel performance =
• 6 cores: similar performance to sequential• 24 cores: can sometimes beat MKL, but barely
![Page 30: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/30.jpg)
30
Parallel performance =Bad MKLperformance
• 6 cores: similar performance to sequential• 24 cores: MKL best for large problems
![Page 31: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/31.jpg)
31
Parallel performance =
• 6 cores: similar performance to sequential• 24 cores: MKL usually the best
![Page 32: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/32.jpg)
32
High-level conclusions• For square matrix multiplication, Strassen’s algorithm is
hard to beat• For rectangular matrix multiplication, use a fast algorithm
that “matches the shape”• Bandwidth limits the performance of shared memory
parallel fast matrix multiplication should be less of an issue in distributed memory
Future work:• Numerical stability• Using fast matmul as a kernel for other algorithms in
numerical linear algebra
![Page 33: Sandia Fast Matmul](https://reader036.vdocuments.site/reader036/viewer/2022062320/558e1bc81a28abb85b8b465b/html5/thumbnails/33.jpg)
33
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION
Austin Benson ([email protected]), ICME, Stanford
Grey Ballard, Sandia National Laboratories
Sandia National Laboratories, October 21, 2014
arXiv: 1409.2908