![Page 1: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/1.jpg)
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH
11/20/2014
![Page 2: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/2.jpg)
2 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
THIS TALK IN ONE SLIDE
Demonstrate how to save space and time by using the CSR format for GPU-based SpMV
Optimized GPU-based SpMV algorithm that increases memory coalescing by using LDS
14.7x faster than other CSR-based algorithms 2.3x faster than using other matrix formats
![Page 3: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/3.jpg)
Background
![Page 4: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/4.jpg)
4 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 5: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/5.jpg)
5 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 6: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/6.jpg)
6 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 7: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/7.jpg)
7 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 8: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/8.jpg)
8 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 9: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/9.jpg)
9 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 10: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/10.jpg)
10 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
1
2
3
4
5
SPARSE MATRIX-VECTOR MULTIPLICATION
Traditionally Bandwidth Limited
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
× =
22
28
18
39
18
![Page 11: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/11.jpg)
11 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 values:
![Page 12: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/12.jpg)
12 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 values:
![Page 13: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/13.jpg)
13 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
0 2 4 1 3 2 0 3 1 columns:
values:
![Page 14: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/14.jpg)
14 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
0 2 4 1 3 2 0 3 1 columns:
values:
![Page 15: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/15.jpg)
15 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
0 2 4 1 3 2 0 3 1
0 3 5 6 8 9 row delimiters:
columns:
values:
![Page 16: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/16.jpg)
16 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
0 2 4 1 3 2 0 3 1
0 3 5 6 8 9 row delimiters:
columns:
values:
![Page 17: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/17.jpg)
17 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
COMPRESSED SPARSE ROW (CSR)
1.0 - 2.0 - 3.0
- 4.0 - 5.0 -
- - 6.0 - -
7.0 - - 8.0 -
- 9.0 - - -
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
0 2 4 1 3 2 0 3 1
0 3 5 6 8 9 row delimiters:
columns:
values:
![Page 18: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/18.jpg)
18 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MEMORY SYSTEM
L1 L1 L1 L1 L1 L1 L1 L1
L2
64-bit Dual Channel Memory
Controller
I$ K$ I$ K$
L2
64-bit Dual Channel Memory
Controller
L2
64-bit Dual Channel Memory
Controller
![Page 19: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/19.jpg)
19 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MEMORY SYSTEM
L1 L1 L1 L1 L1 L1 L1 L1
L2
64-bit Dual Channel Memory
Controller
64 Bytes per clock L1 bandwidth per CU
I$ K$ I$ K$
L2
64-bit Dual Channel Memory
Controller
L2
64-bit Dual Channel Memory
Controller
![Page 20: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/20.jpg)
20 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MEMORY SYSTEM
L1 L1 L1 L1 L1 L1 L1 L1
L2
64-bit Dual Channel Memory
Controller
64 Bytes per clock L1 bandwidth per CU
I$ K$ I$ K$
64 Bytes per clock L2 bandwidth per partition
L2
64-bit Dual Channel Memory
Controller
L2
64-bit Dual Channel Memory
Controller
![Page 21: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/21.jpg)
21 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MEMORY SYSTEM
L1 L1 L1 L1 L1 L1 L1 L1
L2
64-bit Dual Channel Memory
Controller
64 Bytes per clock L1 bandwidth per CU
I$ K$ I$ K$
64 Bytes per clock L2 bandwidth per partition
L2
64-bit Dual Channel Memory
Controller
L2
64-bit Dual Channel Memory
Controller
Uncoalesced accesses are much
less efficient!
![Page 22: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/22.jpg)
22 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
Local Data Share (64KB)
Vector Registers (4x 64KB)
Vector Units (4x SIMD-16)
L1 Cache (16 KB)
AMD GPU COMPUTE UNIT
![Page 23: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/23.jpg)
23 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
Local Data Share (64KB)
Vector Registers (4x 64KB)
Vector Units (4x SIMD-16)
L1 Cache (16 KB)
AMD GPU COMPUTE UNIT
Must do parallel loads to maximize
bandwidth
![Page 24: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/24.jpg)
24 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
LOCAL DATA SHARE (LDS) MEMORY
Scratchpad (software-addressed) on-chip cache
Statically divided between all workgroups in a CU
Highly ported to allow scatter/gather (uncoalesced) accesses
Local Data Share (64KB)
![Page 25: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/25.jpg)
25 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-SCALAR
One thread/work-item per matrix row
![Page 26: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/26.jpg)
26 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-SCALAR
One thread/work-item per matrix row
Workgroup #1
Workgroup #2
Workgroup #3
![Page 27: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/27.jpg)
27 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-SCALAR
One thread/work-item per matrix row
Workgroup #1
Workgroup #2
Workgroup #3
![Page 28: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/28.jpg)
28 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-SCALAR
One thread/work-item per matrix row
Workgroup #1
Workgroup #2
Workgroup #3
Uncoalesced accesses
![Page 29: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/29.jpg)
29 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR
One workgroup (multiple threads) per matrix row
![Page 30: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/30.jpg)
30 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR
One workgroup (multiple threads) per matrix row
Workgroup #1
Workgroup #2
Workgroup #12
.
.
.
![Page 31: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/31.jpg)
31 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR
One workgroup (multiple threads) per matrix row
Workgroup #1
Workgroup #2
Workgroup #12
.
.
.
![Page 32: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/32.jpg)
32 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR
One workgroup (multiple threads) per matrix row
Workgroup #1
Workgroup #2
Workgroup #12
.
.
.
Good: coalesced accesses
![Page 33: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/33.jpg)
33 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR
One workgroup (multiple threads) per matrix row
Workgroup #1
Workgroup #2
Workgroup #12
.
.
.
![Page 34: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/34.jpg)
34 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR
One workgroup (multiple threads) per matrix row
Workgroup #1
Workgroup #2
Workgroup #12
.
.
.
Bad: poor parallelism
![Page 35: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/35.jpg)
35 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
MAKING GPU SPMV FAST
Neither CSR-Scalar nor CSR-Vector offer ideal performance.
CSR works well on CPUs and is widely used
Traditional solution: new matrix storage format + new algorithms
‒ ELLPACK/ITPACK, BCSR, DIA, BDIA, SELL, SBELL, ELL+COO Hybrid, many others
‒ Auto-tuning frameworks like clSpMV try to find the best format for each matrix
‒ Can require more storage space or dynamic transformation overhead
Can we find a good GPU algorithm that does not modify the original CSR data?
![Page 36: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/36.jpg)
CSR-Adaptive
![Page 37: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/37.jpg)
37 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
INCREASING BANDWIDTH EFFICIENCY
![Page 38: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/38.jpg)
38 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
INCREASING BANDWIDTH EFFICIENCY
![Page 39: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/39.jpg)
39 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
INCREASING BANDWIDTH EFFICIENCY
CSR-Vector coalesces long rows by streaming into the LDS
![Page 40: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/40.jpg)
40 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
![Page 41: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/41.jpg)
41 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
![Page 42: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/42.jpg)
42 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
![Page 43: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/43.jpg)
43 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
![Page 44: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/44.jpg)
44 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
+ + + +
![Page 45: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/45.jpg)
45 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
+ + + +
Fast uncoalesced
accesses!
![Page 46: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/46.jpg)
46 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
![Page 47: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/47.jpg)
47 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-VECTOR USING THE LDS
Local Data Share
Fast accesses everywhere
![Page 48: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/48.jpg)
48 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
INCREASING BANDWIDTH EFFICIENCY
How can we get good bandwidth for other “short” rows?
![Page 49: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/49.jpg)
49 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
INCREASING BANDWIDTH EFFICIENCY
Load these into the LDS
How can we get good bandwidth for other “short” rows?
![Page 50: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/50.jpg)
50 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Local Data Share
![Page 51: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/51.jpg)
51 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Local Data Share
![Page 52: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/52.jpg)
52 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
![Page 53: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/53.jpg)
53 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
![Page 54: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/54.jpg)
54 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
![Page 55: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/55.jpg)
55 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
![Page 56: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/56.jpg)
56 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
![Page 57: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/57.jpg)
57 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
![Page 58: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/58.jpg)
58 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-STREAM: STREAM MATRIX INTO THE LDS
Block 1 Block 2 Block 3 Block 4
Local Data Share
Fast accesses everywhere
![Page 59: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/59.jpg)
59 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
0 3 6 8 10 14 16 17 19 20 22 24 32
row delimiters:
values:
DONE ONCE ON THE CPU
columns:
![Page 60: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/60.jpg)
60 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
0 3 6 8 10 14 16 17 19 20 22 24 32
row delimiters:
values:
DONE ONCE ON THE CPU
columns:
![Page 61: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/61.jpg)
61 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
DONE ONCE ON THE CPU
columns:
![Page 62: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/62.jpg)
62 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 63: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/63.jpg)
63 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 64: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/64.jpg)
64 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 65: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/65.jpg)
65 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 66: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/66.jpg)
66 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
Block 1
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 67: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/67.jpg)
67 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
Block 1 Block 2
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 68: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/68.jpg)
68 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
Block 1 Block 2 Block 3 Block 4
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 69: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/69.jpg)
69 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
FINDING BLOCKS FOR CSR-STREAM
row delimiters:
values:
0 3 6 11 12
row blocks:
Block 1 Block 2 Block 3 Block 4
CSR Structure Unchanged
DONE ONCE ON THE CPU
Assuming 8 LDS entries per workgroup
columns:
![Page 70: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/70.jpg)
70 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
WHAT ABOUT VERY LONG ROWS?
Local Data Share
![Page 71: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/71.jpg)
71 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
WHAT ABOUT VERY LONG ROWS?
Block 1 Block 2
Local Data Share
Block 3
![Page 72: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/72.jpg)
72 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
WHAT ABOUT VERY LONG ROWS?
Block 1 Block 2 Block 4
Local Data Share
Block 3
![Page 73: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/73.jpg)
73 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
WHAT ABOUT VERY LONG ROWS?
Block 1 Block 2 Block 4
Local Data Share
CSR-Stream
Block 3
![Page 74: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/74.jpg)
74 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
WHAT ABOUT VERY LONG ROWS?
Block 1 Block 2 Block 4
Local Data Share
CSR-Stream CSR-Vector
Block 3
![Page 75: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/75.jpg)
75 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
WHAT ABOUT VERY LONG ROWS?
Block 1 Block 2 Block 4
CSR-Stream CSR-Vector
Block 3
CSR-Adaptive
![Page 76: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/76.jpg)
76 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
![Page 77: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/77.jpg)
77 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
![Page 78: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/78.jpg)
78 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
On GPU (each workgroup assigned one row block):
Read Row Block Entries
![Page 79: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/79.jpg)
79 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
On GPU (each workgroup assigned one row block):
Read Row Block Entries
Single Row?
![Page 80: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/80.jpg)
80 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
On GPU (each workgroup assigned one row block):
Read Row Block Entries
Single Row?
CSR-Vector Yes
![Page 81: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/81.jpg)
81 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
On GPU (each workgroup assigned one row block):
Read Row Block Entries
Single Row?
CSR-Vector
Vector Load Row into LDS
Perform Parallel Reduction out of LDS
Yes
![Page 82: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/82.jpg)
82 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
On GPU (each workgroup assigned one row block):
Read Row Block Entries
Single Row?
CSR-Vector CSR-Stream
Vector Load Row into LDS
Perform Parallel Reduction out of LDS
Yes No
![Page 83: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/83.jpg)
83 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CSR-ADAPTIVE
(Once per matrix) On CPU: Create Row
Blocks
On GPU (each workgroup assigned one row block):
Read Row Block Entries
Single Row?
CSR-Vector CSR-Stream
Vector Load Multiple Rows into LDS
Perform Reduction out of LDS (serially or in parallel)
Vector Load Row into LDS
Perform Parallel Reduction out of LDS
Yes No
![Page 84: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/84.jpg)
Experiments
![Page 85: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/85.jpg)
85 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
EXPERIMENTAL SETUP
CPU: AMD A10-7850K (3.7 GHz)
‒ 32 GB dual-channel DDR3-2133
GPU: AMD FirePro™ W9100
‒ 44 parallel compute units (CUs)
‒ 930 MHz core clock rate (5.2 single-precision TFLOPs)
‒ 16 GDDR5 channels at 1250 MHz (320 GB/s)
CentOS 6.4 (kernel 2.6.32-358.23.2)
‒ AMD FirePro Driver version 14.20 beta
‒ AMD APP SDK v2.9
![Page 86: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/86.jpg)
86 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
SPMV SETUP
SHOC Benchmark Suite (Aug. 2013 version)
‒ CSR-Vector (details in paper: CSR-Scalar, ELLPACK)
ViennaCL v1.5.2
‒ CSR-Scalar, COO, ELLPACK, ELLPACK+COO HYB
‒ 4-value and 8-value padded versions of CSR-Scalar
‒ Best of these for each matrix: “ViennaCL Best”
clSpMV v0.1
‒ CSR-Scalar, CSR-Vector, BCSR, COO, DIA, BDIA, ELLPACK, SELL, BELL, SBELL
‒ Best of these for each matrix: “clSpMV Best”
‒ Details in paper: clSpMV “Cocktail” format
CSR-Adaptive
20 matrices from Univ. of Florida Sparse Matrix Collection
![Page 87: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/87.jpg)
87 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
DATA STRUCTURE GENERATION TIMES
1
2
4
8
16
32
64
128
256
512
1024
Ge
ne
rati
on
tim
e v
s. C
OO
→C
SR
Row Blocking ELLPACK BCSR Cocktail
Note: Some matrices could not be held in ELLPACK or BCSR
![Page 88: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/88.jpg)
88 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE IN SINGLE-PRECISION GFLOPS
![Page 89: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/89.jpg)
89 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE IN SINGLE-PRECISION GFLOPS
0
10
20
30
40
50
60
70
80
90
GFL
OP
S (S
ingl
e P
reci
sio
n)
Roughly Equal Performance (7/20)
![Page 90: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/90.jpg)
90 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE IN SINGLE-PRECISION GFLOPS
0
10
20
30
40
50
60
70
80
90
GFL
OP
S (S
ingl
e P
reci
sio
n)
Roughly Equal Performance (7/20)
![Page 91: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/91.jpg)
91 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE IN SINGLE-PRECISION GFLOPS
0
10
20
30
40
50
60
70
80
90
GFL
OP
S (S
ingl
e P
reci
sio
n)
110.7 Roughly Equal Performance (7/20) Others Better (4/20)
![Page 92: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/92.jpg)
92 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE IN SINGLE-PRECISION GFLOPS
0
10
20
30
40
50
60
70
80
90
GFL
OP
S (S
ingl
e P
reci
sio
n)
110.7 Roughly Equal Performance (7/20) Others Better (4/20)
![Page 93: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/93.jpg)
93 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE IN SINGLE-PRECISION GFLOPS
0
10
20
30
40
50
60
70
80
90
GFL
OP
S (S
ingl
e P
reci
sio
n)
CSR-Adaptive Performance Higher (9/20)
![Page 94: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/94.jpg)
94 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
PERFORMANCE BENEFIT OF CSR-ADAPTIVE
0
5
10
15
20
25
30
35
CSR-Scalar CSR-Vector ELLPACK ViennaCLBest
clSpMVCocktail
clSpMV BestSingle
CSR
-Ad
apti
ve S
pe
ed
up
![Page 95: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/95.jpg)
95 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CONCLUSION
CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms
Requires little work beyond generating the traditional CSR data structure
‒ Less than 0.1% extra storage compared to CSR
Algorithm is not complex: easy to integrate into libraries and existing SW
![Page 96: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/96.jpg)
96 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CONCLUSION
CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms
Requires little work beyond generating the traditional CSR data structure
‒ Less than 0.1% extra storage compared to CSR
Algorithm is not complex: easy to integrate into libraries and existing SW
AMD Sample Code Available Soon – Ask Us!
![Page 97: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/97.jpg)
97 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
CONCLUSION
CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms
Requires little work beyond generating the traditional CSR data structure
‒ Less than 0.1% extra storage compared to CSR
Algorithm is not complex: easy to integrate into libraries and existing SW
AMD Sample Code Available Soon – Ask Us!
Independently Implemented Version in
ViennaCL 1.6.1
![Page 98: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/98.jpg)
98 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
![Page 99: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/99.jpg)
Backup Slides
![Page 100: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/100.jpg)
100 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MICROARCHITECTURE
![Page 101: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/101.jpg)
101 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MICROARCHITECTURE
![Page 102: Efficient Sparse Matrix-Vector Multiplication on GPUs ... · efficient sparse matrix-vector multiplication on gpus using the csr storage format joseph l. greathouse, mayank daga amd](https://reader030.vdocuments.site/reader030/viewer/2022040118/5e1a8fdd4544d5450b56f5f2/html5/thumbnails/102.jpg)
102 | EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT | NOVEMBER 20, 2014
AMD GPU MICROARCHITECTURE