locality-aware software throttling for sparse matrix ... · –split-join (sj) –split-join queue...
TRANSCRIPT
![Page 1: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/1.jpg)
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Yanhao Chen1, Ari B. Hayes1, Chi Zhang2, Timothy Salmon1, Eddy Z. Zhang1
1. Rutgers University2. University of Pittsburgh
![Page 2: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/2.jpg)
2018 USENIX Annual Technical Conference
Sparse Matrix• Sparse Linear Systems
- CG- GMRES- …
• Physics Based Simulations- CFD
• Deep Learning Optimizations- Pruned Neural Networks
• ……
![Page 3: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/3.jpg)
2018 USENIX Annual Technical Conference
Sparse Matrix Operation
y=A xSparse Matrix
AInput Vector
xOutput Vector
y
&' = ()*+,) { .'/ ⨀ 1/}
Binary Operator3 ∈ 1, … ,8 , 9 ∈ [1, … , ;]
![Page 4: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/4.jpg)
2018 USENIX Annual Technical Conference
Sparse Matrix Vector Multiplication (SpMV)
!"#$%" = sum⨀ = ∗
,- = ./0{2-,4 ∗ 54}
78,8 78,9 78,: 78,;
< < < <
7:,8 < 7:,: <
< 7;,9 < 7;,;
=8=9=:=;
>8>9>:>;
* =
A x y
,- = !"#$%"{2-4⨀54}
,: = 2:,8 ∗ 58 + 2:,: ∗ 5:@ ∈ 1, … ,D , E ∈ [1, … , G]
![Page 5: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/5.jpg)
2018 USENIX Annual Technical Conference
Single Source Shortest Path (SSSP)
!"#$%" = min⨀ = +
,- = ./0{23- + 43}
,- = !"#$%"{2-3⨀43}
S
2
1
3
4
5
S 1 2 3 4 5S12345
67 = 89:{;7 , 2=,7+;= , 2>,7+;> }
Adjacent Matrix A
y: distance vector of j th iteration
x ∶ (previous) distance vector of j-1 th iterationA ∈ 1, … ,E , F ∈ [1, … , H]
![Page 6: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/6.jpg)
2018 USENIX Annual Technical Conference
Problem with Sparsity on GPUs• Low data reuse is always a big problem
• e.g. SpMV
– The input vector and the output vector can be reused a lot
– They are usually too large to fit into GPU’s cache
– The sparsity of the matrix causes irregular accesses of the vectors
– This means low reuse of the data in the cache
!" = $%&{(",* ∗ ,*}
![Page 7: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/7.jpg)
2018 USENIX Annual Technical Conference
Existing Methods to Improve Data Reuse on GPUs
• Warp Scheduling Policy– Throttling concurrent threads– Limits the number of active warps [Rogers+, MICRO’12]– DYNCTA: controls the number of CTAs [Kayiran+,PACT’13]
• Computation and Data Layout Transformation– Reduce irregular memory accesses– Improve Memory Coalescing [Zhang+, ICS’10]
Need Hardware Modification!
Only focus on Spatial Data Reuse!
![Page 8: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/8.jpg)
2018 USENIX Annual Technical Conference
Ø Is the First Software Throttling implementation
Ø Is focused on Temporal Data Reuse
Ø Exploits the Trade-off between throttling performance and GPU throughput
![Page 9: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/9.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling !"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
![Page 10: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/10.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling !"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
Matrix A is bypassing the cache
![Page 11: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/11.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }
Original Case
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
Matrix A is bypassing the cache
Running at one time
![Page 12: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/12.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }
Original Case
Cache Capacity: 4
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
!" #" !$ !%!& #% #&
Matrix A is bypassing the cache
Running at one time
![Page 13: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/13.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
< !% #" > < !% #& > < !& #" > < !& #& >
Tim
e
Throttling Phase 1
Cache Capacity: 4
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
![Page 14: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/14.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
Throttling Phase 1
< !% #" > < !% #& > < !& #" > < !& #& >
Cache Capacity: 4
Tim
e
!" #" !$ #$
All Data Can Fit into the Cache
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
![Page 15: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/15.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
Throttling Phase 2
< !% #" > < !% #& > < !& #" > < !& #& >
Cache Capacity: 4
Tim
e
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
![Page 16: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/16.jpg)
2018 USENIX Annual Technical Conference
SpMV with Software Throttling
< !" #" > < !" #$ > < !$ #" > < !$ #$ >
Throttling Phase 2
< !% #" > < !% #& > < !& #" > < !& #& >
Cache Capacity: 4
Tim
e
!% #" !& #&
All Data Can Fit into the Cache
!"!#!$!%
&"&#&$&%
X =
X Y
'"," '",# '",$ '",%
) ) ) )
'$," ) '$,$ )
) '%,# ) '%,%A
![Page 17: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/17.jpg)
2018 USENIX Annual Technical Conference
What We Need for Software Throttling• An effective partitioning algorithm
– All data items in one partition can fit into the cache– The interaction between different partitions are minimized
• Applicable scheduling methods– Handle the trade-off between throttling and throughput
![Page 18: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/18.jpg)
2018 USENIX Annual Technical Conference
What We Need for Software Throttling• An effective partitioning algorithm
– All data items in one partition can fit into the cache– The interaction between different partitions are minimized
• Applicable scheduling methods– Handle the trade-off between throttling and throughput
![Page 19: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/19.jpg)
2018 USENIX Annual Technical Conference
Graph Representation• Graph Edge Partition Model
– Places an emphasis on Data– Node → Data object– Edge → Interaction between data
1 2 3 4
1 !"," !",$ !",% !",&2 ' ' ' '
3 !%," ' !%,% '
4 ' !&,$ ' !&,&
("($(%(&
)")$)%)&
X =
A X YX1 X2 X3 X4
Y1 Y2 Y3 Y4
![Page 20: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/20.jpg)
2018 USENIX Annual Technical Conference
Why Edge Partition Model?1. Better load balancing
– PowerGraph [OSDI’12], Streaming Edge Partition [KDD’14], SPAC[SIGMETRICS’17]
• Balanced vertex partition is sometimes NOT equal to balanced workload
2. Quantifying the communication cost
3. Applies to a large class of parallel applications– N-body, CFD, Sparse Linear Algebra, Graph Analytics, …
20
![Page 21: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/21.jpg)
2018 USENIX Annual Technical Conference
Different Edge Partition ModelsLoad Balanced Partition Data Balanced Partition
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition 1 Partition 2
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition 1 Partition 2
# Nodes: 4 # Nodes: 4# Edges: 3 # Edges: 3 Cache-Fit Partition
![Page 22: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/22.jpg)
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Cache Capacity: 4
![Page 23: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/23.jpg)
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition A Partition B
Cache Capacity: 4
![Page 24: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/24.jpg)
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition A Partition B
Cache Capacity: 4
![Page 25: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/25.jpg)
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
Cache Capacity: 4
X2 X3 X4
Y2 Y3 Y4
Partition B Partition C
![Page 26: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/26.jpg)
2018 USENIX Annual Technical Conference
Data Balanced Partition• Recursive Bisection Framework
– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]
– Minimize vertex replica (data copy)
Cache Capacity: 4
Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
Partition A
![Page 27: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/27.jpg)
2018 USENIX Annual Technical Conference
What We Need for Software Throttling• A good partitioning algorithm
– All data items in one partition can fit into the cache– The interaction between different partitions are minimized
• Applicable scheduling methods– Handle the trade-off between throttling and throughput
![Page 28: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/28.jpg)
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
![Page 29: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/29.jpg)
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
![Page 30: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/30.jpg)
2018 USENIX Annual Technical Conference
Cache-Fit (CF) Scheduling• Isolate the computation of different Cache-Fit Partitions• Run one Cache-Fit Partition at one time
TL: tuple list
N: # of tuples
TL’: new tuple list
Pi: # of tuples in TL[i]
Strict Barriers
CUDA Function
![Page 31: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/31.jpg)
2018 USENIX Annual Technical Conference
Low Pipeline Utilization
Kernel 1 -- Partition ACache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
X1
Y1
X1
Y2
X2
Y1
4 Working Threads
X2
Y2
![Page 32: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/32.jpg)
2018 USENIX Annual Technical Conference
Low Throughput
Cache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
X2
Y3
X3
Y3
4 Working Threads
Kernel 2 -- Partition B
IDLE
![Page 33: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/33.jpg)
2018 USENIX Annual Technical Conference
Low Pipeline Utilization
Cache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
X4
Y3
X4
Y4
4 Working Threads
Kernel 3 -- Partition C
IDLE
low pipeline utilization
![Page 34: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/34.jpg)
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
![Page 35: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/35.jpg)
2018 USENIX Annual Technical Conference
Cache-Fit Queue (CF-Q) Scheduling• Invoke a single kernel call but still
enable throttling
• Set up a FIFO queue • Each entry corresponds to a chunk
– A chunk is part of a cache-fit partition– Adjacent chunks are from the same
Cache-Fit Partition• Each warp fetches a chunk from the
queue and processes it
Cache-Fit Partition 1
Chunk 1
…Chunk 2
Chunk N
Que
ue
Cache-Fit Partition 2
Chunk 1
…Chunk 2
Chunk M
…
![Page 36: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/36.jpg)
2018 USENIX Annual Technical Conference
Cache-Fit Queue (CF-Q) Scheduling cont.
Cache-Fit Partition 1
Chunk 1
…Chunk 2
Chunk N
Que
ue
Cache-Fit Partition 2
Chunk 1
…Chunk 2
Chunk M
…
Tim
e
Warp 1 Warp 2
…
…
… …Chunk 1
Chunk 4
Chunk 2
Chunk 3
Chunk NChunk 1
Chunk 2Chunk 3
Relaxed Barrier
Chunk 3
Chunk 4
![Page 37: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/37.jpg)
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
![Page 38: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/38.jpg)
2018 USENIX Annual Technical Conference
Split-Join (SJ) Scheduling• Dynamically merge Cache-Fit Partitions
• Perform an Online Profiling to decide which partitions should be merged
• Use the Tree Representation of the data balanced partition to help the Online Profiling
![Page 39: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/39.jpg)
2018 USENIX Annual Technical Conference
Split-Join (SJ) Scheduling
Cache Capacity: 4
Partition A Partition B
Partition C
X1 X2 X3 X4
Y1 Y2 Y3 Y4
![Page 40: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/40.jpg)
2018 USENIX Annual Technical Conference
Split-Join (SJ) Scheduling
A.stime: 0.5A.btime: 0.5
stime: measured standalone running timebtime: optimal running time on this node
B.stime: 0.2B.btime: 0.2
C.stime: 0.2C.btime: 0.2
All Edges
Partition A Partition B, C
Partition B Partition C
![Page 41: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/41.jpg)
2018 USENIX Annual Technical Conference
Split-Join (SJ) Schedulingstime: measured standalone running timebtime: optimal running time on this node
B.stime: 0.2B.btime: 0.2
C.stime: 0.2C.btime: 0.2
BC.stime: 0.3BC.btime: 0.3
All.stime: 1.2All.btime: 0.8
BC.stime < B.btime + C.btime
All Edges
Partition A Partition B, C
Partition B Partition C
All.stime > A.btime + BC.btime
A.stime: 0.5A.btime: 0.5
![Page 42: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/42.jpg)
2018 USENIX Annual Technical Conference
Effective scheduling methods• Four different scheduling methods
– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)
![Page 43: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/43.jpg)
2018 USENIX Annual Technical Conference
Split-Join Queue (SJ-Q) Scheduling• Provide strict barriers between different merged partitions
• No barriers inside a merged partition of SJ– No guarantee of the execution order
• Set up one FIFO queue for each merged partition– Provide relaxed barriers between cache-fit partitions from the same
merged partition
![Page 44: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/44.jpg)
2018 USENIX Annual Technical Conference
Four Scheduling Methods Summary
Methods Pipeline Utilization Profiling Barriers Queue Code
ChangeCF Low No Strict No No
CF-Q High No Relaxed Yes YesSJ High Yes Strict No No
SJ-Q High Yes Strict / Relaxed Yes Yes
![Page 45: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/45.jpg)
2018 USENIX Annual Technical Conference
Software Throttling Performance• Experiment Settings
GPU Model Titan X GTX 745Architecture Pascal Maxwell
Core # 5376 576L2 Cache 3MB 2MB
CUDA Version CUDA 8.0 CUDA 8.0
CPU Intel XeonE5-2620
Intel Core i7-4790
![Page 46: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/46.jpg)
2018 USENIX Annual Technical Conference
Benchmarks• Sparse Linear Algebra Workloads
– Sparse Matrix Vector Multiplication (SPMV)
• Graph Processing Workloads– Bellman-Ford (BMF)– Page Rank (PR)
• Neural Network Optimization– Deep Compression: Pruned AlexNet
![Page 47: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/47.jpg)
2018 USENIX Annual Technical Conference
Sparse Matrix Vector Multiplication• Baseline: CUSP• Matrices: Florida Sparse Matrix Collection
– Focus on large matrices: working set cannot fit into L2 cache– 228 large matrices on GTX 745– 192 large matrices on Titan X
![Page 48: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/48.jpg)
2018 USENIX Annual Technical Conference
Overall SPMV Performance
2.018
0.81
1.21.41.61.8
2
Org + R CF SJ CF-Q SJ-Q
Norm
alize
d Sp
eedu
p
Average Speedup For SPMV
GTX 745Titan X
![Page 49: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/49.jpg)
2018 USENIX Annual Technical Conference
Effect of Cache Hit Rate
0
0.5
1
1.5
2
2.5
3
Org + R CF SJ CF-Q SJ-Q
Nor
mal
ized
Spe
edup
40% - 100%30% - 40%20% - 30%10% - 20%0% - 10%
4.16 4.24Titan X
![Page 50: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/50.jpg)
2018 USENIX Annual Technical Conference
Effect of Working Set Size
0
0.5
1
1.5
2
2.5
3
Org + R CF SJ CF-Q SJ-Q
Nor
mal
ized
Spe
edup 1X - 2X (cache size) 2X - 4X
4X - 8X 8X ~ INF
3.02
Titan X
![Page 51: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/51.jpg)
2018 USENIX Annual Technical Conference
Graph Application Performance
0
0.5
1
1.5
2
2.5
3
3.5
Pokec WebGoogle Wikipedia* WikiTalk IMDB RoadCentral RoadCal
Nor
mal
ized
Spee
dup
BMF & PR Performance using SJ w/ overhead
BMFPR
RoadCal is a small Graph
������������ � �� �������
Titan X
![Page 52: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/52.jpg)
2018 USENIX Annual Technical Conference
Graph Application Performance cont.
0102030405060
Pokec
WebGoogle
Wiki
pedia*
Wiki
TalkIM
DB
RoadCentral
RoadCalL2 C
ache
Hit
Rate
(%)
BMF Cache Hit Rate
Original SJ
05
101520253035
Pokec
WebGoogle
Wiki
pedia*
Wiki
TalkIM
DB
RoadCentral
RoadCalL2 C
ache
Hit
Rate
(%)
PR Cache Hit Rate
Original SJ
Titan XTitan X
RoadCal is a small Graph
![Page 53: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/53.jpg)
2018 USENIX Annual Technical Conference
Deep Learning Benchmark• Deep Compression [Han+,ICLR’16]
– Prune AlexNet to remove low weight elements in fully connected layers
– Deep Compression provide us sparse matrices
0
0.5
1
1.5
2
fc6 fc7 fc8 Combined
Norm
alize
d Sp
eedu
p
Speedup for AlexNet using Deep Compression
CFSJCF-QSJ-Q
Titan X
fc7 and fc8 has small vector size
![Page 54: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/54.jpg)
2018 USENIX Annual Technical Conference
Conclusion• We proposed the first locality-aware Software
Throttling framework for GPUs
• Our framework can increase data reuse by improving Temporal Locality
• We exploited the Trade-off between cache performance and pipeline utilization
![Page 55: Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue (SJ-Q) 2018 USENIX Annual Technical Conference ... –Focus on large matrices: working](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e67f97e708231d43f1758/html5/thumbnails/55.jpg)
2018 USENIX Annual Technical Conference