warp size impact in gpus: large or small?
DESCRIPTION
Warp Size Impact in GPUs: Large or Small?. Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari. ECE, University of Tehran,. ECE, University of Victoria. This Work. Accelerators Accelerators amortize control-flow over groups of threads (warps) - PowerPoint PPT PresentationTRANSCRIPT
Warp Size Impact in GPUs:Large or Small?
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari
ECE, University of Tehran, ECE, University of Victoria
2
This Work Accelerators o Accelerators amortize control-flow over groups of threads (warps)o Warp size impacts performance (branch/memory divergence & memory
access coalescing)o Small Warp: Low Branch/Memory Divergence (+) , Low Memory
Coalescing (-)o Large Warp: High Branch Divergence/Memory (-), High Memory
Coalescing(+)
Question: Possible Solutions?o Enhance Coalescing in a Small-warp machine (SW+) OR o Enhance Divergence in a Large-warp machine (LW+)
Winner: SW+
Warp Size Impact in GPUs: Large or Small?
3
Outline
Branch/Memory Divergence Memory Access Coalescing
Warp Size Impact Warp Size: Large or Small?
o Use machine models to find the answer:o Small-Warp Coalescing-Enhanced Machine (SW+)o Large-Warp Control-flow-Enhanced Machine (LW+)
Experimental Results
Conclusions & Future Work
Warp Size Impact in GPUs: Large or Small?
4
Warping
Opportunitieso Reduce scheduling overheado Improve utilization of execution units (SIMD efficiency)o Exploit inter-thread data locality
Challengeso Memory divergenceo Branch divergence
Warp Size Impact in GPUs: Large or Small?
5
Memory Divergence
Threads of a warp may hit or miss in L1
J = A[S];// L1 cache access
L = K * J;
Hit
Hit
Mis s HitTim
e
Stal
l
Stal
l
Stal
l
Stal
l
Warp T0 T1 T2 T3
Warp T0 T1 T2 T3
Warp Size Impact in GPUs: Large or Small?
6
Branch Divergence
Branch instruction can diverge to two different paths dividing the warp to two groups:1. Threads with taken outcome2. Threads with not-taken outcome
If(J==K){ C[tid]=A[tid]*B[tid];}else if(J>K){ C[tid]=0;}
Warp
Warp
Warp T0 X X T3
Warp
Warp
Tim
e
X T1 T2 X
T0 T1 T2 T3
T0 X X T3
T0 T1 T2 T3
Warp Size Impact in GPUs: Large or Small?
7
Memory Access Coalescing
Common memory access of neighbor threads are coalesced into one transaction
Warp T0 T1 T2 T3
Warp T4 T5 T6 T7
Warp T8 T9 T10 T11
Hit
Hit
Hit
Hit
Mis s Mis s Mis s Mis s
Mis s Hit
Hit
Mis s
Mem. Req. A Mem. Req. B
Mem. Req. C
Mem. Req. D Mem. Req. E
A B A B
C C C C
D E E D
Warp Size Impact in GPUs: Large or Small?
8
Warp Size
Warp Size: number of threads in warp
Small Warp Advantage:o Less branch/memory divergenceo Less synchronization overhead at every instruction
Large Warp Advantage:o Greater opportunity for memory access coalescing
Warp Size Impact in GPUs: Large or Small?
9
Warp Size and Branch Divergence
Lower the warp size, lower the branch divergence
If(J>K){ C[tid]=A[tid]*B[tid];else{ C[tid]=0;}
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
↓ ↓ ↓ ↓ ↓ ↓
↓ ↓
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
2-thread warpT1 T2 T3 T4 T5 T6 T7 T8
No branch divergence
4-thread warp
Branch divergence
Warp Size Impact in GPUs: Large or Small?
10
Warp Size and Memory Divergence
Warp T0 T1 T2 T3
Warp T4 T5 T6 T7
Warp T8 T9 T10 T11
Tim
e
Small warps Large warps
Hit
Hit
Hit
Hit
Mis s Mis s Mis s Mis s
Hit
Hit
Hit
Hit
Warp
T0 T1 T2 T3
Hit
Hit
Hit
Hit
Mis s Mis s Mis s Mis s
Hit
Hit
Hit
Hit
Warp
T0 T1 T2 T3
T8 T9 T10 T11
T4 T5 T6 T7St
all
Stal
l
Stal
l
Stal
lWarp T0 T1 T2 T3
Warp T4 T5 T6 T7
T4 T5 T6 T7
T8 T9 T10 T11
Warp T8 T9 T10 T11
Improving latency hiding
Warp Size Impact in GPUs: Large or Small?
11
Warp Size and Memory Access Coalescing
Warp T0 T1 T2 T3
Warp T4 T5 T6 T7
Warp T8 T9 T10 T11
Tim
eSmall warps Large warpsM
is s Mis s Mis s Mis s
Warp
T0 T1 T2 T3
Mis s Mis s Mis s Mis s
T4 T5 T6 T7
T8 T9 T10 T11
Mis s Mis s Mis s Mis s
Mis s Mis s Mis s Mis s
Mis s Mis s Mis s Mis s
Mis s Mis s Mis s Mis s
Req. A
Req. B
Req. A
Req. A
Req. B
Req. A
Req. B
Reducing # of memory accesses using wider coalescing
5 memory requests 2 memory requests
Warp Size Impact in GPUs: Large or Small?
12
Warp Size Impact on Coalescing
Often Warp Size Coalescing
BKP CP HSPT MU0
102030405060708090
100 8 16 32 64
Coal
esci
ng R
ate
Warp Size Impact in GPUs: Large or Small?
13
Warp Size Impact on Idle Cycles
MU: Warp Size Divergence Idle Cycles BKP: Warp Size Coalescing Idle Cycles
BKP CP HSPT MU0%
20%
40%
60%
80%
100%8 16 32 64
Cont
ributi
on o
f idl
e cy
cles
Warp Size Impact in GPUs: Large or Small?
14
Warp Size Impact on Performance
MU: Warp Size Divergence Performance BKP: Warp Size Coalescing Performance
BKP CP HSPT MU0.4
0.6
0.8
1
1.2
1.4
1.6 8 16 32 64
Nor
mal
ized
IPC
Warp Size Impact in GPUs: Large or Small?
15
ApproachBaseline machine
SW+:-Ideal MSHR compensates coalescing loss of small warps
LW+:-MIMD lanes compensate divergence of large warps
Warp Size Impact in GPUs: Large or Small?
16
SW+ Warps as wide as SIMD width
o Low branch/memory divergence, Improve latency hiding Compensating coalescing loss -> Ideal MSHR
o Compensating small-warp deficiency (memory access coalescing loss)o Ideal MSHR prevents redundant memory transactions by merging the
redundant requests of the warps on the same SM.o Outstanding MSHRs are searched to perform the merge
Warp Size Impact in GPUs: Large or Small?
17
LW+
Warps 8x larger than SIMD widtho Improve memory access coalescing
Compensating divergence -> Lock-step MIMD executiono Compensate large warp deficiency (branch/memory divergence)
Warp Size Impact in GPUs: Large or Small?
18
Methodology
Cycle-accurate GPU simulation through GPGPU-simo Six Memory Controllers (76 GB/s)o 16 8-wide SMs (332.8 GFLOPS)o 1024-thread per coreo Warp Size: 8, 16, 32, and 64
Workloadso RODINIAo CUDA SDKo GPGPU-sim
Warp Size Impact in GPUs: Large or Small?
19
Coalescing Rate
SW+: 86%, 58%, 34% higher coalescing vs. 16, 32, 64 thd/warps LW+: 37%, 17%, higher and -1% lower coalescing vs. 16, 32, 64 thd/warps
Warp Size Impact in GPUs: Large or Small?
BFS BKP GAS HSPT MTM MU NQU SR1 avg1
10
100
1000
10000SW+ 8 16 32 64 LW+
Coal
esci
ng R
ate
20
Idle Cycles
SW+: 11%, 6%, 8% less Idle Cycles vs. 8, 16, 32 thd/warps LW+: 1% more and 4%, 2% less Idle Cycles vs. 8, 16, 32 thd/warps
Warp Size Impact in GPUs: Large or Small?
BFS BKP GAS HSPT MTM MU NQU SR1 avg0%
20%
40%
60%
80%
100%
120%SW+ 8 16 32 64 LW+
Cont
ributi
on o
f idl
e cy
cles
21
Performance
SW+: Outperforms LW+ (11%), 8 (16%), 16(13%), 32 (20%) thd/warps. LW+: Outperforms 8 (5%), 16 (1%), 32 (7%), 64 (15%) thd/warps.
Warp Size Impact in GPUs: Large or Small?
BFS BKP GAS HSPT MTM MU NQU SR1 avg0.40.60.8
11.21.41.61.8
2SW+ 8 16 32 64 LW+
Nor
mal
ized
IPC
22
Conclusion & Future Work
Warp Size Impacts Coalescing, Idle Cycles, and Performance
Investing in Enhancement of small-warp (SW+) machine returns higher gain than investing in enhancement of large-warp (LW+)
Future Work: Evaluating warp size impact on energy efficiency
Warp Size Impact in GPUs: Large or Small?
23
Thank you!Question?
Warp Size Impact in GPUs: Large or Small?
24
Backup-Slides
Warp Size Impact in GPUs: Large or Small?
25
Coalescing Width
Range of the threads in a warp which are considered for memory access coalescingo NVIDIA G80 -> Over sub-warpo NVIDIA GT200 -> Over half-warpo NVIDIA GF100 -> Over entire warp
When the coalescing width is over entire warp, optimal warp size depends on the workload
Warp Size Impact in GPUs: Large or Small?
26
Warp Size and Branch Divergence (continued)
Warp T0 T1 T2 T3
Warp T4 T5 T6 T7
Warp T8 T9 T10 T11
Warp T0 T1 X X
Warp T4 T5 T6 T7
Warp X T9 T10 T11
Warp X X T2 T3
Warp T8 X X X
Warp T0 T1 T2 T3
Warp T4 T5 T6 T7
Warp T8 T9 T10 T11
WarpTim
e T0 T1 T2 T3
T4 T5 T6 T7
T8 T9 T10 T11
Warp
T0 T1 X X
T4 T5 T6 T7
X T9 T10 T11
Warp
X X T2 T3
X X X X
T8 X X X
Warp
T0 T1 T2 T3
T4 T5 T6 T7
T8 T9 T10 T11
Small warps Large warps
Saving some idle cycles
Warp Size Impact in GPUs: Large or Small?
27
Warping
Thousands of threads are scheduled zero-overheado All the context of threads are on-core
Tens of threads are grouped into warpo Execute same instruction in lock-step
Warp Size Impact in GPUs: Large or Small?
28
Key Question
Which warp size should be decided as the baseline?o Then, investing in augmenting the processor toward removing the
associated deficiency Machine models to find the answer
Warp Size Impact in GPUs: Large or Small?
29
GPGPU-sim ConfigNoC
Total Number of SMs 16Number of Memory Ctrls 6Number of SM Sharing an Network Interface 2
SMNumber of thread per SM 1024Maximum allowed CTA per SM 8Shared Memory/Register File size 16KB/64KBSM SIMD width 8Warp Size 8 / 16 / 32 / 64L1 Data cache 48KB: 8-way: LRU: 64BytePerBlockL1 Texture cache 16KB: 2-way: LRU: 64BytePerBlockL1 Constant cache 16KB: 2-way: LRU: 64BytePerBlock
ClockingCore clock 1300 MHzInterconnect clock 650 MHzDRAM memory clock 800 MHz
MemoryNumber of Banks Per Memory Ctrls 8DRAM Scheduling Policy FCFS
Warp Size Impact in GPUs: Large or Small?
30
WorkloadsName Grid Size Block Size #InsnBFS: BFS Graph [3] 16x(8,1,1) 16x(512,1) 1.4MBKP: Back Propagation [3] 2x(1,64,1) 2x(16,16) 2.9MDYN: Dyn_Proc [3] 13x(35,1,1) 13x(256) 64M
FWAL: Fast Walsh Transform [6]6x(32,1,1)3x(16,1,1)(128,1,1)
7x(256)3x(512)
11.1M
GAS: Gaussian Elimination [3] 48x(3,3,1) 48x(16,16) 8.8MHSPT: Hotspot [3] (43,43,1) (16,16,1) 76.2MMP: MUMmer-GPU++ [8] (1,1,1) (256,1,1) 0.3MMTM: Matrix Multiply [14] (5,8,1) (16,16,1) 2.4MMU: MUMmer-GPU [1] (1,1,1) (100,1,1) 0.15MNNC: Nearest Neighbor on cuda [2] 4x(938,1,1) 4x(16,1,1) 5.9MNQU: N-Queen [1] (256,1,1) (96,1,1) 1.2M
NW: Needleman-Wunsch [3]
2x(1,1,1)…2x(31,1,1)(32,1,1)
63x(16) 12.9M
SC: Scan[14] (64,1,1) (256,1,1) 3.6MSR1: Speckle Reducing Anisotropic Diffusion [3] (large dataset) 3x(8,8,1) 3x(16,16) 9.1M
SR2: Speckle Reducing Anisotropic Diffusion [3] (small dataset) 4x(4,4,1) 4x(16,16) 2.4M
Warp Size Impact in GPUs: Large or Small?