warp size impact in gpus: large or small?

30
Warp Size Impact in GPUs: Large or Small? Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria

Upload: morag

Post on 23-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Warp Size Impact in GPUs: Large or Small?. Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari. ECE, University of Tehran,. ECE, University of Victoria. This Work. Accelerators Accelerators amortize control-flow over groups of threads (warps) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Warp Size Impact in GPUs: Large or Small?

Warp Size Impact in GPUs:Large or Small?

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari

ECE, University of Tehran, ECE, University of Victoria

Page 2: Warp Size Impact in GPUs: Large or Small?

2

This Work Accelerators o Accelerators amortize control-flow over groups of threads (warps)o Warp size impacts performance (branch/memory divergence & memory

access coalescing)o Small Warp: Low Branch/Memory Divergence (+) , Low Memory

Coalescing (-)o Large Warp: High Branch Divergence/Memory (-), High Memory

Coalescing(+)

Question: Possible Solutions?o Enhance Coalescing in a Small-warp machine (SW+) OR o Enhance Divergence in a Large-warp machine (LW+)

Winner: SW+

Warp Size Impact in GPUs: Large or Small?

Page 3: Warp Size Impact in GPUs: Large or Small?

3

Outline

Branch/Memory Divergence Memory Access Coalescing

Warp Size Impact Warp Size: Large or Small?

o Use machine models to find the answer:o Small-Warp Coalescing-Enhanced Machine (SW+)o Large-Warp Control-flow-Enhanced Machine (LW+)

Experimental Results

Conclusions & Future Work

Warp Size Impact in GPUs: Large or Small?

Page 4: Warp Size Impact in GPUs: Large or Small?

4

Warping

Opportunitieso Reduce scheduling overheado Improve utilization of execution units (SIMD efficiency)o Exploit inter-thread data locality

Challengeso Memory divergenceo Branch divergence

Warp Size Impact in GPUs: Large or Small?

Page 5: Warp Size Impact in GPUs: Large or Small?

5

Memory Divergence

Threads of a warp may hit or miss in L1

J = A[S];// L1 cache access

L = K * J;

Hit

Hit

Mis s HitTim

e

Stal

l

Stal

l

Stal

l

Stal

l

Warp T0 T1 T2 T3

Warp T0 T1 T2 T3

Warp Size Impact in GPUs: Large or Small?

Page 6: Warp Size Impact in GPUs: Large or Small?

6

Branch Divergence

Branch instruction can diverge to two different paths dividing the warp to two groups:1. Threads with taken outcome2. Threads with not-taken outcome

If(J==K){ C[tid]=A[tid]*B[tid];}else if(J>K){ C[tid]=0;}

Warp

Warp

Warp T0 X X T3

Warp

Warp

Tim

e

X T1 T2 X

T0 T1 T2 T3

T0 X X T3

T0 T1 T2 T3

Warp Size Impact in GPUs: Large or Small?

Page 7: Warp Size Impact in GPUs: Large or Small?

7

Memory Access Coalescing

Common memory access of neighbor threads are coalesced into one transaction

Warp T0 T1 T2 T3

Warp T4 T5 T6 T7

Warp T8 T9 T10 T11

Hit

Hit

Hit

Hit

Mis s Mis s Mis s Mis s

Mis s Hit

Hit

Mis s

Mem. Req. A Mem. Req. B

Mem. Req. C

Mem. Req. D Mem. Req. E

A B A B

C C C C

D E E D

Warp Size Impact in GPUs: Large or Small?

Page 8: Warp Size Impact in GPUs: Large or Small?

8

Warp Size

Warp Size: number of threads in warp

Small Warp Advantage:o Less branch/memory divergenceo Less synchronization overhead at every instruction

Large Warp Advantage:o Greater opportunity for memory access coalescing

Warp Size Impact in GPUs: Large or Small?

Page 9: Warp Size Impact in GPUs: Large or Small?

9

Warp Size and Branch Divergence

Lower the warp size, lower the branch divergence

If(J>K){ C[tid]=A[tid]*B[tid];else{ C[tid]=0;}

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

↓ ↓ ↓ ↓ ↓ ↓

↓ ↓

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

2-thread warpT1 T2 T3 T4 T5 T6 T7 T8

No branch divergence

4-thread warp

Branch divergence

Warp Size Impact in GPUs: Large or Small?

Page 10: Warp Size Impact in GPUs: Large or Small?

10

Warp Size and Memory Divergence

Warp T0 T1 T2 T3

Warp T4 T5 T6 T7

Warp T8 T9 T10 T11

Tim

e

Small warps Large warps

Hit

Hit

Hit

Hit

Mis s Mis s Mis s Mis s

Hit

Hit

Hit

Hit

Warp

T0 T1 T2 T3

Hit

Hit

Hit

Hit

Mis s Mis s Mis s Mis s

Hit

Hit

Hit

Hit

Warp

T0 T1 T2 T3

T8 T9 T10 T11

T4 T5 T6 T7St

all

Stal

l

Stal

l

Stal

lWarp T0 T1 T2 T3

Warp T4 T5 T6 T7

T4 T5 T6 T7

T8 T9 T10 T11

Warp T8 T9 T10 T11

Improving latency hiding

Warp Size Impact in GPUs: Large or Small?

Page 11: Warp Size Impact in GPUs: Large or Small?

11

Warp Size and Memory Access Coalescing

Warp T0 T1 T2 T3

Warp T4 T5 T6 T7

Warp T8 T9 T10 T11

Tim

eSmall warps Large warpsM

is s Mis s Mis s Mis s

Warp

T0 T1 T2 T3

Mis s Mis s Mis s Mis s

T4 T5 T6 T7

T8 T9 T10 T11

Mis s Mis s Mis s Mis s

Mis s Mis s Mis s Mis s

Mis s Mis s Mis s Mis s

Mis s Mis s Mis s Mis s

Req. A

Req. B

Req. A

Req. A

Req. B

Req. A

Req. B

Reducing # of memory accesses using wider coalescing

5 memory requests 2 memory requests

Warp Size Impact in GPUs: Large or Small?

Page 12: Warp Size Impact in GPUs: Large or Small?

12

Warp Size Impact on Coalescing

Often Warp Size Coalescing

BKP CP HSPT MU0

102030405060708090

100 8 16 32 64

Coal

esci

ng R

ate

Warp Size Impact in GPUs: Large or Small?

Page 13: Warp Size Impact in GPUs: Large or Small?

13

Warp Size Impact on Idle Cycles

MU: Warp Size Divergence Idle Cycles BKP: Warp Size Coalescing Idle Cycles

BKP CP HSPT MU0%

20%

40%

60%

80%

100%8 16 32 64

Cont

ributi

on o

f idl

e cy

cles

Warp Size Impact in GPUs: Large or Small?

Page 14: Warp Size Impact in GPUs: Large or Small?

14

Warp Size Impact on Performance

MU: Warp Size Divergence Performance BKP: Warp Size Coalescing Performance

BKP CP HSPT MU0.4

0.6

0.8

1

1.2

1.4

1.6 8 16 32 64

Nor

mal

ized

IPC

Warp Size Impact in GPUs: Large or Small?

Page 15: Warp Size Impact in GPUs: Large or Small?

15

ApproachBaseline machine

SW+:-Ideal MSHR compensates coalescing loss of small warps

LW+:-MIMD lanes compensate divergence of large warps

Warp Size Impact in GPUs: Large or Small?

Page 16: Warp Size Impact in GPUs: Large or Small?

16

SW+ Warps as wide as SIMD width

o Low branch/memory divergence, Improve latency hiding Compensating coalescing loss -> Ideal MSHR

o Compensating small-warp deficiency (memory access coalescing loss)o Ideal MSHR prevents redundant memory transactions by merging the

redundant requests of the warps on the same SM.o Outstanding MSHRs are searched to perform the merge

Warp Size Impact in GPUs: Large or Small?

Page 17: Warp Size Impact in GPUs: Large or Small?

17

LW+

Warps 8x larger than SIMD widtho Improve memory access coalescing

Compensating divergence -> Lock-step MIMD executiono Compensate large warp deficiency (branch/memory divergence)

Warp Size Impact in GPUs: Large or Small?

Page 18: Warp Size Impact in GPUs: Large or Small?

18

Methodology

Cycle-accurate GPU simulation through GPGPU-simo Six Memory Controllers (76 GB/s)o 16 8-wide SMs (332.8 GFLOPS)o 1024-thread per coreo Warp Size: 8, 16, 32, and 64

Workloadso RODINIAo CUDA SDKo GPGPU-sim

Warp Size Impact in GPUs: Large or Small?

Page 19: Warp Size Impact in GPUs: Large or Small?

19

Coalescing Rate

SW+: 86%, 58%, 34% higher coalescing vs. 16, 32, 64 thd/warps LW+: 37%, 17%, higher and -1% lower coalescing vs. 16, 32, 64 thd/warps

Warp Size Impact in GPUs: Large or Small?

BFS BKP GAS HSPT MTM MU NQU SR1 avg1

10

100

1000

10000SW+ 8 16 32 64 LW+

Coal

esci

ng R

ate

Page 20: Warp Size Impact in GPUs: Large or Small?

20

Idle Cycles

SW+: 11%, 6%, 8% less Idle Cycles vs. 8, 16, 32 thd/warps LW+: 1% more and 4%, 2% less Idle Cycles vs. 8, 16, 32 thd/warps

Warp Size Impact in GPUs: Large or Small?

BFS BKP GAS HSPT MTM MU NQU SR1 avg0%

20%

40%

60%

80%

100%

120%SW+ 8 16 32 64 LW+

Cont

ributi

on o

f idl

e cy

cles

Page 21: Warp Size Impact in GPUs: Large or Small?

21

Performance

SW+: Outperforms LW+ (11%), 8 (16%), 16(13%), 32 (20%) thd/warps. LW+: Outperforms 8 (5%), 16 (1%), 32 (7%), 64 (15%) thd/warps.

Warp Size Impact in GPUs: Large or Small?

BFS BKP GAS HSPT MTM MU NQU SR1 avg0.40.60.8

11.21.41.61.8

2SW+ 8 16 32 64 LW+

Nor

mal

ized

IPC

Page 22: Warp Size Impact in GPUs: Large or Small?

22

Conclusion & Future Work

Warp Size Impacts Coalescing, Idle Cycles, and Performance

Investing in Enhancement of small-warp (SW+) machine returns higher gain than investing in enhancement of large-warp (LW+)

Future Work: Evaluating warp size impact on energy efficiency

Warp Size Impact in GPUs: Large or Small?

Page 23: Warp Size Impact in GPUs: Large or Small?

23

Thank you!Question?

Warp Size Impact in GPUs: Large or Small?

Page 24: Warp Size Impact in GPUs: Large or Small?

24

Backup-Slides

Warp Size Impact in GPUs: Large or Small?

Page 25: Warp Size Impact in GPUs: Large or Small?

25

Coalescing Width

Range of the threads in a warp which are considered for memory access coalescingo NVIDIA G80 -> Over sub-warpo NVIDIA GT200 -> Over half-warpo NVIDIA GF100 -> Over entire warp

When the coalescing width is over entire warp, optimal warp size depends on the workload

Warp Size Impact in GPUs: Large or Small?

Page 26: Warp Size Impact in GPUs: Large or Small?

26

Warp Size and Branch Divergence (continued)

Warp T0 T1 T2 T3

Warp T4 T5 T6 T7

Warp T8 T9 T10 T11

Warp T0 T1 X X

Warp T4 T5 T6 T7

Warp X T9 T10 T11

Warp X X T2 T3

Warp T8 X X X

Warp T0 T1 T2 T3

Warp T4 T5 T6 T7

Warp T8 T9 T10 T11

WarpTim

e T0 T1 T2 T3

T4 T5 T6 T7

T8 T9 T10 T11

Warp

T0 T1 X X

T4 T5 T6 T7

X T9 T10 T11

Warp

X X T2 T3

X X X X

T8 X X X

Warp

T0 T1 T2 T3

T4 T5 T6 T7

T8 T9 T10 T11

Small warps Large warps

Saving some idle cycles

Warp Size Impact in GPUs: Large or Small?

Page 27: Warp Size Impact in GPUs: Large or Small?

27

Warping

Thousands of threads are scheduled zero-overheado All the context of threads are on-core

Tens of threads are grouped into warpo Execute same instruction in lock-step

Warp Size Impact in GPUs: Large or Small?

Page 28: Warp Size Impact in GPUs: Large or Small?

28

Key Question

Which warp size should be decided as the baseline?o Then, investing in augmenting the processor toward removing the

associated deficiency Machine models to find the answer

Warp Size Impact in GPUs: Large or Small?

Page 29: Warp Size Impact in GPUs: Large or Small?

29

GPGPU-sim ConfigNoC

Total Number of SMs 16Number of Memory Ctrls 6Number of SM Sharing an Network Interface 2

SMNumber of thread per SM 1024Maximum allowed CTA per SM 8Shared Memory/Register File size 16KB/64KBSM SIMD width 8Warp Size 8 / 16 / 32 / 64L1 Data cache 48KB: 8-way: LRU: 64BytePerBlockL1 Texture cache 16KB: 2-way: LRU: 64BytePerBlockL1 Constant cache 16KB: 2-way: LRU: 64BytePerBlock

ClockingCore clock 1300 MHzInterconnect clock 650 MHzDRAM memory clock 800 MHz

MemoryNumber of Banks Per Memory Ctrls 8DRAM Scheduling Policy FCFS

Warp Size Impact in GPUs: Large or Small?

Page 30: Warp Size Impact in GPUs: Large or Small?

30

WorkloadsName Grid Size Block Size #InsnBFS: BFS Graph [3] 16x(8,1,1) 16x(512,1) 1.4MBKP: Back Propagation [3] 2x(1,64,1) 2x(16,16) 2.9MDYN: Dyn_Proc [3] 13x(35,1,1) 13x(256) 64M

FWAL: Fast Walsh Transform [6]6x(32,1,1)3x(16,1,1)(128,1,1)

7x(256)3x(512)

11.1M

GAS: Gaussian Elimination [3] 48x(3,3,1) 48x(16,16) 8.8MHSPT: Hotspot [3] (43,43,1) (16,16,1) 76.2MMP: MUMmer-GPU++ [8] (1,1,1) (256,1,1) 0.3MMTM: Matrix Multiply [14] (5,8,1) (16,16,1) 2.4MMU: MUMmer-GPU [1] (1,1,1) (100,1,1) 0.15MNNC: Nearest Neighbor on cuda [2] 4x(938,1,1) 4x(16,1,1) 5.9MNQU: N-Queen [1] (256,1,1) (96,1,1) 1.2M

NW: Needleman-Wunsch [3]

2x(1,1,1)…2x(31,1,1)(32,1,1)

63x(16) 12.9M

SC: Scan[14] (64,1,1) (256,1,1) 3.6MSR1: Speckle Reducing Anisotropic Diffusion [3] (large dataset) 3x(8,8,1) 3x(16,16) 9.1M

SR2: Speckle Reducing Anisotropic Diffusion [3] (small dataset) 4x(4,4,1) 4x(16,16) 2.4M

Warp Size Impact in GPUs: Large or Small?