simd divergence optimization through intra-warp compactioncamelab.org/uploads/main/simd...

16
SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

Upload: others

Post on 20-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

SIMD Divergence Optimization through Intra-Warp Compaction

Aniruddha Vaidya Anahita ShayestehDong Hyuk Woo Roy Saharoy Mani Azimi

ISCA 13

Page 2: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Problem

• GPU: wide SIMD lanes – 16 lanes per warp in this work

• SIMD control flow divergence on “if/else” condition

• Common solution: sequentially execute all the control flow paths for all channels– Both the “if” and “else” portion are executed in turn by all channels,

while turning off appropriate channels in each path

• Recent studies: combine threads from different warps that have the same if/else flow path– Problem: increase memory divergence (i.e. the number of distinct

memory or cache line requests per SIMD instruction)

Page 3: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Observation

• The number of hardware execution lanes is typically a fraction of the SIMD instruction width – 4-wide SIMD ALU in Intel’s Ivy Bridge GPU.

• Wide SIMD instructions typically executes over multiple execution cycles due to narrower hardware width.

Page 4: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Goal

• By exploiting the difference between logical andphysical SIMD width of a GPU pipeline, this work addresses the SIMD control divergence problem with intra-warp compaction

Page 5: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

GPU register file

0 1 2 3 4 5 6 7 8 9 a b c d e f

r1

r0

0 1 2 3 4 5 6 7 8 9 a b c d e f

r(n-1)

r2

r1

r0

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

r0

r(n-1)

r2

r1

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

Warp 0r1

Warp 0

Warp 1

16 lanes

Page 6: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Basic Cycle Compression (BCC)

Fused multiply-add (FMA) r3 = r0 * r1 + r2

r0 / fetch @ 1 r1 / fetch @ 2 r2 / fetch @ 3

issue @ 4issue @ 5issue @ 6issue @ 7

Instead, we want to issue a next warp at cycle 5.

Page 7: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Basic Cycle Compression (BCC)

• In this example, the compressed execution time =

the execution time without the divergence caused by the “if/else" clause

If

else

Page 8: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Unfruitful Cases for BCC

• Turned off channels in an instruction are not contiguous or contiguous but not favorably aligned to the hardware SIMD pipeline width

Page 9: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Swizzled Cycle Compression (SCC)

• The positions of disabled and enabled channels are rearranged

If

else

Page 10: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Control Algorithm for Swizzling

• Method:1. Detect the optimal number of cycles for execution

2. Balance occupancy across lanes

Lane 0

Lane 1

Lane 2

Lane 3

1 1 1 1

1 1 1 1

Total 4

Total 0

Total 4

Total 0

1

1

1

1

For 1st EXE cycle, fill idle lanes (1, 3) from busy lanes

For 2nd EXE cycle, fill idle lanes (1, 3) similarly

Total 3Total 2

Total 1Total 2

Total 3Total 2

Total 1Total 2

Optimum cycle: 8/4 = 2

2

2

2

2

Page 11: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Simulation Methods

• Execution-driven simulation– In-house cycle-level Intel GPGPU simulator

• Standalone GPU simulation • A module in parallel CPU+GPU simulation

– Entire GPU performance simulation with entire memory hierarchy

– 50+ OpenCL benchmark applications evaluated

• Trace-driven simulation– GPU core performance simulation only– ~600 OpenCL, OpenGL, multimedia workload traces

Page 12: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Results

0%

10%

20%

30%

40%

50%

BFS HtS

Lava

MD

NW

Par

t

EV

RT-

PR

-Co

nf

RT-

PR

-AL

RT-

PR

-BL

RT-

PR

-WM

RT-

AO

-AL

RT-

AO

-BL

RT-

AO

-WM

LuxM

ark-

sky

LuxM

ark_

sala

luxm

ark_

ocl

cp

bu

lletp

hys

ics

ocl

pro

fv1

p0

righ

twar

e_m

and

elb

ulb

tree

_sea

rch

LuxM

ark_

hd

r

Op

tSA

A

san

dra

_ocl

ati-

eig

enva

l

ati_

flo

ydw

arsh

all

glb

ench

_egy

pt

glb

ench

_pro

FD_

Inte

lFin

alis

ts

FD_

po

litic

ian

s

ALU

cyc

les

save

d

BCC%SCC%

ALU cycles saved (OpenGL and OpenCL)

Page 13: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Results• System Performance (OpenCL; RayTracing)

• Dependent on Data Cluster Bandwidth (L3 cache)

0%

20%

40%

60%

Spe

ed

up

∞ bandwidth2 L3$ lines / cycle1 L3$ line / cycle

BCC%SCC%

On average (across divergent applications),+12% with 1$ line / cycle bandwidth

+18% with 2$ line / cycle bandwidth

Page 14: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Conclusion

• SIMD control divergence solutions

– Exploiting the multi-cycle execution feature of GPUs

– Intra-Warp Compaction

• Basic cycle compression

• Swizzled cycle compression

Page 15: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Register file organization

Baseline: use pairs of registers BCC: fetch only half width registers

Page 16: SIMD Divergence Optimization through Intra-Warp Compactioncamelab.org/uploads/Main/SIMD Divergence... · –Intra-Warp Compaction •Basic cycle compression •Swizzled cycle compression

Register file organization

Operand fetch (16 lanes, 512b) is done in 1-cycle. This operand is held in a 512b latch.Each quad (128b) passes through a four lane swizzler with individual lane enables.

Overhead