simd divergence optimization through intra-warp compactioncamelab.org/uploads/main/simd...

SIMD Divergence Optimization through Intra-Warp Compaction

Aniruddha Vaidya Anahita ShayestehDong Hyuk Woo Roy Saharoy Mani Azimi

ISCA 13

Problem

• GPU: wide SIMD lanes – 16 lanes per warp in this work

• SIMD control flow divergence on “if/else” condition

• Common solution: sequentially execute all the control flow paths for all channels– Both the “if” and “else” portion are executed in turn by all channels,

while turning off appropriate channels in each path

• Recent studies: combine threads from different warps that have the same if/else flow path– Problem: increase memory divergence (i.e. the number of distinct

memory or cache line requests per SIMD instruction)

Observation

• The number of hardware execution lanes is typically a fraction of the SIMD instruction width – 4-wide SIMD ALU in Intel’s Ivy Bridge GPU.

• Wide SIMD instructions typically executes over multiple execution cycles due to narrower hardware width.

Goal

• By exploiting the difference between logical andphysical SIMD width of a GPU pipeline, this work addresses the SIMD control divergence problem with intra-warp compaction

GPU register file

0 1 2 3 4 5 6 7 8 9 a b c d e f

r1

r0

0 1 2 3 4 5 6 7 8 9 a b c d e f

r(n-1)

r2

r1

r0

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

r0

r(n-1)

r2

r1

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

0 1 2 3 4 5 6 7 8 9 a b c d e f

Warp 0r1

Warp 0

Warp 1

16 lanes

Basic Cycle Compression (BCC)

Fused multiply-add (FMA) r3 = r0 * r1 + r2

r0 / fetch @ 1 r1 / fetch @ 2 r2 / fetch @ 3

issue @ 4issue @ 5issue @ 6issue @ 7

Instead, we want to issue a next warp at cycle 5.

Basic Cycle Compression (BCC)

• In this example, the compressed execution time =

the execution time without the divergence caused by the “if/else" clause

If

else

Unfruitful Cases for BCC

• Turned off channels in an instruction are not contiguous or contiguous but not favorably aligned to the hardware SIMD pipeline width

Swizzled Cycle Compression (SCC)

• The positions of disabled and enabled channels are rearranged

If

else

Control Algorithm for Swizzling

• Method:1. Detect the optimal number of cycles for execution

2. Balance occupancy across lanes

Lane 0

Lane 1

Lane 2

Lane 3

1 1 1 1

1 1 1 1

Total 4

Total 0

Total 4

Total 0

1

1

1

1

For 1st EXE cycle, fill idle lanes (1, 3) from busy lanes

For 2nd EXE cycle, fill idle lanes (1, 3) similarly

Total 3Total 2

Total 1Total 2

Total 3Total 2

Total 1Total 2

Optimum cycle: 8/4 = 2

2

2

2

2

Simulation Methods

• Execution-driven simulation– In-house cycle-level Intel GPGPU simulator

• Standalone GPU simulation • A module in parallel CPU+GPU simulation

– Entire GPU performance simulation with entire memory hierarchy

– 50+ OpenCL benchmark applications evaluated

• Trace-driven simulation– GPU core performance simulation only– ~600 OpenCL, OpenGL, multimedia workload traces

Results

0%

10%

20%

30%

40%

50%

BFS HtS

Lava

MD

NW

Par

t

EV

RT-

PR

-Co

nf

RT-

PR

-AL

RT-

PR

-BL

RT-

PR

-WM

RT-

AO

-AL

RT-

AO

-BL

RT-

AO

-WM

LuxM

ark-

sky

LuxM

ark_

sala

luxm

ark_

ocl

cp

bu

lletp

hys

ics

ocl

pro

fv1

p0

righ

twar

e_m

and

elb

ulb

tree

_sea

rch

LuxM

ark_

hd

r

Op

tSA

A

san

dra

_ocl

ati-

eig

enva

l

ati_

flo

ydw

arsh

all

glb

ench

_egy

pt

glb

ench

_pro

FD_

Inte

lFin

alis

ts

FD_

po

litic

ian

s

ALU

cyc

les

save

d

BCC%SCC%

ALU cycles saved (OpenGL and OpenCL)

Results• System Performance (OpenCL; RayTracing)

• Dependent on Data Cluster Bandwidth (L3 cache)

0%

20%

40%

60%

Spe

ed

up

∞ bandwidth2 L3$ lines / cycle1 L3$ line / cycle

BCC%SCC%

On average (across divergent applications),+12% with 1$ line / cycle bandwidth

+18% with 2$ line / cycle bandwidth

Conclusion

• SIMD control divergence solutions

– Exploiting the multi-cycle execution feature of GPUs

– Intra-Warp Compaction

• Basic cycle compression

• Swizzled cycle compression

Register file organization

Baseline: use pairs of registers BCC: fetch only half width registers

Register file organization

Operand fetch (16 lanes, 512b) is done in 1-cycle. This operand is held in a 512b latch.Each quad (128b) passes through a four lane swizzler with individual lane enables.

Overhead

simd divergence optimization through intra-warp compactioncamelab.org/uploads/main/simd...

Documents