gpudet: a deterministic gpu architecture

28
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 1 GPUDet: A Deterministic GPU Architecture Hadi Jooybar 1 , Wilson Fung 1 , Mike O’Connor 2, Joseph Devietti 3 , Tor M. Aamodt 1 1 The University of British Columbia 2 AMD Research 3 University of Washington

Upload: fritzi

Post on 24-Feb-2016

81 views

Category:

Documents


0 download

DESCRIPTION

GPUDet: A Deterministic GPU Architecture. Hadi Jooybar 1 , Wilson Fung 1 , Mike O’Connor 2, Joseph Devietti 3 , Tor M. Aamodt 1. 1 The University of British Columbia 2 AMD Research 3 University of Washington . GPUs are … Fast Energy efficient Commodity hardware. But… - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 1

GPUDet: A Deterministic GPU Architecture

Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3, Tor M. Aamodt1

1The University of British Columbia2AMD Research3University of Washington

Page 2: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 2

• GPUs are …• Fast• Energy efficient• Commodity hardware

But…

× Mostly use for certain range of applications

Why?

Communication among concurrent threads 1000s of Threads

Page 3: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 3

0 __global__ void BFS_step_kernel(...) {1 if( active[tid] ) {2 active[tid] = false;3 visited[tid] = true;4 foreach (int id = neighbour_nodes){5 if( visited[id] == false ){6 cost[id] = cost[tid] + 1;7 active[id] = true;8 *over = true;9 } } } }

V0

V2V1

Cost = -Active = -

Cost = -Active = -

V0

V2V1

Cost = 1Active = 1

Cost = 1Active = 1

V0

V2V1

Cost = 1Active = 1

Cost = 2Active = 1

Motivation

BFS algorithmPublished in HiPC 2007

Page 4: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 4

I will debug it this time

What about debuggers?!

The bug may appear occasionally or in different places in each run.

OMG! Where was that bug?!

Motivation

Page 5: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 5

GPUDetStrong Determinism (hardware proposal)

Same Outputs Same Execution Path

Makes the program easier to Debug Test

Page 6: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 6

0 __global__ void BFS_step_kernel(...) {1 if( active[tid] ) {2 active[tid] = false;3 visited[tid] = true;4 foreach (int id = neighbour_nodes){5 if( visited[id] == false ){6 cost[id] = cost[tid] + 1;7 active[id] = true;8 *over = true;9 } } } }

V0

V2V1

Cost = 1Active = 1

Cost = 2Active = 1

Motivation

BFS algorithmPublished in HiPC 2007

Page 7: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 7

GPUDetStrong Determinism

Same Outputs Same Execution Path

Makes the program easier to Debug Test

×There is no free lunch× Performance OverheadOur goal is to provide Deterministic

Execution on GPU architectures with acceptable performance overhead

Page 8: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 8

DRAMGPU Architecture

Compute Unit

Memory Unit

L1 Cache

ALUALUALU

DRAML2 Cache

Workgroups

CPUKernel launch

workgroup 2workgroup 1workgroup 0

x = input[threadID];y= func(x);output[threadID] = y;

Page 9: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 9

Outline

• Introduction• GPU Architecture• Challenges • Deterministic Execution with GPUDet• GPUDet Optimizations

• Workgroup-Aware Quantum Formation• Deterministic parallel commit using Z-Buffer Unit• Compute Unit level serialization

• Results and Conclusion

Page 10: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 10

Normal Execution

T0

T1

T2

T3

Deterministic GPU Execution Challenges

• Isolation mechanism• Provide method to pause execution of a thread

…Quantum 0

T0

T1

T2

T3

Quantum n

T0

T1

T2

T3

…Isolation

T0

T1

T2

T3

Communication Isolation

T0

T1

T2

T3

Communication

Page 11: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 11

Deterministic GPU Execution Challenges

• Isolation mechanism• Lack of private caches • Lack of cache coherency

• Provide method to pause execution of a thread• Single Instruction Multiple Threads (SIMT)• Potential deadlock condition• Major changes in control flow hardware• Performance overhead workgroupn

wavefront

Page 12: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 12

Deterministic GPU Execution Challenges

• Very large number of threads• Expensive global synchronization• Expensive serialization

• Different program properties• Large number of short running threads• Frequent workgroup synchronization• Less locality in intra thread memory accesses

Page 13: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 13

Outline

• Introduction• GPU Architecture• Challenges • Deterministic Execution with GPUDet• GPUDet Optimizations

• Workgroup-Aware Quantum Formation• Deterministic parallel commit using Z-Buffer Unit• Compute Unit level serialization

• Results and Conclusion

Page 14: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 14

if (tid < 16) x[tid%2] = tid;

x[0] = 0

T0

Coalescing Unit

x[1] = 1

T1

x[0] = 2

T2

x[1] = 15

T15

Deterministic Execution of a Wavefront

Data RaceMask v v - - - - - - … -

Address x

Data 14 15 - - - - - - … -

x[0] = 14 x[1] = 15 Not modifiedTo memory

Execution of one wavefront is deterministic

Page 15: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 15

Deterministic GPU Execution Challenges

• Isolation mechanism• Provide method to pause execution of a thread

…Isolation

T0

T1

T2

T3

Communication Isolation

T0

T1

T2

T3

Communication

wavefront granularity

not a challenge anymore

Page 16: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 16

Reaching Quantum Boundary

Global Memory

Read Only

Store Buffers

Local Memory

Wavefronts

Load Op CommitAtomic Op

• GPUDet-Basic

1. Instruction Count2. Atomic Operations3. Memory Fences4. Workgroup Barriers5. Execution Complete

Page 17: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 17

Outline

• Introduction• GPU Architecture• Challenges • Deterministic Execution with GPUDet• GPUDet Optimizations

• Workgroup-Aware Quantum Formation• Deterministic parallel commit using Z-Buffer Unit• Compute Unit level serialization

• Results and Conclusion

Page 18: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 18

Workgroup-Aware Quantum Formation• Extra global synchronizations

Load Imbalance

Reducing number of synchronizationsAvoid unnecessary quantum termination

Page 19: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 19

AES

BFSr

BFSf

CFD C

P

HO

TSP

LIB

LPS

SRA

D HT

ATM

CLop

t0%

20%

40%

60%

80%

100% Atomic OperationsInstruction CountExecution CompleteWorkgroup Barriers

%of

Ter

min

ation

Rea

sons

Workgroup-Aware Quantum Formation

Quanta are finished by workgroup barriers

All reach a workgroup barrier

Continue execution in the parallel mode

Workgroup-Aware Decision Making

Page 20: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 20

AES

BFSr

BFSf

CFD C

P

HO

TSP

LIB

LPS

SRA

D HT

ATM

CLop

t0%

20%

40%

60%

80%

100% Atomic OperationsInstruction CountExecution CompleteWorkgroup Barriers

%of

Ter

min

ation

Rea

sons

Finish execution of the Kernel function

Workgroup-Aware Decision Making

Workgroup-Aware Quantum Formation

Deterministic workgroup partitioning

Page 21: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 21

Deterministic Parallel Commit using the Z-Buffer Unit

∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞

∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞7 7 7 ∞ ∞ ∞7 7 7 ∞ ∞ ∞7 7 7 ∞ ∞ ∞

8 8 8 8 8 88 8 8 8 8 87 7 7 8 8 87 7 7 8 8 87 7 7 8 8 8

8 8 5 5 8 88 8 5 5 5 87 5 5 5 5 57 5 5 5 5 55 5 5 5 5 5

Depth Buffer

Store Buffer Contents ≈ Color Values

Wavefront ID ≈ Depth Values

Z-Buffer Unit

Page 22: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 22

• GPUs preserve Point to Point Ordering

AAA

AAA

Serialization is only among compute units

Compute Unit Level Serialization

Page 23: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 23

Outline

• Introduction• GPU Architecture• Challenges • Deterministic Execution with GPUDet• GPUDet Optimizations

• Workgroup-Aware Quantum Formation• Deterministic parallel commit using Z-Buffer Unit• Compute Unit level serialization

• Results and Conclusion

Page 24: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 24

ResultsAE

S

BFSr

BFSf

CFD C

P

HO

TSP

LIB

LPS

SRA

D HT

ATM

CLop

t00.5

11.5

22.5

33.5

44.5

5

Serial Mode

Commit Mode

Parallel Mode

Nor

mal

ized

Ex

ecuti

on T

ime

2x Slowdown

• GPGPU-Sim 3.0.2Applications with atomic operations

Page 25: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 25

20% Performance Improvement for application with barriers

19% Performance Improvement for application with small kernel functions

Quantum FormationAE

S

BFSr

BFSf

CFD C

P H LIB

LPS

SRA

D HT

ATM

CLop

t

AVG

0

1

2

3

4

5

GPUDet-baseWorkgroup BarrierEnd of the Kernel

Nor

mal

ized

Exec

ution

Tim

e

Page 26: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 26

Deterministic Parallel Commit using the Z-Buffer UnitZ-

Buffe

r

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

Z-Bu

ffer

Lock

ing

AES BFSr BFSf CFD CP HOTSP LIB LPS SRAD HT ATM Clopt

0

2

4

6

8

10#REF! #REF!

Nor

mal

ized

Exe

cutio

n Ti

me

60% Performance Improvement on Average

Page 27: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 27

Compute Unit Level Serialization

W-S

er

CU-S

er

W-S

er

CU-S

er

W-S

er

CU-S

er

CLopt HT ATM

02468

101214

Serial Mode Series2Series1

Nor

mal

ize

Exec

ution

Tim

e6.1x Performance Improvement in

Serial Mode

Page 28: GPUDet: A  Deterministic  GPU Architecture

Hadi Jooybar GPUDet: A Deterministic GPU Architecture 28

Conclusion

• Encourages programmers to use GPUs in broader

range of applications• Exploits GPU characteristics to reduce performance

overhead• Deterministic execution within a wavefront• Workgroup-aware quantum formation• Deterministic parallel commit using Z-Buffer Unit• Compute Unit level serialization

Questions?