gpu-qin: a methodology for evaluating error resilience of gpgpu applications

21
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1

Upload: glenna

Post on 08-Feb-2016

62 views

Category:

Documents


2 download

DESCRIPTION

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications. Bo Fang , Karthik Pattabiraman , Matei Ripeanu , The University of British Columbia Sudhanva Gurumurthi AMD Research. Graphic Processing Unit. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU

ApplicationsBo Fang , Karthik Pattabiraman, Matei

Ripeanu,The University of British Columbia

Sudhanva Gurumurthi AMD Research

1

Page 2: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Graphic Processing Unit Main accelerators for general

purpose applications (e.g. TITAN , 18,000 GPUs)

2

Performance

PowerReliability

Trade-off

Page 3: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Reliability trends The soft error rate per chip will continue to

increase because of larger arrays; probability of soft errors within the combination logic will also increase. [Constantinescu, Micro 2003]

Source: final report for CCC cross-layer reliability visioning study (http://www.relxlayer.org/)

3

Page 4: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

GPGPUs vs Traditional GPUs Traditional: image rendering

GPGPU: DNA matchingATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT

ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT

Serious result !

Page 5: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Vulnerability and error resilience Vulnerability: Probability that the

system experiences a fault that causes a failure • Not consider the behaviors of

applications Error Resilience: Given a fault occurs,

how resilient the application is ✔

5Device/Circuit Level

Architectural Level

Operating System Level

Application Level

Architecture Vulnerability Factor study stops here

Error resilience study

Faults

Page 6: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Goal

6

Application-specific, software-

based, fault tolerance

mechanisms

Investigate the error-resilience

characteristics of

applicationsFind

correlation with

application

properties

Page 7: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

GPU-Qin: a methodology and a tool Fault injection: introduce faults in a

systematic, controlled manner and study the system’s behavior

GPU-Qin: on real hardware via cuda-gdb Advantages:

Representativeness efficiency

7

Page 8: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Challenges High level: massive parallelism

Inject into all the threads: impossible!

In practice: limitations in cuda-gdb Breakpoint on source code level, not

instruction level Single-step to target instructions

8

Page 9: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Fault Model Transient faults (e.g. soft errors) Single-bit flip

No cost to extend to multiple-bit flip Faults in arithmetic and logic

unit(ALU), floating point Unit (FPU) and the load-store unit(LSU). (Otherwise ECC protected)

9

Page 10: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Overview of the FI methodology

10

Grouping Profiling

Fault injection

Page 11: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Grouping• Detects groups of threads with

similar behaviors (i.e. same number of dynamic insts)

• Pick representative groups (challenge 1)

11

Category

Benchmarks Groups Groups to profile

% of threads in picked groups

Category I

AES,MRI-Q,MAT,Mergesort-k0, Transpose

1 1 100%

Category II

SCAN, Stencil, Monte Carlo, SAD, LBM, HashGPU

2-10 1-4 95%-100%

Category III

BFS 79 2 >60%

Grouping

Profiling

Fault injectio

n

Page 12: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Profiling

Finds the instructions executed by a thread and their corresponding CUDA source-code line (challenge2)

Output: Instruction trace: consisting of the

program counter values and the source line associated with each instruction

12

Grouping

Profiling

Fault injectio

n

Page 13: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Fault injection

13

Native execution

Breakpoint hit

Single-step execution

PC hit

Fault injection

Single-step execution

Activation window

Native execution

Grouping

Profiling

Fault injectio

n

Until end

Page 14: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Heuristics for efficiency A large number of iterations in loop

Limit the number of loop iterations explored

Single-step is time-consuming Limit activation window size

14

Page 15: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Characterization study NVIDIA Tesla C 2070/2075 12 CUDA benchmarks, 15 kernels

Rodinia, Parboil and CUDA SDK Outcomes:

Benign: correct output Crash: exceptions from the system or

application Silent Data Corruption (SDC):

incorrect output Hang: finish in considerably long time 15

Page 16: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Error Resilience Characteristics

16

AES

HashGPU

-md5

HashGPU

-sha1

MRI-Q MAT

Transp

ose

SAD-k0

SAD-k1

SAD-k2

Stenci

l

SCAN-bl

ock

MONTE

MergeS

ort-k0 BFS LB

M0%

10%20%30%40%

Benchmarks

SDC

rate

AES

HashGPU

-md5

HashGPU

-sha1

MRI-Q MAT

Transp

ose

SAD-k0

SAD-k1

SAD-k2

Stenci

l

SCAN-bl

ock

MONTE

MergeS

ort-k0 BFS LB

M0%

20%40%60%80%

Benchmarks

Cras

h ra

te1.SDC rates and crash

rates vary significantly across benchmarks

SDC: 1% - 38% Crash: 5% - 70%2. Average failure rate is 67%3. CPUs have 5% - 15% difference in SDC rates

We need application-specific FT mechanisms

Page 17: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Hardware error detection mechanisms• Hardware exceptions

17

AES (Crash rate: 43%) MAT (Crash rate: 30%)

1. Exceptions due to memory-related error2. Crashes are comparable on CPUs and GPUs

Page 18: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Crash latency – measure the fault propagation• Two examples:

18

1.Device illegal address have highest latency in all benchmarks except Stencil

2.Warp out-of-range has lower crash latency otherwise

3. Crash latency in CPUs is shorter (thousands of cycles) [Gu, DSN04]

We can use crash latency to design application-specific

checkpoint/recovery techniques

Page 19: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Resilience CategoryResilience Category

Benchmarks Measured SDC Dwarfs

Search-based MergeSort 6% Backtrack and Branch+Bound

Bit-wise Operation

HashGPU, AES 25% - 37% Combinational Logic

Average-out Effect

Stencil, MONTE 1% - 5% Structured Grids, Monte Carlo

Graph Processing

BFS 10% Graph Traversal

Linear Algebra Transpose, MAT, MRI-Q, SCAN-block, LBM, SAD

15% - 25% Dense Linear Algebra, Sparse Linear Algebra, Structured Grids 19

Page 20: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Conclusion GPU-Qin: a methodology and a fault

injection tool to characterize the error resilience of GPU applications We find that there are significant

variations in resilience characteristics of GPU applications

Algorithmic characteristics of the application can help us understand the variation of SDC rates

20

Page 21: GPU-Qin: A Methodology  For Evaluating Error Resilience of GPGPU Applications

Thank you !Project homepage: http://netsyslab.ece.ubc.ca/wiki/

index.php/FTGPUCode: https://github.com/

DependableSystemsLab/GPU-Injector

21