functional safety and the gpu - nvidia€¦ · description 2013 statistics 2015 statistics fatal...

Richard Bramley, 5/11/2017

FUNCTIONAL SAFETY AND THE GPU

2

AGENDA

How good is good enough

What is functional safety

Functional safety and the GPU

Safety support in Nvidia GPU

Conclusions

3

HOW GOOD IS GOOD ENOUGH ?

4

N. Saxena4

ACCIDENT STATISTICS– US1

Description 2013 Statistics 2015 Statistics

Fatal Crashes 30,057 35,092

Non-Fatal Crashes 5,657,000 6,263,834

Number of Registered Vehicles 269,294,000 281,312,446

Licensed Drivers 212,160,000 218,084,465

Vehicle Miles Travelled 2,988,000,000,000 3,095,373,000,000

Fatal Crash Rate in FITs 2,3 250 – 500 283 - 566

Non-Fatal Crash Rate in FITs 2,3 46K – 92K 51K – 102K

What is an appropriate target ?

Google Non-Fatal Crash FIT Rate = 150K

1Source: Traffic Safety Facts 2013/2015, NHTSA document reference DOT HS 8123842 Derived from NHTSA data on driver related fatal crashes3Assumes an average speed of 50MPH

5

TARGET FAILURE RATES

Description Statistics

Acceptable risk (no further improvement required) 1:1,000,0001

US population (2015) >321,000,000

Traffic deaths 35,092

“Acceptable” deaths as per guidelines 321

Required improvement x100

1 Derived following data from UK health and safety executive publications

Wide variety of targets in industry Target risk reduction of 2x to 100x compared to human driver

6

SAFETY AND AUTONOMOUS VEHICLES

Safety during intended operationSafety of the intended function

(SOTIF ISO/PAS 21448 in development)

Safety in presence of a faultFunctional Safety ISO-26262

Algorithms Software Hardware

7

FUNCTIONAL SAFETY BASICS

8

DEFINITION PER STANDARDS

“Absence of unreasonable risk due to hazards caused by malfunctioning behavior of electrical/electronic systems” – ISO 26262-1:2011; 1.51

“Part of overall safety relating to the equipment under control and the equipment under control, control system that depends on the correct functioning of the electrical/electronic/programmable electronic safety-related systems and other risk reduction measures” – IEC 61508-4:2010; 3.1.12

9

CLASSIC EXAMPLE

• Consider a motor winding which may overheat and cause a hazard.

• Reliability engineering approach might design the winding to be more resilient to over-temperature conditions

• Functional safety engineering approach might add a temp sensor to detect the over-temperature condition and switch off the motor

IEC 61508-0:2005; 3.1

https://upload.wikimedia.org/wikipedia/commons/0/0f/Stator_Winding_of_a_BLDC_Motor.jpg

10

ACHIEVING FUNCTIONAL SAFETY

Systematic and random faults must be considered

Systematic faults mitigated by:

Following compliant process at all stages of development

Monitoring of the complete product lifecycle

Random faults are mitigated by:

Failure mode analysis to understand the fault behavior of the system

Application of diagnostic measures to detect the failure modes

Transition to the safe state on failure mode detection

11

FAIL SAFE

Undetected Failures

Good

State

Safe

State

Detected Failures

m – mission , b- backup, (x), m or b is in repair mode.

Failed

State

12

FAIL OPERATIONAL

Undetected Failures

Good

State

Failed

State

m – mission , b- backup, (x), m or b is in repair mode.

For full autonomy the initial “safe state” can be a transition to the backup system

Backup

Detected Failures

Repair

Final

safe

state

Detected Failures

Undetected Failures

13

FAULT CLASSIFICATIONSISO 26262-10; B.1

All Faults λ

Non-Safety Related

Element λNSR

Safe λS

Safety Related Element λSR

Safe λS

Single Point λSPF

Residual λRF

Multi-Point Latent λMPF, L

Multi-Point Detected λMPF, D

Multi-Point Perceived λMPF, P

14

SINGLE POINT FAULT METRIC (SPFM)

Shows the percentage of overall single point faults which are:

Safety related AND

Safe OR dangerous but detected

λs - safe fault failure rate, can also be expressed as a % (Fsafe) the ration of overall possible faults which are safe.

15

LATENT FAULT METRIC (LFM)

Shows the percentage of overall multiple point faults which are:

Safety related AND

Safe OR dangerous but detected OR dangerous but perceived

Customarily limited to scenarios considering 2 point independent faults

Primary consideration is fault in mission logic combined with fault in safety mechanism

16

ARCHITECTURAL METRIC TARGETS

ASIL A ASIL B ASIL C ASIL D

SPFM N/A >=90% >=97% >=99%

LFM N/A >=60% >=80% >=90%

All targets are recommendations. Developers can set their own targets based on appropriate argumentation.

17

PROBABILISTIC METRICSProbabilistic Metric for (Random) Hardware Failure (PMHF)

Examines the residual probability of violation of safety goal after application of diagnostics, in a given time of operation.

Some pushback in market due to inconsistency between methods used by different vendors.

ISO 26262-10:2011; 8.3.3

NOTE: Multiple versions of equation possible depending on conditional probability of failures. Simplest form shown

18

PMHF TARGETS

ASIL A ASIL B ASIL C ASIL D

PMHF N/A 100 FIT 100 FIT 10 FIT

All targets are recommendations. Developers can set their own targets based on appropriate argumentation.

19

RELEVANCE TO GPU

20

TRADITIONAL CV

EXAMPLES OF SAFETY CRITICAL OPERATION ON GPU

Normalize gamma and color

Compute gradients

Weighted voting

Contrast and normalize

Collect HOGS

Traditional Classification: (pattern and template matching)

MACHINE LEARNING*

CNN (Convolutional Neural network)

MLP (Multi-layer perceptron)

SVM (Support vector machine)

*Focus is inferencing, training handled analogously to validation and calibration of a traditional safety related algorithm.

21

GPU MEASUREMENT METHODOLOGIES

Silicon-based fault Injection

Design

Simulation

Fault

Injection

Beamtesting

C-models/

RAM Liveness

Much of the measurement is done on representative kernels asthe final applications are not available at design time

Architectural

Safeness,

Diagnostic

Coverage

(SPFM,LFM),

SRAM

“Liveness”

RepresentativeWorkloads

FurtherSafety Analysis

22

MEASURING SAFE FAULTS IN RAMS“LIVENESS”

RAMs are sensitive to particle radiation (4x larger failure rate per bit than flops)

RAM contents may not be sensitive to faults (pixels)

RAM contents may be very sensitive to faults (instructions)

An important indicator is RAM Liveness

0

20

40

60

80

100

120

k27

37

6_

rfd

pk2

74

22

_rf

dp

spar

se_

rfd

pcu

dn

n3

_rfd

pk2

73

82

_L2

cud

nn

1_L

2H

OG

_L2

win

ogr

ad_

L2k2

73

98

_L1

har

ris_

L1cu

dn

n2

_L1

k27

37

6_

icc

k27

42

2_

icc

spar

se_

icc

cud

nn

3_i

cck2

73

82

_if

bcu

dn

n1

_ifb

HO

G_i

fbwinograd

_i…

k27

39

8_

gcc

har

ris_

gcc

cud

nn

2_g

cck2

73

76

_ta

ilk2

74

22

_ta

ilsp

arse

_ta

ilcu

dn

n3

_tai

l

time

W1

W2

W4

W3

tr1

tr2

tr4 tr4’

Texe

write

read

tr3 = 0

The occupancy can be computed: (tr1 + tr2 + tr3 + tr4 + tr4’) / 4 x Texe .

Occ

up

ancy

%

Majority of RAMS in this GPUless than 10% occupancyFsafe > 90%

23

TESTING REPRESENTATIVE KERNELS

Parameter measurement is very sensitive to kernel definition

Traditional CV has a wide diversity of operations

Difficult to define representative kernels

Machine learning has a smaller set of repeated operations

Enabling a more complete definition of kernels for measurements

More accurate and reliable measurements

24

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 62 63 64 65 66

Rat

io

#Ru

ns

Kernel ID sorted by Launch order

Output 1000 Class, GoogLeNet, total_run=5000

FAIL Counts Residual fault ratio Diagnosed fault ratio

DEEP LEARNING APPLICATION SAFENESS

GIE GoogLeNet

67 kernels in GoogLeNetinference

Faults in latter kernels have a higher possibility to cause errors

#FAIL Counts represents the proportion of faults for which the application predicted the wrong final class

Weighted average safeness is >99 %

26

SAFETY SUPPORT IN NVIDIA GPUS

27

SYSTEMATIC DEVELOPMENT OF GPUHARDWARE

Selected GPU cores targeted for automotive usage are developed with a process for ISO 26262 compliance

28NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LAYERED SAFETY MECHANISMS

HW plausibility checks enabling multiple execution checks throughout the GPU,

Protection of large safety related memories,

Dependent failure mitigation; mainly caches and shared structures,

Parity/ECC protection of key

structures

HW machine checks

Redundant execution

29

FLEXIBLE REDUNDANCY MODEL

GPU Context

Channel 0

Channel 1

Channel N

Work Distribution

SM0

SM1

SM2

SM3

MemoryAccess

GPU

Flexible Execution modelBuilt-in HW and SW diagnostics

Machine Checks

Parity /ECC

Common cause

failure

mitigation

30

FLEXIBLE REDUNDANCY MODEL

GPU Context

Channel 0

Channel 1

Channel N

Work Distribution

SM0

SM1

SM2

SM3

MemoryAccess

GPU

Flexible Execution modelBuilt-in HW and SW diagnostics

Machine Checks

Parity /ECC

Common cause

failure

mitigation

31

SYSTEMATIC CONSIDERATIONS

Software in the runtime is under development for ISO 26262 compliance

Software used in development (training) considered as off-line tools per ISO 26262

Software and tools

TensorRT

32

GPU FAULT MITIGATION

33

CONCLUSIONS

Nvidia is developing selected GPUs for compliance to ISO 26262

Nvidia has multiple unique capabilities to analyze safety-related performance of GPUs

Analysis to date indicates DNNs have a high degree of internal redundancy that results in high ratio of safe faults

Selected GPUs are being built with additional hardware and software diagnostic mechanisms

Nvidia is developing software and tools needed to support safety related development

functional safety and the gpu - nvidia€¦ · description 2013 statistics 2015 statistics fatal...

Documents