class: combined logic architecture soft error sensitivity … · 2013-04-24 · chair for...

32
Chair for Dependable Nano Computing (ITEC) 0 KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) www.kit.edu CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis Mojtaba Ebrahimi, Liang Chen, Hossein Asadi, and Mehdi B. Tahoori

Upload: others

Post on 21-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Chair for Dependable Nano Computing (ITEC) 0 KIT – University of the State of Baden-Wuerttemberg and

National Research Center of the Helmholtz Association

INSTITUTE OF COMPUTER ENGINEERING (ITEC) – CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC)

www.kit.edu

CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis

Mojtaba Ebrahimi, Liang Chen, Hossein Asadi, and Mehdi B. Tahoori

Chair for Dependable Nano Computing (ITEC) 1

Soft Errors

Soft Errors caused by radiation-induced

particle strikes

Strike releases electron & hole pairs

Absorbed by source & drain

Alter transistor state

Soft error at logic-level and higher:

Bit-flip in flip-flops

Transient pulses in combinational gates

System Soft Error Rate (SER) grows

exponentially with Moore’s law [Dixit11IRPS][Ibe10TED]

Number of transistors per system is

proportional to SER

+ - + + + -

- -

Transistor Device

source drain

Particle

Strike

0

1

2

3

4

5

250 180 130 90 65 45 32 22Design Rule (nm)

Soft

Erro

r R

ate

(No

rmal

ize

d)

SER vs. Design Rule [adapted from Ibe10TED]

Chair for Dependable Nano Computing (ITEC) 2

Not all soft errors cause system failures

Circuit level masking (logical, timing, and electrical)

Architecture level masking

Application level masking

Nodes not equally susceptible to soft errors

Finding most susceptible modules

Applying selective protection schemes

Balancing reliability, cost, and performance

SER Estimation technique is needed

Rank the nodes according to their SER

Motivation: Importance of SER Estimation

Chair for Dependable Nano Computing (ITEC) 3

Previous Soft Error Rate (SER) estimation techniques

Fault injection (time consuming)

Analytical (accuracy)

Architecture-level techniques regular structures

(register-file, cache)

Circuit-level techniques irregular structures

(controller unit, pipeline)

Microprocessors include both regular and irregular structures

Independent analysis is not accurate

Proposed: CLASS: Combined Logic Architecture Soft error Sensitivity

analysis

Combines a circuit-level technique with an architecture-level analysis

Discrete time Markov chains for handling different error propagation

and masking scenarios

Motivation: What is missing?

Chair for Dependable Nano Computing (ITEC) 4

Preliminaries

CLASS

Experimental Results

Conclusions

Outline

Chair for Dependable Nano Computing (ITEC) 5

Preliminaries

CLASS

Experimental Results

Conclusions

Outline

Chair for Dependable Nano Computing (ITEC) 6

ACE (Architecturally Correct Execution) analysis is the

most common technique [Mukherjee03Micro]

a.k.a Life-time analysis

Lifetime of a bit is divided into :

ACE

e.g. Program counter almost always in ACE state

un-ACE (unnecessary for ACE)

e.g. Branch predictor always in un-ACE state

AVF (Architecture Vulnerability Factor)

Fraction of time that system is on ACE state

Program counter ~ 100%, Branch predictor = 0%

ACE analysis overestimates failure rate

In some cases up to 7x [George10DSN] Overdesign

Analytical Architecture Level SER Estimation

Techniques

Chair for Dependable Nano Computing (ITEC) 7

Binary/Algebraic Decision Diagrams (BDD/ADD)-based

techniques [Zhang06ISQED] [Norman05TCAD] [Krishnaswamy05DATE] [Miskov08TCAD]

Based on decision diagrams

Highly accurate

Scalability problem

Error Probability Propagation (EPP)-based techniques [Asadi06ISCAS] [Fazeli10DSN] [Chen12IOLTS]

Based on propagation rules

Reasonable accuracy

Scalable

Analytical Circuit Level SER Estimation

Techniques

Chair for Dependable Nano Computing (ITEC) 8

Shortcoming of Previous techniques

Circuit-level techniques consider:

ACE cannot completely model effect of irregular structures

in error propagation from regular structures

A circuit What seen by circuit

level techniques

Chair for Dependable Nano Computing (ITEC) 9

Preliminaries

CLASS

Experimental Results

Conclusions

Overview

Chair for Dependable Nano Computing (ITEC) 10

CLASS: A Simple Example

Error at Output

(Failure)

Latent in Memory

Error at Error Site

0.2 0.3

1

Masked

0.5

Error at Output

(Failure)

Latent in Memory

Error at Error Site

0.2 0.3

1

0.6 Error at Memory Outputs

Masked0.4

0.5

Error at Output

(Failure)

Latent in Memory

Error at Error Site

0.1

0.2 0.3

0.4

1

0.6 Error at Memory Outputs

Masked0.4

0.5

0.5

Memory

Primary Inputs

Primary Ouputs

Q

QSET

CLR

D

Q

QSET

CLR

D

0.3

0.2 0.6

0.1

0.4

Propagation from error site Propagation from memory

contents to its output

Propagation from memory

output (Iterative)

Saturates in few iterations

Discrete Time Markov Chain (DTMC)

Chair for Dependable Nano Computing (ITEC) 11

Overview of CLASS

Irregular Structures(Combinational and

Sequential Elements)

Phase ISystem Decomposition

Irre

gula

r St

ruct

ure

sR

egu

lar

Stru

ctu

res

Memory m

Memory 2

Memory 1

DTMC

Phase IIIJoint Analysis

Enhanced EPP Analysis

Phase IIIndependent Analysis

Memory ACE Analysis

(MACE)

Chair for Dependable Nano Computing (ITEC) 12

Important Error Propagations

Propagation from irregular structures to regular structures

and vice versa :

Short-term effect: error in memory inputs affects its outputs

immediately

Long-term effect: error in memory inputs contaminates its contents

MVF (Memory Vulnerability Factor): propagation probability of an

error from memory contents to its output

MemoryIrregular Structure

Primary Inputs

Primary Ouputs

Memory Inputs

Memory Outputs

1

1 Short-term Effect

1

2

3

MemoryIrregular Structure

Primary Inputs

Primary Ouputs

Memory Inputs

Memory Outputs

2

1

1

2

Short-term Effect

Long-term Effect

MemoryIrregular Structure

Primary Inputs

Primary Ouputs

Memory Inputs

Memory Outputs

2 3

1

1

2

3

Short-term Effect

Long-term Effect

MVF

Chair for Dependable Nano Computing (ITEC) 13

A simple memory

Computation of Short- and Long-term Effects

Probability of having short- and long-term

effect for each input port of memory

Short-term effect

Long-term effect

Error

at Port

Probability of Error

None Output

(Short-term)

Contents

(Long-term)

Both

Address 0 1-SP(WEn) SP(WEn) 0

Error

at Port

Probability of Error

None Output

(Short-term)

Contents

(Long-term)

Both

Address 0 1-SP(WEn) SP(WEn) 0

Data-in 1-SP(WEn) 0 0 SP(Wen)

WEn 0 0 0 1

Chair for Dependable Nano Computing (ITEC) 14

Looking at memory boundaries

Intervals leading to a read access are ACE

MVFWord: Fraction of time that it has an ACE state

MVFMemory : Average of all MVFWord

Memory ACE (MACE)

Write

Read

Write

Read

Read

Write

Write

ACE ACEunACE unACE

Time

Chair for Dependable Nano Computing (ITEC) 15

CLASS is compatible with different circuit-level techniques

It inherits characteristics of employed circuit-level technique

Multi-cycle 4-value EPP [Fazeli10DSN] is employed

Based on topological traverse of netlist

O(n2), Inaccuracy within 2%

Circuit-level technique should be enhanced to:

Model short-term effect of error in memories

Compute long-term effect probability (treated as independent error)

Enhanced Circuit-level Technique

BMemory

Address

AQ

QSET

CLR

D

Q

QSET

CLR

D

SP=0.8

SP=0.5

Data-in

WEn

SP=0.25Error Probability (EP)

EP=1

MVF×EPP(Memory output to primary output)

1×0.8=0.8

EP=0.80.8×0.75=0.6

EP=0.6

Short-term

Lang-term0.8×0.25=0.2

0.6×0.5=0.3

EP=0.3

EP=0.3

Chair for Dependable Nano Computing (ITEC) 16

Integration of MACE and EPP Using DTMC

Memory

Logic

Primary Inputs

Primary Ouputs

Error at Output

(Failure)

Latent in Memory

Error at Error Site

PMLT

PELT PEO

PMO

1

MVF Error at Memory Outputs

Masked1-MVF

1-PELT-PEO

1-PMLT-PMO

Error propagation DTMC

A simple system

Steady state

solution of DTMC

Independent of

error site

Symbol Definition

PEO P(Error site Primary Outputs)

PELT P(Error site Latent in Memory)

PMO P(Memory Output Primary Outputs)

PMLT P(Memory Output Latent in Memory)

MVF P(Latent in Memory Memory Outputs)

Chair for Dependable Nano Computing (ITEC) 17

Overall Flow

HDL DiscritpionApplicationSystem

SpecificationHDL Discritpion

Gate-level Netlist

Application

VCD File

Synthesis

Post-synthesis Simulation

System Specification

Done by Commercial

Tools

HDL Discritpion

Gate-level Netlist

MVF

Application

EPPs

VCD File

Signal Probabilites

Synthesis

Post-synthesis Simulation

MACE Analysis

EPP Analysis

DTMC Matrix

CLASS

Solve Matrix for Each Cell

Vulnerability of Cells

VCD file Analysis

System Specification

Done by Commercial

Tools

Chair for Dependable Nano Computing (ITEC) 18

Preliminaries

CLASS

Experimental Results

Conclusions

Outline

Chair for Dependable Nano Computing (ITEC) 19

OpenRISC 1200 processor

Synthesis results:

44,480 gates

4,960 flip-flops

4 Memory Units

Overall: 49,444 error sites

Benchmarks are selected

from Mibench:

QSort, CRC32, Stringsearch

Experimental Setup

[www.opencores.org]

Chair for Dependable Nano Computing (ITEC) 20

EPP [Asadi06ISCAS][Fazeli10DSN]

ACE[Biswas05ISCA]

implemented for caches and their tags

Simulation-based Statistical Fault Injection (SFI)

Faults are injected during post-synthesis simulation

One fault per running entire application

Each FI on average takes 87 seconds

Only logic masking is considered

Comparable with functional simulation

System configuration:

Intel Xeon E5520

16 GB RAM

Comparison with Other Techniques

Chair for Dependable Nano Computing (ITEC) 21

Inaccuracy Analysis

Regular structures:

OpenRISC Regular Structures

Instruction Cache (IC)

Instruction Cache Tag (ICT)

Data Cache (DC)

Data Cache Tag (DCT)

CLASS

Average inaccuracy: 3.4%

Maximum inaccuracy : 9.2%

ACE

Average inaccuracy : 11.7%

Maximum inaccuracy : 23.3%

CLASS may overestimate or

underestimate Due to inaccuracy of 4-value EPP

SFI

CLASS

ACE

Fai

lure

Pro

bab

ilit

y

Chair for Dependable Nano Computing (ITEC) 22

Inaccuracy Analysis

Irregular structures:

Three representative blocks of

OpenRISC:

ALU

Instruction Cache FSM (IC FSM)

Pipeline fetch unit

CLASS

Average inaccuracy : 2.1%

Maximum inaccuracy : 4.0%

EPP

Average inaccuracy : 7.2%

Maximum inaccuracy : 26.3%

Average inaccuracy of individual

error sites is less than 7%.

SFI

CLASS

EPP

Fai

lure

Pro

bab

ilit

y

Chair for Dependable Nano Computing (ITEC) 23

Runtime

Individual inaccuracy of 7% needs 196 faults at each error site

49,444 possible error sites

49,444 x 196 x 87 more than 27 years

CLASS is 5 orders of magnitude faster than simulation-based

SFI Less than one hour

Runtime Breakdown:

VCD Analysis 85%

EPP 14%

Solving DTMCs 1%

Chair for Dependable Nano Computing (ITEC) 24

Preliminaries

CLASS

Experimental Results

Conclusions

Outline

Chair for Dependable Nano Computing (ITEC) 25

CLASS: SER estimation technique for entire systems

Unlike existing techniques which only focused on either

regular or irregular structures

Interactions between regular and irregular structures are

considered

Compared to SFI:

Inaccuracy is less than 7%

5 orders of magnitude faster

Multiple times more accurate than EPP (4x) and ACE (3x)

Conclusion

Chair for Dependable Nano Computing (ITEC) 26

Thanks for your attention!

Q & A

Chair for Dependable Nano Computing (ITEC) 27

Backup Slides

Chair for Dependable Nano Computing (ITEC) 28

Sources of inaccuracy in current implementation

Propagation of an error to several memory units

Multiple propagation of an error to memory outputs

Sources of Inaccuracy

Chair for Dependable Nano Computing (ITEC) 29

Access Types: Read Miss (RM) and Read Hit (RH)

Un-ACE Intervals * RM

ACE Intervals: *RH

ACE Analysis: Instruction Cache

How to Estimate the Soft Error Rate of an Embedded Processor?

Chair for Dependable Nano Computing (ITEC) 30

Access types: Read Miss (RM), Read Hit (RH),

Write Miss (WM), Write Hit (WH)

ACE intervals: *RH

Un-ACE intervals: *WH

Potentially ACE intervals: *WM, *RM

ACE Analysis: Data Cache (Write-back)

Chair for Dependable Nano Computing (ITEC) 31

Short-term effect of error in memories

Long-term effect are considered as propagated as

independent error

Enhanced to calculate the probability of error propagation

to memory inputs

Only logic masking is considered

Multi-cycle 4-value EPP [Fazeli10DSN] is employed

CLASS has no limitation regarding circuit-level techniques

Enhanced EPP