design validation and debugging - vast...

1

Design Validation and Debugging

Tim Cheng

Department of Electrical & Computer Engineering

UC Santa Barbara

VLSI Design and Education Center (VDEC)

Univ. of Tokyo

2

Harder to Design Robust and Reliable Chips• First-silicon success rate has been dropping

– ~30% for complex ASIC/[email protected] (according to an ASIC vendor)

– Pre-silicon logic bugs have been increasing at 3X-4X per generation for Intel’s processors

• Yield has been dropping for volume production and takes longer to ramp up the yield– IBM’s 8-core Cell-Processor chips: ~10-20% yield (July 2006)

• ―Better than worst-case‖ design resulting in failures w/o defects– Increase in variation of process parameters with scaling

– Worst-case design getting way too conservative

3

In-Field Failures are Common and Costly

• Xbox:16.4% failure rate

• Additional warranty and refund will cost Microsoft $1.15B ($86 per $300-item)

• More than financial cost: reputation and market loss

• Non-trivial failure rate

– 15% in average

http://arstechnica.com/news.ars/post/20080214-xbox-360-failure-rates-worse-than-most-consumer-

electornics.html

4

Design for Robustness and Reliability

• Systems must be designed to cope with failures • Efficient silicon debug is becoming a must

– Need efficient design validation and debugging methodology – Design for debugging would become necessary

• Must have embedded self-test for error detection – For both testing in manufacturing line and in-field testing– Both on-line and off-line testing

• Re-configurability and adaptability for error recovery make better sense– Using spares to replace defective parts– Using redundancy to mask errors– Using tuning to compensate variations

5

Outline

• Post-Silicon Validation and Debug

• SMT-Based RTL Error Diagnosis [ITC 2008]

• SAT-Based Diagnostic Test Generation [ATS 2007]

6

Bugs in Silicon• Manufacturing defects

– Discovered during manufacturing test (<<1M DPM)

• Functional bugs (AKA logic bugs)

– Exist in all components

– ~98% found before tape out, ~2% post-silicon*

• Circuit bugs (AKA electrical bugs)

– Not all components exhibit failures

– Fails in some operating region (voltage, temperature, or frequency)

– Usually cause by design margin errors, IR drop, crosstalk coupling, L di/dt noise, process variation …

– ~50% found before tape out, ~50% post-silicon*

* Source: Intel

7

Validation Domain Characteristics

• Pre-silicon validation– Cycle accurate simulation

– FSIM << FPROD: cycle poor

– Any signal visible (i.e. white box): debugging is straightforward

– Limited platform level interaction

• Post-silicon validation– Tests run at FPROD: cycle rich

– Component tested in platform configuration

– Only package pins visible: difficult debug

8

Post-Si History And Trends

• Functional bugs relatively constant– Correlate well to design complexity (amount

of new and changed RTL)

– Late specification changes are contributors

• Circuit and analog bugs growing over time– I/O circuit complexity increasing sharply

– Speedpaths (limiting FMAX of component) dominate CPU core circuit issues

9

Post-Si Debug Challenges

• Trend is toward lower observability

– Integration increasing towards SoC

• Functional and circuit issues require different solutions

• On average circuit bugs take 3x as much time to root cause vs. functional bugs

– Bugs found on platforms, but are debugged on debug-enabled automatic test equipment (ATE)

– Often need multiple iterations to reproduce on the tester

– Often long latency between circuit issue and it’s syndrome

10

Pre-Si Verification vs. Post-Si Debugging

Specification

RTL Description

Logic Netlist

Physical Design

Pre-silicon Functional DebuggingInsert

Corrections

Silicon Debugging & Fault Diagnosis

Insert Faults/ Errors

11

Automated Debugging/Diagnosis

Testbenchor

Test Vectors

Design or

Silicon

Verificationor

Testing

PASS

FAIL

Counter examples/Diagnostic Patterns

Automated Debugging/Diagnosis

A failed verification/test step is followed by

debugging/diagnosis:

12

Leveraging Pre-Si Verification & Manufacturing

Test Efforts for Post-Si Validation

Specification

RTL Description

Logic Netlist

Physical Design

Pre-silicon verification

Post-silicon validation

12

Manufacturing test

Black Box

Black BoxModels at very

low level of

abstraction

White Box

Lack of error

propagation

analysis/metrics

13

Outline




14

SAT assignment(s) → Fault location(s)!

SAT-Based Diagnosis

Replicate circuit for each test

Add additional circuitry into circuit model

Add input/output constraints

Erroneous Design Failing Tests

15

SAT-Based Diagnosis - Example

• Stuck-at-1 fault on line l1

• Input vector v=(0, 0, 1) detects 1/0 at y

1x

1

x2

x3 yl

1 1 / 01

0

0 0 / 1

Courtesy: A. Veneris

16

SAT-Based Diagnosis –Example (Cont’d)

1. Insert a MUX at each error candidate location

0

1

x1

x2

x3 yl1

s1

w1

2. Apply input/output vector constraints

0

1

x1

x2

x3 yl1

s1

w1

1

0

0 0


17

SAT-Based Diagnosis – Multiple Diagnostic Tests

0

1

0

1

0

1

0

0

10

0

1

10

1

0

10

x1

1

x2

1

x3

1

y1

l1

1

s1

w1

1

x1

2

x2

2

x3

2

y2

l1

2

w1

2

x1

3

x2

3

x3

3

y3

l1

3

w1

3


18

RTL Design Error Diagnosis

• Using Boolean SAT-Solvers for RTL design error diagnosis is not efficient

– The translation to Boolean is expensive

– High level information is discarded

18

Propose a SMT-based, automated method for RTL-level

design error diagnosis

19

Satisfiability Modulo Theory (SMT) Solvers

• Targets combined decision procedures (CDP)

• Integrate Boolean-level approach with higher-level decision procedures, such as ILP

• SHIVA-UIF: an SMT solver developed for RTL circuit

• Boolean Theory

• Bit-vector Theory

• Equality Theory} Makes a good candidate as

the satisfiability engine for hardware designs

20

RTL Design Error Diagnosis Utilizing

SHIVA-UIF

• Extend the main idea of

Boolean-SAT-based

diagnosis approach to

word-level

– MUXs are added to

word-level signals

20

Add MUXs to design

SMT

UNSAT

Add identified candidate to

possible candidate list

SAT

Failing Patterns,

Error Candidates

Remove

remaining

candidates

Impose test as constraints

Add constraints to avoid

same solution

Reduced

candidate

list

21

• Simple effect-cause analysis used to limit the potential candidates

• A MUX is inserted at each potential erroneous signal

21

X1

X2

LY

+=

X3

Initialization Steps

W

S

22

Could Directly Modifying HDL Code

(at Potential Erroneous Statements)

module full_adder_imp (a1, a2, c_in, s, c_out);

input a1, a2, c_in;

output s, c_out;

wire temp;

assign s = a1 ^ a2 ^ c_in;

assign temp = (a1 & a2) | (a1 & c_in);

assign c_out = temp | (a2 & c_in);

endmodule

module full_adder_muxed (a1, a2, free1, free2,

free3, s1, s2, s3, c_in, s, c_out);

input a1, a2, c_in;

input free1, free2, free3;

input s1, s2, s3;

output s, c_out;

wire temp_mux, s_mux, cout_mux;

assign s_mux = a1 ^ a2 ^ c_in;

assign s = s1 ? s_mux : free1;

assign temp_mux = (a1 & a2) | (a1 & c_in);

assign temp = s2 ? temp_mux : free2;

assign c_out_mux = temp | (a2 & c_in);

assign c_out = s3 ? c_out_mux : free3;

endmodule

23

• Add constraints corresponding to a failing test and its expected response to the MUX-inserted circuit/code

Inserting Constraints w.r.t. Failing Test and Expected Response

23

X1

X2

LY

+=

W

S

X3

3

5

3

1

( ( S? (W):(3+3) ) = 5 )

SATS = 1W =5

24

Experimental Results

• 11 example circuits (IWLS 2005 benchmarks)• An error is randomly injected in each circuit• * after applying simple effect-cause analysis

DesignNo. of word-

level elementsNo. of

patternsNo. of initial candidates*

No. offinal

candidatesB03 108 4 72 6

B04 108 5 72 9

B05 9700 28 12949 5

C5 115 20 211 13

C10 230 18 561 9

C12 420 12 579 100

C15 345 13 911 25

C16 540 9 595 7

C17 720 8 815 28

C18 1800 28 2135 10

C30 2910 26 3499 87

25

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

nu

mb

er o

f rem

ain

ing

ca

nd

ida

tes

Number of failing patterns imposed

Circuit C5

Experimental Results

25

• 4 sample circuits, each with 1000 random errors

• Average/Max/Minimum number of remaining candidates

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

nu

mb

er o

f rem

ain

ing

ca

nd

ida

tes


Circuit B03

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

nu

mb

er o

f rem

ain

ing

ca

nd

ida

tes


Circuit C15

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

nu

mb

er o

f rem

ain

ing

ca

nd

ida

tes


Circuit C10

26

Experimental Results – Effect of Applying More Failing Tests

26

• Average of 4 sample circuits, each with 1000 random errors

Range of failing test indexes

# of erroneous ckt instances in which #

of candidates reduced

(out of 1000)

Average reduction in size of candidate

list (in %)

5 to 200 588 1.74%

10 to 200 418 1.16%

20 to 200 318 0.97%

50 to 200 177 0.73%

100 to 200 102 0.62%

27

Disadvantage of Model-Free Diagnosis

• Some errors are indistinguishable from each other

• Example: L is the real error location but the solver can find satisfying values for all initial error candidates

Golden Model

X1

X2

LY

+=

X3X1

X2

LY

-=

X3

W4

S4

W5

S5W

2S2

W1

s1 W3

s3

Design

28

Advantages of SMT-Based RTL Design Error Diagnosis

• The learned information can be reused

• The order of candidate identification is easy to difficult,

implicitly done by the solver

– Solver tends to set MUXs of easy-to-diagnosis candidates first,

and,

– By the time of checking difficult candidates, the accumulated

learned clauses help reduce complexity

• Running All-SAT for this model results in:

– Eliminating a group of candidates without explicitly targeting

them one at a time

29

Outline




30

Diagnostic Test Pattern Generation (DTPG)

• Generates tests that distinguish fault types or locations

• One of the most computationally intensive problems

• Most existing methods are based on modified conventional ATPG or Sequential ATPG

• Very complex and tedious implementation

30

31

Traditional SAT-based DTPG

• Use a miter-like model to transform DTPG into a SAT problem

SAT Distinguishable

UNSAT Indistinguishable

PI

PO

PO

M=1

Faulty

Faulty

×

×f1

f2

31

32

SAT-based DTPG

• Limitations:– Need to build a miter circuit for each fault

pair

– Cannot share learned information for different fault pairs

Objectives: Reduce number of miter circuitsand the computational cost for each DTPG run

by using learned information from previous runs

32

33

DTPG Model for Injecting Multiple Fault Pairs

• Inject the same set of N=2n to-be-differentiated faults into each of the two circuits in the miter

• Add a n-to-2n decoder in each circuit to activate exactly one fault at a time

• The extra sets of primary inputs to the decoders, PI1

and PI2, are extra primary inputs

• Solve objective M=1

PI

PO

M=1?

n-2n DecoderPI1

sel1

sel2Faulty

selN

PO

n-2n DecoderPI2

sel'1

sel'2Faultysel'N

33

Vi differentiates f1 and f6!!

001

Vi

110

34

DTPG Procedure Using Proposed Model

• For a SAT solution, values assigned at PI1 and PI2

represent indices of activated fault pair; values assigned at PI is a diagnostic test

• After diagnostic test of fault pair fi and fj, is found, add a blocking clause to avoid test for the same pair generated again

• After UNSAT, all remaining fault pairs are indistinguishable

Build the

DTPG model

M=1?

UNSAT

Diagnostic pattern

found

Add SAT constraint

SAT

List of fault candidates

End

Simplify the circuit

34

35

Main Advantages of the DTPG Model

• The learned information can be reused

• Order of target fault pair selection is automatically determined by SAT solving– Easy-to-distinguish fault pairs would be implicitly

targeted first

• Running All-SAT for this miter model could:– Find diagnostic patterns for all pairs of faults

– Naturally perform diagnostic pattern compaction

• Identify a group of indistinguishable fault pairs without explicitly targeting them one at a time

35

36

Finding More Compact Diagnostic Tests

36

PI

PO

M=1?

n-2n DecoderPI1

sel1

sel2Faulty

selN

PO

n-2n DecoderPI2

sel'1

sel'2Faultysel'N

000

011

Vi

Vi differentiates f0 and f3

0x0

11x

Vj

Vj differentiates {f0, f2} and {f6, f7}

37

DTPG with Compaction Heuristic

• Solve objective M = 1 using SAT solver

• Use existing patterns to guide the SAT solving

• Find don’t cares at PI1 and PI2 in the newly generated pattern - so the corresponding pattern differentiate two groups of faults

37

38

DTPG for Multiple Faults

• Need m n-to-2n decoder in each faulty circuit (m is the cardinality of multiple faults)

• One output from each decoder is connected to an m-input OR gate

• Can inject m or fewer faults

• Combine existing methods before using the proposed DTPG model

n-2

decoder

n-2

decoder

....

n n

... ...

.

Seli

Selj

39

DTPG Results

Circuit#Initial

Fault Pairs#D/#E/

#A#Diagnostic

PatternsCPU(sec)

S5378 66 63/3/0 13 0.3

S13207 1225 1198/27/0 28 3.9

S15850 231 204/27/0 7 3.3

S35932 120 106/14/0 7 2.0

S38417 351 351/0/0 8 2.9

S38584 1225 1205/20/0 33 7.3

• Initial fault pairs: generated by a critical-path-tracing tool• All fault pairs injected into one miter circuit• #D—distinguishable, #E—equivalent, #A—aborted

40

Summary

• SMT-based RTL Design Error Diagnosis

– An enhanced model injecting single/multiple design errors

– Enable sharing of the learned information

– Identify false candidates without explicitly targeting them

• SAT-based DTPG

– Use an enhanced miter model injecting multiple faults

– Enable sharing of the learned information

– Identify undifferentiable faults efficiently

– Support diagnosis between mixed, multiple fault types

– Combine with diagnostic test pattern compaction

design validation and debugging - vast...

Documents