hpca, austin, texas february 13 2006 bulletproof: a defect-tolerant cmp switch architecture 1...

17
HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant BulletProof: A Defect-Tolerant CMP CMP Switch Architecture Switch Architecture Kypros Constantinides Stephen Plaza Jason Blome Bin Zhang Valeria Bertacco Scott Mahlke Todd Austin Michael Orshansky Advanced Computer Architecture Lab Department of Electrical and Computer Engineering University of Michigan University of Texas at Austin

Post on 22-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

1

BulletProof: A Defect-Tolerant CMPBulletProof: A Defect-Tolerant CMPSwitch ArchitectureSwitch Architecture

Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang†

Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky†

‡Advanced Computer Architecture Lab †Department of Electrical and Computer Engineering

University of Michigan University of Texas at Austin

Page 2: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

2

IntroductionIntroduction• Reliability is a critical aspect of any computer design• System designers target for very small failure rates• Today reliability targets are met by using fault-avoidance design

techniques– use of conservative design margins

• For future process technologies it wouldbe impossible to avoid system failures by using conservative design margins– need defect-tolerant design techniques T

ran

sist

or R

elia

bil

ity

Transistor Lifetime (years)

Now

Future

Page 3: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

3

• Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components – “BulletProof”

Reliable System Design SpaceReliable System Design SpaceMANUFACTURING

DEFECT WEAR-OUT DEFECT TRANSIENT ERROR

NO-DETECTION Untestable DefectsSystem fails in unpredictable way

System glitch manifests in unpredictable way

DETECTION TestingComponent terminates at first error

Component terminates. Hard-reset restore

DETECTION+CORRECTION

Post-manufacturing recovery

Online defect recovery

Transient fault recovery

DETECTION+CORRECTION

+REPAIR

Post-manufacturing reconfiguration

Online repair

DMRDMR

ECC - memory

cache-line swap-outmemory-array spares

TMR

DivaRazorECCTMR

BulletProof

Mainstream SolutionsMainstream Solutions High-end SolutionsHigh-end Solutions Specialized SolutionsSpecialized Solutions Research-stage SolutionsResearch-stage Solutions

TYPE OF DEFECT

DESIGN FEATURE

Page 4: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

4

CMP Switch ArchitectureCMP Switch Architecture• Goal: A defect tolerant CMP switch design• Baseline switch architecture is provided by Li-Shiuan Peh• Implements the routing and flow-control functions required for

transmitting packets in a 2D Torus network• Wormhole switch pipelined

at the flit level (32-bit flits)• Dimensional order routing• Specified in Verilog and

synthesized to a gate-level netlist~ 9K logic gates and 1700 sequential elements

InputBuffers

VC StateRouting Logic

InputBuffers

VC StateRouting Logic

InputBuffers

VC StateRouting Logic

InputBuffers

VC StateRouting Logic

InputBuffers

VC StateRouting Logic

Cross-Bar Controller

Switch Arbiter

Input Controllers

Cross-Bar

Page 5: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

5

Soft Errors (SEU) VulnerabilitySoft Errors (SEU) Vulnerability• In earlier work we studied the vulnerability of the switch

architecture to soft-errors– Only 3.2% of faults eventually cause an error

• Age-related wear-out silicon defects is a more challenging reliability threat for future technologies

• In this work we focus on solutions for in-field silicon defects• These solutions also provide soft-error tolerance to the design

Page 6: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

6

Self-Repairing SystemsSelf-Repairing Systems• Defect-tolerant self-repairing systems need to support:

– Error Detection– System Diagnosis (locate the origin of the error)– System Repair– System Recovery

• Key idea:– error detection must be performance efficient

• continuously check execution for errors– diagnosis, repair and recovery are insensitive on performance

• get invoked only when an error is detected (rare scenario)• trade-off performance for more cost efficient techniques

Page 7: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

7

Traditional Defect-Tolerant TechniquesTraditional Defect-Tolerant Techniques• Traditional techniques for designing defect-tolerant systems:

– Triple Modular Redundancy (TMR)• Forward recovery• Applicable to both combinational

and sequential logic• Can not tolerate more than one

defective modules• Area and power overhead ~ 3X

– Error Correction Codes (ECC)• Lower overhead solution• Applicable only for state

holding structures and busses

M

M

M

V

R1 R2 D1 R3 D2 D3 D4 R4 D5 D6 D7 D8

ECC bits

Data bits

Page 8: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

8

• The synthesized netlist of the added components account for ~10% of the total switch area

• Provide error detection for both hard and soft errors

BufferChecker

Routing Logic

Routing LogicARB

Cross-bar Controller

Header

Input Buffers Cross-bar

ARB

CRCChecker

CRC

Error Detection: Low-Cost Domain Specific TechniqueError Detection: Low-Cost Domain Specific Technique

Error

FLITCRC

Checker

Page 9: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

9

Adding Defect Resiliency With Lower Cost Adding Defect Resiliency With Lower Cost • Automatic Cluster Decomposition• Balanced recursive min-cut heuristic algorithm

Input: a) design’s gate-level netlist b) number of partitionsOutput: a partitioned netlistGoal: – Balance partition sizes:

- smaller partition higher resilience – Minimize cut edges:

- reduce cost overhead- reduce vulnerable logic

• Partitions can have both combinational and sequential logic

A

B

C

D

E

F

G

H

J

I

A

B

C

D

E

F

G

H

J

I

Page 10: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

10

A

B

FA

B

F

D

E

HD

E

H

C

G

J

I

C

G

J

I

• Partition sparing:– Only one spare is active for

each partition of the switch– Replace voting logic with

spare swapping logic– Lower power overhead– A defect is fatal if it hits the

last spare of a partition or the spare swapping logic

Silicon Protection Factor (SPF) =

– The number of defect in a design are proportional to the design’s area– Enables to compare different defect tolerant designs

0

2

4

6

8

10

12

14

16

18

0 100 200 300 400 500 600 700 800 900

#Partitions

Def

ect R

esili

ency

z

Mean Defects to FailureSPF - Defect Tolerance

0

2

4

6

8

10

12

14

16

18

0 100 200 300 400 500 600 700 800 900

#Partitions

Def

ect R

esili

ency

z

Mean Defects to FailureSPF - Defect ToleranceSPF – Defect Tolerance

7.6X more defectstolerated per unit area

Partition Sparing – Silicon Protection FactorPartition Sparing – Silicon Protection Factor

1 extra spare per partition

Mean Defects to FailureArea Overhead

15.8X more defects tolerated

Page 11: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

11

System RecoverySystem Recovery• Add a Recovery Pointer to each

input buffer• Recovery pointers advance 4 cycles

after the input controller grantsthe requesting output channel– Guarantees that flit is CRC checked

• On error detection:– All CRC checkers drop

outgoing flits– Switch pipeline is flushed– Head pointers are set to recovery

pointers– Restart execution

CRC Checker

InterconnectSwitch

CRC Checker

CRC Checker

CRC Checker

RecoveryLogic

CRC Checker

RoutedFlit

RoutedFlit

RoutedFlit

RoutedFlit

RoutedFlit

Error Detection Signal

abcde abcdeInputBuffers

Tail Head RecoveryHead

a: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit buffered

e d

Page 12: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

12

System Diagnosis and RepairSystem Diagnosis and Repair• Iterative trial-and-error technique

• Built-In-Self-Test (BIST)– For each partition keep automatically generated test vectors in ROM– Apply test vectors to each partition through scan chains to locate the

defective partition

Recover to the last correct state of the switch

For partition i swap in the spare for the current copy and restart execution

Error detected? i < # partitions?

Continue Execution

Increase i

No No

YesYes

Fatal Defect

Page 13: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

13

35

34

33

32

31-C_3SH(IC)+1SP_IR30-C_2SH(IC)+1SP_IR

2928

27-C_3SH(IC)_IR26

25

24-C+CL_2SP_BIST

23-C+CL_2SP_IR

22-C_2SP_BIST

21-C_2SP_IR20

19-S+CL_2SP_BIST

18-S+CL_2SP_IR17

16 15-C+CL_1SP_IR

14-C_1SP_BIST

13

12

11

10-S+CL_1SP_BIST

9-S+CL_1SP_IR8-S_1SP_IR

7-G_TMR

6 5

4-C_TMR

3-S+CL_TMR+ECC 2-S+CL_TMR1-S_TMR

37-C_5SH(IC)+2SP_IR

36

38-S_ECC

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9 10 11 12

Normalized Defect Resiliency - Silicon Protection Factor (SPF)

Are

a O

verh

ead h

How does these techniques affect the system’s lifetime?

Pareto Optimal DesignsPareto Optimal Designs

Pareto Sub-optimal DesignsPareto Sub-optimal Designs12 partitions (cmps)2/5 spare input controllers1 spare per cmp. (rest)Iterative replayArea = 1.76XSPF = 2.53

12 partitions (cmps)2/5 spare input controllers1 spare per cmp. (rest)Iterative replayArea = 1.76XSPF = 2.53

206 partitions2 spares per partitionIterative replayArea = 3.4XSPF = 11.1

206 partitions2 spares per partitionIterative replayArea = 3.4XSPF = 11.1

206 partitions1 spare per partitionBuilt-In-Self-TestArea = 3.16XSPF = 5.54

206 partitions1 spare per partitionBuilt-In-Self-TestArea = 3.16XSPF = 5.54

206 partitions1 spare per partitionIterative replayArea = 2.3XSPF = 7.6

206 partitions1 spare per partitionIterative replayArea = 2.3XSPF = 7.6

12 partitions (cmps)TMRArea = 3.04XSPF = 1.54

12 partitions (cmps)TMRArea = 3.04XSPF = 1.54

more robust designs

chea

per d

esig

ns cheaper

more robust designs

Exploring Defect-Tolerant CMP Switch DesignsExploring Defect-Tolerant CMP Switch Designs

Page 14: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

14

““Bathtub Curve”: A model for semiconductor hard failures Bathtub Curve”: A model for semiconductor hard failures • The lifetime failure rate for semiconductor systems follows what is known

as the bathtub curve • Trend for future process technologies:

– Failure rate of grace period gets larger– Breakdown period is earlier in system’s lifetime

Grace PeriodInfant Period Breakdown Period

Time

Failu

re R

ate

(FIT

)

Future process technologies

Page 15: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

15

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)

Def

ectiv

e P

arts

(%)

g

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)

Def

ectiv

e P

arts

(%)

g

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)

Def

ectiv

e P

arts

(%)

g

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)

Def

ectiv

e P

arts

(%)

g

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)

Def

ectiv

e P

arts

(%)

g

System Lifetime – A Post 65nm Technology Case ScenarioSystem Lifetime – A Post 65nm Technology Case Scenario

Fai

lure

Rat

e (F

IT)

12000

24000

36000

48000

60000

72000

84000

96000

108000

120000

TMRSPF=1.54

TMRSPF=1.54 3/5 spare IC

1 spare restSPF=3.01

3/5 spare IC1 spare restSPF=3.01

1 spareSPF=7.631 spare

SPF=7.63

2 sparesSPF=11.11

2 sparesSPF=11.11

1 defect1 defect every two yearsevery two years

Page 16: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

16

Conclusions – Future WorkConclusions – Future WorkConclusions• Traditional mechanisms are insufficient for tolerating moderate

numbers of defects• Domain-specific techniques along with resource sparing, iterative

diagnosis and reconfiguration are more effective• Decomposing the design into modest-sized partitions is the most

effective granularity to apply redundancy

Future Work• Use of spare components based on component wear-out profiles• Explore low-cost defect-tolerant techniques for microprocessors

Page 17: HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros

HPCA, Austin, TexasFebruary 13 2006

BulletProof: A Defect-Tolerant CMPSwitch Architecture

17

Questions?Questions?