online diagnosis of network-on-chip

31
Online Diagnosis of Networks-on-Chip Sebastian Klotz Supervisor: Dipl.-Inf. Stefan Holst Reliable NoC in the Many Core Era 6/15/2009

Upload: others

Post on 12-Sep-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Diagnosis of Network-on-Chip

Online Diagnosis of Networks-on-Chip

Sebastian Klotz

Supervisor: Dipl.-Inf. Stefan Holst

Reliable NoC in the Many Core Era

6/15/2009

Page 2: Online Diagnosis of Network-on-Chip

2

Agenda

1. Objectives of Online Diagnosis

2. System Level Fault Models

3. Control Fault Detection/Localization Methods

4. Data Fault Detection/Localization Methods

5. Conclusion

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 3: Online Diagnosis of Network-on-Chip

3

Agenda

1. Objectives of Online Diagnosis

2. System Level Fault Models

3. Control Fault Detection/Localization Methods

4. Data Fault Detection/Localization Methods

5. Conclusion

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 4: Online Diagnosis of Network-on-Chip

4

Objectives of Online Diagnosis

Online Diagnosis:

Detection and localization of faulty switches and inter-switch links

Fault classification Distinguish between transient, intermittent and permanent faults

Observe network behavior during operation

Provide service (input) for recovery mechanism

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 5: Online Diagnosis of Network-on-Chip

5

Agenda

1. Objectives of Online Diagnosis

2. System Level Fault Models

3. Control Fault Detection/Localization Methods

4. Data Fault Detection/Localization Methods

5. Conclusion

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 6: Online Diagnosis of Network-on-Chip

6

NoC Switch Architecture

Switch Components:

First-In-First-Out

(FIFO) Buffer

Multiplexer (MUX)

Crossbar Switch

Router

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

N

S

E

W

SEW

N

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

Page 7: Online Diagnosis of Network-on-Chip

7

System Level Fault Models

Control Faults:

Dropped Data

Fault

Direction Fault

Multiple Copies

in Space Fault

Multiple Copies

in Time Fault

Data Fault:

Corrupted

Data Fault

Goal: Route data from W

N

Fault-free case

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

N

S

E

W

SEWN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

Page 8: Online Diagnosis of Network-on-Chip

8

N

S

E

W

SEWN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

System Level Fault Models

Goal: Route data from W

N

Fault in Router, FIFO or MUX

Control Faults:

Dropped Data

Fault

Direction Fault

Multiple Copies

in Space Fault

Multiple Copies

in Time Fault

Data Fault:

Corrupted

Data Fault

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 9: Online Diagnosis of Network-on-Chip

9

N

S

E

W

SEWN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

System Level Fault Models

Goal: Route data from W

N

Fault in the Router

Control Faults:

Dropped Data

Fault

Direction Fault

Multiple Copies

in Space Fault

Multiple Copies

in Time Fault

Data Fault:

Corrupted

Data Fault

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 10: Online Diagnosis of Network-on-Chip

10

N

S

E

W

SEWN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

System Level Fault Models

Goal: Route data from W

N

Fault in the Router or MUX

Control Faults:

Dropped Data

Fault

Direction Fault

Multiple Copies

in Space Fault

Multiple Copies

in Time Fault

Data Fault:

Corrupted

Data Fault

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 11: Online Diagnosis of Network-on-Chip

11

N

S

E

W

SEWN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

System Level Fault Models

Goal: Route data from W

N

Fault in the FIFO Buffer

&

Control Faults:

Dropped Data

Fault

Direction Fault

Multiple Copies

in Space Fault

Multiple Copies

in Time Fault

Data Fault:

Corrupted

Data Fault

@t+1@t

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 12: Online Diagnosis of Network-on-Chip

12

N

S

E

W

SEWN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUXMUX

Crossbar

Processor

Switch

Router

MUX

MUX

FIFO

FIFO

FIFO

FIFO

FIFO

System Level Fault Models

Control Faults:

Dropped Data

Fault

Direction Fault

Multiple Copies

in Space Fault

Multiple Copies

in Time Fault

Data Fault:

Corrupted

Data Fault

Goal: Route data from W

N

Routing is ok, but data is affected!

10100101

10101101

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 13: Online Diagnosis of Network-on-Chip

13

Agenda

1. Objectives of Online Diagnosis

2. System Level Fault Models

3. Control Fault Detection/Localization Methods

4. Data Fault Detection/Localization Methods

5. Conclusion

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 14: Online Diagnosis of Network-on-Chip

14

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

S1[0,0]

S8[1,2]

S2[1,0]

S7[0,2]

[x,y]

- 3x3 Network-on-Chip / XY-Routing

S9[2,2]

S4[0,1]

S5[1,1]

S6[2,1]

S3[2,0]

1XXYY nDestinatioSwitchSwitchSource

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 15: Online Diagnosis of Network-on-Chip

15

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaN

[x,y]

Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults

Ex. 1: Route from S4 [0,1]

S9 [2,2]

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 16: Online Diagnosis of Network-on-Chip

16

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaN

[x,y]

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults

Ex. 1: Route from S4 [0,1]

S9 [2,2]

S8[1,2]

from other Switches

=?

ComperatorSwitch Address

Source AddressDestination Address ErrorNo

1XXYY nDestinatioSwitchSwitchSource

nDestinatioSwitch XX SwitchSource YY

Distraction Detected!

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 17: Online Diagnosis of Network-on-Chip

17

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults

Ex. 2: Route from S2 [1,0]

S6 [2,1]

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaN

[x,y]

This stuck-at port fault cannot be

discovered; S3 acts like expected!

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 18: Online Diagnosis of Network-on-Chip

18

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults

Ex. 3: Route from S8 [1,2]

S4 [0,1]

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaE

[x,y]

8

SwitchSource YY No violation of XY-Routing!

Livelock!

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 19: Online Diagnosis of Network-on-Chip

19

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

Direction Faults - „Reflection“

Ex. 3: Route from S8 [1,2]

S4 [0,1]

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaE

[x,y]

Idea:

Decrement „switch count field“ on each hop.Switch count field = „10,9,8,7,6,5,4,3,2,1,0“ !?

Drop packet

8

Livelock!

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 20: Online Diagnosis of Network-on-Chip

20

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

Dropped Data Faults

Ex. 3: Route from S8 [1,2]

S4 [0,1]

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaE

[x,y]

ACK/NACK?

Idea:

Start timer whenever data leaves the source.Stop timer when reception is confirmed.Expiration of the timer indicates dropped data.

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 21: Online Diagnosis of Network-on-Chip

21

Duplicated Packet

Packet #4 missing

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

# 1# 2# 2# 3# 5

# 6

# 11# 10# 9# 8# 7

Sourcenode

Destination node

Network-on-Chip

DroppedDataFaults

MultipleCopiesFaults

Sequence Number

Watchdog Counter

Dropped Data Faults Multiple Copies in Space / Time Faults

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 22: Online Diagnosis of Network-on-Chip

22

Control Fault Detection/Localization

Distraction Detection

Switch Count (TTL)

Time-out Detection

Sequence Number

Trapped Packet Detection

N

S

E

W

PProcessor

Router

SaP Fault

ViolationindicatesSaP Fault!

Direction Faults - „Stuck-at Processor“

SwitchnDestinatio XYXY

Comparison that isperformed in the PE:

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 23: Online Diagnosis of Network-on-Chip

23

Agenda

1. Objectives of Online Diagnosis

2. System Level Fault Models

3. Control Fault Detection/Localization Methods

4. Data Fault Detection/Localization Methods

5. Conclusion

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 24: Online Diagnosis of Network-on-Chip

24

Data Fault Detection/Localization

End-to-End (e2e)

Switch-to- Switch (s2s)

Code- Disjoint- Detection (cdd)

Detect Corrupted Data Faults

Perform Data Fault Detection by means of Error Detection and Correction (EDC) Codes.

ED:

Parity-Check Codes

Cyclic Redundancy Check (CRC) Codes

EDC:

Hamming Codes (SEC/DED)

SEC: Single Error CorrectionDED: Double Error Detection

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 25: Online Diagnosis of Network-on-Chip

25

Data Fault Detection/Localization

End-to-End (e2e)

Switch-to- Switch (s2s)

Code- Disjoint- Detection (cdd)

PE PE

Encoder Decoder

Switch A Switch B

Sender NI Receiver NI

Packet buffer

Queuing buffer

Credit signal

Data

Encode data at the sending node

Data Fault Detection at the destination

Clear localization is not possible

PE: Processing ElementNI: Network Interface

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 26: Online Diagnosis of Network-on-Chip

26

Data Fault Detection/Localization

End-to-End (e2e)

Switch-to- Switch (s2s)

Code- Disjoint- Detection (cdd)

DecoderSwitch B

PE PE

Encoder DecoderSwitch A

Sender NI Receiver NI

Packet buffer Circular

(queuing and retransmission)

buffers

Data

Decoder

ACK

NACK

Encode data at the sending node

Perform “checking” at ever switch input

Last switch and link is suspicious

PE: Processing ElementNI: Network Interface

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 27: Online Diagnosis of Network-on-Chip

27

Data Fault Detection/Localization

End-to-End (e2e)

Switch-to- Switch (s2s)

Code- Disjoint- Detection (cdd)

N

S

EW

P

FIFO

link error flag

Router

Po(Xo)

switch error flag

Xi

Xip

XoXop

Pi(Xi)

Encode data at the sending node

Perform “checking” at ever switch in- as well as output

Clear localization of the fault (switch/link)

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 28: Online Diagnosis of Network-on-Chip

28

Data Fault Detection/Localization

Comparison of the Fault Localization capabilities:

End-to-End, Switch-to-Switch and Code-Disjoint-Detection

Sw1

D1 D2

Sw7

Sw6

Sw5Sw3S2

Sw2S1

Sw4

L1 L2

L3

L4 L5 L6

L9

L8

L7

I

II

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Path II (S2 to D2 )

Fault in L5 :

e2e = {S2 , L4 , Sw3 , L5 , Sw4 , L6 , Sw5 ,L7 , Sw6 , L8 , Sw7 , L9, D2 }

s2s = {Sw3 , L5 }cdd = {L5 }

Path I (S1 to D1 )

Fault in Sw1 :

e2e = {S1 , L1 , Sw1 , L2 , Sw2 , L3 , D1 }s2s = {Sw1 , L2 }cdd = {Sw1 }

Page 29: Online Diagnosis of Network-on-Chip

29

Agenda

1. Objectives of Online Diagnosis

2. System Level Fault Models

3. Control Fault Detection/Localization Methods

4. Data Fault Detection/Localization Methods

5. Conclusion

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 30: Online Diagnosis of Network-on-Chip

30

Conclusion

Objectives of Online Diagnosis:

Fault Detection/Localization

Fault Classification (transient, intermittent & permanent)

Fault Modeling

Regard altered switch behavior (abstraction)

Classify models into either Control or Data Faults

Control Fault Detection/Localization

Distraction Detection with different extensions

Data Fault Detection/Localization

ED(C)

Detection / e2e, s2s and cdd

Localization

Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip

Page 31: Online Diagnosis of Network-on-Chip

Thank you for your attention!

Reliable NoC in the Many Core Era

6/15/2009