online diagnosis of network-on-chip
TRANSCRIPT
Online Diagnosis of Networks-on-Chip
Sebastian Klotz
Supervisor: Dipl.-Inf. Stefan Holst
Reliable NoC in the Many Core Era
6/15/2009
2
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
3
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
4
Objectives of Online Diagnosis
Online Diagnosis:
Detection and localization of faulty switches and inter-switch links
Fault classification Distinguish between transient, intermittent and permanent faults
Observe network behavior during operation
Provide service (input) for recovery mechanism
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
5
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
6
NoC Switch Architecture
Switch Components:
First-In-First-Out
(FIFO) Buffer
Multiplexer (MUX)
Crossbar Switch
Router
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
N
S
E
W
SEW
N
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
7
System Level Fault Models
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Goal: Route data from W
N
Fault-free case
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
8
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in Router, FIFO or MUX
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
9
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in the Router
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
10
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in the Router or MUX
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
11
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in the FIFO Buffer
&
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
@t+1@t
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
12
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Goal: Route data from W
N
Routing is ok, but data is affected!
10100101
10101101
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
13
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
14
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
S1[0,0]
S8[1,2]
S2[1,0]
S7[0,2]
[x,y]
- 3x3 Network-on-Chip / XY-Routing
S9[2,2]
S4[0,1]
S5[1,1]
S6[2,1]
S3[2,0]
1XXYY nDestinatioSwitchSwitchSource
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
15
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaN
[x,y]
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 1: Route from S4 [0,1]
S9 [2,2]
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
16
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaN
[x,y]
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 1: Route from S4 [0,1]
S9 [2,2]
S8[1,2]
from other Switches
=?
ComperatorSwitch Address
Source AddressDestination Address ErrorNo
1XXYY nDestinatioSwitchSwitchSource
nDestinatioSwitch XX SwitchSource YY
Distraction Detected!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
17
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 2: Route from S2 [1,0]
S6 [2,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaN
[x,y]
This stuck-at port fault cannot be
discovered; S3 acts like expected!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
18
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 3: Route from S8 [1,2]
S4 [0,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaE
[x,y]
8
SwitchSource YY No violation of XY-Routing!
Livelock!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
19
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults - „Reflection“
Ex. 3: Route from S8 [1,2]
S4 [0,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaE
[x,y]
Idea:
Decrement „switch count field“ on each hop.Switch count field = „10,9,8,7,6,5,4,3,2,1,0“ !?
Drop packet
8
Livelock!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
20
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Dropped Data Faults
Ex. 3: Route from S8 [1,2]
S4 [0,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaE
[x,y]
ACK/NACK?
Idea:
Start timer whenever data leaves the source.Stop timer when reception is confirmed.Expiration of the timer indicates dropped data.
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
21
Duplicated Packet
Packet #4 missing
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
# 1# 2# 2# 3# 5
# 6
# 11# 10# 9# 8# 7
Sourcenode
Destination node
Network-on-Chip
DroppedDataFaults
MultipleCopiesFaults
Sequence Number
Watchdog Counter
Dropped Data Faults Multiple Copies in Space / Time Faults
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
22
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
N
S
E
W
PProcessor
Router
SaP Fault
ViolationindicatesSaP Fault!
Direction Faults - „Stuck-at Processor“
SwitchnDestinatio XYXY
Comparison that isperformed in the PE:
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
23
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
24
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
Detect Corrupted Data Faults
Perform Data Fault Detection by means of Error Detection and Correction (EDC) Codes.
ED:
Parity-Check Codes
Cyclic Redundancy Check (CRC) Codes
EDC:
Hamming Codes (SEC/DED)
SEC: Single Error CorrectionDED: Double Error Detection
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
25
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
PE PE
Encoder Decoder
Switch A Switch B
Sender NI Receiver NI
Packet buffer
Queuing buffer
Credit signal
Data
Encode data at the sending node
Data Fault Detection at the destination
Clear localization is not possible
PE: Processing ElementNI: Network Interface
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
26
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
DecoderSwitch B
PE PE
Encoder DecoderSwitch A
Sender NI Receiver NI
Packet buffer Circular
(queuing and retransmission)
buffers
Data
Decoder
ACK
NACK
Encode data at the sending node
Perform “checking” at ever switch input
Last switch and link is suspicious
PE: Processing ElementNI: Network Interface
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
27
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
N
S
EW
P
FIFO
link error flag
Router
Po(Xo)
switch error flag
Xi
Xip
XoXop
Pi(Xi)
Encode data at the sending node
Perform “checking” at ever switch in- as well as output
Clear localization of the fault (switch/link)
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
28
Data Fault Detection/Localization
Comparison of the Fault Localization capabilities:
End-to-End, Switch-to-Switch and Code-Disjoint-Detection
Sw1
D1 D2
Sw7
Sw6
Sw5Sw3S2
Sw2S1
Sw4
L1 L2
L3
L4 L5 L6
L9
L8
L7
I
II
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
Path II (S2 to D2 )
Fault in L5 :
e2e = {S2 , L4 , Sw3 , L5 , Sw4 , L6 , Sw5 ,L7 , Sw6 , L8 , Sw7 , L9, D2 }
s2s = {Sw3 , L5 }cdd = {L5 }
Path I (S1 to D1 )
Fault in Sw1 :
e2e = {S1 , L1 , Sw1 , L2 , Sw2 , L3 , D1 }s2s = {Sw1 , L2 }cdd = {Sw1 }
29
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
30
Conclusion
Objectives of Online Diagnosis:
Fault Detection/Localization
Fault Classification (transient, intermittent & permanent)
Fault Modeling
Regard altered switch behavior (abstraction)
Classify models into either Control or Data Faults
Control Fault Detection/Localization
Distraction Detection with different extensions
Data Fault Detection/Localization
ED(C)
Detection / e2e, s2s and cdd
Localization
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
Thank you for your attention!
Reliable NoC in the Many Core Era
6/15/2009