from channel slicing to spatial division...
TRANSCRIPT
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
From Channel Slicing to From Channel Slicing to Spatial Division MultiplexingSpatial Division Multiplexing
---- the asynchronous router designthe asynchronous router design
Wei Song03/12/2009
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IndexIndex
•• Channel SlicingChannel Slicing– Asynchronous NoCs and routers– Channel Slicing– A wormhole router design
• Spatial Division Multiplexing (SDM)– Motives– Switching networks– 2-stage Clos network– The distributed scheduler– Implementation results
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Asynchronous Asynchronous NoCsNoCs
• GALS• Full async comm fabric • QDI pipelines
• Low dynamic power• Tolerance to variation• Fast prototype
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SynchronisedSynchronised QDI PipelinesQDI Pipelines
8
4
Nangate Cell Lib 65nm 1-of-4
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Channel Slicing (1)Channel Slicing (1)
• Remove the C-element tree• Sub-channels run
independently
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Channel Slicing (2)Channel Slicing (2)
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Channel Slicing (3)Channel Slicing (3)
sub-channels
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
The Wormhole RouterThe Wormhole Router
arbiter
arbiter
5 input ports
5 output ports
ctl
ctl
80
16
80
16
80
16
80
16
d_i_0
ack_i_0
d_i_4
ack_i_4
d_o_0
ack_o_0
d_o_4
ack_o_4
• Faraday 130 nm• 5 32-bit ports• 3 routers:
– Synchronised– Channel Sliced– Plus lookahead
N N+1
N+2
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Area Results
Channel Slicing: 23%extra controllers in input buffer increased wire count in crossbar
Lookahead: 5.3%extra AND gates and C2P elements on critical path
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Speed Results
Synchronised: 345MHzChannel Slicing: 450MHzChSlice+LH: 590MHz
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Compare with Other Routers
Asynchronous cell library: constrains the adaptation to other projects ANoC, ASPIN
Bundled-data: less tolerant to variationMANGO, QNoC, ASPIN
Customized design: design complexityASPIN
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Data Width Effect
0 20 40 60 80 100 120 1400.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0C
ycle
Per
iod
(ns)
Data Wdith of Ports (bit)
ChSlice + LH Channel Slicing Synchronised
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IndexIndex
• Channel Slicing– Asynchronous NoCs and routers– Channel Slicing– A wormhole router design
•• Spatial Division Multiplexing (SDM)Spatial Division Multiplexing (SDM)– Motives– Switching networks– 2-stage Clos network– The distributed scheduler– Implementation results
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SDM: Motivation (1)SDM: Motivation (1)
• The problems that the wormhole router cannot handle:– QoS, delay and throughput guaranteed services– Fault-tolerance– Network efficiency
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Motivation (2)Motivation (2)
Switch AllocatorInput Port 0
Input Port P-1
Output Port 0
Output Port P-1PxP
Crossbar
W
W
Input Buffer
Switch Scheduler
Input Port 0
Input Port P-1
Output Port 0
Output Port P-1
M
MPxMP
Switching Network
W/M
W/M
Wormhole
Virtual Channel SDM
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Motivation (3) Motivation (3) –– Problems of VCProblems of VC
• Pipelines are synchronised
• Area overhead• QoS (complicated
arbiters)• TDMA (time slot
definition)• Fault-tolerance (partial
faulty link)
Input Buffer
VC Allocator
Switch Allocator
Input Port 0
Input Port P-1
Output Port 0
Output Port P-1
M
PxP
Crossbar
W
W
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Motivation (4) Motivation (4) –– Benefits of SDMBenefits of SDM
• Delay and throughput Guarantee
• Fault-tolerance• Speed (Channel slicing)• Area• Link efficiency
– interruptsInput Buffer
Switch Scheduler
Input Port 0
Input Port P-1
Output Port 0
Output Port P-1
M
MPxMP
Switching Network
W/M
W/M
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Motivation (4) Motivation (4) –– Problems of SDMProblems of SDM
• Area overhead
• Scheduling Algorithm– Wormhole (P to 1)– SDM (MP to M)
2CBC P W= ×
2SDMC M P W= × ×
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IndexIndex
• Channel Slicing– Asynchronous NoCs and routers– Channel Slicing– A wormhole router design
• Spatial Division Multiplexing (SDM)– Motives–– Switching networksSwitching networks– 2-stage Clos network– The distributed scheduler– Implementation results
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SDM: Switching NetworksSDM: Switching Networks
• Strict Non-Blocking (SNB)– An input port and an output port is always
connectable• Rearrangeable Non-Blocking (RNB)
– An input port and an output port is connectable with possible changes on existing connections
• Blocking– Not all input ports and output ports are connectable
under certain cases
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
CrossbarCrossbar
• SNB2
CBC N W= ×
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
ClosClos NetworkNetwork
SNB/RNB
C(m,n,k)
N = nk
SNB: m >= 2n-1
RNB: m = n
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Benes NetworkBenes Network
Multi-stage Clos C(2,2,4) + 2C(2,2,2)
SNB
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Area of Switching NetworksArea of Switching Networks
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Problems of all Switching NetworksProblems of all Switching Networks
• Crossbar– Area ~ N2
– Easy to schedule• Clos
– Area ~ N1.5
– Difficult but possible to schedule by hardware– Optimal area is reached when
• Benes– Area ~ NlogN– Impossible to schedule by hardware (microprocessor)– Optimal area is reached when N=2n
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IndexIndex
• Channel Slicing, the wormhole router• Spatial Division Multiplexing (SDM)
– Motives– Switching networks–– 22--stage stage ClosClos networknetwork– The distributed scheduler– Implementation results
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SDM: 2SDM: 2--stage stage ClosClos NetworkNetwork
M M
SIM
M M
WIM
M M
NIM
M M
EIM
M M
LIM
5 5
CM(0)
5 5
CM(r)
5 5
CM(M-1)
SOM
WOM
NOM
EOM
LOM
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Area ComparisonArea Comparison
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Benefits of the 2Benefits of the 2--stage stage ClosClos NetworkNetwork
• Minimal area when M <= 16• Only have 2-stages, latency is reduced• Latency bounded• Scheduling algorithm is also simplified• The CMs could be further reduced
• It is a RNB network. An SNB network requires 3 stages
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IndexIndex
• Channel Slicing, the wormhole router• Spatial Division Multiplexing (SDM)
– Motives– Switching networks– 2-stage Clos network–– The distributed schedulerThe distributed scheduler– Implementation results
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SDM: Scheduling AlgorithmsSDM: Scheduling Algorithms
•• Optimized algorithmsOptimized algorithms– Always reach the optimal configuration that every possible
connection is configured– Time complexity O(N2)– Normally software based ( [Leroy 2008] microprocessor,
64 ports, 50us)•• Heuristic algorithmsHeuristic algorithms
– Capable of configuring part of the possible connections with less time and area
– Time complexity O(N) ~ O(logN)– Normally hardware implementable, distributed, and
scalable
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Synchronous Dispatch Synchronous Dispatch AlgsAlgs..
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Synchronous Dispatch Synchronous Dispatch AlgsAlgs..
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Synchronous Dispatch Synchronous Dispatch AlgsAlgs..
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Synchronous Dispatch Synchronous Dispatch AlgsAlgs..
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Synchronous Dispatch Synchronous Dispatch AlgsAlgs..
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Problems. Of Sync Problems. Of Sync AlgsAlgs..
• Iterations are synchronised.• The requests from IMs are blind and greedy.• CMs are blind and greedy too.• Multiple requests are sent out by IMs
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.
• IM scheduler and CM schedulers are independent
• The scheduling algorithm can support arbitrary number of CMs
• Less transition rate than synchronous schedulers
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IM scheduler (1)IM scheduler (1)
IMrb[VCN-1][SN-1]
IMr[0][0]
IMr[0][SN-1]
IMr[VCN-1][0]
IMr[VCN-1][SN-1]
IMrb[0][0]
IMrb[0][SN-1]
IMrb[VCN-1][0]
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IM scheduler (2)IM scheduler (2)
h[i][j][k]
CMrKeep[i][k][j]CMrMx[i][k][j]
CMrMx[i][k][0]
CMrMx[i][k][VCN-1]
CMrME[k][i]
CMrME[k][0]
CMrME[k][SN-1]
CMr[k][0]
CMr[k][SN-1]
CMa[k][i]cfgMx[j][k][i]
cfgMx[j][k][0]
cfgMx[j][k][SN-1]
cfg[j][k]
cfg[j][0]cfg[j][CMN-1]
IMa[j]
CMrMx[i][k][j]CMr[k][i]
IMrb[j][i]CMrMx[j][0][i]
CMrMx[j][CMN-1][i]
CMs[k][i]CMs[k][i]
IMr[j][i]CMrKeep[i][k][j]
CMrME[k][i]CMsb[i][k]
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
CM schedulerCM scheduler
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
IndexIndex
• Channel Slicing, the wormhole router• Spatial Division Multiplexing (SDM)
– Motives– Switching networks– 2-stage Clos network– The distributed scheduler–– Implementation resultsImplementation results
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SDM: implementation (1)SDM: implementation (1)
• Faraday 130nm• Wormhole, SDM crossbar, and SDM Clos• 64-bit ports, 4 virtual circuits/port• Design Compiler synthesized• System Verilog for testbench• Switches are back-annotated with latency from
synthesis
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
SDM: implementation (2)SDM: implementation (2)
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Network Performance (1)Network Performance (1)
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Network Performance (2)Network Performance (2)
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Conclusion of ResultsConclusion of Results
• SDM outperforms Wormhole with short frames and local traffic
• The connection loss from SNB to RNB is significant
• SDM is good at GT traffic, this work is the first step to a QoS router
• How to configure the SDM to settle GT paths is the next problem.
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
ReferencesReferences
• Channel Slicing:– ASP-DAC 2010. – UK Async Forum, 2009. – International Symposium on SOC, 2009.
• SDM– In submission to ASYNC 2010.
2009-12-2Advanced Processor Technology GroupThe School of Computer Science
Questions?Questions?