from channel slicing to spatial division...

53
2009-12-2 Advanced Processor Technology Group The School of Computer Science From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division Multiplexing -- -- the asynchronous router design the asynchronous router design Wei Song 03/12/2009

Upload: others

Post on 01-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

From Channel Slicing to From Channel Slicing to Spatial Division MultiplexingSpatial Division Multiplexing

---- the asynchronous router designthe asynchronous router design

Wei Song03/12/2009

Page 2: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IndexIndex

•• Channel SlicingChannel Slicing– Asynchronous NoCs and routers– Channel Slicing– A wormhole router design

• Spatial Division Multiplexing (SDM)– Motives– Switching networks– 2-stage Clos network– The distributed scheduler– Implementation results

Page 3: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Asynchronous Asynchronous NoCsNoCs

• GALS• Full async comm fabric • QDI pipelines

• Low dynamic power• Tolerance to variation• Fast prototype

Page 4: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SynchronisedSynchronised QDI PipelinesQDI Pipelines

8

4

Nangate Cell Lib 65nm 1-of-4

Page 5: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Channel Slicing (1)Channel Slicing (1)

• Remove the C-element tree• Sub-channels run

independently

Page 6: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Channel Slicing (2)Channel Slicing (2)

Page 7: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Channel Slicing (3)Channel Slicing (3)

sub-channels

Page 8: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

The Wormhole RouterThe Wormhole Router

arbiter

arbiter

5 input ports

5 output ports

ctl

ctl

80

16

80

16

80

16

80

16

d_i_0

ack_i_0

d_i_4

ack_i_4

d_o_0

ack_o_0

d_o_4

ack_o_4

• Faraday 130 nm• 5 32-bit ports• 3 routers:

– Synchronised– Channel Sliced– Plus lookahead

N N+1

N+2

Page 9: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Area Results

Channel Slicing: 23%extra controllers in input buffer increased wire count in crossbar

Lookahead: 5.3%extra AND gates and C2P elements on critical path

Page 10: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Speed Results

Synchronised: 345MHzChannel Slicing: 450MHzChSlice+LH: 590MHz

Page 11: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Compare with Other Routers

Asynchronous cell library: constrains the adaptation to other projects ANoC, ASPIN

Bundled-data: less tolerant to variationMANGO, QNoC, ASPIN

Customized design: design complexityASPIN

Page 12: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Data Width Effect

0 20 40 60 80 100 120 1400.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0C

ycle

Per

iod

(ns)

Data Wdith of Ports (bit)

ChSlice + LH Channel Slicing Synchronised

Page 13: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IndexIndex

• Channel Slicing– Asynchronous NoCs and routers– Channel Slicing– A wormhole router design

•• Spatial Division Multiplexing (SDM)Spatial Division Multiplexing (SDM)– Motives– Switching networks– 2-stage Clos network– The distributed scheduler– Implementation results

Page 14: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SDM: Motivation (1)SDM: Motivation (1)

• The problems that the wormhole router cannot handle:– QoS, delay and throughput guaranteed services– Fault-tolerance– Network efficiency

Page 15: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Motivation (2)Motivation (2)

Switch AllocatorInput Port 0

Input Port P-1

Output Port 0

Output Port P-1PxP

Crossbar

W

W

Input Buffer

Switch Scheduler

Input Port 0

Input Port P-1

Output Port 0

Output Port P-1

M

MPxMP

Switching Network

W/M

W/M

Wormhole

Virtual Channel SDM

Page 16: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Motivation (3) Motivation (3) –– Problems of VCProblems of VC

• Pipelines are synchronised

• Area overhead• QoS (complicated

arbiters)• TDMA (time slot

definition)• Fault-tolerance (partial

faulty link)

Input Buffer

VC Allocator

Switch Allocator

Input Port 0

Input Port P-1

Output Port 0

Output Port P-1

M

PxP

Crossbar

W

W

Page 17: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Motivation (4) Motivation (4) –– Benefits of SDMBenefits of SDM

• Delay and throughput Guarantee

• Fault-tolerance• Speed (Channel slicing)• Area• Link efficiency

– interruptsInput Buffer

Switch Scheduler

Input Port 0

Input Port P-1

Output Port 0

Output Port P-1

M

MPxMP

Switching Network

W/M

W/M

Page 18: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Motivation (4) Motivation (4) –– Problems of SDMProblems of SDM

• Area overhead

• Scheduling Algorithm– Wormhole (P to 1)– SDM (MP to M)

2CBC P W= ×

2SDMC M P W= × ×

Page 19: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IndexIndex

• Channel Slicing– Asynchronous NoCs and routers– Channel Slicing– A wormhole router design

• Spatial Division Multiplexing (SDM)– Motives–– Switching networksSwitching networks– 2-stage Clos network– The distributed scheduler– Implementation results

Page 20: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SDM: Switching NetworksSDM: Switching Networks

• Strict Non-Blocking (SNB)– An input port and an output port is always

connectable• Rearrangeable Non-Blocking (RNB)

– An input port and an output port is connectable with possible changes on existing connections

• Blocking– Not all input ports and output ports are connectable

under certain cases

Page 21: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

CrossbarCrossbar

• SNB2

CBC N W= ×

Page 22: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

ClosClos NetworkNetwork

SNB/RNB

C(m,n,k)

N = nk

SNB: m >= 2n-1

RNB: m = n

Page 23: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Benes NetworkBenes Network

Multi-stage Clos C(2,2,4) + 2C(2,2,2)

SNB

Page 24: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Area of Switching NetworksArea of Switching Networks

Page 25: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Problems of all Switching NetworksProblems of all Switching Networks

• Crossbar– Area ~ N2

– Easy to schedule• Clos

– Area ~ N1.5

– Difficult but possible to schedule by hardware– Optimal area is reached when

• Benes– Area ~ NlogN– Impossible to schedule by hardware (microprocessor)– Optimal area is reached when N=2n

Page 26: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IndexIndex

• Channel Slicing, the wormhole router• Spatial Division Multiplexing (SDM)

– Motives– Switching networks–– 22--stage stage ClosClos networknetwork– The distributed scheduler– Implementation results

Page 27: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SDM: 2SDM: 2--stage stage ClosClos NetworkNetwork

M M

SIM

M M

WIM

M M

NIM

M M

EIM

M M

LIM

5 5

CM(0)

5 5

CM(r)

5 5

CM(M-1)

SOM

WOM

NOM

EOM

LOM

Page 28: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Area ComparisonArea Comparison

Page 29: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Benefits of the 2Benefits of the 2--stage stage ClosClos NetworkNetwork

• Minimal area when M <= 16• Only have 2-stages, latency is reduced• Latency bounded• Scheduling algorithm is also simplified• The CMs could be further reduced

• It is a RNB network. An SNB network requires 3 stages

Page 30: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IndexIndex

• Channel Slicing, the wormhole router• Spatial Division Multiplexing (SDM)

– Motives– Switching networks– 2-stage Clos network–– The distributed schedulerThe distributed scheduler– Implementation results

Page 31: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SDM: Scheduling AlgorithmsSDM: Scheduling Algorithms

•• Optimized algorithmsOptimized algorithms– Always reach the optimal configuration that every possible

connection is configured– Time complexity O(N2)– Normally software based ( [Leroy 2008] microprocessor,

64 ports, 50us)•• Heuristic algorithmsHeuristic algorithms

– Capable of configuring part of the possible connections with less time and area

– Time complexity O(N) ~ O(logN)– Normally hardware implementable, distributed, and

scalable

Page 32: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Synchronous Dispatch Synchronous Dispatch AlgsAlgs..

Page 33: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Synchronous Dispatch Synchronous Dispatch AlgsAlgs..

Page 34: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Synchronous Dispatch Synchronous Dispatch AlgsAlgs..

Page 35: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Synchronous Dispatch Synchronous Dispatch AlgsAlgs..

Page 36: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Synchronous Dispatch Synchronous Dispatch AlgsAlgs..

Page 37: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Problems. Of Sync Problems. Of Sync AlgsAlgs..

• Iterations are synchronised.• The requests from IMs are blind and greedy.• CMs are blind and greedy too.• Multiple requests are sent out by IMs

Page 38: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.

Page 39: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.

Page 40: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.

Page 41: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.

Page 42: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Asynchronous Scheduling Alg.Asynchronous Scheduling Alg.

• IM scheduler and CM schedulers are independent

• The scheduling algorithm can support arbitrary number of CMs

• Less transition rate than synchronous schedulers

Page 43: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IM scheduler (1)IM scheduler (1)

IMrb[VCN-1][SN-1]

IMr[0][0]

IMr[0][SN-1]

IMr[VCN-1][0]

IMr[VCN-1][SN-1]

IMrb[0][0]

IMrb[0][SN-1]

IMrb[VCN-1][0]

Page 44: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IM scheduler (2)IM scheduler (2)

h[i][j][k]

CMrKeep[i][k][j]CMrMx[i][k][j]

CMrMx[i][k][0]

CMrMx[i][k][VCN-1]

CMrME[k][i]

CMrME[k][0]

CMrME[k][SN-1]

CMr[k][0]

CMr[k][SN-1]

CMa[k][i]cfgMx[j][k][i]

cfgMx[j][k][0]

cfgMx[j][k][SN-1]

cfg[j][k]

cfg[j][0]cfg[j][CMN-1]

IMa[j]

CMrMx[i][k][j]CMr[k][i]

IMrb[j][i]CMrMx[j][0][i]

CMrMx[j][CMN-1][i]

CMs[k][i]CMs[k][i]

IMr[j][i]CMrKeep[i][k][j]

CMrME[k][i]CMsb[i][k]

Page 45: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

CM schedulerCM scheduler

Page 46: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

IndexIndex

• Channel Slicing, the wormhole router• Spatial Division Multiplexing (SDM)

– Motives– Switching networks– 2-stage Clos network– The distributed scheduler–– Implementation resultsImplementation results

Page 47: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SDM: implementation (1)SDM: implementation (1)

• Faraday 130nm• Wormhole, SDM crossbar, and SDM Clos• 64-bit ports, 4 virtual circuits/port• Design Compiler synthesized• System Verilog for testbench• Switches are back-annotated with latency from

synthesis

Page 48: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

SDM: implementation (2)SDM: implementation (2)

Page 49: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Network Performance (1)Network Performance (1)

Page 50: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Network Performance (2)Network Performance (2)

Page 51: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Conclusion of ResultsConclusion of Results

• SDM outperforms Wormhole with short frames and local traffic

• The connection loss from SNB to RNB is significant

• SDM is good at GT traffic, this work is the first step to a QoS router

• How to configure the SDM to settle GT paths is the next problem.

Page 52: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

ReferencesReferences

• Channel Slicing:– ASP-DAC 2010. – UK Async Forum, 2009. – International Symposium on SOC, 2009.

• SDM– In submission to ASYNC 2010.

Page 53: From Channel Slicing to Spatial Division Multiplexingwsong83.github.io/presentation/sdmnoc20091203.pdf• SDM outperforms Wormhole with short frames and local traffic • The connection

2009-12-2Advanced Processor Technology GroupThe School of Computer Science

Questions?Questions?