tkt-2431 soc design · tkt-2431 soc design ......

84
TKT TKT-2431 Soc 2431 Soc Design Design TKT TKT-2431 Soc 2431 Soc Design Design Lec 10 Lec 10 – On On-chip communication chip communication Erno Erno Salminen Salminen, , Tero Tero Arpinen Arpinen Department of Computer Systems Department of Computer Systems Tampere University of Technology Tampere University of Technology Tampere University of Technology Tampere University of Technology Fall 2010 Fall 2010

Upload: lythu

Post on 29-May-2018

251 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

TKTTKT--2431 Soc 2431 Soc DesignDesignTKTTKT--2431 Soc 2431 Soc DesignDesignLec 10 Lec 10 –– OnOn--chip communicationchip communication

ErnoErno SalminenSalminen, , TeroTero ArpinenArpinen

Department of Computer SystemsDepartment of Computer SystemsTampere University of TechnologyTampere University of TechnologyTampere University of TechnologyTampere University of Technology

Fall 2010Fall 2010

Page 2: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Copyright noticeCopyright notice Part of the slides adapted from slide set

by Alberto Sangiovanni-VincentelliEE249 t U i it f C lif i B k l course EE249 at University of California, Berkeley

http://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml by Timo D. Hämäläinen

M i O Chi Chi C i ti S C S i Managing On-Chip Chip Communications, SoC Symposium, Tampere 19.11.2003

#2/45

Page 3: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Copyright(2): Part of figures fromCopyright(2): Part of figures from L. Benini, G. De Micheli, Networks on chips: a new

SoC paradigm, Computer, Vol. 35, Iss. 1, Jan. 2002, pp 70 78pp. 70 -78.

V. Lahtinen, Design and Analysis of Interconnection Architectures for On-Chip Digital Systems, PhD Th i T U i i f T h lThesis, Tampere University of Technology, Department of Information Technology, June 2004. http://www.tkt.cs.tut.fi/research/daci/pub_open/lahtinen_thep p _ p _

sis.pdfWolf, W.; Jerraya, A.A.; Martin, G.; , "Multiprocessor

System-on-Chip (MPSoC) Technology," Computer-System on Chip (MPSoC) Technology, ComputerAided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct 2008

#3/45

Oct. 2008

Erno Salminen - Nov. 2010

Page 4: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

ContentsContentsProblem statementPhysical limitationsPhysical limitationsNetwork-on-chip (NoC)ExtraExtra

See also: E Salminen A Kulmala T D Hämäläinen "Survey of Network-on-chip Proposals" white paper E. Salminen, A. Kulmala, T.D. Hämäläinen, Survey of Network on chip Proposals , white paper,

OCP-IP, [online]: http://www.ocpip.org/socket/whitepapers/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, 2008, 13 pages.

E. Salminen, A. Kulmala, T.D. Hämäläinen, "On Network-on-chip comparison", Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510. http://daci digitalsystems cs tut fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen

#4/45

http://daci.digitalsystems.cs.tut.fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen_-_On_Network-on-chip_compar.pdf&id=82519

Page 5: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

At firstAt first

Make sure that simple things worksimple things work before even tryingbefore even trying more complex onesmore complex ones

#5/45

Page 6: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Problem Statement Problem Statement -- SoC ComplexitySoC Complexity SoC consists of heterogenous components Varying communication requirements/profiles Varying communication requirements/profiles Not all components communicate with each

otherSoC

other

Mem_1 Mem_N Periph_1 Periph_N

Communication networkCommunication network

#6/45

Proc_1 Proc_N Acc_1 Acc_N

Page 7: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Different requirementsDifferent requirements1. Varying Bandwidth (or throughput) Amount of data transferred in unit time, [MB/s] High requirement between CPU and memory Low requirement between CPU and peripheral

2 Diff t l t t ti2. Different latency expectations

M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N

CPU_1 Acc_NCPU_N Acc_1

#7/45

High BWLow BW

Page 8: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Characteristics of offered traffic foadCharacteristics of offered traffic foad1. Spatial: where the data go all sources similar?

2. Temporal: average data rate3. Temporal: when to transferp

a) Short bursts of high transfer activity and long periods of inactivity

b) T f ith t t i d i t lb) Transfers with constant sizes and intervals

very

data amountsrc

Spatial: Temporal:

a

c d

timebursty

time

moderately bursty

Spatial:

a) one dst: neighbor

b) one dst: some

c) few dst

#8/45

b

time

constant bitrate

c) few dst

d) send to allb

Page 9: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Basic metric: LatencyBasic metric: Latency

Delay between start of transfer and completionp time (last data ejected) – time (first data enters) [n cycles for transferring d words]

Interrupts usually require low latency Cache fills require low latencyCache fills require low latency Real-time systems require guaranteed

latency (always below some limit)latency (always below some limit) Stream data (voice, video) may require

constant latency (low jitter)

#9/45

constant latency (low jitter)

Page 10: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Measuring loadMeasuring load--latency behaviorlatency behavior Traffic generator mimics

IPs Sends dataSends data Receives data

One should 1 include the latency of1. include the latency of

network interface (NI)2. exclude the headers

when calculating traffic l dload

3. measure the latency of the whole transfers (which may be several packets.may be several packets. I.e. at lest one full packet, not just header latency)

4. include ”infinite” buffer at source to avoid throttling

#10/45

source to avoid throttling[Salminen, On the credibility of load-latency measurements, Soc, 2008]

Page 11: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Measured loadMeasured load--latency curvelatency curveNetwork saturates

when the traffic load t t hi hgets too high Latency approaches

infinityinfinityCertain bounds can

be derives analyticallyOf course, the goal

i i i l tis minimum latency and maximum saturation point

#11/45

saturation point[Salminen, On the credibility of load-latency measurements, Soc, 2008]

Page 12: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Physical limitationsPhysical limitations

Page 13: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

ITRS 2003: InterconnectITRS 2003: InterconnectCChip crosship cross--sectionsectionppSeveral metal layers - less congestionHierarchical scaling

Wires on top levels are wider

Hierarchical scaling

levels are wider and taller than on lower levelson lower levels

Top layers for Power supply

transistors

Power supply Clock Global signals

#13/45

g

Page 14: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

ITRS 2003: InterconnectITRS 2003: Interconnect HUOM! OBS!

Muy importante!

global signals

global signals withglobal signals with repeaters (bigger area and energy)

gate

local signals

gate

#14/45

Delay of global wires does not scale with technology

Page 15: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Several clock domainsSeveral clock domains Not possible/practical to use same clock in every

componentGALS – Globally asynchronous, locally synchronous

Components have local clocks Communication needs handshaking/synchronization Communication needs handshaking/synchronization

M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N

Proc_1 Proc_N Acc_1 Acc_N

#15/45

High freqLow freq

Page 16: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Energy breakdown forecastEnergy breakdown forecast

compare

#16/45

[Mattan Erez, Stream Architectures –Programmability and Efficiency,

Tampere SoC, Nov. 17 2004]

Page 17: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

LocalizationLocalizationC i ti t b l li d t id l Communication must be localized to avoid long wires consume much energy

C i i

are slow, prone to error, cause routing congestionSeveral small components instead of few large Communication

between non-neighboring

tcomponents requires many hops

[Mattan Erez, Stream Architectures –

#17/45

Programmability and Efficiency, Tampere SoC, Nov. 17 2004]

Page 18: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Reliability problemsReliability problems ”Synchronization failures between clock

domains will be rare but unavoidable” - BeniniElectrical noise due to crosstalk,

electromagentic interference, radiation...gData errors or upsets, soft errorsData transfers become unreliable andData transfers become unreliable and

nondeterministicDesign needs both deterministic andDesign needs both deterministic and

stochastic models

#18/45

Page 19: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Achieving reliabilityAchieving reliability Today, designers use physical techniques to

overcome reliability problems Wire sizingWire sizing Length optimization Repeater insertion Shieldingg Data coding Bunch of others...Huge design effort requiredg g q

In (near) future, 100% reliability on physical level cannot be afforded anymore

Reliability muts be increased with additional HW or Reliability muts be increased with additional HW or SW layers Error detecting/correcting codes Retransmissions

#19/45

Retransmissions Request/acknowledge and time-out counters

Page 20: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

NetworkNetwork--onon--chip (NoC)chip (NoC)

Page 21: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

NetworkNetwork--onon--Chip (NoC)Chip (NoC) Communication network on chip NoC motivation NoC motivation1. High fab cost and effort in traditional VLSI Design general-purpose platform Design general purpose platform

2. Flexibility - For changing application needs3 Concurrency in transfers3. Concurrency in transfers4. Only short signal wires due to power and

delay problemsdelay problems5. On-chip wires are no longer reliable Us all packet s itched m lti hop net ork

#21/45

Usually packet-switched, multi-hop network

Page 22: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Differences betweenDifferences betweenMultiprocessors and SoCMultiprocessors and SoCpp

Multiprocessor systems (past) System-on-Chip (portable device)Scaleability important after fab (increase Scaleability an issue only at design timeScaleability important after fab (increase nodes)

Scaleability an issue only at design time (reuse, easy addition of nodes)

Load balancing and even distribution of computation important for maximum performance

Energy consumption important, idle nodes must be shut down

p

Communication network used as means of balancing computation and communication (both adjusted for optimal performance)

Computation might already be fixed per node (functional partition) Network serves nodes (only network adj sted)performance) adjusted)

Dataflow computing Computation is very heterogeneous, both dataflow and control style

In principle any node can compute a Execution of various applications clustered given task within SoC (specialized nodes)

Some research seems to be ”Re-inventing the wheel” New challenge: Energy saving combined to

Much experience and well established reasearch of routing, switching, scaleability, tailoring according to

#22/45

past multiprocessor researchapplications

Page 23: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Micronetwork protocol stackMicronetwork protocol stack Layers are specialized and optimized according to

application (domain)

abstraction

Splitting long transfer

HW dependent SW

Arbitration, packetization to increase reliabilityRouting

Splitting long transfer into packets, reordering

Arbitration, packetization to increase reliability

#23/45

Page 24: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

NoC terminologyNoC terminology Processing elements exchange messages Network interface converts messages to/from

network-specific packets/streams Packet consists of several flits (≈words)

Routers communicate via ports and ports on the

agent(0) communication network

Routers communicate via ports, and ports on the boundary of the whole network are called terminals

processing element

network interfacerouter(0) router(1)message

pktpktfl fl fl

router(2)

(degree=4)

agent(1) linkAbbreviations:fl = flit, flow ctrl unitph = phit, physical unit

port

ph

phor stream

#24/45

( g )ph

p p , p ypkt =packet ph phph

terminal

Page 25: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Design Design choiceschoices of of NoCNoC Basic considerations deal with1. Structure1. Structure topology – logical sturcture routers and links

(floorplan defines the physical layout)

router design2. Control routing – which way to take flow control and switching – when to transmit

#25/45

Page 26: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Homogeneous networkHomogeneous network replication effect solve realization issues

once and for all less flexible

P bl i if i Problematic if processing units are heterogeneous assumes uniform size for assumes uniform size for

components and hence either

a) wastes areaa) wastes area b) components have to be

splitted

#26/45

H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.

Page 27: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Heterogeneous networkHeterogeneous networkcommon in contemporary SoCsbetter fit to application domain – betterbetter fit to application domain better

performance components are not components are not

uniformly sized hierarcahical hierarcahical

structure Are ASICs possible Are ASICs possible

in the future anymore?

#27/45

yH. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.

Page 28: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Network topologyNetwork topologyDefines the components (e.g. routers) p ( g ) the connections (e.g. each router connected to 4

neighbours)Vast number of topologies proposed in

literature – but there’s no free lunch!

b=bus hb=hierarchical bus r=ringp=point-to-point

#28/45

ft=fat-treex=crossbar c=customt=2-D torus

Page 29: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Network topology (2)Network topology (2)Can be modeled with graphs node = router (+processing unit)( p g ) edge = data stream

Number of nodes denoted with NAverage path length L Avg num of edges between all nodes in graphg g g p Small L desired for small latency

Average degree <k>g g Avg. num of edges in each switch Large <k> may decrease L but implementation

#29/45

gets more complex also

Page 30: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Metric: Bisection bandwidthMetric: Bisection bandwidthWhen design is partitioned into two (nearly)

equal halves, it is the minimum number of i hi h t b t th h lwires which must cross between the halves

considering all possible partitions Number of nodes in halves differs at most by 1 Number of nodes in halves differs at most by 1 Also other definitions...

High number means higher number ofHigh number means higher number of possible routes and hence increased bandwidth, flexibility and possibly fault-t ltoleranceShould increase with the number of nodes in

scalable networks

#30/45

scalable networks

Page 31: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Generic routerGeneric routerForwards data from input ports to outputsFIFOs can be on either side of the crossbar 1 FIFO per port is the most common virtual channels allow multiple FIFOs per port

generic router

Area and delay increase reapidly with the number of ports

generic router

routing arbitrator

.

......

nput

por

tsoutput port

FIFOscrossbar

...

#31/45

in ts...

Page 32: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Routing algortihmRouting algortihm Selects route from source to destination1. Deterministic

S Same route always used between source and destination e.g. 2-D mesh: first find correct row, then correct column All packets arrive in-order One blocked (or faulty) link/router, blocks all packets on

that route2. Adaptivep

Route varies according to blockage Better performance (at least when reordering neglected) Better faul-tolerance Better faul tolerance Deadloack avoidance needs extra care

Data may arrive out-of-order Reordering buffers required at receiver

#32/45

Reordering buffers required at receiver Buffers may consume large area/energy

Page 33: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

SwitchingSwitching1. Store-and-forward switching

Data forwarded when whole packet received Whole packet buffered increases area and latency increases area and latency

2. Virtual cut-through: Data forwarded ASAP Whole packet buffered if output blocked

3. Wormhole: Data forwarded ASAP Buffer sizes can be independent of the packet size Reserves the whole transfer path and hence increases contention Reserves the whole transfer path and hence increases contention

Some schemes drop packets when contention is high Highly undetermistic Acknowledges required (roundtrip latency, buffers for retransfers) Not recommended in general Not recommended in general

Buffering has big impact on NoC performance and router area

#33/45

Page 34: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Quick terminology quizQuick terminology quizWhat is in common with the following terms? Koala bear Whale fish (valaskala in Finnish) Wormhole routing

Such things do not exist although many people talk about them Koala is marsupial Whale is mammal Wormhole is switching policy

#34/45

Page 35: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Example topologiesExample topologies

Page 36: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

(Shared multimaster) bus(Shared multimaster) bus Bus = set of signals

connected to all devices Sh d Shared resource

One connection between devices reserves the whole interconnection

Single busN = 16L 1interconnection

Bandwidth shared among devices

L = 1<k> = -

Bandwidth may be scaled by adding links

Most common SoC network M lti l b

Low implementation costs, simpleL i l li bl ti

Multiple busN = 16L = 1

<k> = -

#36/45

Long signal lines problematic

Page 37: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Bus arbitration / addr decodingBus arbitration / addr decoding Arbitration decides which master can use the

shared resource (e.g. bus or memory)( g y) Single-master system does not need arbitration E.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priority May be pipelined with previous transfer

Decoding is needed to determine the target Central / Distributed schemes Address and Data are broadcast to every node Decoder select which read the data or respond

#37/45

Page 38: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Centralized / DistributedCentralized / Distributed

M1 M2 M3 A2 A3A1

Arbiterarbiter/decoder

arbiter/decoder

arbiter/decoderrequest +

grant

S1 S2 S3

Decoderarbiter/decoder

arbiter/decoderS1 S2 S3

A4

decoder

A5

decoder

select

M = masterS = slave

a) Centralized b) Distributed

Fi 2 C t li d di t ib t d t l

#38/45

Figure 2. Centralized vs. distributed control

Page 39: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Complex bus topologiesComplex bus topologies Hierarchical bus - Several bus

segments connected with bridges Fast access as long as the target is in

Hierarchical bus (chain)N = 16L = 2 3g g

the same segment Requires locality of accesses

Theoretical max. speed-up = num of segments

L = 2.3<k> = 2

segments Segments either circuit or packet-

switched together Packet-switching provides more Hi hi l b ( h i Packet switching provides more

parallelism with added buffering Split-bus

No data storage – only three-state

Hierarchical bus (chain + tree)

N = 16L = 2.1

<k> = 2.5

buffers If switches are non-conducting,

smaller effective capacitance and, hence smaller energy

A A A

#39/45

hence, smaller energy

Split-bus

A A A

Page 40: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Other topologiesOther topologies

RingN = 16L = 6.3<k> = 3

3D hypercube

Fully connected, point-to-point networkN = 16L = 1

<k> =

hypercubeN = 8

L = 3.7<k> = 8

<k> = -

Highest performance Clearly not scalable

3-D topologies are hard to map on 2-D

Simple layout Unidirectional ring may

result in long latency

#40/45

Clearly not scalable approach

hard to map on 2 D silicon die

g y Good for pipelines

Page 41: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Topologies: mesh and torusTopologies: mesh and torus2-D mesh and torus are very popularSimple layout for uniformly sized nodesSimple layout for uniformly sized nodes Wrap-around wires in torus need special

attention

2-D mesh

#41/45

2 D meshN = 16L = 4.7<k> = 4

2-D torusN = 16L = 4.1<k> = 5

Page 42: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Topologies: TreeTopologies: Tree Trad. tree has bisection

bandwidth=1 Bottleneck for uniform

traffic Does not matter when the

Rooted, complete, binary tree

N = 16L = 6 5

traffic is localized

Fat-tree has more (or wider) links near root

L = 6.5<k> = 2.9

wider) links near root Becoming more popular as

NoC topology

Trees also constructed so that each node is processing node

Fat tree with butterfly elements and fanout of 2 (binary fat tree)

N = 16L = 6.5

#42/45

processing node <k> = 3.5

Page 43: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Topologies: static analysisTopologies: static analysis Some basic properties may be analyzed statically Simulation with real applications preferred (i.e. dynamic analysis)

N t k N b f N b f Li kN t k P ll l L t Bi ti Li k Network Number of switches

Number of wires

Links

Single bus 0 1 Bi

Multiple bus 0 e Bi

Hierarchical bus (chain) e 1 e Bi

Network Parallel transactions

Longest path

Bisection bandwidth

Links

Single bus 1 1 1 Bi

Multiple bus e (e ≤ N) 1 e BiHierarchical bus (chain) e-1 e Bi

Crossbar N2/4 N2/2 Bi

One-sided crossbar N2/2 N2-N/2 Bi

Binary tree N-1 2(N-1) Bi

Hierarchical bus (chain) e (e ≤ N) e (e ≤ N) 1 Bi

Crossbar N N N-1 Bi

One-sided crossbar N 2N-1 N/2 Bi

Binary tree N 2log2N 1 BiFat tree (fanout 2) Nlog2N 2Nlog2N Bi

Ring N 2N Bi

3-D hypercube N N+(N/2)log2N Bi

Binary tree N 2log2N 1 Bi

Fat tree (fanout 2) N 2log2N N Bi

Ring N N/2+2 2 Bi

3-D hypercube N log2N+2 N/2 Bi2-D mesh N 3N-2N1/2 Bi

2-D torus N 3N Bi

Point-to-point, fully connected

0 (N2-N)/2 Bi

2-D mesh N 2N1/2 N1/2 Bi

2-D torus N N1/2+2 2N1/2 Bi

Point-to-point, fully connected

N 1 (N/2)*(N/2) Bi

#43/45

Omega network (MIN) (N/4)(log2N-1) (N/2)log2N UniOmega network (MIN) N/2 log2N N Uni

Lahtinen 2004: Table 3.2 Performance Lahtinen 2004: Table 3.3 Implementation costs

Page 44: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

DaytonaDaytona (2001), OMAP (2004), (2001), OMAP (2004), MPCoreMPCore(2005)(2005)( )( )

Single bus

Two buses

Single bus

#44/45

W. Wolf. et al. , "Multiprocessor System-on-Chip (MPSoC) Technology," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct. 2008

Page 45: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Industrial Industrial exampleexample: : ViperViper byby Philips Philips (2001)(2001)( )( )

Four buses

#45/45

S. Dutta et al., "Viper: A multiprocessor SOC for advanced set-top box and digital TV systems," Design & Test of Computers, IEEE , vol.18, no.5, pp.21-31, Sep-Oct 2001

Page 46: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

ST ST NomadikNomadikST ST NomadikNomadik(2003) (2003)

Multiple buses

#46/45 Erno Salminen - Nov. 2010

Page 47: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

CellCell BE BE byby IBM/Sony/Toshiba (2005)IBM/Sony/Toshiba (2005) Khunjush, F.; Dimopoulos, N.J.; , "Extended characterization of DMA transfers on the Cell BE processor,"

Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on , vol., no., pp.1-8, 14-18 April 2008

See aldo: D. Shippy, M. Phipps, The Race for a New Game Machine: Creating the Chips Inside the XBox 360 and the Playstation 3 Citradel 2009and the Playstation 3, Citradel, 2009

Four rings

#47/45 Erno Salminen - Nov. 2010

Page 48: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Tile64 Tile64 byby TileraTilera (2008)(2008)2-D mesh with 4 DDR controller for extrnal

memoriesTile = 3-wide 32b VLIW, 750 MHz90nm, 615M tran, 11W90nm, 615M tran, 11W

S. Bell et al., TILE64 -Processor: A 64-Core SoCwith Mesh Interconnect, ISSCC 2008

#48/45 Erno Salminen - Nov. 2010

Page 49: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Faust (Faust (2009)2009)

M difi d 2 DModified 2-D mesh, asynchnoronousNoC

[E. Beigne et al., An Asynchronous Power Aware and Adaptive NoCBased Circuit, JSSC, 2009]

#49/45 Erno Salminen - Nov. 2010

Page 50: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

ConclusionConclusionSoC has many components, different

requirementsWire delays and power consumption

becoming very problematicBi diff b t l l d l b l (Big difference between local and global (or off-chip) communicationFully synchronous approach becomingFully synchronous approach becoming

unfeasibleNetwork-on-chip = multi-hop on-chip networkNetwork on chip multi hop on chip network Often packet-switched Buffering, routing, and topology are important

#50/45 Erno Salminen - Nov. 2010

design decisions

Page 51: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

NoCNoC SurveySurveyNoteNote: : AllAll slidesslides in in thisthis set set areare lecturelecturematerialmaterial!!

Erno Salminen - Nov. 2010

Page 52: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Survey of NetworkSurvey of Network--onon--chip proposals chip proposals [2008][2008][ ][ ]

This paper gives an overview of state-of-the-art regarding the network-on-chip (NoC) proposals.

NoC paradigm replaces dedicated, design-specific wires withNoC paradigm replaces dedicated, design specific wires with scalable, general purpose, multi-hop network. Numerous examples from literature are selected to highlight the contemporary approaches and reported implementation results. Th j t d f N C h d t th t iThe major trends of NoC research and aspects that require more investigations are pointed out.

A packet-switched 2-D mesh is the most used and studied topology so far It is also a sort of an average NoC currentlytopology so far. It is also a sort of an average NoC currently. Good results and interesting proposals are plenty.

However, large differences in implementation results, vague documentation and lack of comparison were also observeddocumentation, and lack of comparison were also observed.

http://www.ocpip.org/uploads/documents/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf

#52/45 Erno Salminen - Nov. 2010

Page 53: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Basic NoC propertiesBasic NoC properties

--- clip clip (39 lines omitted in the slide show)---

#53/45 Erno Salminen - Nov. 2010

Page 54: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

NoC implementationsNoC implementations

--- clip clip (14 lines omitted in the slide show)---

#54/45 Erno Salminen - Nov. 2010

Page 55: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Average NoC 2008Average NoC 2008

#55/45 Erno Salminen - Nov. 2010 [Salminen et al. Survey of NoC proposals, OCP-IP, 2008]

Page 56: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Average NoC 2008 (2)Average NoC 2008 (2)

#56/45 Erno Salminen - Nov. 2010

as[Salminen et al. Survey of NoC proposals, OCP-IP, 2008]

Page 57: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Case StudyCase StudyCase StudyCase Study

Managing Interconnection Complexity in Managing Interconnection Complexity in Heterogeneous IP Block InterconnectionHeterogeneous IP Block Interconnection(HIBI)(HIBI)(HIBI)(HIBI)

Erno Salminen - Nov. 2010

Page 58: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Overview of Managing OnOverview of Managing On--Chip Chip CommunicationsCommunications

Dedicated point-to-point links

Simple Alwaysguaranteed

LimitedLimited IP block specificyyp

Single bus

nts

nts

WW exib

ility

exib

ility

ss ee

elem

enel

emen

cy&

BW

cy&

BW

ty &

Fle

ty &

Fle

bloc

ksbl

ocks

rk re

use

rk re

use

Hierarchical bus structures

Regular multi-hop topologies et

wor

k et

wor

k

Late

ncLa

tenc

alea

bilit

alea

bilit

# of

IP

# of

IP

Net

wor

Net

worstructures

topologies

Customized multi-hop Verycomple

Designonce

Generalp rpose

Best-effort/Predictable

Ne

Ne

Sca

Sca NN

Arbitrar

#58/45 Erno Salminen - Nov. 2010

p complex oncepurposePredictable Arbitrary

Page 59: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Lessons LearnedLessons LearnedMany communication networks have been studied in

TUT On chip communication research started 1997 On-chip communication research started 1997

A regular topology can well be fitted to algorithm specific comp/comm balanced implementationIn general case there is no optimal topology

Communication-centric design was successfully conducted for performanceconducted for performanceImportant to exploit features of application(s) to optimize interconnection

Established parallel processing doctrines can be applied to SoCSoC challenge is heterogeneity in computation

#59/45 Erno Salminen - Nov. 2010

SoC challenge is heterogeneity in computation

Page 60: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Interconnection Implementation ViewInterconnection Implementation View Make lowest level data transfer mechanisms simple and

efficient Minimum number of signalsg “Every clock edge carries useful data in transaction”

Perform all high-level operations on basic mechanisms Layered protocol model, OCP compatibley p , p Message passing

Use identical HW modules to compose overall interconnection Translate IP specific communication operations to networka s ate spec c co u cat o ope at o s to et o Support all (practical) topologies No limits to number of IP blocks (whole design) Support (re-)configurabilitypp ( ) g y Fit to all communication needs –from memories to peripherals

“Gives body to build interconnect”“Gives body to build interconnect”

#60/45 Erno Salminen - Nov. 2010

Page 61: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

System Design ViewSystem Design View Make interconnection aware of application functionality

A) System design time Communication profiled from application processes Communication profiled from application processes Clustering: localization of communication Allocation of communication resources (segments, buffers) Optimization of non-reconfigurable parameters Optimization of non reconfigurable parameters Initial QoS and other transfer parameters

B) Run time Utilize knowledge of predictable communication events if Utilize knowledge of predictable communication events if

available Guaranteed QoS in transfers

Track communication –change QoS & other parameters if required

Totally change mode of operation if required HIBI Design Flow is 80% of the HIBI interconnect scheme

#61/45 Erno Salminen - Nov. 2010

“Gives brains to the communication”“Gives brains to the communication”

Page 62: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI Identical Interconnection ModulesHIBI Identical Interconnection Modules

HIBI wrapper is the only building block used everywhere in interconnectiony Between network and IP-blocks Between network segments Wrapper is parametrizable, modular, and

configurableA FIFO b ff i Asyncronous FIFO buffering

HIBI network

HIBIWrapper

FIFO / OCP i t f

HIBIwrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

#62/45 Erno Salminen - Nov. 2010

P1 Mem1PN Acc1... AccN...... MemN

interface

IP

Page 63: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI NetworkHIBI Network HIBI network consists of bus segments and bridges

Transfers in segment synchronous circuit switched Transfers across bridges asynchronous packet switched Scales from serial point-to-point link to an arbitrary

topologyp gy

Identical signals between wrappers in network side No dedicated point-to-point signals

All i l h d i hi k All signals shared within network segment Wrapper layout is independent of the number of agents

Totally distributed arbitrationTotally distributed arbitration No central arbiter Each wrapper is aware of communication details

#63/45 Erno Salminen - Nov. 2010

Page 64: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI Network Example

rr rr

HIBIHIBIWrapperWrapper

IP BLOCKIP BLOCK

HIBIHIBIWrapperWrapper

IP BLOCKIP BLOCK

HIBIHIBIWrapperWrapper

IP BLOCKIP BLOCKIP BLOCKIP BLOCK

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapper Bridge

HIB

IH

IBI

Wra

ppe

Wra

ppe

HIB

IH

IBI

Wra

ppe

Wra

ppe

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapperpppp

IP BLOCKIP BLOCKIP BLOCKIP BLOCK

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapper

HIBIHIBIWrapperWrapper

pppp

IP BLOCKIP BLOCK

pppp

IP BLOCKIP BLOCK

pppp

IP BLOCKIP BLOCK

pppp pppp

IP BLOCKIP BLOCK

Clock domainClock domain

#64/45 Erno Salminen - Nov. 2010

Page 65: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Bus latencyBus latency Total latency consists of several phases From: K. Kuusilinna, PhD Thesis, TUT, 2001.

Action Available MethodsAction Available MethodsRequest bus ownership

Wait for higher priority transactions to complete / Arbitrationrb

itrat

ion

tenc

y

Central arbiter, daisy chain, wired-OR,connectionless arbitration

Round-robin, hierarchical round-robin,time-slot, fixed priority, adaptive

Waiting time may be long during high contection

Bus ownership granted

complete / ArbitrationA

rla

t time slot, fixed priority, adaptive

(See Request)

Begin transaction Address/data multiplexing,handshaking

contection

Until all data has been transferred ora limit for data transfers per burst is reached.

Wait for master ready /Wait for target ready

Transfer first data

Initi

alla

tenc

y

a ds a g

p

Transfer data

Wait for master ready /Wait for target ready

Subs

eque

ntda

ta la

tenc

y

Optimizing this phase has biggest impact in long transfers

#65/45 Erno Salminen - Nov. 2010

Drive or wait for the bus to settle to idle state

Turn

-aro

und

late

ncy

Figure: Bus latency

transfers

Page 66: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI Quality of ServiceHIBI Quality of ServiceTDMA (time division multiple access) with

freely run-time adjustable frame length and y j gslot durations and allocationsRe-synchronization to application phasey pp pAlso traditional priority/round-robin

time frametime frame time frametime frame

allocated time slotA1

competitionA3 A2 A3 A1 A3 t

competition

A3A2

A3A1

A1 A2 A1 A3 A1

Priority

Round-robin

tA2 A3 A1

#66/45 Erno Salminen - Nov. 2010

A2A1 A2 A3 A1 A2 A3 t

Page 67: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI Basic TransferHIBI Basic TransferPipelined with arbitrationSplit-transactionsSplit transactionsBurst transfersNo wait cycles allowedNo wait cycles allowedNon pre-emptive transfers QoS is guaranteed with TDMA or with a QoS is guaranteed with TDMA or with a

combination of Send Max+Priority/RoundRobinpipeline

rq addr

ret addr

addr

data

w addr

w data ret dataw data

w addr rq addr ret addr

rq data rq data

ret addr ...

#67/45 Erno Salminen - Nov. 2010

t

ret addrdata w data ret dataw data rq data rq data

split transaction

Page 68: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI Wrapper Structure (v.2)HIBI Wrapper Structure (v.2)

IP signals in IP signals out

HI prior tx FIFO

LO prior tx FIFO

HI prior rx FIFO

LO prior rx FIFO

M D

Config memTx FSM

Mux Demux

Addr decoderRx FSM

#68/45 Erno Salminen - Nov. 2010

HIBI signals out HIBI signals in

Page 69: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Wrapper Configuration MemoryWrapper Configuration Memory Stores all information for distributed arbitration

Permanent: ROM, 1 page Semi run-time configurable: ROM with several pages Full run-time configurable: RAM, with pages

Curr page

Curr conf

C f

Newconf

values

Dem Mux

Time slot

valuesConf page

Timeslot

mux

#69/45 Erno Salminen - Nov. 2010

logicslotsignalsCycle counter

Page 70: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

HIBI Wrapper Area in ASICHIBI Wrapper Area in ASIC

35 000

25 000

30 000

35 000

RAMROM

15 000

20 000

Area

[gat

es]

ROM

5 000

10 000

A

08 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b

lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0

lo prior FIFOs = 5 / 5hi prior FIFOs = 5 / 5

lo prior FIFOs = 10 / 5hi prior FIFOs = 10 / 5

#70/45 Erno Salminen - Nov. 2010

1-page mem 1-page mem 2-page mem

Page 71: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Runtime comparisonRuntime comparisonSalminen et al., SAMOS 2005.

#71/45 Erno Salminen - Nov. 2010

Page 72: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

OtherOther notesnotes on on NoCNoC

Erno Salminen - Nov. 2010

Page 73: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Network topology categoriesNetwork topology categories1. Static networks utilize only point-to-point or

shared connection lines2. Dynamic networks use switches (or routers)

for communicationa) Direct = each processing node connected to

switchb) Indirect = some switches are not connected

directly to any processing node

#73/45 Erno Salminen - Nov. 2010

Page 74: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Problems with Current NoC DiscussionProblems with Current NoC DiscussionWhat is ”NoC” – no common definition

Something new, good by definition (needs no proof),...General purpose – but to what extentGeneral purpose – but to what extent

Arbitrary connectivity between any node? Uniform overall transfer distribution?

Discussion about “optimal topology” Discussion about optimal topology Multiprocessor architectures for scientific computations? Can massive fine-grain granularity parallelism be utilized in

realistic SoC applications?realistic SoC applications? Copying computer network ideas without criticism

In-network data buffering, routing tables and algorithms Compare to current TCP/IP or past ATM routers! Compare to current TCP/IP or past ATM routers!

Toy test case applications Billion transistors – executes single FFT? Common benchmarks should be designed!

#74/45 Erno Salminen - Nov. 2010

Common benchmarks should be designed!

Page 75: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Wiring hierarchyWiring hierarchy How far can signal reach

in one local clock cycle? Depends on Depends on

frequency (i.e duration of clock cycle)

Wiring parameters (layer

l b l

Wiring parameters (layer, width, height, density, shielding)

Not far anyway global

intermediate

Not far anyway... Global wires will function

as lossy transmission linesRC d l f d

local

RC models of today become inaccurate

3-D modeling s-l-o-w and difficult

#75/45 Erno Salminen - Nov. 2010

[H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003]

difficult

Page 76: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Crosstalk impactCrosstalk impactLong fast switching wires

Long wires close to each otherLong, fast switching wires

Switching on neighbor g gwires affects delay

Delay on wire 4 shown in table 2

#76/45 Erno Salminen - Nov. 2010

P. Liljeberg et al., Self-timed Approach for Noise Reduction in Noc, in “Interconnect-centric design for advanced SoC and NoC”, Kluwer. 2004

Page 77: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Transaction latency components

#77/45 Erno Salminen - Nov. 2010

Scalable Multiprocessors, lecture slides, http://www.cs.princeton.edu/courses/archive/spr07/cos598A/

Page 78: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Impact of DMAImpact of DMA agent

CPU core data mem DMA network i)CPU core

instr. mem

data mem DMA interface

other perihp.

ii)

comp comp ...w/o DMAa) short comm time

compw/ DMA

)

comp comm compw/o DMA

comp ...

comp comm comp ...w/o DMA

compcommcomp

comm...compw/ DMA

b) equal comp and comm time

comm comm ...w/o DMA

w/ DMA

c) long comm time

#78/45 Erno Salminen - Nov. 2010

comm comm...w/ DMA

Page 79: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Retransfer buffersRetransfer buffers If packets are dropped or corrupted in delivery (usually) they have to

retransferred Variable latencies problematic: is packet dropped and just havinf longer latency If Time-out latency exceeded , packet is assumed to be missingy p g

Source must store packets until it recieves acknowledge of succesfull transfer Sending acknowledge after each packet results in small buffer but (at least)

double latency Sengin ack after each N packet reuires bigger buffers but gives better g p gg g

performance

source destination

ack (ok)a) ack for each packet

src

buf

dst Latency per pkt = send_latency + ack_latency

b) ack for each N src dst

Latency per pkt =

#79/45 Erno Salminen - Nov. 2010ack (ok,ok,fail,ok)

each Npackets

src

buf buf buf buf

dst (N*send_latency + ack_latency) / N

Page 80: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Reordering buffersReordering buffers Packets arriving Out-of-order may require huge reordering buffers

Sometimes processing units may accept out-of-order delivery or buffers can be integrated with internal memory of the processing unit

If ack is sent after 4 packets buffer for 4 packets is needed If ack is sent after 4 packets, buffer for 4 packets is needed Furthermore, separate buffers are needed for each source as data may

received in interleaved manner E.g. (pkt_<n>_<src>) received: pkt_1_1, pkt_4_1, pkt_4_2, pkt_3_3... E if k t ft N k t d S E.g. if ack sent after N apckets and S sources

reorder buffer size = N*S packets

source 0 destination source 1source 0 destination

ack (ok)a) ack for each packet

src dst

bufAck forces in-order delivery

source 1

b) ack for each N

dst

buf buf buf bufsrc buf buf buf buf

#80/45 Erno Salminen - Nov. 2010

ack

each Npackets

src buf buf buf buf

buf buf buf buf...

ack

Page 81: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Buffer reservationBuffer reservation

Notification ofSender agent Receiver agent Sender agent Receiver agent

Notification of the next tx

Reserve buffer Notification of the reserved buffer

Reserve buffer

Configure rx DMA

ACK

Configure rx DMA

Actual data

(optional ACK)

Actual data

(copy data)

C d t

Observedtx duration

Consume data

Reserve buffer etc.

Consume data

Observedtx duration

#81/45 Erno Salminen - Nov. 2010

Page 82: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Intertiwned/ReorderingIntertiwned/Reordering Transfers from different

sources may arbitrarily i t t i d

destination0

i) fixed-length packets

intertwined In addition, packets may

arrive out-of-order...

ddee

aabbcc

dd aa bb eecc

from

net

wor

k

arrive out-of-order

source0

”FIFO”-like buffers

ii) variable-length packets

netw

ork

source1destination0

aabbcc

ddee ddaabbeecc

destination0

netw

ork

...dd ee

dd aa bb eecc

These are either single words, bursts, or packets, depending on

the network

from

cc

linked list buffers

aa bb

#82/45 Erno Salminen - Nov. 2010

Page 83: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Irregular IP sizeIrregular IP size IP’s tend to have irregular size and shape Largest IP per row/column decides its height/width

S Some space wasted links will have varying length

Reordering the IPs reduces areag Ensure that frequently communicating IPs are still close to

each other

#83/45 Erno Salminen - Nov. 2010

<19.5% reduction in area>

Page 84: TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research

Customized meshCustomized meshConnect more than IP to one routerSomewhat smaller bandwidth available per IPSomewhat smaller bandwidth available per IP Usually enough, though

Adopt totally customized topology (theAdopt totally customized topology (the rightmost fig)

#84/45 Erno Salminen - Nov. 2010