1 lecture 17: on-chip networks today: background wrap-up and innovations

1

Lecture 17: On-Chip Networks

• Today: background wrap-up and innovations

2

Router Pipeline

• Four typical stages: RC routing computation: the head flit indicates the VC that it belongs to, the VC state is updated, the headers are examined and the next output channel is computed (note: this is done for all the head flits arriving on various input channels) VA virtual-channel allocation: the head flits compete for the available virtual channels on their computed output channels SA switch allocation: a flit competes for access to its output physical channel ST switch traversal: the flit is transmitted on the output channel

A head flit goes through all four stages, the other flits do nothing in the first two stages (this is an in-order pipeline and flits can not jump ahead), a tail flit also de-allocates the VC

3

Speculative Pipelines

• Perform VA and SA in parallel• Note that SA only requires knowledge of the output physical channel, not the VC• If VA fails, the successfully allocated channel goes un-utilized

RCVASA

ST

-- SA ST

-- SA ST

-- SA ST

Cycle 1 2 3 4 5 6 7

Head flit

Body flit 1

Body flit 2

Tail flit

• Perform VA, SA, and ST in parallel (can cause collisions and re-tries)• Typically, VA is the critical path – can possibly perform SA and ST sequentially

• Router pipeline latency is a greater bottleneck when there is little contention• When there is little contention, speculation will likely work well!• Single stage pipeline?

RCVA

SA ST

SA ST

SA ST

SA ST

4

Alpha 21364 Pipeline

RC T DWSA1WrQ

RESA2ST1

ST2 ECC

Routing

Transport/Wire delay

Update of input unit state

Write to input queues

Switch allocation – local

Switch allocation – global

Switch traversal

Append ECC information

5

Recent Intel Router

Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006

• Used for a 6x6 mesh• 16 B, > 3 GHz• Wormhole with VC flow control

6

Recent Intel Router


7

Recent Intel Router


8

Data Points

• On-chip network’s power contribution in RAW (tiled) processor: 36% in network of compute-bound elements (Intel): 20% in network of storage elements (Intel): 36% bus-based coherence (Kumar et al. ’05): ~12% Polaris (Intel) network: 28% SCC (Intel) network: 10%

• Power contributors: RAW: links 39%; buffers 31%; crossbar 30% TRIPS: links 31%; buffers 35%; crossbar 33% Intel: links 18%; buffers 38%; crossbar 29%; clock 13%

9

Network Power

• Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton

•Energy for a flit = ER . H + Ewire . D = (Ebuf + Exbar + Earb) . H + Ewire . D

ER = router energy H = number of hops Ewire = wire transmission energy D = physical Manhattan distance Ebuf = router buffer energy Exbar = router crossbar energy Earb = router arbiter energy

• This paper assumes that Ewire . D is ideal network energy (assuming no change to the application and how it is mapped to physical nodes)

10

Segmented Crossbar

• By segmenting the row and column lines, parts of these lines need not switch less switching capacitance (especially if your output and input ports are close to the bottom-left in the figure above)• Need a few additional control signals to activate the tri-state buffers• Overall crossbar power savings: ~15-30%

11

Cut-Through Crossbar

• Attempts to optimize the common case: in dimension-order routing, flits make up to one turn and usually travel straight

• 2/3rd the number of tristate buffers and 1/2 the number of data wires

• “Straight” traffic does not go thru tristate buffers

• Some combinations of turns are not allowed: such as E N and N W (note that such a combination cannot happen with dimension-order routing)

• Crossbar energy savings of 39-52%

12

Write-Through Input Buffer

• Input flits must be buffered in case there is a conflict in a later pipeline stage

• If the queue is empty, the input flit can move straight to the next stage: helps avoid the buffer read

• To reduce the datapaths, the write bitlines can serve as the bypass path

• Power savings are a function of rd/wr energy ratios and probability of finding an empty queue

13

Express Channels

• Express channels connect non-adjacent nodes – flits traveling a long distance can use express channels for most of the way and navigate on local channels near the source/destination (like taking the freeway)

• Helps reduce the number of hops

• The router in each express node is much bigger now

14

Express Channels

• Routing: in a ring, there are 5 possible routes and the best is chosen; in a torus, there are 17 possible routes

• A large express interval results in fewer savings because fewer messages exercise the express channels

15

Express Virtual Channels

• To a large extent, maintain the same physical structure as a conventional network (changes to be explained shortly)

• Some virtual channels are treated differently: they go through a different router pipeline and can effectively avoid most router overheads

16

Router Pipelines

• If Normal VC (NVC): at every router, must compete for the next VC and for the switch will get buffered in case there is a conflict for VA/SA

• If EVC (at intermediate bypass router): need not compete for VC (an EVC is a VC reserved across multiple routers) similarly, the EVC is also guaranteed the switch (only 1 EVC can compete for an output physical channel) since VA/SA are guaranteed to succeed, no need for buffering simple router pipeline: incoming flit directly moves to ST stage

• If EVC (at EVC source/sink router): must compete for VC/SA as in a conventional pipeline before moving on, must confirm free buffer at next EVC router

17

Title

• Bullet

1 lecture 17: on-chip networks today: background wrap-up and innovations

Documents

vc slide

router crossbar energy

e buf e xbar e arb

h e wire

head flit body flit

router energy h

router buffer energy

rc va sa st slide