1 lecture 17: on-chip networks today: background wrap-up and innovations
Post on 19-Dec-2015
214 views
TRANSCRIPT
2
Router Pipeline
• Four typical stages: RC routing computation: the head flit indicates the VC that it belongs to, the VC state is updated, the headers are examined and the next output channel is computed (note: this is done for all the head flits arriving on various input channels) VA virtual-channel allocation: the head flits compete for the available virtual channels on their computed output channels SA switch allocation: a flit competes for access to its output physical channel ST switch traversal: the flit is transmitted on the output channel
A head flit goes through all four stages, the other flits do nothing in the first two stages (this is an in-order pipeline and flits can not jump ahead), a tail flit also de-allocates the VC
3
Speculative Pipelines
• Perform VA and SA in parallel• Note that SA only requires knowledge of the output physical channel, not the VC• If VA fails, the successfully allocated channel goes un-utilized
RCVASA
ST
-- SA ST
-- SA ST
-- SA ST
Cycle 1 2 3 4 5 6 7
Head flit
Body flit 1
Body flit 2
Tail flit
• Perform VA, SA, and ST in parallel (can cause collisions and re-tries)• Typically, VA is the critical path – can possibly perform SA and ST sequentially
• Router pipeline latency is a greater bottleneck when there is little contention• When there is little contention, speculation will likely work well!• Single stage pipeline?
RCVA
SA ST
SA ST
SA ST
SA ST
4
Alpha 21364 Pipeline
RC T DWSA1WrQ
RESA2ST1
ST2 ECC
Routing
Transport/Wire delay
Update of input unit state
Write to input queues
Switch allocation – local
Switch allocation – global
Switch traversal
Append ECC information
5
Recent Intel Router
Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006
• Used for a 6x6 mesh• 16 B, > 3 GHz• Wormhole with VC flow control
6
Recent Intel Router
Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006
7
Recent Intel Router
Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006
8
Data Points
• On-chip network’s power contribution in RAW (tiled) processor: 36% in network of compute-bound elements (Intel): 20% in network of storage elements (Intel): 36% bus-based coherence (Kumar et al. ’05): ~12% Polaris (Intel) network: 28% SCC (Intel) network: 10%
• Power contributors: RAW: links 39%; buffers 31%; crossbar 30% TRIPS: links 31%; buffers 35%; crossbar 33% Intel: links 18%; buffers 38%; crossbar 29%; clock 13%
9
Network Power
• Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton
•Energy for a flit = ER . H + Ewire . D = (Ebuf + Exbar + Earb) . H + Ewire . D
ER = router energy H = number of hops Ewire = wire transmission energy D = physical Manhattan distance Ebuf = router buffer energy Exbar = router crossbar energy Earb = router arbiter energy
• This paper assumes that Ewire . D is ideal network energy (assuming no change to the application and how it is mapped to physical nodes)
10
Segmented Crossbar
• By segmenting the row and column lines, parts of these lines need not switch less switching capacitance (especially if your output and input ports are close to the bottom-left in the figure above)• Need a few additional control signals to activate the tri-state buffers• Overall crossbar power savings: ~15-30%
11
Cut-Through Crossbar
• Attempts to optimize the common case: in dimension-order routing, flits make up to one turn and usually travel straight
• 2/3rd the number of tristate buffers and 1/2 the number of data wires
• “Straight” traffic does not go thru tristate buffers
• Some combinations of turns are not allowed: such as E N and N W (note that such a combination cannot happen with dimension-order routing)
• Crossbar energy savings of 39-52%
12
Write-Through Input Buffer
• Input flits must be buffered in case there is a conflict in a later pipeline stage
• If the queue is empty, the input flit can move straight to the next stage: helps avoid the buffer read
• To reduce the datapaths, the write bitlines can serve as the bypass path
• Power savings are a function of rd/wr energy ratios and probability of finding an empty queue
13
Express Channels
• Express channels connect non-adjacent nodes – flits traveling a long distance can use express channels for most of the way and navigate on local channels near the source/destination (like taking the freeway)
• Helps reduce the number of hops
• The router in each express node is much bigger now
14
Express Channels
• Routing: in a ring, there are 5 possible routes and the best is chosen; in a torus, there are 17 possible routes
• A large express interval results in fewer savings because fewer messages exercise the express channels
15
Express Virtual Channels
• To a large extent, maintain the same physical structure as a conventional network (changes to be explained shortly)
• Some virtual channels are treated differently: they go through a different router pipeline and can effectively avoid most router overheads
16
Router Pipelines
• If Normal VC (NVC): at every router, must compete for the next VC and for the switch will get buffered in case there is a conflict for VA/SA
• If EVC (at intermediate bypass router): need not compete for VC (an EVC is a VC reserved across multiple routers) similarly, the EVC is also guaranteed the switch (only 1 EVC can compete for an output physical channel) since VA/SA are guaranteed to succeed, no need for buffering simple router pipeline: incoming flit directly moves to ST stage
• If EVC (at EVC source/sink router): must compete for VC/SA as in a conventional pipeline before moving on, must confirm free buffer at next EVC router