cs/ece 757: advanced computer architecture ii (parallel ...mr56/ece757/lecture17.pdftopologies...
Post on 24-Feb-2021
4 Views
Preview:
TRANSCRIPT
Page 1
CS/ECE 757: Advanced Computer Architecture II(Parallel Computer Architecture)
Interconnection Network Design (Chapter 10)
Copyright 2001 Mark D. HillUniversity of Wisconsin-Madison
Slides are derived from work bySarita Adve (Illinois), Babak Falsafi (CMU),
Alvy Lebeck (Duke), Steve Reinhardt (Michigan),and J. P. Singh (Princeton). Thanks!
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 2
Interconnection Networks
• Goal: Communication between computers• Warning: Terminology-rich environment• Focus on Networks for Parallel Computing
– today’s System Area Networks exhibit many of the same properties
• Traffic Patterns– Arbitrary (bisection bandwidth matters)– Near-neighbor– Permutations (e.g., FFT)– Multicasts or Broadcasts?
– Latency or Bandwidth Critical?
Page 2
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 3
Terms
Network characterized by• Topology
– physical structure of the graph
• Routing Algorithm– through which paths can message flow
• Switching Strategy– How data in message traverses its route – Circuit Switched vs Packet Switched
• Flow Control– When does a packet (or portions of it) move along its route
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 4
More Terms
• Given topology constructed by linking switches and network interfaces, must deliver packet from node A to node B
• Link: cable with connectors on each end– connect switches to other switches or network interfaces
• Switch: N inputs N outputs (degree N)• Phit: Minimum # of bits physically moved across link
in one cycle• Flit: Minimum # of bits move across link as a single
unit• Packet: unit that requires routing information, some
number of flits
Page 3
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 5
Outline
• Topology
• Switching, Routing, & Deadlock
• Switch Design
• Flow Control
• Case Studies
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 6
Topology
• Structure of the interconnect• Determines
– Switch Degree: number of links from a node– Diameter: number of links crossed between nodes on maximum
shortest path– Average distance: number of hops to random destination– Bisection: minimum number of links that separate the network into
two halves
• Don’t forget Network Interface (NI)– “ On-” and “ off-ramp” to the interconnect “ freeway”– NI latency can dominate interconnect latency
(reducing the importance of topology)
Page 4
(C) 2003 J. E. Smith CS/ECE 757 7
Topology
Node
SW Interface
HW Interface
linklink
bandwidth
Node
SW Interface
HW Interface
linklink
bandwidth
. . .
. . .
Interconnect
Bisection Bandwidth
Overhead
Latency
Application Application
(C) 2003 J. E. Smith CS/ECE 757 8
Buses
• Direct interconnect• Performance/Cost:
– Switch cost: N– Wire cost: const– Avg. latency: const– Bisection B/W: const– Not neighbor optimized– May be local optimized
• Capable of broadcast (good for MP coherence)• Bandwidth not scalable => major problem
– Hierarchical buses?» Bisection B/W remains constant» Becomes neighbor optimized
MemoryMemoryMemoryMemory
Proc
cache
Proc
cache
Proc
cache
Proc
cache
Page 5
(C) 2003 J. E. Smith CS/ECE 757 9
Rings
• Direct interconnect• Performance/Cost:
– Switch cost: N– Wire cost: N– Avg. latency: N / 2– Bisection B/W: const– neighbor optimized, if bi-directional– probably local optimized
• Not easily scalable– Hierarchical rings?
» Bisection B/W remains constant» Becomes neighbor optimized
M
P
RING
S
M
P
S
M
P
S
(C) 2003 J. E. Smith CS/ECE 757 10
Hypercubes
• n-dimensional unit cube• Direct interconnect• Performance/Cost:
– Switch cost: N log2 N– Wire cost: (N log2 N) /2– Avg. latency: (log2 N) /2– Bisection B/W: N /2– neighbor optimized– probably local optimized
• latency and bandwidth scale well, BUT– individual switch complexity
grows=> max size is built in
S
M P
S
M P
S
M P
S
M P
S
M P
S
M P
S
M P
S
M P
000
110001 101
111011
010
100
Page 6
(C) 2003 J. E. Smith CS/ECE 757 11
2D Torus
• Direct interconnect• Performance/Cost:
– Switch cost: N– Wire cost: 2 N– Avg. latency: N 1/2
– Bisection B/W: 2 N 1/2
– neighbor optimized– probably local optimized
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
(C) 2003 J. E. Smith CS/ECE 757 12
2D Torus, contd.
• Cost scales well• Latency and bandwidth do not scale as well as
hypercube, BUT– difference is relatively small for practical-sized systems
• Weave nodes to make inter-node latencies const.• Routing can be tricky (deadlocks)• 2D Mesh similar, but without wraparound
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
Page 7
(C) 2003 J. E. Smith CS/ECE 757 13
3D Torus
• Direct interconnect• Performance/Cost:
– Switch cost: N– Wire cost: 3 N– Avg. latency: 3(N1/3 / 2)– Bisection B/W: 2 N2/3
– neighbor optimized– probably local optimized
(C) 2003 J. E. Smith CS/ECE 757 14
3D Torus, contd.
• Cost scales well• Latency and bandwidth do not scale as well
as hypercube, BUT– difference is relatively small for practical-sized systems
• Seems to have become an interconnect of choice:– Cray, Intel, Tera, DASH, etc.
• Routing can be tricky (deadlocks)• 3D Mesh similar, but without wraparound
Page 8
(C) 2003 J. E. Smith CS/ECE 757 15
Crossbars
• Indirect interconnect• Performance/Cost:
– Switch cost: N 2
– Wire cost: 2 N– Avg. latency: const– Bisection B/W: N– Not neighbor optimized– Not local optimized
Processor
MemoryMemoryMemoryMemory
MemoryMemoryMemoryMemory
Processor
Processor
Processor
corssbar
(C) 2003 J. E. Smith CS/ECE 757 16
Crossbars, contd.
• Capable of broadcast• No network conflicts• Cost does not scale well• Often implemented with
muxes
mux
P
M
P
P
P
mux
M
mux
M
mux
M
Page 9
(C) 2003 J. E. Smith CS/ECE 757 17
Multistage Interconnection Networks (MINs)
• AKA: Banyan, Baseline, Omega networks, etc.
• Indirect interconnect• Crossbars interconnected
with shuffles• Can be viewed as
overlapped MUX trees• Destination address
specifies the path• The shuffle interconnect
routes addresses in a binary fashion
• This can most easily be seen with MINs in the form at right
0101
1011
(C) 2003 J. E. Smith CS/ECE 757 18
MINs, contd.
• f switch outputs => decode log2 f bits in switch• Performance/Cost:
– Switch cost: (N log f N) / f– Wire cost: N log f N– Avg. latency: log f N– Bisection B/W: N– Not neighbor optimized
• Problem (in combination with latency growth)– Not local optimized
• Capable of broadcast• Commonly used in large UMA systems
Page 10
(C) 2003 J. E. Smith CS/ECE 757 19
Multistage Nets, Equivalence
• By rearranging switches, multistage nets can be shown to be equivalent
A
B
C
D
A
B
C
D
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
E
F
G
H
E
F
G
H
I
J
K
L
M
N
O
P
I
J
M
N
K
L
O
P
(C) 2003 J. E. Smith CS/ECE 757 20
Fat Trees
• Tree-like, with constant bandwidth at all levels
• Closely related to MINs• Indirect interconnect• Performance/Cost:
– Switch cost: N log f N
– Wire cost: f N log f N
– Avg. latency: approx 2 log f N– Bisection B/W: f N– neighbor optimized– may be local optimized
• Capable of broadcastP
SM
P
SM
P
SM
P
SM
P
SM
P
SM
P
SM
P
SM
Page 11
(C) 2003 J. E. Smith CS/ECE 757 21
Fat Trees, Contd.
• The MIN-derived Fat Tree, is, in fact, a Fat Tree:• However, the switching "nodes" in effect do not have full
crossbar connectivity
P
SM
P
SM
P
SM
P
SM
P
SM
P
SM
P
SM
P
SM
A B C D E F G H
I J K L M N O P
Q R S T U V W X
A B C D E F G H
I J K L M N O P
Q R S T U V W X
S S S S S S S S
M M M M M M M M
P P P P P P P P
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 22
Important Topologies
N = 1024
Type Degree Diameter Ave Dist Bisection Diam Ave D
1D mesh 2 N-1 2N/3 1
2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 63 21
3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 ~30 ~10
nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n
(N = kn)
Ring 2 N / 2 N/4 2
2D torus 4 N1/2 N1/2 / 2 2N1/2 32 16
k-ary n-cube 2n n(N1/n) nN1/n/2 15 8 (3D) (N = kn) nk/2 nk/4 2kn-1
Hypercube n n = LogN n/2 N/2 10 5
Page 12
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 23
N = 1024
Type Degree Diameter Ave Dist Bisection Diam Ave D
2D Tree 3 2Log2 N ~2Log2 N 1 20 ~20
4D Tree 5 2Log4 N 2Log4 N - 2/3 1 10 9.33
kD k+1 Logk N
2D fat tree 4 Log2 N N
2D butterfly 4 Log2 N Log2 N N/2 20 20
Topologies (cont)
CM-5 Thinned Fat Tree
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 24
Butterfly
N/2 Butterfly
°°°
N/2 Butterfly
°°°
Benes Network
N Butterfly
ReversedN
Butterfly
°°°
• Routes all permutations w/o conflict
• Notice similarity to Fat Tree (Fold in half)
• Randomization is major breakthrough
• All paths equal length
• Unique path from anyinput to any output
• Conflicts cause tree saturation
Multistage: nodes at ends, switches in middle
Page 13
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 25
Outline
• Topology
• Switching, Routing, & Deadlock
• Switch Design
• Flow Control
• Case Studies
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 26
Queue on each end• Can send both ways (“ Bi-directional, Full
Duplex” )• Rules for communication? “ protocol”
– Synchronous send » Need Request & Response signaling
– Name for standard group of bits sent: Packet
ABCs of Networks
• Starting Point: Send bits between 2 computers
Page 14
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 27
A Simple Example
• What is the packet format?– Fixed? (for HW Interpretation)– Number bytes?
Request/Response
Address/Data
1 bit 32 bits
0: Please send data from Address1: Packet contains data corresponding to request
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 28
Questions About Simple Example
• What if more than 2 computers want to communicate?
– Need node identifier field (destination) in packet– Routing and topology
• What if packet is garbled in transit?– Add error detection field in packet (e.g., CRC)
• What if packet is lost?– More elaborate protocols to detect loss (e.g., NAK, time outs)
• What if multiple processes/machine?– Dispatch– Queue per process
• Questions such as these lead to more complex protocols and packet formats
Page 15
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 29
General Packet Format
• Header– routing and control information
• Payload– carries data (non HW specific information)– can be further divided (framing, protocol stacks…)
• Error Code– generally at tail of packet so it can be generated on the way out
Header Payload Error Code
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 30
Message vs. Packet
• A Message may be composed of several packets• Applications reason about messages• Network transfers packets• Small fixed size packets. Problems?
Fragmentation and reassembly (SW overhead)• Variable Size packets. Problems?
Congestion
Page 16
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 31
Packet Switched vs Circuit Switched
Circuit Switched• Establish Route then Send Data• Telephone systemPacket Switched• Route each packet individually• Delivery Guarantees
– Reliable– In order, what if not?
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 32
Routing
• Store-and-forward• Cut-through• Virtual cut-through• Wormhole
Page 17
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 33
Store and Forward
• Store-and-forward policy: each switch waits for the full packet to arrive in the switch before it is sent on to the next switch
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 34
Cut Through
• Cut-through routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately
Page 18
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 35
Virtual Cut-Through
• What to do if output port is blocked?• Lets the tail continue when the head is blocked,
absorbing the whole message into a single switch. – Requires a buffer large enough to hold the largest packet.
• Degenerates to store-and-forward with high contention
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 36
Wormhole
• When the head of the message is blocked the message stays strung out over the network
– Potentially blocks other messages (needs only buffer the piece of the packet that is sent between switches).
– CM-5 used it, with each switch buffer being 4 bits per port.– Myrinet uses it
• Interaction with Pack Size• Can cause tree saturation…
Page 19
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 37
Store and Forward vs. Cut-Through
• Advantage– Latency reduces from function of:
number of intermediate switches times the size of the packet
to
time for 1st part of the packet to negotiate the switches + the packet size ÷ interconnect BW
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 38
Routing Algorithm
• How do I know where a packet should go?• Arithmetic• Source-Based• Table Lookup• Adaptive—route based on network state
(e.g., contention)
Page 20
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 39
Arithmetic Routing
• For regular topology, simple arithmetic to determine route
• 2D Mesh (Also called NEWS network)– packet header contains signed offset to destination– switch ++ or -- one field of header (x or y dimension)– when x == 0 and y == 0, then at correct processor
• Requires ALU in switch• Must recompute CRC
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 40
Source Based and Table Lookup Routing
Source Based• Source specifies output port for each switch in route• Very Simple Switches
– no control state– strip output port off header
• Myrinet uses thisTable Lookup• Very Small Header, index into table for output port• Big tables, must be kept up to date...
Page 21
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 41
001
000
101
100
010 110
111011
Deterministic v.s. Adaptive Routing
• Deterministic—follows a pre-specified route
– mesh: dimension-order routing» (x1, y1) -> (x2, y2)» first Dx = x2 - x1,» then Dy = y2 - y1,
– hypercube: edge-cube routing» X = x0x1x2 . . .xn -> Y = y0y1y2 . . .yn» R = X xor Y» Traverse dimensions of differing
address in order– tree: common ancestor
• Adaptive—route determined by contention for output port
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 42
Adaptive Routing
• Essential for fault tolerance– at least multipath
• Can improve utilization of the network• Simple deterministic algorithms easily run into bad
permutations
• Fully/partially adaptive, minimal/non-minimal• Can introduce complexity or anomalies• Little adaptation goes a long way!
Page 22
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 43
Deadlock
Guri Sohi
• Break a Necessary Condition– Use more than one resource– Not willing to release resource in use– Cycle in order of recourse use
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 44
Deadlock Free Routing
• Virtual Channels– Not virtual cut-through– Add buffers so flits of wormhole packets can be interleaved
• Up*-Down*– Number switches: higher = farther away from processors– route up, make one turn, route down
• Turn Model Routing– Restrict order of turns
» West First» North Last» Negative First
– Can increase number of hops
Page 23
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 45
Minimal turn restrictions in 2D
West-f irst
north-last negative first
-x +x
+y
-y
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 46
Outline
• Topology
• Switching, Routing, & Deadlock
• Switch Design
• Flow Control
• Case Studies
Page 24
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 47
A Generic Switch
• At minimum, must route inputs to outputs
InputBuffer
OutputBuffer
Cross-bar
ControlRouting, Scheduling
Receiver Transmitter
OutputPorts
InputPorts
VLSI makes it easier to create larger fully connected switches
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 48
Switch Design Issues
• ports, pin limited• data path (cross bar designs)• non-blocking crossbar• routing logic per input
– ALU– table– Finite State Machine for cut-through
Page 25
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 49
Switch Buffering
• need to absorb some transient peaks• shared buffer pool
– need high bandwidth– one output could hog all buffer space
• input buffers
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 50
Input Buffering
• buffer per port• routing logic• head of line (HOL) blocking
– subsequent packet may be routed to unused output port
Page 26
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 51
Output Buffering
• Buffers “ logically” associated with output– split on either side of cross bar
• Arbitration for physical link (output scheduling)– static priority– random– round-robin– oldest-first
• Effects of adaptive routing?– Select output based on availability– requires feedback from output port
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 52
Stacked Dimension Switch
• Uses only 2x2 switch to build higher dimension switch
2x2
2x2
2x2
zin zout
yin
xin
yout
xout
Hostin
Hostout
Page 27
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 53
Outline
• Topology
• Switching, Routing, & Deadlock
• Switch Design
• Flow Control
• Case Studies
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 54
Congestion Control
• Packet switched networks do not reserve bandwidth; this leads to contention
• Solution: prevent packets from entering until contention is reduced (e.g., metering lights)
• Options:– End-to-end Flow Control– Link-level Flow Control
Page 28
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 55
Flow Control
• Packet discarding: If a packet arrives at a switch and there is no room in the buffer, the packet is discarded
– no communication between switches, requires higher level protocol
• Flow control: between pairs of receivers and senders; use feedback to tell the sender when it is allowed to send the next packet
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 56
Link-Level Flow Control
Data
Ready
• Transfer single flit when receiver is ready• Could have long links with many flits in flight
Page 29
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 57
Credit-based (Window) Flow Control
• Receiver gives N credits to sender– sender decrements count– stops sending if zero– receiver sends back credit as it drains its buffer– bundle credits to reduce overhead
• Must account for link latency
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 58
Water Level
• high water, low water• stop & go back to source switch (Myrinet)• can send redundant stop/go
Stop
Go
Incoming phits
Outgoing phits
Page 30
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 59
Outline
• Topology
• Switching, Routing, & Deadlock
• Switch Design
• Flow Control
• Case Studies
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 60
Case Study Cray T3D
• 1024 switch nodes each connected to 2 processors• 3D Torus, bidirectional, 300 MB/s• Link: 16 bits, 8 control bits• Variable size packet (multiple of 16 bits)• Logical request & response networks
– 2 virtual channels each for deadlock
• Stacked dimension routing• Wormhole for large packets, virtual cut-through for
small packets
Page 31
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 61
IBM SP-2 (Vulcan)
• Switch has eight bidirectional 40 MB/s links• Link: 8 data bits, 1 tag, 1 reverse flow-control• Flit is 16 bits, phit is 8• input FIFO + output FIFO + central buffer 128 8-byte
segments
(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 62
Real Machines
• Wide links, smaller routing delay• Tremendous variation
top related