dynamic networks

Dynamic Networks

L.N. BhuyanPartly from Berkeley Notes

04/19/23 CS258 S99 2

What is Dynamic Network

• Dynamic Network is the network that can connect any input to any output by enabling or disabling some switches in the network

• Examples: - Shared Bus: The bus arbiter connects a

processor to a memory - Crossbar: Consists of a lot of switching

elements, which can be enabled to connect many inputs to many outputs simultaneously

- Multistage Network: Consists of several stages of switches that are enabled to get connections

- The nodes in static networks (like Mesh) also consist of dynamic crossbars

04/19/23 CS258 S99 3

Crossbar Switch Design

• Complexity O(N**2) for an NXN Crossbar – Why? See next page

Cross-bar

InputBuffer

Control

OutputPorts

Input Receiver Transmiter

Ports

Routing, Scheduling

OutputBuffer

04/19/23 CS258 S99 4

How do you build a crossbar

Io

I1

I2

I3

Io I1 I2 I3

O0

Oi

O2

O3

N**2 switches => Cost O(N**2)

Time taken by the arbiter = O(N**2)Multiplexors are controlled from controller

From Control

04/19/23 CS258 S99 5

Crossbar Contd.

• An NXN Crossbar allows all N inputs to be connected simultaneously to all N outputs

• It allows all one-to-one mappings, called permutations. No. of permutations = N!

• When two or more inputs request the same output, only one of them is connected and others are either dropped or buffered

• When processors access memories through crossbar, this situation is called memory access conflicts

• Given p as the probability of request by a processor per cycle and assuming that a processor’s request is uniformly directed to all N memories, the average number of connections allowed per cycle, called Bandwidth (BW) is

BW = N{1(1-p/N)**(N-1)} – Derive this!!!

04/19/23 CS258 S99 6

Input buffered swtich

• Independent routing logic per input - FSM• Scheduler logic arbitrates each output - priority, FIFO, random• Head-of-line blocking problem – The head packet in a buffer cannot depart because the

output is busy with another packet. The second packet may be destined to an output that is free, but cannot depart due to blocking by the first packet => One solution is to create multiple input queues, one per output, called Virtual Output Queuing – adopted in most routers.

• Scheduler Design – How to ensure maximum simultaneous connections is a challenging research area.

Cross-bar

OutputPorts

Input Ports

Scheduling

R0

R1

R2

R3

04/19/23 CS258 S99 7

Problems with Input-Buffered Switch

• FIFO Input buffers give rise to Head of the Line (HOL) problem

• Current routers employ a separate input queue for each output, called virtual output queue (VOQ)

• Then how to schedule the packets from different VOQ’s for transmission?

04/19/23 CS258 S99 8

VOQ-based Input Buffered Switch

04/19/23 CS258 S99 9

Scheduling in Input Buffered Switch

• n independent arbitration problems?– static priority, random, round-robin

• simplifications due to routing algorithm?• general case is max bipartite matching – Iterative

algorithms – iSLIP in Cisco

Cross-bar

OutputPorts

R0

R1

R2

R3

O0

O1

O2

InputBuffers

04/19/23 CS258 S99 10

Iterative Matching– A 3-step Procedure

Request Grant Accept

04/19/23 CS258 S99 11

04/19/23 CS258 S99 12

04/19/23 CS258 S99 13

04/19/23 CS258 S99 14

04/19/23 CS258 S99 15

Fair Scheduling in Crossbar (Infocom 2002)

Motivation:

•Current routers employ fair scheduling at the output link, but with high link speed there are very few packets at the output buffer. These packets were selected by the crossbar with equal probability from the input buffers.

•Many more packets are waiting in the input queues. Choosing packets during arbitration depending on the reservation will ensure better QoS among competing flows at the input buffers.

04/19/23 CS258 S99 16

The iFS Algorithm

Initially, all inputs and outputs are considered as unmatched and none of the inputs have any candidates.Then in each iteration:

Grant stage: Each unmatched output selects a flow with the smallest virtual time for its head-of-line cell and marks the cell as a candidate for the corresponding input. Grant signal is then given to the input.

Accept stage: Each unmatched input examines its candidate set, selects a winner according to age and sends an accept signal to its output. The input and output are

then considered as matched. Reset the candidate set to empty.

04/19/23 CS258 S99 17

Output/Shared Buffered Switch

Control

OutputPorts

Input Ports

OutputPorts

OutputPorts

OutputPorts

R0

R1

R2

R3

RAMphase

O0

Oi

O2

O3

DoutDin

Io

I1

I2

I3

addr

Shared Buffer

Output Buffered Switch has buffers at output to store packets. There is always a minimal transmitting buffer at the input. What happens if there are 2 or more packets to the same output at the same time. In order to capture both, the switch speed has to be N times that of link speed => Difficult to design.

RAM speed has to be N times the link speed.

04/19/23 CS258 S99 18

Shared Buffer Switch: IBM SP Vulcan switch

• Many gigabit Ethernet switches use similar design without the cut-through

• 128 8-byte ‘chunks’ in central queue, LRU per output

FIFO

CRCcheck

Routecontrol

FlowControl

8 8

Des

eria

lizer

64

Input Port

RAM64x128

InArb

OutArb

8 x 8Crossbar

CentralQueue

FIFO

CRCGen

FlowControl

8 8Seri

aliz

er

64

Ouput Port

XBarArb

FIFO

CRCcheck

Routecontrol

FlowControl

8 8

Des

eria

lize

rInput Port

°°°

64

°°°

FIFO

CRCGen

FlowControl

8 8Ser

ializ

er

Ouput Port

XBarArb

8

°°°

8

04/19/23 CS258 S99 19

SGI SPIDER: IEEE Micro Jan 1997

04/19/23 CS258 S99 20

Flow Control

• What do you do when push comes to shove?– Ethernet: collision detection and retry after delay

– FDDI, token ring: arbitration token

– TCP/WAN: buffer, drop, adjust rate

– any solution must adjust to output rate

• Link-level flow control

Data

Ready

04/19/23 CS258 S99 21

Examples• Short Links

• long links– several flits on the wire

So

urce

Des

tin

atio

n

Data

Req

Ready/AckF/E F/E

04/19/23 CS258 S99 22

Multistage Interconnection Network

A network consisting of multiple stages of crossbar switches has the following properties.

• NxN network for N=2n

• Consists of log2N stages of 2x2 switches

• Has N/2 2x2 switches per stage

• Cost O(N log n) instead of O(N2) for Crossbar

• For N= an, a MIN can be similarly designed with axa switches

04/19/23 CS258 S99 23

Multistage interconnection networks

0

1

2

3

4

5

6

7

000

001

010

011

100

101

110

111

1

1

0

Omega NetworkComplexity O(Nlog2N)

04/19/23 CS258 S99 24

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

=0

=1

=2

=3

=4

=5

=6

=7

(a) Perfect shuffle (b) Inverse perfect shuffle

shuffle interconnection

S(an-1 an-2 … a1 a0) = (an-2 an-3 … a0 an-1 )

04/19/23 CS258 S99 25

Omega Network

• Every stage of switches is preceded by a perfect shuffle interconnection

S(an-1 an-2 … a1 a0) = (an-2 an-3 … a0 an-1 )• An input can be connected to a straight or

exchange output in a 2x2 switch.

E(an-1 an-2 … a1 a0) = (an-1 an-2 … a1 ā0) • To route a message/packet in an Omega

network, the destination tag which is binary equivalent of the destination is used, (dn-1 dn-2 …

d1 d0). The ith bit di is used to control the routing at the ith stage counted from the right with 0 <= i <= n-1. If di = 0, the input is connected to the upper output. If di = 1, it is connected to the lower output.

04/19/23 CS258 S99 26

Self Routing

• A processor generates a tag that is binary equivalent of the destination

• MSB controls the leftmost stage and the lsb controls the rightmost stage of the Omega network. A small controller inside the 2 x 2 switch senses this bit and enables the connection

• If bit ci = 0, the request is to the upper output; if it is 1, the request is to the lower output.

• Based on digit if switch size is greater than 2

• Network conflict - Select Round Robin

• Less Bandwidth than crossbar, but more cost effective

• What about QoS? Future research

04/19/23 CS258 S99 27

Theorem: The Omega network is self routing

Let source be (sn-1sn-2 … s2 … s1s0) and destination be (dn-1dn-2 … d2 … d1d0). Before Stage 1, the source is switched to the position (sn-2sn-3 … s1 … s0sn-1) due to perfect shuffle connection. After Stage 1 it is switched to (sn-2sn-3 … s1 … s0dn-

1) as per the (n-1)th of the destination.

Before 2nd stage of the switches, the source is connected to (sn-3 … s0dn-1sn-2) as after 2nd stage it becomes (sn-3 … s0dn-1dn-2)

If we continue like this for n stages, the source matches (dn-1dn-2 … di … d1d0) which is the destination.

04/19/23 CS258 S99 28

Example: SP

• 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz clock

• packet sw, cut-through, no virtual channel, source-based routing

• variable packet <= 255 bytes, 31 byte fifo per input, 7 bytes per output, 16 phit links

P0P1P2P3 P15

E0E1E2E3 E15

Intra-Rack Host Ports

Inter-Rack External Switch Ports

16-node Rack

SwitchBoard

Multi-rack Configuration

04/19/23 CS258 S99 29

Summary

• Routing Algorithms restrict the set of routes within the topology

– simple mechanism selects turn at each hop

– arithmetic, selection, lookup

• Deadlock-free if channel dependence graph is acyclic– limit turns to eliminate dependences

– add separate channel resources to break dependences

– combination of topology, algorithm, and switch design

• Deterministic vs. adaptive routing

• Switch design issues– input/output/pooled buffering, routing logic, selection logic

• Flow control

• Real networks are a ‘package’ of design choices

04/19/23 CS258 S99 30

Protocols: HW/SW Interface

• Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently;

– Enabling technologies: SW standards that allow reliable communications without reliable networks

– Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites

• Transmission Control Protocol/Internet Protocol (TCP/IP)

– This protocol family is the basis of the Internet

– IP makes best effort to deliver; TCP guarantees delivery

– TCP/IP used even when communicating locally: NFS uses IP even though communicating across homogeneous LAN

04/19/23 CS258 S99 31

TCP/IP packet

• Application sends message• TCP breaks into 64KB

segements, adds 20B header

• IP adds 20B header, sends to network

• If Ethernet, broken into 1500B packets with headers, trailers

• Header, trailers have length field, destination, window number, version, ...

TCP data(≤ 64KB)

TCP Header

IP Header

IP Data

Ethernet

CPU

User

Kernel

NIC NIC

PCI Bus

Communicating with the Server: The O/S Wall

Problems:• O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP)

• O/S interrupt • Data copying from kernel space to user space and vice versa

• Oh, the PCI Bottleneck!

04/19/23 CS258 S99 33

The Send/Receive Operation

• The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB.

• The data is copied from the User space to the Kernel space

• The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet.

• The OS copies the data onto the network interface card (NIC) send queue.

• The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer.

04/19/23 CS258 S99 34

Transmitting data across the memory bus using a standard NIC

http://www.dell.com/downloads/global/power/1q04-her.pdf

04/19/23 CS258 S99 35

Timing Measurement in UDP Communication

X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005

04/19/23 CS258 S99 36

I/O Acceleration Techniques

• TCP Offload: Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers.

• O/S Bypass: User-level software techniques to bypass protocol stack – Zero Copy Protocol

(Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA)

• Architectural Techniques: Instruction set optimization, Multithreading, copy engines, onloading, prefetching, etc.

04/19/23 CS258 S99 37

Comparing standard TCP/IP and TOE enabled TCP/IP stacks

(http://www.dell.com/downloads/global/power/1q04-her.pdf)

04/19/23 CS258 S99 38

Chelsio 10 Gbs TOE

04/19/23 CS258 S99 39

Cluster (Network) of Workstations/PCs

04/19/23 CS258 S99 40

Myrinet Interface Card

04/19/23 CS258 S99 41

InfiniBand Interconnection

• Zero-copy mechanism. The zero-copy mechanism enables a user-level application to perform I/O on the InfiniBand fabric without being required to copy data between user space and kernel space.

• RDMA. RDMA facilitates transferring data from remote memory to local memory without the involvement of host CPUs.

• Reliable transport services. The InfiniBand architecture implements reliable transport services so the host CPU is not involved in protocol-processing tasks like segmentation, reassembly, NACK/ACK, etc.

• Virtual lanes. InfiniBand architecture provides 16 virtual lanes (VLs) to multiplex independent data lanes into the same physical lane, including a dedicated VL for management operations.

• High link speeds. InfiniBand architecture defines three link speeds, which are characterized as 1X, 4X, and 12X, yielding data rates of 2.5 Gbps, 10 Gbps, and 30 Gbps, respectively.

Reprinted from Dell Power Solutions, October 2004. BY ONUR

CELEBIOGLU, RAMESH RAJAGOPALAN, AND RIZWAN ALI

04/19/23 CS258 S99 42

InfiniBand system fabric

04/19/23 CS258 S99 43

UDP Communication – Life of a Packet

X. Zhang, L. Bhuyan and W. Feng, “Anatomy of UDP and M-VIA for Cluster Communication” Journal of Parallel and Distributed Computing (JPDC), Special issue on Design and Performance of Networks for Super-, Cluster-, and Grid-Computing, Vol. 65, Issue 10, October 2005, pp. 1290-1298.

dynamic networks

Documents

output priority

output link

output buffer

input fsmscheduler logic

separate input queue

multiple input queues

virtual output queuing

n inputs