127 february 2006 rapidio ft research update: adaptive routing david bueno february 27, 2006 hcs...

127 February 2006

RapidIO FT Research Update:Adaptive Routing

David BuenoFebruary 27, 2006

HCS Research LaboratoryDept. of Electrical and Computer Engineering

University of Florida

227 February 2006

Overview RapidIO switches traditionally handle routing using routing tables

with destID-output port pairs Packet route is NOT specified at source, instead is determined by

switches as packet travels through network Routing tables are generally fixed with one port for each destID

Want to explore capabilities of adaptive routing in RapidIO switches for purposes of performance (load balancing) and fault tolerance Many of our FT network designs provide the option of “over-provisioning”

the backplane by providing an extra switch Initial experiments with high-bandwidth corner turns found it best to leave

extra switch inactive as no benefits were gained by using it in active mode with a fixed-routed application

Early GMTI experiments tested adaptive round-robin routing with GMTI corner turns and found a fixed-routed version performed better Lesson learned here is that if an application CAN be effectively statically

routed and network provides enough bandwidth, use fixed routing

327 February 2006

Background Several relevant papers on this topic uncovered in previous literature

searches [1] and [2] of most interest since they deal with expanding an existing protocol

(InfiniBand) to support adaptive routing Unlike IBA, RIO spec does not forbid adaptive routing, but leaves

implementation up to developer

One important issue in implementing adaptive routing in a RIO system is in-order delivery For traffic flows requiring in-order delivery, [2] suggests assigning multiple

destIDs to each node that may be the recipient of an in-order flow All switch routing tables would then provide only a single output port for this

destID Example: Assign destID’s 5 and 6 to physical processing element P

Assume use ID 5 for adaptive traffic, 6 for in-order traffic Sample routing table entry for a RIO switch then could look like:

destID: 5 Port: 1, 2, 3 destID: 6 Port: 3 Packets for destID 5 can leave through ports 1, 2, or 3, but packets for destID 6 must

leave through port 3 All packets for destID 5 and 6 end up at the same destination, processing element P

427 February 2006

Initial Model Improvements (2-10-06) Models already supported adaptive routing assumed similar to Honeywell RIOS

“aggregate” capabilities Round-robin selection of output ports from a list similar to previous example

Expanded simulation models to allow selection of output port based on port with smallest number of packets outstanding to be sent and accepted

Expanded models to allow random selection of output port Selection of port takes place prior to decision to accept or reject a packet based

on buffer space, priority, etc. Created additional 32-node benchmarks to test usefulness of adaptive routing

for traffic that cannot be statically scheduled Random reads- Each processing element issues 1000 read requests to random

destinations for 256 B. Request N+1 is not issued until request N is filled. Generally ~32 packets are in flight in the network at any one time.

Random sends 256- Each processing element issues 1000 message passing packets (256 B) to random destinations. There is a large delay after each packet is sent so that each iteration is not subject to contention prior to starting. (i.e. everyone sends their packet, then waits awhile, then everyone sends again at the same time, and this happens a total of 1000 times)

Random sends 4096- Each processing element issues 1000 full RapidIO messages (4096 B) to random destinations. There is a large delay after each message is sent so that each iteration is not subject to contention prior to starting.

527 February 2006

Experiments Overview (1) All experiments use the Fault-

Tolerant Clos (FTC) network architecture

Results generally hold for any of our FT architectures with 5-switch core stage if routing is configured identically

Adaptive routing only possible in FIRST stage if a shortest-hop path is to be taken to destination First-stage switch may choose

between any active core switch (up to 5 active switches) assuming packet is destined for a destination node NOT connected to the same first-stage switch

Most paths traverse three switches to get from one node to another Some paths only require one switch

when both source and dest are connected to same switch

627 February 2006

Experiments Overview (2) For all experiments, 5-switch core assumes all 5

switches are active 4-switch core may represent either of two cases:

4 active switches with a 5th switch unpowered as a spare 4 active switches, when the 5th switch has previously failed

3-switch core should be interpreted similarly Note that based on number of nodes and network

bandwidth, 5 switches is over provisioned, 4 switches is “correct” provisioning, and 3 switches is under provisioned

727 February 2006

New Model/Experiment Revisions (2-23-06) Updated models now used for collection of fixed results

Old fixed-routed models had been based on switch model prior to summer 05 internship Older switch model treated central switch memory as a single pool of buffer space

Made decision to accept or reject packets based on priority and total switch memory free (set of 4 thresholds, 1 per priority)

Model revised during internship to treat each output port individually, much like understanding of Honeywell RIOS Decision to accept packet based on output-port dependent factors:

Priority and number of packets of this priority currently buffered for its destination output port Total number of packets currently buffered for its destination output port Total amount of free switch memory (i.e. can another packet fit in the switch at all)

Wasn’t a perfect “apples to apples” comparison between fixed-routed and adaptive-routed systems because adaptive systems were based on new switch model

For shortest-buffer tactic, added capability to choose a random buffer from the set of shortest buffers rather than choosing the first one the simulation finds

Additional experiments: Changed sequence of random destinations generated

Insignificant effect on all results (<1%) Already performing enough repetitions to fairly gather latency results for random sends experiments

Changed initialization of round-robin sequence to random rather than first port in the list Again, insignificant effect on all results (<1%) Random traffic quickly ensures that port lists of each switch are not “synchronized” at all with respect to each

other Fair load balance is achieved regardless of starting point of each list

827 February 2006

Random Reads: Revised Shortest buffer with random selection of

shortest buffer now slightly outperforms round robin in all cases Old shortest-buffer tactic most often did not

make use of all available backplane switch resources Caused unbalanced network load and

performance penalty New tactic slightly improves upon round robin

in most cases Round robin more simple and still does very

good job of balancing the load of random traffic

Fixed method performance remained mostly the same, except slightly worse in 3-switch case Note this does NOT indicate that separate

buffer management is a worse scheme Instead, it is just simply a more fair, correct

comparison Switches could be configured to allow more

packets of this priority (0), which would change results across the board

Random Read Requests (256 B): Old Results

2700000

2750000

2800000

2850000

2900000

2950000

3000000

3050000

5-Switch Core 4-Switch Core 3-Switch Core

Com

plet

ion

Tim

e (n

s)

Round RobinShortest BufferRandom BufferFixed

Random Read Requests (256 B): Revised

2700000

2750000

2800000

2850000

2900000

2950000

3000000

3050000


Com

plet

ion

Tim

e (n

s)Round Robin

Shortest Buffer

Random Buffer

FixedRandom Shortest Buffer

927 February 2006

Random Sends (256 B): Revised For 5-switch and 4-switch cases,

light traffic still lends itself to fixed mapping

Round robin and random shortest buffer now very similar in all cases Random shortest buffer will behave

similarly to an “out-of-order” round robin in many cases

Fixed performance again slightly degraded in 3-switch case for reasons already discussed Results further emphasize the

effectiveness of adaptive routing when network is under-provisioned

Random Message Passing Sends (256 B): Old Results

2400

2450

2500

2550

2600

2650

2700

2750


Ave

rage

Pac

ket L

aten

cy (n

s)

Round Robin

Shortest Buffer

Random Buffer

Fixed

Random Message Passing Sends (256 B): Revised

2400

2450

2500

2550

2600

2650

2700

2750


Aver

age

Pack

et L

aten

cy (n

s)Round Robin

Shortest Buffer

Random Buffer

Fixed

Random Shortest Buffer

1027 February 2006

Random Sends (4096 B): Revised New fixed setup performs worse in all cases

due to more restrictive buffer management Again, previous comparison was not fair Current configuration could be

optimized and would affect all results, not just fixed

Fixed routing and random adaptive routing two worst options in all cases Old fixed results were actually aided by

unfair buffer management scheme as explained earlier This experiment was most dramatically

affected by the change due to the high contention and high number of retries issued

New fixed results suffer in all cases Fixed routing for under-provisioned 3-

switch case now even worse than before! Explanation for poor performance on

following slide

Random Message Passing Sends (4096 B): Old Results

10000

10500

11000

11500

12000

12500

13000

13500


Aver

age

Pack

et L

aten

cy (n

s)

Round Robin

Shortest Buffer

Random Buffer

Fixed

Random Message Passing Sends (4096 B): Revised

10000

10500

11000

11500

12000

12500

13000

13500

14000


Aver

age

Pack

et L

aten

cy (n

s) Round RobinShortest BufferRandom BufferFixedRandom Shortest Buffer

1127 February 2006

Fixed Routing Problems Fixed routing for under-provisioned 3-switch case

now even worse than before! Explanation for poor performance in both cases:

Imagine P0 wants to send a 4096 B message to P4 Imagine P1 simultaneously wants to send a 4096 B

message to P16 Both messages must travel through switch 0, whose

(partial) balanced, fixed routing table looks like:

P0

P1

2nd-LevelActive Switches

P4

P16

Switch 0

Dest ID Port

0 0

1 1

2 2

3 3

4 8

5 9

6 7

7 8

8 9

9 7

10 8

11 9

12 7

13 8

14 9

15 7

16 8

Both messages are entirely serialized through Switch 0 Port 8

With only 3 backplane switches, this scenario becomes very likely But, similar scenario may occur in

4- and 5-switch cases with less frequency

Any form of adaptive routing that will use ports 7, 8, and 9 for this traffic will be better

This is why even random selection on a per-packet basis performs better than fixed routing in the 3-switch case Fixed also the worst method in 4-

switch case by a lesser margin

1227 February 2006

Conclusions Optimal routing strategy highly dependent on algorithm and communication patterns

Adaptive routing not very useful when high traffic amounts (such as corner turns) can be adequately balanced statically Previous experiments have shown it can do more harm than good

These experiments show adaptive routing most useful in cases of heavy network contention when large transactions can not be statically scheduled

In general, round robin and random shortest buffer appear to be most effective adaptive routing strategies for Clos-based RIO networks Results may vary widely for other network configurations, but Clos networks the focus here due to

their FT properties and high performance Random shortest buffer improved upon initial shortest buffer routing but still may not worth

the cost of extra logic required to make decisions based on buffer status Effectiveness is limited in a Clos network because choice can only be made at first-stage switch

Even if buffer at first-stage switch is empty, it could be headed to a highly congested second-stage switch! Do NOT want to concern switches with the status of OTHER switches in the network

May be more useful in some applications specifically tailored towards this routing strategy But, similar queue “bypass” could be handled just using RapidIO priority mechanism already present in protocol

Adaptive routing improved upon fixed routing in almost all experiments, even when selection of port was completely random Exception was random sends (256 B) case, where traffic was so light that fixed routing was relatively

efficient and balanced Best case for adaptive routing was random sends (4096 B), where large messages cause problems

when statically scheduled for the same output port Extra-switch core helpful in ALL cases when traffic is random, even without adaptive

routing Adaptive routing enhances usefulness of active 5th core switch

1327 February 2006

References[1] J. M. Montanana, J. Flich, A. Robles, P. Lopez, and J.

Duato, "A Transition-Based Fault-Tolerant Routing Methodology For Infiniband Networks," in Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004.

[2] J. C. Martinez, J. Flich, A. Robles, P. Lopez, and J. Duato, “Supporting Adaptive Routing in InfiniBand Networks,” In Proceedings of the Eleventh Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pp. 165-172, February 2003.

127 february 2006 rapidio ft research update: adaptive routing david bueno february 27, 2006 hcs...

Documents