127 february 2006 rapidio ft research update: adaptive routing david bueno february 27, 2006 hcs...
DESCRIPTION
327 February 2006 Background Several relevant papers on this topic uncovered in previous literature searches [1] and [2] of most interest since they deal with expanding an existing protocol (InfiniBand) to support adaptive routing Unlike IBA, RIO spec does not forbid adaptive routing, but leaves implementation up to developer One important issue in implementing adaptive routing in a RIO system is in-order delivery For traffic flows requiring in-order delivery, [2] suggests assigning multiple destIDs to each node that may be the recipient of an in-order flow All switch routing tables would then provide only a single output port for this destID Example: Assign destID’s 5 and 6 to physical processing element P Assume use ID 5 for adaptive traffic, 6 for in-order traffic Sample routing table entry for a RIO switch then could look like: destID: 5 Port: 1, 2, 3 destID: 6 Port: 3 Packets for destID 5 can leave through ports 1, 2, or 3, but packets for destID 6 must leave through port 3 All packets for destID 5 and 6 end up at the same destination, processing element PTRANSCRIPT
127 February 2006
RapidIO FT Research Update:Adaptive Routing
David BuenoFebruary 27, 2006
HCS Research LaboratoryDept. of Electrical and Computer Engineering
University of Florida
227 February 2006
Overview RapidIO switches traditionally handle routing using routing tables
with destID-output port pairs Packet route is NOT specified at source, instead is determined by
switches as packet travels through network Routing tables are generally fixed with one port for each destID
Want to explore capabilities of adaptive routing in RapidIO switches for purposes of performance (load balancing) and fault tolerance Many of our FT network designs provide the option of “over-provisioning”
the backplane by providing an extra switch Initial experiments with high-bandwidth corner turns found it best to leave
extra switch inactive as no benefits were gained by using it in active mode with a fixed-routed application
Early GMTI experiments tested adaptive round-robin routing with GMTI corner turns and found a fixed-routed version performed better Lesson learned here is that if an application CAN be effectively statically
routed and network provides enough bandwidth, use fixed routing
327 February 2006
Background Several relevant papers on this topic uncovered in previous literature
searches [1] and [2] of most interest since they deal with expanding an existing protocol
(InfiniBand) to support adaptive routing Unlike IBA, RIO spec does not forbid adaptive routing, but leaves
implementation up to developer
One important issue in implementing adaptive routing in a RIO system is in-order delivery For traffic flows requiring in-order delivery, [2] suggests assigning multiple
destIDs to each node that may be the recipient of an in-order flow All switch routing tables would then provide only a single output port for this
destID Example: Assign destID’s 5 and 6 to physical processing element P
Assume use ID 5 for adaptive traffic, 6 for in-order traffic Sample routing table entry for a RIO switch then could look like:
destID: 5 Port: 1, 2, 3 destID: 6 Port: 3 Packets for destID 5 can leave through ports 1, 2, or 3, but packets for destID 6 must
leave through port 3 All packets for destID 5 and 6 end up at the same destination, processing element P
427 February 2006
Initial Model Improvements (2-10-06) Models already supported adaptive routing assumed similar to Honeywell RIOS
“aggregate” capabilities Round-robin selection of output ports from a list similar to previous example
Expanded simulation models to allow selection of output port based on port with smallest number of packets outstanding to be sent and accepted
Expanded models to allow random selection of output port Selection of port takes place prior to decision to accept or reject a packet based
on buffer space, priority, etc. Created additional 32-node benchmarks to test usefulness of adaptive routing
for traffic that cannot be statically scheduled Random reads- Each processing element issues 1000 read requests to random
destinations for 256 B. Request N+1 is not issued until request N is filled. Generally ~32 packets are in flight in the network at any one time.
Random sends 256- Each processing element issues 1000 message passing packets (256 B) to random destinations. There is a large delay after each packet is sent so that each iteration is not subject to contention prior to starting. (i.e. everyone sends their packet, then waits awhile, then everyone sends again at the same time, and this happens a total of 1000 times)
Random sends 4096- Each processing element issues 1000 full RapidIO messages (4096 B) to random destinations. There is a large delay after each message is sent so that each iteration is not subject to contention prior to starting.
527 February 2006
Experiments Overview (1) All experiments use the Fault-
Tolerant Clos (FTC) network architecture
Results generally hold for any of our FT architectures with 5-switch core stage if routing is configured identically
Adaptive routing only possible in FIRST stage if a shortest-hop path is to be taken to destination First-stage switch may choose
between any active core switch (up to 5 active switches) assuming packet is destined for a destination node NOT connected to the same first-stage switch
Most paths traverse three switches to get from one node to another Some paths only require one switch
when both source and dest are connected to same switch
627 February 2006
Experiments Overview (2) For all experiments, 5-switch core assumes all 5
switches are active 4-switch core may represent either of two cases:
4 active switches with a 5th switch unpowered as a spare 4 active switches, when the 5th switch has previously failed
3-switch core should be interpreted similarly Note that based on number of nodes and network
bandwidth, 5 switches is over provisioned, 4 switches is “correct” provisioning, and 3 switches is under provisioned
727 February 2006
New Model/Experiment Revisions (2-23-06) Updated models now used for collection of fixed results
Old fixed-routed models had been based on switch model prior to summer 05 internship Older switch model treated central switch memory as a single pool of buffer space
Made decision to accept or reject packets based on priority and total switch memory free (set of 4 thresholds, 1 per priority)
Model revised during internship to treat each output port individually, much like understanding of Honeywell RIOS Decision to accept packet based on output-port dependent factors:
Priority and number of packets of this priority currently buffered for its destination output port Total number of packets currently buffered for its destination output port Total amount of free switch memory (i.e. can another packet fit in the switch at all)
Wasn’t a perfect “apples to apples” comparison between fixed-routed and adaptive-routed systems because adaptive systems were based on new switch model
For shortest-buffer tactic, added capability to choose a random buffer from the set of shortest buffers rather than choosing the first one the simulation finds
Additional experiments: Changed sequence of random destinations generated
Insignificant effect on all results (<1%) Already performing enough repetitions to fairly gather latency results for random sends experiments
Changed initialization of round-robin sequence to random rather than first port in the list Again, insignificant effect on all results (<1%) Random traffic quickly ensures that port lists of each switch are not “synchronized” at all with respect to each
other Fair load balance is achieved regardless of starting point of each list
827 February 2006
Random Reads: Revised Shortest buffer with random selection of
shortest buffer now slightly outperforms round robin in all cases Old shortest-buffer tactic most often did not
make use of all available backplane switch resources Caused unbalanced network load and
performance penalty New tactic slightly improves upon round robin
in most cases Round robin more simple and still does very
good job of balancing the load of random traffic
Fixed method performance remained mostly the same, except slightly worse in 3-switch case Note this does NOT indicate that separate
buffer management is a worse scheme Instead, it is just simply a more fair, correct
comparison Switches could be configured to allow more
packets of this priority (0), which would change results across the board
Random Read Requests (256 B): Old Results
2700000
2750000
2800000
2850000
2900000
2950000
3000000
3050000
5-Switch Core 4-Switch Core 3-Switch Core
Com
plet
ion
Tim
e (n
s)
Round RobinShortest BufferRandom BufferFixed
Random Read Requests (256 B): Revised
2700000
2750000
2800000
2850000
2900000
2950000
3000000
3050000
5-Switch Core 4-Switch Core 3-Switch Core
Com
plet
ion
Tim
e (n
s)Round Robin
Shortest Buffer
Random Buffer
FixedRandom Shortest Buffer
927 February 2006
Random Sends (256 B): Revised For 5-switch and 4-switch cases,
light traffic still lends itself to fixed mapping
Round robin and random shortest buffer now very similar in all cases Random shortest buffer will behave
similarly to an “out-of-order” round robin in many cases
Fixed performance again slightly degraded in 3-switch case for reasons already discussed Results further emphasize the
effectiveness of adaptive routing when network is under-provisioned
Random Message Passing Sends (256 B): Old Results
2400
2450
2500
2550
2600
2650
2700
2750
5-Switch Core 4-Switch Core 3-Switch Core
Ave
rage
Pac
ket L
aten
cy (n
s)
Round Robin
Shortest Buffer
Random Buffer
Fixed
Random Message Passing Sends (256 B): Revised
2400
2450
2500
2550
2600
2650
2700
2750
5-Switch Core 4-Switch Core 3-Switch Core
Aver
age
Pack
et L
aten
cy (n
s)Round Robin
Shortest Buffer
Random Buffer
Fixed
Random Shortest Buffer
1027 February 2006
Random Sends (4096 B): Revised New fixed setup performs worse in all cases
due to more restrictive buffer management Again, previous comparison was not fair Current configuration could be
optimized and would affect all results, not just fixed
Fixed routing and random adaptive routing two worst options in all cases Old fixed results were actually aided by
unfair buffer management scheme as explained earlier This experiment was most dramatically
affected by the change due to the high contention and high number of retries issued
New fixed results suffer in all cases Fixed routing for under-provisioned 3-
switch case now even worse than before! Explanation for poor performance on
following slide
Random Message Passing Sends (4096 B): Old Results
10000
10500
11000
11500
12000
12500
13000
13500
5-Switch Core 4-Switch Core 3-Switch Core
Aver
age
Pack
et L
aten
cy (n
s)
Round Robin
Shortest Buffer
Random Buffer
Fixed
Random Message Passing Sends (4096 B): Revised
10000
10500
11000
11500
12000
12500
13000
13500
14000
5-Switch Core 4-Switch Core 3-Switch Core
Aver
age
Pack
et L
aten
cy (n
s) Round RobinShortest BufferRandom BufferFixedRandom Shortest Buffer
1127 February 2006
Fixed Routing Problems Fixed routing for under-provisioned 3-switch case
now even worse than before! Explanation for poor performance in both cases:
Imagine P0 wants to send a 4096 B message to P4 Imagine P1 simultaneously wants to send a 4096 B
message to P16 Both messages must travel through switch 0, whose
(partial) balanced, fixed routing table looks like:
P0
P1
2nd-LevelActive Switches
P4
P16
Switch 0
Dest ID Port
0 0
1 1
2 2
3 3
4 8
5 9
6 7
7 8
8 9
9 7
10 8
11 9
12 7
13 8
14 9
15 7
16 8
Both messages are entirely serialized through Switch 0 Port 8
With only 3 backplane switches, this scenario becomes very likely But, similar scenario may occur in
4- and 5-switch cases with less frequency
Any form of adaptive routing that will use ports 7, 8, and 9 for this traffic will be better
This is why even random selection on a per-packet basis performs better than fixed routing in the 3-switch case Fixed also the worst method in 4-
switch case by a lesser margin
1227 February 2006
Conclusions Optimal routing strategy highly dependent on algorithm and communication patterns
Adaptive routing not very useful when high traffic amounts (such as corner turns) can be adequately balanced statically Previous experiments have shown it can do more harm than good
These experiments show adaptive routing most useful in cases of heavy network contention when large transactions can not be statically scheduled
In general, round robin and random shortest buffer appear to be most effective adaptive routing strategies for Clos-based RIO networks Results may vary widely for other network configurations, but Clos networks the focus here due to
their FT properties and high performance Random shortest buffer improved upon initial shortest buffer routing but still may not worth
the cost of extra logic required to make decisions based on buffer status Effectiveness is limited in a Clos network because choice can only be made at first-stage switch
Even if buffer at first-stage switch is empty, it could be headed to a highly congested second-stage switch! Do NOT want to concern switches with the status of OTHER switches in the network
May be more useful in some applications specifically tailored towards this routing strategy But, similar queue “bypass” could be handled just using RapidIO priority mechanism already present in protocol
Adaptive routing improved upon fixed routing in almost all experiments, even when selection of port was completely random Exception was random sends (256 B) case, where traffic was so light that fixed routing was relatively
efficient and balanced Best case for adaptive routing was random sends (4096 B), where large messages cause problems
when statically scheduled for the same output port Extra-switch core helpful in ALL cases when traffic is random, even without adaptive
routing Adaptive routing enhances usefulness of active 5th core switch
1327 February 2006
References[1] J. M. Montanana, J. Flich, A. Robles, P. Lopez, and J.
Duato, "A Transition-Based Fault-Tolerant Routing Methodology For Infiniband Networks," in Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004.
[2] J. C. Martinez, J. Flich, A. Robles, P. Lopez, and J. Duato, “Supporting Adaptive Routing in InfiniBand Networks,” In Proceedings of the Eleventh Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pp. 165-172, February 2003.