mitch gusat and wolfgang denzel with ibm team ibm zrl ieee
TRANSCRIPT
Case study: InfiniBand CM Parameter Tuning
Mitch Gusat and Wolfgang Denzelwith IBM team
IBM ZRL
IEEE 802 Plenary SessionDallas, Nov. 2006
IBM Zurich Research Lab 2
Outline
• Problem: Congestion in IBA Networks
• Solution (elements thereof): CCA
• Need: Tune the CCA parameters
• Method: ZRL Congestion Benchmarking
• Simulation results and conclusions
IBM Zurich Research Lab 3
Problem: Hotspot Congestion in Lossless ICTNs => Saturation Trees
shortcut routes within same switch elements (SE)
same SEs
port 128
port 4
port 1HCA
HCA
. .
. .
.
HCA
HCA
. .
. .
.
16 8x8 SEs
. .
..
. .
. .
.
32 4x4 half SEs
.
.
.
.
32 4x4 half SEs
.
.
.
.
32 4x4 half SEs
.
.
.
.
32 4x4 half SEs
.
.
.
.
SourceHost
ChannelAdapters
DestinationHost
ChannelAdapters
“Inn
ocen
t”tr
affic
+ 1
Hot
spot
hotspot
Zurich Research Laboratory
Congestion Control © 2005 IBM Corporation
Effect: Global collapse of throughput caused by single hotspot
shortcut routes within same switch elements (SE)
same SEs
port 128
port 4
port 1HCA
HCA
. .
. .
.
HCA
HCA
. .
. .
.
16 8x8 SEs
. .
..
. .
. .
.
32 4x4 half SEs
.
.
.
.
32 4x4 half SEs
.
.
.
.
32 4x4 half SEs
.
.
.
.
32 4x4 half SEs
.
.
.
.
SourceHost
ChannelAdapters
DestinationHost
ChannelAdapters
Too
muc
h tr
affic
to h
otsp
ot
hotspot
0
0.2
0.4
0.6
0.8
1
0 2ms 4ms 6ms 8ms 10ms
Thro
ughp
ut (
for e
ach
outp
ut)
Time
congestion start congestion end
300% hotspot oversubscriptionduring congestion period
all time constant offered load = 0.9
throughput of hotspot output
saturates
throughput of other outputsbreaks down
Zurich Research Laboratory
Congestion Control © 2005 IBM Corporation
Solution (elements thereof): IBA Congestion Control Annex (CCA)
Source HCASwitch
congestionthresholdIndex
FECNDestination
HCA
Congestion Control Tablecontaining inter-packet delays (IPD)
IPD−
+
Timer
BECNBECN
IPD
packet packet
Forward ExplicitCongestion Notification
Backward ExplicitCongestion NotificationParameters
Switch queue threshold (~Qeq): sw_th– congestion detection and FECN marking
IPD table size = 64-256 entries – dynamic range of per flow rate control
highest IPD entry: max_ipd – largest inter-packet gap = 1/Rmin
IPD index increment: ipd_idx_incr– step on each new BECN => rate control granularity
IPD recovery timer: rec_time– timeout of the autonomic rate increase timer
IBM Zurich Research Lab 6
Questions
1. Does it work?
2. Tuning: What parameter settings for which cases?=> exploration of a combinatorial sub-space
3. Where are the limits of its operation?
IBM Zurich Research Lab 7
Does it work?
• Qualified “yes” => needs tuningeasy for small fabrics w/ simple traffic, hard for others...
• Param tuning required per (1) fabric architecture and (2) traffic pattern
• Fragile stability (stiffness): CCA sensitivity to traffic and params...
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
Tuned parameters (almost..)Un-tuned congestion control
IBM Zurich Research Lab 8
Early Observations• With min. pkt size under high background load, ECN traffic can
lead to instability
• A hotspot relay destination receiving permanent short-packet traffic may be overloaded with F/BECN signaling
• BECN and ACK packets can be caught in reverse congestion
• Reaction delay improvable by direct BCN vs. the CCA mechanism
• CCA operation is sensitive to parameters Oscillations observed... seem to be controllable
• Parameter settings strongly depend on fabric and traffic!
IBM Zurich Research Lab 9
Simulation Model
• Switching network4x4..12x12 input-buffered CIOQ switchesCredit-based flow controlMultistage bidirectional fat-tree networkSource routing: Static or dynamic (shortest path)
• Edge adapter (HCA) operationQueue pairs, credit-controlled round-robin scheduling with rate ctrl. triggered by CMDestination will ACK (64B pkt) after successful reception of user packets (MTUs)Return of short BECN packets if FECN bit is set
• CM mechanism
Switch action: Output buffer threshold with hysteresis -> FECN bit set in packets leaving switch module and BECN packets returned from destination HCA
SRC HCA action: Rate control according to CCA.
Zurich Research Laboratory
Congestion Control © 2005 IBM Corporation
Baseline Switch: Xbar-based CIOQ with Input Buffering
sche
dule
r
sche
dule
r
sche
dule
r
In[1]
In[2]
In[n]
Out[1]
Out[2]
Out[n]
. . . . .
. . . . .
. . . . .
. . .
. .
. . .
. .
. . .
. .
Logi
cal O
utpu
t Que
ue
Logi
cal O
utpu
t Que
ue
Logi
cal O
utpu
t Que
ue
Physical Input Buffer
Physical Input Buffer
Physical Input Buffer
Switch ArchitectureIN buffer = 72KB
Output service: RR
Switch element (SE) described in Omnet++ #include <omnetpp.h>#include <packetdefinition.h>
class Switch : public cSimpleModule{Module_Class_Members( Switch, cSimpleModule, 0 )< some parameter and variable definitions …….. >virtual void initialize();virtual void handleMessage( cMessage *msg );virtual void finish();
}
Define_Module( Switch );
void Switch::initialize(){
< some initializations, allocation of queue arrays, schedule first SCHEDULE_EVENT, …….. >
void Switch::handleMessage(cMessage *msg){switch( msg->kind() ){case PACKET: < en-queue received packet …….. >case SCHEDULE_EVENT: < de-queue and send packet, return credit,
update scheduler pointer, schedule nextSCHEDULE_EVENT, ….. >
case CREDIT: < return credit to credit pool …….. >}
}
void Switch::finish(){}
3. performed after simulation run
Typical 3-method structure
1. performed before simulation run
2. handling of received messages during simulation run
IBM Zurich Research Lab 11
Baseline Fabric: MIN Topology => Bidir Fat Tree
• 2-level / 3-stage bidir MIN
• Simulated: 8 – 32 nodes
• Time per run: 2-30 min.
shortcut routes within same modules
same modules
port 32
port 4
port 1
port 4
port 1
8 4x4 modules 8 4x4 modules4 8x8 modules
HCATx
HCATx
HCARx
HCARxport 32
shortcut routes within same modules
32 4x4 modules
same modules
port 128
port 4
port 1
port 4
port 1
32 4x4 modules 32 4x4 modules32 4x4 modules
16 8x8 modules
HCATx
HCATx
HCARx
HCARxport 128
• 3-level / 5-stage bidir MIN
• Simulated: 128 – 432 nodes
• Time per run: 20-500 min.
IBM Zurich Research Lab 12
Traffic: ZRL Congestion Benchmark• Source nodes generate one or more hotspots according to matrix
[λij_hot]: tp->q = αk_hot [λij_hot]:tp->q , [λij_hot] is specified per each case below
1. Congestion type: IN- or OUTput-generated
2. Hotspot severity: HSV = λaggr / μHS , λaggr = ∑ λi at hotspotted output, μHS = service rate of the HS
Mild HSV ~ 1 .. 2 Moderate HSV = 3 .. 7Severe HSV >> 10.
3. Hotspot degree: HSD is the fan-in of congestive tree at the measured hotspot
Small’ HSD <10% (of all sources inject hot traffic)Medium HSD 20..60%Large HSD >90%.
IBM Zurich Research Lab 13
Sensitivity Analysis: Combinatorial Tree
Via analysis and validated by simulation, rank CCA params by sensitivity:1. ipd_indx_incr2. rec_time3. ipd_max
Simulation subtree for a 128-node IB network
background load
0.5
0.9
IPD recovery timemin
max
10.5us
.
.
.
.
.
.
IPD index step
min
max
20
.
.
.
.
.
.
maximum IPD
min
max
84us
.
.
.
.
.
.
hotspot degree
(# hot flows)
3 32
3/16x16
5/8x8
# stages / SE size
.
.
.
.
.
.
.
.
.
• 1 sim. point: ~ 1hr• Exhaustive coverage: ~ 78 yrs.
output: ~ 23 PB of data...not feasible, therefore...
• Prune the tree by analytical pre-selection
Network: 32-node 3-stage/128-node 5-stage
Traffic:single HSV = moderate congestiontwo HSD = 3 and 32 flows ca. 800 cases
ca 800 cases were analyzed
IBM Zurich Research Lab 14
Sample Results: “Input-generated” CongestionSelection: 4 ‘corner’ cases
Offered load• pkt size = 2KB (all user traffic MTU)• negative-exponential interarrival times• background traffic destinations uniformly distributed, load:
a) 50%b) 90%
2 congestion patterns, both with HSV ~ 3• Small HSD:
During hotspot period, 3 sources send 100% of their load to DST #31• Large HSD:
During the hotspot all 32 sources send 10% of their load to DST #31
IBM Zurich Research Lab 15
Four corner cases: Un-tuned CM ~ No CM
large hotspot, 50% load
small hotspot, 50% load
large hotspot, 90% load
small hotspot, 90% load
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
Zurich Research Laboratory
Congestion Control © 2005 IBM Corporation
With marginal settings of ipd_idx_incr and rec_time
0
.5
1
0 2ms 4ms 6ms 8ms 10ms
Thro
ughp
ut (
for o
utpu
ts 0
..31)
Time
start congestion end0
.5
1
0 2ms 4ms 6ms 8ms 10msTh
roug
hput
(fo
r out
puts
0..3
1)Time
start congestion end0
.5
1
0 2ms 4ms 6ms 8ms 10ms
Thro
ughp
ut (
for o
utpu
ts 0
..31)
Time
start congestion end0
.5
1
0 2ms 4ms 6ms 8ms 10ms
Thro
ughp
ut (
for o
utpu
ts 0
..31)
Time
start congestion end
Corner case: 4
Load: High (0.9)Hotspot degree: Large (32)
Out-of-range parametersIPD index step too low (=2) IPD index step too high (=40) IPD recovery timer too fast (=2.6us) IPD recovery timer too slow (=84us)
32x32 3-stage (2-level) fat-tree Common parameters: Packet size: 2 KB Hotspot severity: 300%
oscillations breakdown hot flow suffering
Throughput: with Congestion Control:
oscillations
Focus: 4th corner case
IBM Zurich Research Lab 17
Tuning for IG congestion in 32 to 128-node fabrics
32-node• sw_th = 90% of switch input buffer size• max_ipd = 1000 ( = 21µs )
from fairness under worst case => IPD sufficient for all other N-1 sources flows to access a share (1 pkt) of the HS bandwidth
• IPD table index increment ipd_idx_incr = 5 (sweetspot)• IPD recovery time rec_time = 500 ( = 10.5µs )
128-node• max_ipd and ipd_idx_incr ~ 4x larger than above
linear scalability confirmed by simulations
IBM Zurich Research Lab 18
Output-generated congestion with CM tuned for IG
ipd
50% load
ipd
90% load
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
200
400
600
800
1000
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
IPD[31] IPD[30] IPD[29] IPD[28] IPD[27] IPD[26] IPD[25] IPD[24] IPD[23] IPD[22] IPD[21] IPD[20] IPD[19] IPD[18] IPD[17] IPD[16] IPD[15] IPD[14] IPD[13] IPD[12] IPD[11] IPD[10] IPD[9] IPD[8] IPD[7] IPD[6] IPD[5] IPD[4] IPD[3] IPD[2] IPD[1] IPD[0]
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
200
400
600
800
1000
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
IPD[31] IPD[30] IPD[29] IPD[28] IPD[27] IPD[26] IPD[25] IPD[24] IPD[23] IPD[22] IPD[21] IPD[20] IPD[19] IPD[18] IPD[17] IPD[16] IPD[15] IPD[14] IPD[13] IPD[12] IPD[11] IPD[10] IPD[9] IPD[8] IPD[7] IPD[6] IPD[5] IPD[4] IPD[3] IPD[2] IPD[1] IPD[0]
IBM Zurich Research Lab 19
OG w/ “Big Bang” control50% load 90% load
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
Parameters for OG:1. ipd table filing: multiplicative (vs. linear in the IG)2. max_ipd = 10000 ( = 210us )3. recovery time: rec_time = 10000 ( = 210us )4. ipd increase: ipd_idx_incr[i] += 127• ipd decrease: ipd_idx_incr[i] -= 1
IBM Zurich Research Lab 20
OG w/ “Big Bang” control and enhancements
ipd
90% load, with false BECN filtering
ipd
90% load, 3x switch memory
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
0
0.2
0.4
0.6
0.8
1
1.2
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
simulation time
output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]
hotspot
• With above enhancements, different (from IG congestion) control algorithm, and, different table values,
• a (marginally) stable solution to OG congestion is feasible.
IBM Zurich Research Lab 21
Tuning Guidelines for IG CongestionN = network size, i.e., number of portsHSD = maximum hot spot degree
Worst case = N, but can be less if system partitioning known.Absolute time (µsec) values are w/ reference to our model’s RTTof 2-20 µsec. (w/o congestion)
Item Value
IPD Table size 128
Switch threshold for congestion detection 90% of queue capacity, with hysteresis
IPD table index increment Min(1/6•N, 1/2•HSD)
Max IPD value 2/3 HSD µsec.
Recovery timer 10 µsec.
IBM Zurich Research Lab 22
Limitations• Stability limits...?
Large hotspot degree (128 sources send 5% to hotspot) with 50% background load has no solution in the param space
• Input- and Output-generate congestions require two distinct control loops (table values and rate control algorithms)
Corollary: Any mixture of IG and OG must be controlled by the dominant case, although the solution might not be stable.
• Why? • Under-sampling
Generally not enough true BECNs for convergence Next problem: false BECNs
» That’s all.
Contributors
N. Ni†, G. Pfister†, C. Minkenberg*, T. Engbersen*, D. Craddock§, W. Rooney§, R. Luijten* and T. Schlipf‡
* IBM Research GmbH, Rüschlikon, Switzerland; § IBM Systems and Technology Group, Poughkeepsie, NY USA;
† IBM Systems and Technology Group, Austin, TX USA; ‡ IBM Systems and Technology Group Boeblingen, Germany.
Contact: [email protected]