june 4, 2005 mobs 2005 1 sampling and stability in tcp/ip workloads lisa hsu, ali saidi, nathan...
Post on 22-Dec-2015
214 views
TRANSCRIPT
June 4, 2005June 4, 2005 MoBS 2005MoBS 2005 11
Sampling and Stability in Sampling and Stability in TCP/IP WorkloadsTCP/IP Workloads
Lisa Hsu, Ali Saidi, Nathan BinkertLisa Hsu, Ali Saidi, Nathan Binkert
Prof. Steven ReinhardtProf. Steven Reinhardt
University of MichiganUniversity of Michigan
22June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
BackgroundBackground
During networking experiments, some During networking experiments, some runs would inexplicably get no bandwidthruns would inexplicably get no bandwidth
Searched high and low for what was Searched high and low for what was “wrong” “wrong” Simulator bug?Simulator bug? Benchmark bug?Benchmark bug? OS bug?OS bug?
Answer: none of the aboveAnswer: none of the above
33June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
The Real AnswerThe Real Answer
Simulation Methodology!?Simulation Methodology!? Tension between speed and accuracy in Tension between speed and accuracy in
simulationsimulation Want to capture representative portions of Want to capture representative portions of
simulation WITHOUT running the entire simulation WITHOUT running the entire application application
Solution: Fast functional simulation Solution: Fast functional simulation
So what’s the problem here?So what’s the problem here?
44June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
TCP TuningTCP TuningTCP tunes itself to the performance of TCP tunes itself to the performance of underlying systemunderlying systemSets its send rate based on perceived end-to-Sets its send rate based on perceived end-to-end bandwidthend bandwidth Performance of networkPerformance of network Performance of receiverPerformance of receiver
During checkpointing simulation, had tuned to During checkpointing simulation, had tuned to performance of meaningless systemperformance of meaningless systemAfter switching to detailed simulation, the After switching to detailed simulation, the dramatic change in underlying system dramatic change in underlying system performance disrupted flowperformance disrupted flow
55June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Timing DependenceTiming Dependence
The degree to which an application’s The degree to which an application’s performance depends upon execution performance depends upon execution timing (e.g. memory latencies)timing (e.g. memory latencies)
Three classes:Three classes: Non-timing dependent (like SPEC2000)Non-timing dependent (like SPEC2000) Weakly timing dependent (like multithreaded)Weakly timing dependent (like multithreaded) Strongly timing dependentStrongly timing dependent
66June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Strongly Timing DependentStrongly Timing Dependent
Execution Path
Packet from application Perceived bandwidth high
send it now!
Peceived bandwidth low wait til later
Application execution depends on stored feedback state from
underlying system (like TCP/IP workloads)
77June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Correctness IssueCorrectness Issue
Execution Path
Packet from application Perceived bandwidth high
send it now!
Peceived bandwidth low wait til later
Functional Simulation Detailed Simulation
MEANINGLESS
88June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Need to….Need to….
Packet from application Perceived bandwidth
high send it now!
Peceived bandwidth low wait til later
Perceived bandwidth
reflects that of configuration
under test
Safe to take Data!!
99June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
GoalsGoals
More rigorous characterization of this More rigorous characterization of this phenomenonphenomenon
Determine severity of this tuning problem Determine severity of this tuning problem across a variety of networking workloadsacross a variety of networking workloads Network link latency sensitivity?Network link latency sensitivity? Benchmark type sensitivity?Benchmark type sensitivity? Functional CPU performance sensitivity?Functional CPU performance sensitivity?
1010June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
M5 SimulatorM5 SimulatorNetwork targeted full system simulatorNetwork targeted full system simulatorReal NIC modelReal NIC model National Semiconductor DP83820 GigE National Semiconductor DP83820 GigE
Ethernet ControllerEthernet Controller
Boots Linux 2.6Boots Linux 2.6 Uses Linux 2.6 driver for DP83820Uses Linux 2.6 driver for DP83820
All systems (and link) modeled in a single All systems (and link) modeled in a single processprocess Synchronization between systems managed Synchronization between systems managed
by a global tick frequencyby a global tick frequency
1111June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
ModesModes Wall Clock SpeedWall Clock Speed Simulated CPU Simulated CPU SpeedSpeed
Pure Functional Pure Functional (PF)(PF)
CheckpointingCheckpointing
Very fastVery fast1 or 8 IPC1 or 8 IPC
1 Cycle Mem1 Cycle Mem
Functional with Functional with Caches (FC)Caches (FC)Cache WarmupCache Warmup
FastFast1 IPC + 1 IPC +
Blocking Caches Blocking Caches << 1 IPC << 1 IPC
Detailed (D)Detailed (D)Data MeasurementData Measurement
Very SlowVery SlowOoO SuperscalarOoO Superscalar
Non-blocking Non-blocking Caches Caches
Operating ModesOperating Modes
1 IPC + 1 IPC +
Blocking Caches Blocking Caches << 1 IPC << 1 IPC
SLOWEST
1 or 8 IPC 1 or 8 IPC 1 Cycle Mem 1 Cycle Mem
FASTER
OoO Superscalar OoO Superscalar Non-Blocking Non-Blocking
CachesCachesFASTER
or 8 IPC1 or 8 IPC 1 or 8 IPC 1 Cycle Mem 1 Cycle Mem
FASTEST or 8 IPC
1212June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
BenchmarksBenchmarks
2 system client/server configuration2 system client/server configuration Netperf Netperf
Stream – a transmit microbenchmarkStream – a transmit microbenchmark
Maerts – a receive microbenchmarkMaerts – a receive microbenchmark SPECWeb99SPECWeb99
NAT configuration (3 system config)NAT configuration (3 system config) Netperf maerts with a NAT gateway between Netperf maerts with a NAT gateway between
client and serverclient and server
1313June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Experimental ConfigurationExperimental Configuration
System Under Test
Drive Systemlink
(sender/NAT/receiver) (receiver/sender)
PF8
CHECKPOINTING
PF1/PF8
CACHE WARMUP
FC1 cache
MEASUREMENT
D
(x2 if NAT)
1414June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
““Graph Theory”Graph Theory”
Tuning periods after CPU model changes?Tuning periods after CPU model changes?
How long do they last?How long do they last?
Which graph minimizes Detailed modeling Which graph minimizes Detailed modeling time necessary?time necessary?
Effects of checkpointing PF width?Effects of checkpointing PF width?
1515June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Netperf MaertsNetperf MaertsDetailed
0
1
2
3
4
5
6
7
8
10 20 30 40 50 60 70 80 90 100
Millions of Cycles
Gb
ps w idth=1
w idth=8
FC->Detailed
0
1
2
3
4
5
6
Millions of Cycles
Gb
ps w idth=1
w idth=8COV 1.66%COV .5%
PF checkpoints loadedtransition to D
or FC
FC Cache warmup
endstransition to D
Known achievable bandwidth by each
system configuration
Tuning period
Tuning period
Takeaways:
1) Shift from “high performance” CPU to lower causes more drastic tuning periods
2) Shift from lower performance to higher has more gentle transition
No tuning!
bears brunt of tuning time
1616June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Netperf StreamNetperf Stream
Why no tuning periods?Why no tuning periods? Because it is SENDER limited!Because it is SENDER limited! Change in performance is local – no feedback from Change in performance is local – no feedback from
network or receiver requirednetwork or receiver required Thus changes in send rate can be immediateThus changes in send rate can be immediate
FC->Detailed
0
0.5
1
1.5
2
2.5
Millions of Cycles
Gb
ps width = 1
width = 8
Detailed
0
0.5
1
1.5
2
2.5
10 20 30 40 50 60 70 80 90 100
Millions of Cycles
Gb
ps width = 1
width = 8
1717June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
NAT Netperf MaertsNAT Netperf MaertsFC->Detailed
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Millions of Cycles
Gb
ps w idth = 1
w idth = 8
Detailed
0
0.5
1
1.5
2
2.5
10 20 30 40 50 60 70 80 90 100
Millions of Cycles
Gb
ps w idth = 1
w idth = 8
NAT = System
Under Test
sender receiver
CPU changes applied here
The “pipe” is changing – this feedback takes longer to receive in TCP because it is not explicit may ruin simulation
1818June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
TCP Kernel ParametersTCP Kernel Parameters
TCP RULES:TCP RULES:
pouts may NOT exceed cwndspouts may NOT exceed cwnds
bytes(pouts) may NOT exceed sndwndsbytes(pouts) may NOT exceed sndwnds
Detailed Kernel Params
0
50
100
150
200
250
300
0.5 10.5 20.5 30.5 40.5 50.5 60.5 70.5 80.5 90.5
Millions of Cycles
Pac
kets
37490
37500
37510
37520
37530
37540
37550
37560
37570
37580
37590
pouts
cw nds
sndw nds
poutspouts – unACKed packets in flight– unACKed packets in flight
cwndscwnds – congestion window (in – congestion window (in packets)packets)
**Reflects state of the network pipe**Reflects state of the network pipe
sndwndssndwnds – available receiver buffer – available receiver buffer space (in bytes)space (in bytes)
**Reflects receiver’s ability to **Reflects receiver’s ability to receivereceive
Solved in real world by TCP timeouts, but would
take much too long to simulateDea
dlock
?
1919June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
SPECWeb99SPECWeb99Detailed
0
1
2
3
4
5
6
Millions of Cycles
Gb
ps
w idth = 1
w idth = 8
FC->Detailed
0
1
2
3
4
5
6
Millions of Cycles
Gb
ps w idth = 1
w idth = 8
Much more complex than NetperfMuch more complex than Netperf
Harder to understand fundamental interactionsHarder to understand fundamental interactions
Speculations in paper – but understanding this Speculations in paper – but understanding this more deeply definitely future workmore deeply definitely future work
2020June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
What About Link Delay?What About Link Delay?Maerts Link Delay Comparison
0
1
2
3
4
5
6
10 100
190
280
370
460
550
640
730
820
910
1000
1090
Millions of Cycles
Gb
ps Zero delay
400us Delay
400us Delay Kernel Parameters
0
50
100
150
200
250
300
0.5 79 158
236
315
393
472
550
629
707
786
864
943
1021
1100
Millions of Cycles
Pac
kets pouts
cw nds
TCP algorithm: cwnd can only increase upon TCP algorithm: cwnd can only increase upon every receipt of an ACK packetevery receipt of an ACK packetRamp-up of cwnd is limited by RTTRamp-up of cwnd is limited by RTTKEY POINT: tuning time is sensitive KEY POINT: tuning time is sensitive to RTTto RTT
2121June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
ConclusionsConclusionsTCP/IP workloads require a tuning period TCP/IP workloads require a tuning period relative to the network RTT when receiver limitedrelative to the network RTT when receiver limitedSender-limited workloads are generally not Sender-limited workloads are generally not problematicproblematicSome cases lead to unstable system behaviorSome cases lead to unstable system behaviorTips for minimizing tuning time:Tips for minimizing tuning time: ““Slow” fast forwarding CPUSlow” fast forwarding CPU Try different switchover pointsTry different switchover points Use fast-ish cache warmup period to bear brunt of Use fast-ish cache warmup period to bear brunt of
transitiontransition
2222June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Future WorkFuture Work
Identify other strongly timing dependent Identify other strongly timing dependent workloads (feedback directed workloads (feedback directed optimization?)optimization?)
Examine SPECWeb behavior furtherExamine SPECWeb behavior further
Further investigate protocol interactions Further investigate protocol interactions that cause zero bandwidth periodsthat cause zero bandwidth periods Hopefully lead to more rigorous avoidance Hopefully lead to more rigorous avoidance
methodmethod
2424June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Non-Timing DependentNon-Timing Dependent
memory access
Execution Path
Perfect CacheHITMISS L1
Single-threaded, application only execution (like SPEC2000)
2525June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Weakly Timing DependentWeakly Timing Dependent
Execution Path
memory access
Perfect Cachecontinue
L1 Missidle loop
RAM accessschedule different thread
Application execution tied to OS decisions (like multi-threaded apps)
2626June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Basic TCP OverviewBasic TCP Overview
Congestion Control AlgorithmCongestion Control Algorithm Match send rate to the network’s ability to Match send rate to the network’s ability to
receive itreceive it
Flow Control AlgorithmFlow Control Algorithm Match send rate to the receiver’s ability to Match send rate to the receiver’s ability to
receive itreceive it
Overall goal:Overall goal: Send data as fast as possible without Send data as fast as possible without
overwhelming system, which would overwhelming system, which would effectively cause slowdowneffectively cause slowdown
2727June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Congestion ControlCongestion Control
Feedback in the form ofFeedback in the form of Time OutsTime Outs Duplicate ACKsDuplicate ACKs
Feedback dictates Congestion Window Feedback dictates Congestion Window parameterparameter Limits the number of unACKed packets out at Limits the number of unACKed packets out at
a given time (i.e. send rate)a given time (i.e. send rate)
2828June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Congestion Control cont.Congestion Control cont.
Slow StartSlow Start Congestion window starts at 1, every ACK Congestion window starts at 1, every ACK
received is an exponential increase in received is an exponential increase in congestion windowcongestion window
Additive Increase, Multiplicative Decrease Additive Increase, Multiplicative Decrease (AIMD)(AIMD) Every ACK increases window by 1, losses Every ACK increases window by 1, losses
perceived by DupACK halve the windowperceived by DupACK halve the window
Timeout recoveryTimeout recovery Upon timeout, go back to slow startUpon timeout, go back to slow start
2929June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Flow ControlFlow Control
Feedback in the form of explicit TCP Feedback in the form of explicit TCP header notificationsheader notifications Receiver tells sender how much kernel buffer Receiver tells sender how much kernel buffer
space it has availablespace it has available
Feedback dictates send window Feedback dictates send window parameterparameter Limits the amount of unACKed data out at any Limits the amount of unACKed data out at any
given timegiven time
3131June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Non Timing DependentNon Timing Dependent
Single threaded, application only Single threaded, application only simulation (like SPEC2000)simulation (like SPEC2000)
The execution timing does not affect the The execution timing does not affect the commit order of instructionscommit order of instructions
Architectural state generated by a fast Architectural state generated by a fast functional simulator would be the same as functional simulator would be the same as a detailed simulatora detailed simulator
3232June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Weakly Timing DependentWeakly Timing Dependent
Applications whose performance are tied Applications whose performance are tied with OS decisionswith OS decisions Multi-threaded (CMP, SMT, etc.)Multi-threaded (CMP, SMT, etc.)
Execution timing effects like cache hits Execution timing effects like cache hits and misses, memory latencies, etc. can and misses, memory latencies, etc. can affect scheduling decisionsaffect scheduling decisionsHowever, these execution path variations However, these execution path variations are all valid and do not pose a correctness are all valid and do not pose a correctness problemproblem
3333June 4, 2005June 4, 2005 MoBS 2005MoBS 2005
Strongly Timing DependentStrongly Timing Dependent
Workloads that explicitly tune themselves Workloads that explicitly tune themselves to performance of underlying systemto performance of underlying system
Tuning to an artificially fast system affects Tuning to an artificially fast system affects system performancesystem performance
When switching to detailed simulation, you When switching to detailed simulation, you may get meaningless results may get meaningless results