Computer Measurement Group, India 1Computer Measurement Group, India 1
www.cmgindia.org
Automatically Determining Load Test Duration Using Confidence Intervals
Rajesh Mansharamani, FreelancerSubhasri Duttagupta, Tata Consultancy ServicesAnuja Nehete, Persistent Systems
1st CMG India Annual Conference, Dec 2014
Computer Measurement Group, India 2
Load Test Duration?
System Under Test
Load Testing tool N Virtual Userswith think time Z, request/response
LoadRunner, JMeter, Grinder, VSTS, Rational etc.
Test Objective:• Determine response time and throughput of run for regular performance tests (not long duration tests or soak tests)
How long should the load test be run until the average response time converges?
Assumptions:1. Average Response Time will converge over time2. Number of concurrent users will not change
during the test (except initial ramp up)
Solution from PT Tools:• None! Tester needs to manually enter duration or
specify a fixed number of iterations
Computer Measurement Group, India 3
Load Test Duration: State of the ArtApproach 1: Ad hoc
= Whatever comes to your mind or cut and paste from your predecessor
• Large Financial Services: Test Duration = 30 sec
• Many Load Testing Projects: 5 min without knowing why
• Several Projects: 15min to 20min, without knowing why
Computer Measurement Group, India 4
Load Test Duration: State of the ArtApproach 2: Visually Determine Start of Steady State and Then Use Ad hoc Duration After Start of Steady State
Average Response Time
Test Duration in Sec
Transient State Steady
State
Ad hoc Duration
Computer Measurement Group, India 5
Load Test Duration: State of the ArtApproach 3: Ad hoc Transient Duration, Ad hoc Steady State Duration
= Discard first X minutes of data, and take measurements from next Y minutes where both X and Y are arbitrarily chosen
• We have seen for example: 20 min discard, 20 min keep, which then becomes the ‘golden standard’ for the organization to follow in the future
Approach 4: Duration in Hours to Mitigate Effect of Transients
• Large Manufacturing: 2 hours x 6 tests per run
• Large Stock Exchange: 5.5 hours (duration of entire day!)
Computer Measurement Group, India 6
Load Test Duration: Limitations of State of the Art
1. Test duration is too short Incorrect estimates of performance are way off from reality
2. Test duration too long Limited number of PT cycles or long schedule
3. Visual determination of transient/steady state needs to be repeated for each type of application
4. If you wish to offer PT as a service and want a small team to manage many applications you wouldn’t want to visually look at each and every application under test
Would be best if a performance test can decide for itself when it has converged
Computer Measurement Group, India 7Computer Measurement Group, India
Let the Test Determine Its Run Duration
• As run duration increases you expect to converge to a given value of mean response time
• What does one mean by ‘convergence’?
True Mean Response Time E[R] = lim𝑛→∞
❑1𝑛∑𝑖=1
𝑛
𝑅 𝑖
Estimated Mean Response Time R(n) = ❑1𝑛∑𝑖=1𝑛
𝑅 𝑖
Solution: Keep increasing n until R(n) E[R]
What is wrong with this approach?
We don’t know E[R] to start with
Computer Measurement Group, India 8Computer Measurement Group, India
Let the Test Determine Its Run Duration
• As run duration increases you expect to converge to a given value of mean response time
• What does one mean by ‘convergence’?
We need a level of confidence that our estimate of mean response time R(n) is in the neighbourhood of the true mean E[R]
Computer Measurement Group, India 9Computer Measurement Group, India
Proposed Approach
• User inputs:• MaxDuration• MinDuration in Steady State• [Desired Level of Confidence in Output (for Advanced Users)]
1. Start Test for Max Duration
2. When Steady State is reached reset all measurement counters
3. If Test Converges (to Desired Level of Confidence) and MinDuration has elapsed in Steady State then
Stop Test before Max Duration
4. Output Test Results
Computer Measurement Group, India 10Computer Measurement Group, India
Statistics 101: Confidence Intervals
68.2% of values within (µ ± σ)95.4% of values within (µ ± 2σ)99.7% of values within (µ ± 3σ)
90% of values within (µ ± 1.645σ)95% of values within (µ ± 1.960σ)99% of values within (µ ± 2.576σ)
Normal Distribution with Mean µ and Std σ
Thus we can say with 99% Confidence that any sample of the ‘Normal’ Random Variable is in the interval (µ ± 2.576σ)
Computer Measurement Group, India 11Computer Measurement Group, India
Central Limit Theorem
Let Y1, Y2, …, YN be N independent random variables each with
mean μ and std σ
Then YAVG = (Y1 + Y2 + ... + YN)/N approaches a Normal Distributionwith mean μ and std σ/sqrt(N)
But the problem is that successive response time samples from a test run are not necessarily independent
Therefore use Batch Means which would reduce the correlationFor example if batch size is 100 then let
Y1 = (R1 + R2 + … + R100)/100Y2 = (R101 + R102 + … + R200)/100…
Now apply Central Limit Theorem on Y1 , Y2 , …
Computer Measurement Group, India 12Computer Measurement Group, India
Statistics 101: Confidence Intervals
Thus we can say with 99% Confidence that YAVG is in the interval (µ ± 2.576σ)
Y1 = Avg response time of first 100 samples, Y2 = Avg response time of next 100 samples, …
Then YAVG = (Y1 + Y2 + ... + YN)/N approaches a Normal Distributionwith mean μ and std σ/sqrt(N)
However, we don’t know µ and σ to start with. So use instead of Normal distribution use Student t-distribution, which uses the estimated or computed values of mean and std.
Computer Measurement Group, India 13Computer Measurement Group, India
Statistics 101: Student t-Distribution
Likewise tables are available for student t-distribution that tell you what value to use instead of 2.576: µ’ ± tconf,n-1 σ’
For the Normal Distribution we said that with 99% Confidence YAVG is in the interval (µ ± 2.576σ)
For example visit: http://easycalculation.com/statistics/t-distribution-critical-value-table.php
n-1 90% Conf 95% Conf 99% Conf
1 6.3138 12.7065 63.6551
10 1.8124 2.2282 3.1693
100 1.6602 1.9840 2.6259
200 1.6525 1.9719 2.6007
Interval gets tighter as n increases Converge to normal distribution as n increases
The factor tconf, n-1 is a function of number of samples n and the degree of confidence, such as 90% or 95% or 99%
µ’ and σ’ are computed estimates of the true mean and std
Computer Measurement Group, India 14Computer Measurement Group, India
Load Test Duration Algorithm: Step 1Step 1: Discard Transient State Data
Run Duration
Thro
ughp
ut X
Xk = throughput after k minutes of run duration
Heuristic: Throughput convergence
If Xk is within 90% of Xk-1 then steady state is reached.
Computer Measurement Group, India 15Computer Measurement Group, India
Load Test Duration Algorithm: Step 2Step 2: Start collecting samples in steady state until desired level of confidence is reached.
Tran
sien
t Avg of Batch response time averages (batch size = 100)
Width of confidence interval determined by Student t-distribution for 99% conf, and n (batch avg resp time samples)µ’ ± t99,n-1 σ’
Interval is within 15% of mean
Empirically 99% confidence interval within 15% of estimated average response time works well
Computer Measurement Group, India 16Computer Measurement Group, India
Load Test Duration Algorithm: Min/Max Duration• Periodic Events such as Garbage Collection, or background noise, or daemon processes, can affect results during short periods
• Keep a Min Duration in Steady State: say 5 minutes(This could have been samples too, but it is easier for practitioners to give a duration)
• Also what if the average response time does not converge quickly?• Keep a Max Duration of run say 20 to 30 min
Computer Measurement Group, India 17Computer Measurement Group, India
Load Test Duration Algorithm: Summary1. Start test for Max Duration
2. From the first sample onwards compute performance metrics including throughput Xk for k minutes of run duration
If (Xk is within 90% of Xk-1 ) then Steady State is reached, reset computation of all metrics
Else if (Max Duration is reached) stop test and output metrics
3. In steady state assume batch of 100 samples, set n=0 (no. of batches) Rbsum = 0; Rbsumsq = 0; For completion of every 100 samples
Rb = average response time of this batch of 100 samplesn = n+1Rbsum = Rbsum + RbRbsumsq = Rubsumsq + Rb*RbAvgRb = Rbsum/nStdRb = sqrt(Rbsumsq/n – AvgRb * AvgRb)
If ((t99,n-1 * StdRb/sqrt(n)) 0.15 AvgRb) and (MinDuration over in steady state)) or (MaxDuration is reached)
stop test and output performance metricsEnd for
Running mean and std
99% confidence interval within 15% of mean
Computer Measurement Group, India 18Computer Measurement Group, India
Validation Against Lab Applications300 concurrent users, think time = 2 sec, max duration = 22 minutes
DellDVD
JPetStore
RUBiSeQuiz
NxGT
0
10
20
30
40
50
60
70
MinDur = 5m
True Mean
RAVG ms Time to reach Steady State
= 3 min for DellDVD and 2 min for other apps
All Apps Converged within Min Duration of 5 min99% Conf Int within 15% of estimated mean
= RAVG at 22min
99% Conf Int within P% of estimated mean
Estimated mean at convergence within 5% of true mean.
7-8 min of test duration instead of 22 min!!
8.1%
1.9%
5.5%
2.9%
0.9%
Computer Measurement Group, India 19Computer Measurement Group, India
Validation Against Lab Applications: Min Dur = 0300 concurrent users, think time = 2 sec, max duration = 22 minutes
DellDVD
JPetStore
RUBiSeQuiz
NxGT
0
10
20
30
40
50
60
70
80MinDur = 5mMinDur = 0True Mean
RAVG ms Time to reach Steady State
= 3 min for DellDVD and 2 min for other apps
= RAVG at 22min
99% Conf Int converges to 15% of estimated mean within S seconds of steady state
52 sec3 sec
14 sec
2 sec
2 sec
14% error
21% error
Such high errors are acceptable only during initial stages of PT. Hence MinDur = 5 min makes sense.
Computer Measurement Group, India 20Computer Measurement Group, India
Validation Against Real World Apps• MORT: 80 conc users, 20 min, 26 pages, page response varies from few
millisec for some pages to 30 seconds for other pages• VMS: 11 pages, 25 conc users, 20 min, 5 sec think time• HelpDesk: 31 pages, 150 conc users, 15 min, think time 0 to 15 sec
MORT VMS HelpDesk0
200
400
600
800
1000
Algo
True Mean
RAVG ms
Convergence as per Algo in T minutes (MinDur = 5min)
7 min
13.6 min
8 min
Error < 5% in all three cases
Computer Measurement Group, India 21Computer Measurement Group, India
Does the Algo Work for Page Level Response Times?
Page Number of MORT
Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size
RAVG at
Convergence
True Mean Error
Page 1 2 min 8.9 min 14.6% 32.74 sec 32.67 sec 0.2%
Page 2 2 min 5.0 min 7.3% 47.51 ms 45.51 ms 4.4%
Page 3 3 min 8.9 min 14.9% 33.40 sec 33.46 sec 0.2%
Page 4 3 min 10.1 min 14.4% 34.44 sec 34.31 sec 0.1%
Page 6 2 min 9.8 min 13.7% 35.59 sec 35.65 sec 0.1%
21 pages in each of MORT and Helpdesk did not converge due to lack of samples. We should look page level convergence only for tagged pages.
Error < 5% for Helpdesk too and < 8% for VMS
Three VMS pages did not converge due to outliers!
Computer Measurement Group, India 22Computer Measurement Group, India
Outliers in VMS
Response Time (ms)
Elapsed Time (sec)
Computer Measurement Group, India 23Computer Measurement Group, India
• Heuristic: • Maintain running average of response time
• If response time sample >= 5 times running average• Keep it in outlier bucket
• If number of samples in outlier bucket exceed 5% of total samples then include them in the total sample population and recompute running mean, std, confidence interval [O(1)]
• Remove outliers
• How do we know what is an outlier
• What if we classify something as an outlier initially but as we progress forward it becomes an ‘inlier’ ?
Two VMS pages converged with this approach but third did not
Real Time Handling of Outliers
Computer Measurement Group, India 24Computer Measurement Group, India
Too many outliers reclassified as inliers. Our Algo rightly shows that for this page the run duration must go beyond the specified max duration.
VMS Page That Did Not ConvergeRe
spon
se T
ime
(ms)
Elapsed Time (sec)
Computer Measurement Group, India 25Computer Measurement Group, India
Summary & Future Work
• Simple O(1) streaming algo that can easily be integrated in to load test tools or run off response time log in real time, for a load test to automatically determine its own convergence
• Inputs – resp time log, min, max duration, conf level, tagging of pages
• Can skip min duration and tagging during initial rounds of testing
• Outlier removal is biased towards initial set – if these are reclassified as inliers they will never be classified as outliers again
• Histograms are more accurate for outlier removal, but maintaining them at run time during load testing is more expensive
• What about convergence of percentiles instead of RAVG?
• What about fluctuating workload, where number of concurrent users varies over time?
Any questions?