Download - Computer Measurement Group, India 1 1 Automatically Determining Load Test Duration Using Confidence Intervals Rajesh Mansharamani, Freelancer

Computer Measurement Group, India 1Computer Measurement Group, India 1

www.cmgindia.org

Automatically Determining Load Test Duration Using Confidence Intervals

Rajesh Mansharamani, FreelancerSubhasri Duttagupta, Tata Consultancy ServicesAnuja Nehete, Persistent Systems

1st CMG India Annual Conference, Dec 2014

http://www.cmgindia.org/

Computer Measurement Group, India 2

Load Test Duration?

System Under Test

Load Testing tool N Virtual Userswith think time Z, request/response

LoadRunner, JMeter, Grinder, VSTS, Rational etc.

Test Objective:• Determine response time and throughput of run for regular performance tests (not long duration tests or soak tests)

How long should the load test be run until the average response time converges?

Assumptions:1. Average Response Time will converge over time2. Number of concurrent users will not change

during the test (except initial ramp up)

Solution from PT Tools:• None! Tester needs to manually enter duration or

specify a fixed number of iterations


Load Test Duration: State of the ArtApproach 1: Ad hoc

= Whatever comes to your mind or cut and paste from your predecessor

• Large Financial Services: Test Duration = 30 sec

• Many Load Testing Projects: 5 min without knowing why

• Several Projects: 15min to 20min, without knowing why


Load Test Duration: State of the ArtApproach 2: Visually Determine Start of Steady State and Then Use Ad hoc Duration After Start of Steady State

Average Response Time

Test Duration in Sec

Transient State Steady

State

Ad hoc Duration


Load Test Duration: State of the ArtApproach 3: Ad hoc Transient Duration, Ad hoc Steady State Duration

= Discard first X minutes of data, and take measurements from next Y minutes where both X and Y are arbitrarily chosen

• We have seen for example: 20 min discard, 20 min keep, which then becomes the ‘golden standard’ for the organization to follow in the future

Approach 4: Duration in Hours to Mitigate Effect of Transients

• Large Manufacturing: 2 hours x 6 tests per run

• Large Stock Exchange: 5.5 hours (duration of entire day!)


Load Test Duration: Limitations of State of the Art

1. Test duration is too short Incorrect estimates of performance are way off from reality

2. Test duration too long Limited number of PT cycles or long schedule

3. Visual determination of transient/steady state needs to be repeated for each type of application

4. If you wish to offer PT as a service and want a small team to manage many applications you wouldn’t want to visually look at each and every application under test

Would be best if a performance test can decide for itself when it has converged

Computer Measurement Group, India 7Computer Measurement Group, India

Let the Test Determine Its Run Duration

• As run duration increases you expect to converge to a given value of mean response time

• What does one mean by ‘convergence’?

True Mean Response Time E[R] = lim𝑛→∞

❑1𝑛∑𝑖=1

𝑛

𝑅 𝑖

Estimated Mean Response Time R(n) = ❑1𝑛∑𝑖=1𝑛

𝑅 𝑖

Solution: Keep increasing n until R(n) E[R]

What is wrong with this approach?

We don’t know E[R] to start with


Let the Test Determine Its Run Duration

• As run duration increases you expect to converge to a given value of mean response time

• What does one mean by ‘convergence’?

We need a level of confidence that our estimate of mean response time R(n) is in the neighbourhood of the true mean E[R]


Proposed Approach

• User inputs:• MaxDuration• MinDuration in Steady State• [Desired Level of Confidence in Output (for Advanced Users)]

1. Start Test for Max Duration

2. When Steady State is reached reset all measurement counters

3. If Test Converges (to Desired Level of Confidence) and MinDuration has elapsed in Steady State then

Stop Test before Max Duration

4. Output Test Results


Statistics 101: Confidence Intervals

68.2% of values within (µ ± σ)95.4% of values within (µ ± 2σ)99.7% of values within (µ ± 3σ)

90% of values within (µ ± 1.645σ)95% of values within (µ ± 1.960σ)99% of values within (µ ± 2.576σ)

Normal Distribution with Mean µ and Std σ

Thus we can say with 99% Confidence that any sample of the ‘Normal’ Random Variable is in the interval (µ ± 2.576σ)


Central Limit Theorem

Let Y1, Y2, …, YN be N independent random variables each with

mean μ and std σ

Then YAVG = (Y1 + Y2 + ... + YN)/N approaches a Normal Distributionwith mean μ and std σ/sqrt(N)

But the problem is that successive response time samples from a test run are not necessarily independent

Therefore use Batch Means which would reduce the correlationFor example if batch size is 100 then let

Y1 = (R1 + R2 + … + R100)/100Y2 = (R101 + R102 + … + R200)/100…

Now apply Central Limit Theorem on Y1 , Y2 , …


Statistics 101: Confidence Intervals

Thus we can say with 99% Confidence that YAVG is in the interval (µ ± 2.576σ)

Y1 = Avg response time of first 100 samples, Y2 = Avg response time of next 100 samples, …

Then YAVG = (Y1 + Y2 + ... + YN)/N approaches a Normal Distributionwith mean μ and std σ/sqrt(N)

However, we don’t know µ and σ to start with. So use instead of Normal distribution use Student t-distribution, which uses the estimated or computed values of mean and std.


Statistics 101: Student t-Distribution

Likewise tables are available for student t-distribution that tell you what value to use instead of 2.576: µ’ ± tconf,n-1 σ’

For the Normal Distribution we said that with 99% Confidence YAVG is in the interval (µ ± 2.576σ)

For example visit: http://easycalculation.com/statistics/t-distribution-critical-value-table.php

n-1 90% Conf 95% Conf 99% Conf

1 6.3138 12.7065 63.6551

10 1.8124 2.2282 3.1693

100 1.6602 1.9840 2.6259

200 1.6525 1.9719 2.6007

Interval gets tighter as n increases Converge to normal distribution as n increases

The factor tconf, n-1 is a function of number of samples n and the degree of confidence, such as 90% or 95% or 99%

µ’ and σ’ are computed estimates of the true mean and std


Load Test Duration Algorithm: Step 1Step 1: Discard Transient State Data

Run Duration

Thro

ughp

ut X

Xk = throughput after k minutes of run duration

Heuristic: Throughput convergence

If Xk is within 90% of Xk-1 then steady state is reached.


Load Test Duration Algorithm: Step 2Step 2: Start collecting samples in steady state until desired level of confidence is reached.

Tran

sien

t Avg of Batch response time averages (batch size = 100)

Width of confidence interval determined by Student t-distribution for 99% conf, and n (batch avg resp time samples)µ’ ± t99,n-1 σ’

Interval is within 15% of mean

Empirically 99% confidence interval within 15% of estimated average response time works well


Load Test Duration Algorithm: Min/Max Duration• Periodic Events such as Garbage Collection, or background noise, or daemon processes, can affect results during short periods

• Keep a Min Duration in Steady State: say 5 minutes(This could have been samples too, but it is easier for practitioners to give a duration)

• Also what if the average response time does not converge quickly?• Keep a Max Duration of run say 20 to 30 min


Load Test Duration Algorithm: Summary1. Start test for Max Duration

2. From the first sample onwards compute performance metrics including throughput Xk for k minutes of run duration

If (Xk is within 90% of Xk-1 ) then Steady State is reached, reset computation of all metrics

Else if (Max Duration is reached) stop test and output metrics

3. In steady state assume batch of 100 samples, set n=0 (no. of batches) Rbsum = 0; Rbsumsq = 0; For completion of every 100 samples

Rb = average response time of this batch of 100 samplesn = n+1Rbsum = Rbsum + RbRbsumsq = Rubsumsq + Rb*RbAvgRb = Rbsum/nStdRb = sqrt(Rbsumsq/n – AvgRb * AvgRb)

If ((t99,n-1 * StdRb/sqrt(n)) 0.15 AvgRb) and (MinDuration over in steady state)) or (MaxDuration is reached)

stop test and output performance metricsEnd for

Running mean and std

99% confidence interval within 15% of mean


Validation Against Lab Applications300 concurrent users, think time = 2 sec, max duration = 22 minutes

DellDVD

JPetStore

RUBiSeQuiz

NxGT

0

10

20

30

40

50

60

70

MinDur = 5m

True Mean

RAVG ms Time to reach Steady State

= 3 min for DellDVD and 2 min for other apps

All Apps Converged within Min Duration of 5 min99% Conf Int within 15% of estimated mean

= RAVG at 22min

99% Conf Int within P% of estimated mean

Estimated mean at convergence within 5% of true mean.

7-8 min of test duration instead of 22 min!!

8.1%

1.9%

5.5%

2.9%

0.9%


Validation Against Lab Applications: Min Dur = 0300 concurrent users, think time = 2 sec, max duration = 22 minutes

DellDVD

JPetStore

RUBiSeQuiz

NxGT

0

10

20

30

40

50

60

70

80MinDur = 5mMinDur = 0True Mean

RAVG ms Time to reach Steady State

= 3 min for DellDVD and 2 min for other apps

= RAVG at 22min

99% Conf Int converges to 15% of estimated mean within S seconds of steady state

52 sec3 sec

14 sec

2 sec

2 sec

14% error

21% error

Such high errors are acceptable only during initial stages of PT. Hence MinDur = 5 min makes sense.


Validation Against Real World Apps• MORT: 80 conc users, 20 min, 26 pages, page response varies from few

millisec for some pages to 30 seconds for other pages• VMS: 11 pages, 25 conc users, 20 min, 5 sec think time• HelpDesk: 31 pages, 150 conc users, 15 min, think time 0 to 15 sec

MORT VMS HelpDesk0

200

400

600

800

1000

Algo

True Mean

RAVG ms

Convergence as per Algo in T minutes (MinDur = 5min)

7 min

13.6 min

8 min

Error < 5% in all three cases


Does the Algo Work for Page Level Response Times?

Page Number of MORT

Time to Steady State

Time to Converge after Steady State

99% Confidence Interval size

RAVG at

Convergence

True Mean Error

Page 1 2 min 8.9 min 14.6% 32.74 sec 32.67 sec 0.2%

Page 2 2 min 5.0 min 7.3% 47.51 ms 45.51 ms 4.4%




21 pages in each of MORT and Helpdesk did not converge due to lack of samples. We should look page level convergence only for tagged pages.

Error < 5% for Helpdesk too and < 8% for VMS

Three VMS pages did not converge due to outliers!


Outliers in VMS

Response Time (ms)

Elapsed Time (sec)


• Heuristic: • Maintain running average of response time

• If response time sample >= 5 times running average• Keep it in outlier bucket

• If number of samples in outlier bucket exceed 5% of total samples then include them in the total sample population and recompute running mean, std, confidence interval [O(1)]

• Remove outliers

• How do we know what is an outlier

• What if we classify something as an outlier initially but as we progress forward it becomes an ‘inlier’ ?

Two VMS pages converged with this approach but third did not

Real Time Handling of Outliers


Too many outliers reclassified as inliers. Our Algo rightly shows that for this page the run duration must go beyond the specified max duration.

VMS Page That Did Not ConvergeRe

spon

se T

ime

(ms)

Elapsed Time (sec)


Summary & Future Work

• Simple O(1) streaming algo that can easily be integrated in to load test tools or run off response time log in real time, for a load test to automatically determine its own convergence

• Inputs – resp time log, min, max duration, conf level, tagging of pages

• Can skip min duration and tagging during initial rounds of testing

• Outlier removal is biased towards initial set – if these are reclassified as inliers they will never be classified as outliers again

• Histograms are more accurate for outlier removal, but maintaining them at run time during load testing is more expensive

• What about convergence of percentiles instead of RAVG?

• What about fluctuating workload, where number of concurrent users varies over time?

Any questions?

Download - Computer Measurement Group, India 1 1 Automatically Determining Load Test Duration Using Confidence Intervals Rajesh Mansharamani, Freelancer

Top Related