performance evaluations an overview and some lessons learned “knowing is not enough; we must...

Performance Evaluations

An overview and some lessons learned

“Knowing is not enough; we must apply. Willing is not enough; we must do.”

Johann Wolfgang von Goethe

2Performance Evaluations - 2007 (U. Roehm)

Performance Evaluation

• Research methodology: – quantitative evaluation– of an existing software or hardware artefact

SUT - ‘system under test’– using a set of experiments– each consisting of a series of performance measurements– to collect realistic performance data


Performance Evaluations

• Prepare– What do you want to measure?

• Plan– How can you measure and what is needed?

• Implement– Pitfalls for the correct implementation of benchmarks.

• Evaluate– How to conduct performance experiments.

• Analyse and Visualise– What does the outcome mean?


Step 1: Preparations

• What do you want to show?– Understanding of behaviour of existing system?– Or proof an approach’s superiority?

• Which performance metrics are you interested in?– Mean response times– Throughput– Scalability– …

• What are the variables?– Parameters of own system / algorithm– number of concurrent users, number of nodes, …


Performance Metrics

• Response Time (rt)– Time duration the SUT takes to answer a request– Note: For complex tasks, response time ≠ runtime

• Mean Response Time (mrt)– Mean of the response times of a set of requests– Only successful requests count!

• Throughput (thp)– Number of successful requests per time unit


Runtime versus Response Time

• Model: client-server communication with server queueing incoming requests (e.g. web servers or database servers)

• client sends a request a– runt(a) - runtime to complete request a

runt(a) = treceiveN(last_result) - tsend

– rt(a) - response time for action a (until first result comes back to client!) rt(a) = treceive1(first_result) - tsend

– wt(a) - waiting time of action a in server queue– et(a) - execution time of action a (after server took request from queue)– frt(a) - first result time of action a (Note: frt(a) <= et(a) )– tc(a) - network transamission times for the request a and its result(s)

tc(a)

tc(result)

tsendtreceiveNtarrival

tend

runt(a)

wt(a) et(a)

frt(a)

rt(a)

treceive1

queue executionnetwork network

tc(result). . .


More Performance Metrics

• Scalability / Speedup: thpn / thp1

• Fairness:

• Resource Consumption– memory usage– CPU load– energy consumption– …

• Note: The later are typically server-side statistics!


Step 2: Planing

• Which experiments to show the intended results?

• What do you need to run those experiment?– Hardware– Software– Data (!)

• Prepare an evaluation schedule– Evaluations always take longer than you expect!– Expect to change some goals / your approach based on the

outcome of initial experiments

• Some initial runs might be helpful to explore the space


Typical Client-Server Evaluation Setup

Client

Client

Server

ServerTestNetwork

Server - ServerNetwork

Client Emulator(s) Test Network System Under Test (SUT)

Often just one multithreaded client

that emulates n concurrent clients

In general, the SUT can be arbitrary complex;e.g. clustered servers or multi-tier architectures

Response time and throughputis measured here

The client emulator(s) should run on a separate machine than the server(s).


Workload Specifications

• Multiprogramming Level (MPL) – How many concurrent users / clients?

• Heterogeneous workload?– Is every client doing the same or are there variations?– Typically: Well-defined set of transactions / request kinds

with a defined distribution

• Do you emulate programs or users?– If just interested in peak performance, then as many

requests as possible– Sometimes more complex user model needed

E.g. TPC-C users with think times and sleep times


Experimental Data

• Problem: How do we get reasonable test data?

• Approach 1: Tracing– If you have an existing system available, trace ‘typical’

usages and use such traces to drive your experiments

• Approach 2: Standard Benchmarks– Use the data generator of a standard benchmark– In some areas, there are explicit data corpi to evaluate

• Approach 3: Make something up yourself– Always the least preferable way!!!– Justify why think that your data setup is representative

e.g. by using the pattern of a standard benchmark


Standard Benchmarks

• There are many standard benchmarks available– very helpful to make results more comparable– most come with synthetic data generators– make your results more publishable

(reviewers will have more trust in your experimental setup)

• Disadvantages:– Standard benchmarks can be very complex– Some specifications are not free, but cost money

• Examples :– TPC-C, TPC-H, TPC-R, TPC-W– ECPerf, SPECjAppServer, IBM’s Trade2, etc.


Example: TPC Benchmarks

• TPC - Transaction Processing Council (tpc.org)– Non-profit corporation of commercial software vendors– Defined a set of database and e-business performance benchmarks

• TPC-C– Measures the performance of OLTP systems (bank scenario)– V1.0 became official in 1992; current version v5.8

• TPC-H and TPC-R (former TPC-D)– Performance of OLAP systems (warehouse scenario); – In 1999, TPC-D replaced by TPC-H (ad-hoc queries) + TPC-R (reporting)

• TPC-W and TPC-App– transactional web benchmark simulating interactive e-business website– TPC-W obsolete since April 2005; replaced(?) by TPC-App

• TPC-E– New OLTP benchmark that simulates the workload of a brokerage firm

http://www.tpc.org/


Step 3: Implementation

• Goal: Evaluation program(s) (client emulators) that measure what you have planned.

• Typical elements to take care of– Accurate timing– Random number generation– Fast logging– No hidden serialisation, e.g. via global singletons– No screen output during measurement interval

• Avoid measuring the test harness rather than SUT…


Time Measurements

• Every programming language offers some timing functions

• But be aware that there is a timer resolution– E.g. Java’s System.CurrentTimeInMillis() suggests by its

name that it measures time in milliseconds… The question is, how many milliseconds between updates…

• There is no point in trying to measure something taking microseconds with timers with milliseconds resolution!


Example: Java Timing

• Standard (all JDK’s): System.currentTimeInMillis()– Be aware: Has different resolutions depending on OS and platforms!

• Since JDK 1.4.2 (portable, undocumented!!): sun.misc.Perf– Example: // may throw SecurityException

sun.misc.Perf perf = sun.misc.Perf.getPerf();long ticksPerSecond = perf.highResFrequency();long currTick = perf.highResCounter();long milliSeconds = (currTick * 1000) / ticksPerSecond;

• In JDK 1.5: java.lang.System.nanoTime()– Always uses the best precision available on a system, but no guaranteed resolution

• Some third party solutions using, e.g., Windows’ HighPerformanceTimers through Java JNI (hence limited portability, best for Windows)

Linux (2.2, x86) 1 ms

Mac OS X 1 ms

Windows 2000 10 ms

Windows 98 60 ms

Solaris (2.7/i386, 2.8/sun4u) 1 ms


Example: Wrong Timer Usage


Xmple:Same Experiment - High-res TimerHistogram of Bean Creation Times (server-side)

0

100

200

300

400

500

600

0 1 2 3

time [ms]

frequency

Cache-Hit TimesTenMySQL (local) MySQL (remote)Oracle (remote)

Note the ‘faster’ response times as compared to the previous experiment

Note: No chance to measure the duration of a cache hit with CurrentTimeInMillis()


Random Number Generators

• Common Mistakes:– A multi-threaded client, but all threads use the same

global objectThis effectively serialises your threads!

– Large set of random numbers is generated within codeWe do not want to measure how fast Java can

generate random numbersUse an array with pre-generated random numbers

(space vs. time)– Seeds are the same

You make your program deterministic…


Logging

• Goal: Fast logging of results during experiments without interfering with the measurements

• Approach:– Log to a file, not the screen

screen output / scrolling is VERY slow- Very common mistake

– Use standard log libraries with low overheade.g. Java’s log4j (http://sourceforge.net/projects/log4j/)

or Windows’ performance counter API – If your client reads data from a hard drive,

write your log data to a different disk– Log asynchronously, be fast, be thread-safe (careful!)


Windows Performance Monitor

• Windows includes aperformance monitorapplication– online GUI– can capture to file– supports remote

monitoring!

• Based on Windows APIPerformance Counters– \\ComputerName\Object(Instance)\Counter

– supported by basically every server application; huge # of statistics– can be used in own programs– Cf. http://technet2.microsoft.com/WindowsServer/en/library/3fb01419-b1ab-4f52-a9f8-09d5ebeb9ef21033.mspx?mfr=true

– From Java: http://www.javaworld.com/javaworld/jw-11-2004/jw-1108-windowspm.html

http://technet2.microsoft.com/WindowsServer/en/library/3fb01419-b1ab-4f52-a9f8-09d5ebeb9ef21033.mspx?mfr=true





http://www.javaworld.com/javaworld/jw-11-2004/jw-1108-windowspm.html




Step 4: Evaluation

• Objective: To collect accurate performance data in a set of experiments.

• Three major issues:– Controlled evaluation environment– Documentation– Archiving of raw data


Evaluation Environment

• We want a stable evaluation environment that allows us to measure the system under test under repeatable settings without interference– Clean computer initialisation (client(s) and server(s))

No concurrent programs, minimum set of sys servicesDisable anti-virus software!

– Decide: open or closed system?– Decide: cold or warm caches?– Make sure you do not measure any side-effects!– Many prefer to measure during night - why?


Open vs. Closed Systems

• Open system: clients arrive and leave the system (with appropriate distribution)

• Closed system: fix number of clients; each starts a new task after finishing the previous one

• Open system is generally more realistic

• Closed system is much easier to write as test harness

• Open system behaves much worse when contention is high

• See Schroeder et al, Proc NSDI’06


Sequential Evaluation

• The special case of MPL = 1– There are not multiple clients or the system is centralised

• Test harness is typically the same; also many of the implementation details still apply

• Several experiment repetitions– In order to determine stable statistics

e.g. mean response time – Note: Same number of repetitions for all experiments!


• Multiprogramming level > 1

• A parallel evaluation has three distinct phases:

• Performance can be measured either – periodically during steady phase, or – in summary at the end (measuring period = whole steady phase)

then it must be repeated several times to the mean value

Parallel Evaluation

time

pe

rfo

rma

nce

ramp-up steady phase close-down

Only measure during the steady-phase!

Note: fixed MPL


Performance Behaviour over MPL

• The measurements from the previous slide give you a single measuring point for one fixed MPL

• Typically interested in system behaviour of varying MPLs– One needs to conduct separate experiments for each MPL

• What to expect if we plot the throughput over the MPL?

This? Or this?


Outlier Policy

• Inherent complexity of the evaluated systems – There is always a ‘noise signal’ or ‘uncertainty factor’

• We expect individual measurements to vary around a stable mean value– Note that the standard deviation can be quite high.– But some measurements are ‘way off’ <- have to be dealt with

• Outlier policy:– What is an outlier?

e.g. more than n times the standard deviation off– How to deal with it?

e.g. replace with a ‘spare’ value

• Trace outliers - if there are too many, are they still outliers?


Evaluation Scripting

• Conducting evaluations is tedious and error-prone– E.g. to test scalability in example paper:

3 scenarios x 4 algorithms = 12 configurations12 configurations x 10 MPLs x 21 measurements = 2520

– Note: The benchmark implementation is only one of those 2520

• Need a ‘test harness’ - set of scripts/programs that– automates the evaluation as much as possible, and– makes the evaluation repeatable!

start/stop servers, copy logging and configuration files, check for hanging runs etc.

– Write scripts in the language that you prefer Python, shell scripts, … good knowledge of OS and shell scripting is helpful


Example Evaluation ScriptREM @Echo OFFTITLE QueryTestREM parameters:REM %1 number of nodesREM … (this script has a bunch of parameters)REM -----------------------------------------------------------------------------------------

REM freshly start server Z:/programs/gnu/bin/sleep 10 start ./server.exe Z:/programs/gnu/bin/sleep 20

REM configure server and prepare cache ./pdbsetconfig Nodes=%2 MaxQueueSize=22 MaxLoad=1 WrapperType=%3 Nodes=%2 Routing=RoundRobin Password=%4

Username=%4 ScanInputQueue=FCFS ./client Verbose=0 Directory=P:/tpc-r/%3/pdb_queries Files=CacheFlush.sql Loops=%1 REM run the actual measurement REM the result logging is implemented as part of the client program ./pdbstatistics RESET IF %10==verbose ./pdbgetconfig IF %10==verbose echo === Testing %5-Routing (maxload %6 history %7) with %8/%9ms IF %10==verbose ./client Files=%8 Verbose=1 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True

ServerStats=True IF NOT %10==verbose ./client Files=%8 Verbose=0 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True

ServerStats=True ExcelOutput="%1 %6 %7 %8" REM stop server againREM first try normally, but if it hangs - kill it. ./pdbsetconfig Shutdown Z:/programs/gnu/bin/sleep 20 Z:/programs/reskit4/kill -F server.exe

Archiving of results as part of the client;dir structure is parameterised in the script

logs the current configurationand which test it is


Evaluation Documentation

• Goal: Being able to verify your results, e.g. by re-running your experiments.

• Keep detailed documentation of what you are doing– Full disclosure of evaluation environment

including hardware, OS, software with full version numbers and any patches/changes

– Write a ‘Readme’: how you set-up and conduct the tests Chances are high that you want to get re-run something later, but

already a few days(!) after you will not remember every detail anymore…

Not to mention follow-up projects – Keep Evaluation Logbook

When done, what evaluations etc.


Evaluation: Archiving of Results

• Goal: To be able to analyse the results later in all details, even with regard to some aspects that you did not think about when you planned the evaluation

• Archive ALL RAW DATA of your results– not just average values etc.– include any server logfiles, error files, and configurations

• Best practice:– Keep a directory structure that corresponds to your

evaluations– Collect all raw result files, logfiles, config files etc. for each

individual experiment; include the environment description


Example Result Archive Structure• Evaluation

– Cluster Mix 1

Evaluation Setup.Readme

Cluster Mix 1.xls

Serial locking

- 2006-8-12_run1

· Client.log

· Server.log

- 2006-8-13_run2

· …

Object-level locking

Field-level locking

Semantic Locking– Cluster Mix 2

…– Semantic Mix

…

I prefer to have separate sheetsin the Excel file for all raw results,and then one aggregation sheetfor the means, which are plotted.


Archiving: Common Mistakes

• No server logs– How do you verify that some effect wasn’t due to a fault?

• No configuration / environment description– Was this result before or after you changed some setting?

• No raw data values, but just aggregatesUpdates Reads Staleness Drift Read ratio Mode MRT Executions Aborts

500 100 2500 0 90 reader 93.2654 1861 981

500 100 2500 500 90 reader 84.9423 2097 721

500 100 2500 1000 90 reader 72.2220 2387 601

500 100 2500 1500 90 reader 68.5 2692 338

500 100 2500 2000 90 reader 67.1024 2744 181

500 100 2500 2500 90 reader 66.0952 2953 3

Mins? Secs? Millis?How many runs?

Standard deviation? Meaning of this column?contradicts to first two columns


Step 5: Result Analysis

• Raw performance data analysed with statistical methods– Cf. last week’s lecture

• Important: – Standard deviation– Confidence intervals– Include error bars in graphs


Step 6: Result Presentation

• Finally, some remarks on a good visualisation of the results

• Note: The following examples are copied from conference submissions for example purposes only, not to claim the authors…


Example 1: Wrong Origin

• Which one is better and how much?LEACH or DEEAC?

Y-axis does not start at 0!Misleads the actual performance differenceand both graphs side-by-sideare not comparable!

Also note:What does the Y-axis show?


Example 2: Wrong Scaling

• How would you describe the scalability of algorithm ‘cc’?

• Standard problem when using Excel the ‘easy way’...


Example 3: ?

• Which approach is better?

Ok, this is unfair as the authors used this graph explicitly to show that all approaches behave the same…


Presentation of the Results

• Be consistent with naming, colors, and order– Two graphs a very hard to compare if you use different colors or

names for the same things

• Axis’ start at 0; give descriptive axis titles with unit names

• Show error bars– If it becomes to messy, just show one and explain in text

• Excel and non-linear intervals == MESS– e.g. gnuplot much better

• Make sure everything is readable– use large fonts, thick curves (not just 1 point) and dark colors

• Everything in the graphs should be explained


Links and References

• Performance Evaluations– Jim Gray. “The Benchmark Handbook: For Database and Transaction

Processing Systems”. Morgan-Kaufman, 1992.– G. Haring, C. Lindemann and M. Reiser (eds.). “Performance Evaluation:

Origins and Directions”. LNCS 1769, 2000.– B. Schroeder et al. “Open versus Closed: A Cautionary Tale”. In

Proceedings of the USENIX NSDI’06, pp. 239-252, 2006.– P. Wu, A. Fekete and U. Roehm: “The Efficacy of Commutativity-Based

Semantic Locking in Real-World Applications”. 2006.– U. Roehm. “OLAP with a Cluster of Databases”. DISBIS 80, 2002.

• Java Time Measurements– http://www.jsresources.org/faq_performance.html

– V. Roubtsov, “My kingdom for a good timer!”, January 2003. URL: http://www.javaworld.com/javaworld/javaqa/2003-01/01-qa-0110-timing.html

• Java vs. C etc.– http://www.idiom.com/~zilla/Computer/javaCbenchmark.html

http://www.jsresources.org/faq_performance.html





performance evaluations an overview and some lessons learned “knowing is not enough; we must...

Documents

performance experiments

response time runtime

result time of action

response time mrtmean

goetheperformance evaluations

spaceperformance evaluations

response timemodel

result tsendrta