performance evaluations an overview and some lessons learned “knowing is not enough; we must...
TRANSCRIPT
Performance Evaluations
An overview and some lessons learned
“Knowing is not enough; we must apply. Willing is not enough; we must do.”
Johann Wolfgang von Goethe
2Performance Evaluations - 2007 (U. Roehm)
Performance Evaluation
• Research methodology: – quantitative evaluation– of an existing software or hardware artefact
SUT - ‘system under test’– using a set of experiments– each consisting of a series of performance measurements– to collect realistic performance data
3Performance Evaluations - 2007 (U. Roehm)
Performance Evaluations
• Prepare– What do you want to measure?
• Plan– How can you measure and what is needed?
• Implement– Pitfalls for the correct implementation of benchmarks.
• Evaluate– How to conduct performance experiments.
• Analyse and Visualise– What does the outcome mean?
4Performance Evaluations - 2007 (U. Roehm)
Step 1: Preparations
• What do you want to show?– Understanding of behaviour of existing system?– Or proof an approach’s superiority?
• Which performance metrics are you interested in?– Mean response times– Throughput– Scalability– …
• What are the variables?– Parameters of own system / algorithm– number of concurrent users, number of nodes, …
5Performance Evaluations - 2007 (U. Roehm)
Performance Metrics
• Response Time (rt)– Time duration the SUT takes to answer a request– Note: For complex tasks, response time ≠ runtime
• Mean Response Time (mrt)– Mean of the response times of a set of requests– Only successful requests count!
• Throughput (thp)– Number of successful requests per time unit
6Performance Evaluations - 2007 (U. Roehm)
Runtime versus Response Time
• Model: client-server communication with server queueing incoming requests (e.g. web servers or database servers)
• client sends a request a– runt(a) - runtime to complete request a
runt(a) = treceiveN(last_result) - tsend
– rt(a) - response time for action a (until first result comes back to client!) rt(a) = treceive1(first_result) - tsend
– wt(a) - waiting time of action a in server queue– et(a) - execution time of action a (after server took request from queue)– frt(a) - first result time of action a (Note: frt(a) <= et(a) )– tc(a) - network transamission times for the request a and its result(s)
tc(a)
tc(result)
tsendtreceiveNtarrival
tend
runt(a)
wt(a) et(a)
frt(a)
rt(a)
treceive1
queue executionnetwork network
tc(result). . .
7Performance Evaluations - 2007 (U. Roehm)
More Performance Metrics
• Scalability / Speedup: thpn / thp1
• Fairness:
• Resource Consumption– memory usage– CPU load– energy consumption– …
• Note: The later are typically server-side statistics!
8Performance Evaluations - 2007 (U. Roehm)
Step 2: Planing
• Which experiments to show the intended results?
• What do you need to run those experiment?– Hardware– Software– Data (!)
• Prepare an evaluation schedule– Evaluations always take longer than you expect!– Expect to change some goals / your approach based on the
outcome of initial experiments
• Some initial runs might be helpful to explore the space
9Performance Evaluations - 2007 (U. Roehm)
Typical Client-Server Evaluation Setup
Client
Client
Server
ServerTestNetwork
Server - ServerNetwork
Client Emulator(s) Test Network System Under Test (SUT)
Often just one multithreaded client
that emulates n concurrent clients
In general, the SUT can be arbitrary complex;e.g. clustered servers or multi-tier architectures
Response time and throughputis measured here
The client emulator(s) should run on a separate machine than the server(s).
10Performance Evaluations - 2007 (U. Roehm)
Workload Specifications
• Multiprogramming Level (MPL) – How many concurrent users / clients?
• Heterogeneous workload?– Is every client doing the same or are there variations?– Typically: Well-defined set of transactions / request kinds
with a defined distribution
• Do you emulate programs or users?– If just interested in peak performance, then as many
requests as possible– Sometimes more complex user model needed
E.g. TPC-C users with think times and sleep times
11Performance Evaluations - 2007 (U. Roehm)
Experimental Data
• Problem: How do we get reasonable test data?
• Approach 1: Tracing– If you have an existing system available, trace ‘typical’
usages and use such traces to drive your experiments
• Approach 2: Standard Benchmarks– Use the data generator of a standard benchmark– In some areas, there are explicit data corpi to evaluate
• Approach 3: Make something up yourself– Always the least preferable way!!!– Justify why think that your data setup is representative
e.g. by using the pattern of a standard benchmark
12Performance Evaluations - 2007 (U. Roehm)
Standard Benchmarks
• There are many standard benchmarks available– very helpful to make results more comparable– most come with synthetic data generators– make your results more publishable
(reviewers will have more trust in your experimental setup)
• Disadvantages:– Standard benchmarks can be very complex– Some specifications are not free, but cost money
• Examples :– TPC-C, TPC-H, TPC-R, TPC-W– ECPerf, SPECjAppServer, IBM’s Trade2, etc.
13Performance Evaluations - 2007 (U. Roehm)
Example: TPC Benchmarks
• TPC - Transaction Processing Council (tpc.org)– Non-profit corporation of commercial software vendors– Defined a set of database and e-business performance benchmarks
• TPC-C– Measures the performance of OLTP systems (bank scenario)– V1.0 became official in 1992; current version v5.8
• TPC-H and TPC-R (former TPC-D)– Performance of OLAP systems (warehouse scenario); – In 1999, TPC-D replaced by TPC-H (ad-hoc queries) + TPC-R (reporting)
• TPC-W and TPC-App– transactional web benchmark simulating interactive e-business website– TPC-W obsolete since April 2005; replaced(?) by TPC-App
• TPC-E– New OLTP benchmark that simulates the workload of a brokerage firm
14Performance Evaluations - 2007 (U. Roehm)
Step 3: Implementation
• Goal: Evaluation program(s) (client emulators) that measure what you have planned.
• Typical elements to take care of– Accurate timing– Random number generation– Fast logging– No hidden serialisation, e.g. via global singletons– No screen output during measurement interval
• Avoid measuring the test harness rather than SUT…
15Performance Evaluations - 2007 (U. Roehm)
Time Measurements
• Every programming language offers some timing functions
• But be aware that there is a timer resolution– E.g. Java’s System.CurrentTimeInMillis() suggests by its
name that it measures time in milliseconds… The question is, how many milliseconds between updates…
• There is no point in trying to measure something taking microseconds with timers with milliseconds resolution!
16Performance Evaluations - 2007 (U. Roehm)
Example: Java Timing
• Standard (all JDK’s): System.currentTimeInMillis()– Be aware: Has different resolutions depending on OS and platforms!
• Since JDK 1.4.2 (portable, undocumented!!): sun.misc.Perf– Example: // may throw SecurityException
sun.misc.Perf perf = sun.misc.Perf.getPerf();long ticksPerSecond = perf.highResFrequency();long currTick = perf.highResCounter();long milliSeconds = (currTick * 1000) / ticksPerSecond;
• In JDK 1.5: java.lang.System.nanoTime()– Always uses the best precision available on a system, but no guaranteed resolution
• Some third party solutions using, e.g., Windows’ HighPerformanceTimers through Java JNI (hence limited portability, best for Windows)
Linux (2.2, x86) 1 ms
Mac OS X 1 ms
Windows 2000 10 ms
Windows 98 60 ms
Solaris (2.7/i386, 2.8/sun4u) 1 ms
17Performance Evaluations - 2007 (U. Roehm)
Example: Wrong Timer Usage
18Performance Evaluations - 2007 (U. Roehm)
Xmple:Same Experiment - High-res TimerHistogram of Bean Creation Times (server-side)
0
100
200
300
400
500
600
0 1 2 3
time [ms]
frequency
Cache-Hit TimesTenMySQL (local) MySQL (remote)Oracle (remote)
Note the ‘faster’ response times as compared to the previous experiment
Note: No chance to measure the duration of a cache hit with CurrentTimeInMillis()
19Performance Evaluations - 2007 (U. Roehm)
Random Number Generators
• Common Mistakes:– A multi-threaded client, but all threads use the same
global objectThis effectively serialises your threads!
– Large set of random numbers is generated within codeWe do not want to measure how fast Java can
generate random numbersUse an array with pre-generated random numbers
(space vs. time)– Seeds are the same
You make your program deterministic…
20Performance Evaluations - 2007 (U. Roehm)
Logging
• Goal: Fast logging of results during experiments without interfering with the measurements
• Approach:– Log to a file, not the screen
screen output / scrolling is VERY slow- Very common mistake
– Use standard log libraries with low overheade.g. Java’s log4j (http://sourceforge.net/projects/log4j/)
or Windows’ performance counter API – If your client reads data from a hard drive,
write your log data to a different disk– Log asynchronously, be fast, be thread-safe (careful!)
21Performance Evaluations - 2007 (U. Roehm)
Windows Performance Monitor
• Windows includes aperformance monitorapplication– online GUI– can capture to file– supports remote
monitoring!
• Based on Windows APIPerformance Counters– \\ComputerName\Object(Instance)\Counter
– supported by basically every server application; huge # of statistics– can be used in own programs– Cf. http://technet2.microsoft.com/WindowsServer/en/library/3fb01419-b1ab-4f52-a9f8-09d5ebeb9ef21033.mspx?mfr=true
– From Java: http://www.javaworld.com/javaworld/jw-11-2004/jw-1108-windowspm.html
22Performance Evaluations - 2007 (U. Roehm)
Step 4: Evaluation
• Objective: To collect accurate performance data in a set of experiments.
• Three major issues:– Controlled evaluation environment– Documentation– Archiving of raw data
23Performance Evaluations - 2007 (U. Roehm)
Evaluation Environment
• We want a stable evaluation environment that allows us to measure the system under test under repeatable settings without interference– Clean computer initialisation (client(s) and server(s))
No concurrent programs, minimum set of sys servicesDisable anti-virus software!
– Decide: open or closed system?– Decide: cold or warm caches?– Make sure you do not measure any side-effects!– Many prefer to measure during night - why?
24Performance Evaluations - 2007 (U. Roehm)
Open vs. Closed Systems
• Open system: clients arrive and leave the system (with appropriate distribution)
• Closed system: fix number of clients; each starts a new task after finishing the previous one
• Open system is generally more realistic
• Closed system is much easier to write as test harness
• Open system behaves much worse when contention is high
• See Schroeder et al, Proc NSDI’06
25Performance Evaluations - 2007 (U. Roehm)
Sequential Evaluation
• The special case of MPL = 1– There are not multiple clients or the system is centralised
• Test harness is typically the same; also many of the implementation details still apply
• Several experiment repetitions– In order to determine stable statistics
e.g. mean response time – Note: Same number of repetitions for all experiments!
26Performance Evaluations - 2007 (U. Roehm)
• Multiprogramming level > 1
• A parallel evaluation has three distinct phases:
• Performance can be measured either – periodically during steady phase, or – in summary at the end (measuring period = whole steady phase)
then it must be repeated several times to the mean value
Parallel Evaluation
time
pe
rfo
rma
nce
ramp-up steady phase close-down
Only measure during the steady-phase!
Note: fixed MPL
27Performance Evaluations - 2007 (U. Roehm)
Performance Behaviour over MPL
• The measurements from the previous slide give you a single measuring point for one fixed MPL
• Typically interested in system behaviour of varying MPLs– One needs to conduct separate experiments for each MPL
• What to expect if we plot the throughput over the MPL?
This? Or this?
28Performance Evaluations - 2007 (U. Roehm)
Outlier Policy
• Inherent complexity of the evaluated systems – There is always a ‘noise signal’ or ‘uncertainty factor’
• We expect individual measurements to vary around a stable mean value– Note that the standard deviation can be quite high.– But some measurements are ‘way off’ <- have to be dealt with
• Outlier policy:– What is an outlier?
e.g. more than n times the standard deviation off– How to deal with it?
e.g. replace with a ‘spare’ value
• Trace outliers - if there are too many, are they still outliers?
29Performance Evaluations - 2007 (U. Roehm)
Evaluation Scripting
• Conducting evaluations is tedious and error-prone– E.g. to test scalability in example paper:
3 scenarios x 4 algorithms = 12 configurations12 configurations x 10 MPLs x 21 measurements = 2520
– Note: The benchmark implementation is only one of those 2520
• Need a ‘test harness’ - set of scripts/programs that– automates the evaluation as much as possible, and– makes the evaluation repeatable!
start/stop servers, copy logging and configuration files, check for hanging runs etc.
– Write scripts in the language that you prefer Python, shell scripts, … good knowledge of OS and shell scripting is helpful
30Performance Evaluations - 2007 (U. Roehm)
Example Evaluation ScriptREM @Echo OFFTITLE QueryTestREM parameters:REM %1 number of nodesREM … (this script has a bunch of parameters)REM -----------------------------------------------------------------------------------------
REM freshly start server Z:/programs/gnu/bin/sleep 10 start ./server.exe Z:/programs/gnu/bin/sleep 20
REM configure server and prepare cache ./pdbsetconfig Nodes=%2 MaxQueueSize=22 MaxLoad=1 WrapperType=%3 Nodes=%2 Routing=RoundRobin Password=%4
Username=%4 ScanInputQueue=FCFS ./client Verbose=0 Directory=P:/tpc-r/%3/pdb_queries Files=CacheFlush.sql Loops=%1 REM run the actual measurement REM the result logging is implemented as part of the client program ./pdbstatistics RESET IF %10==verbose ./pdbgetconfig IF %10==verbose echo === Testing %5-Routing (maxload %6 history %7) with %8/%9ms IF %10==verbose ./client Files=%8 Verbose=1 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True
ServerStats=True IF NOT %10==verbose ./client Files=%8 Verbose=0 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True
ServerStats=True ExcelOutput="%1 %6 %7 %8" REM stop server againREM first try normally, but if it hangs - kill it. ./pdbsetconfig Shutdown Z:/programs/gnu/bin/sleep 20 Z:/programs/reskit4/kill -F server.exe
Archiving of results as part of the client;dir structure is parameterised in the script
logs the current configurationand which test it is
31Performance Evaluations - 2007 (U. Roehm)
Evaluation Documentation
• Goal: Being able to verify your results, e.g. by re-running your experiments.
• Keep detailed documentation of what you are doing– Full disclosure of evaluation environment
including hardware, OS, software with full version numbers and any patches/changes
– Write a ‘Readme’: how you set-up and conduct the tests Chances are high that you want to get re-run something later, but
already a few days(!) after you will not remember every detail anymore…
Not to mention follow-up projects – Keep Evaluation Logbook
When done, what evaluations etc.
32Performance Evaluations - 2007 (U. Roehm)
Evaluation: Archiving of Results
• Goal: To be able to analyse the results later in all details, even with regard to some aspects that you did not think about when you planned the evaluation
• Archive ALL RAW DATA of your results– not just average values etc.– include any server logfiles, error files, and configurations
• Best practice:– Keep a directory structure that corresponds to your
evaluations– Collect all raw result files, logfiles, config files etc. for each
individual experiment; include the environment description
33Performance Evaluations - 2007 (U. Roehm)
Example Result Archive Structure• Evaluation
– Cluster Mix 1
Evaluation Setup.Readme
Cluster Mix 1.xls
Serial locking
- 2006-8-12_run1
· Client.log
· Server.log
- 2006-8-13_run2
· …
Object-level locking
Field-level locking
Semantic Locking– Cluster Mix 2
…– Semantic Mix
…
I prefer to have separate sheetsin the Excel file for all raw results,and then one aggregation sheetfor the means, which are plotted.
34Performance Evaluations - 2007 (U. Roehm)
Archiving: Common Mistakes
• No server logs– How do you verify that some effect wasn’t due to a fault?
• No configuration / environment description– Was this result before or after you changed some setting?
• No raw data values, but just aggregatesUpdates Reads Staleness Drift Read ratio Mode MRT Executions Aborts
500 100 2500 0 90 reader 93.2654 1861 981
500 100 2500 500 90 reader 84.9423 2097 721
500 100 2500 1000 90 reader 72.2220 2387 601
500 100 2500 1500 90 reader 68.5 2692 338
500 100 2500 2000 90 reader 67.1024 2744 181
500 100 2500 2500 90 reader 66.0952 2953 3
Mins? Secs? Millis?How many runs?
Standard deviation? Meaning of this column?contradicts to first two columns
35Performance Evaluations - 2007 (U. Roehm)
Step 5: Result Analysis
• Raw performance data analysed with statistical methods– Cf. last week’s lecture
• Important: – Standard deviation– Confidence intervals– Include error bars in graphs
36Performance Evaluations - 2007 (U. Roehm)
Step 6: Result Presentation
• Finally, some remarks on a good visualisation of the results
• Note: The following examples are copied from conference submissions for example purposes only, not to claim the authors…
37Performance Evaluations - 2007 (U. Roehm)
Example 1: Wrong Origin
• Which one is better and how much?LEACH or DEEAC?
Y-axis does not start at 0!Misleads the actual performance differenceand both graphs side-by-sideare not comparable!
Also note:What does the Y-axis show?
38Performance Evaluations - 2007 (U. Roehm)
Example 2: Wrong Scaling
• How would you describe the scalability of algorithm ‘cc’?
• Standard problem when using Excel the ‘easy way’...
39Performance Evaluations - 2007 (U. Roehm)
Example 3: ?
• Which approach is better?
Ok, this is unfair as the authors used this graph explicitly to show that all approaches behave the same…
40Performance Evaluations - 2007 (U. Roehm)
Presentation of the Results
• Be consistent with naming, colors, and order– Two graphs a very hard to compare if you use different colors or
names for the same things
• Axis’ start at 0; give descriptive axis titles with unit names
• Show error bars– If it becomes to messy, just show one and explain in text
• Excel and non-linear intervals == MESS– e.g. gnuplot much better
• Make sure everything is readable– use large fonts, thick curves (not just 1 point) and dark colors
• Everything in the graphs should be explained
41Performance Evaluations - 2007 (U. Roehm)
Links and References
• Performance Evaluations– Jim Gray. “The Benchmark Handbook: For Database and Transaction
Processing Systems”. Morgan-Kaufman, 1992.– G. Haring, C. Lindemann and M. Reiser (eds.). “Performance Evaluation:
Origins and Directions”. LNCS 1769, 2000.– B. Schroeder et al. “Open versus Closed: A Cautionary Tale”. In
Proceedings of the USENIX NSDI’06, pp. 239-252, 2006.– P. Wu, A. Fekete and U. Roehm: “The Efficacy of Commutativity-Based
Semantic Locking in Real-World Applications”. 2006.– U. Roehm. “OLAP with a Cluster of Databases”. DISBIS 80, 2002.
• Java Time Measurements– http://www.jsresources.org/faq_performance.html
– V. Roubtsov, “My kingdom for a good timer!”, January 2003. URL: http://www.javaworld.com/javaworld/javaqa/2003-01/01-qa-0110-timing.html
• Java vs. C etc.– http://www.idiom.com/~zilla/Computer/javaCbenchmark.html