Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang.

Download Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang.

Post on 18-Jan-2016

213 views

Category:

Documents

0 download

TRANSCRIPT

  • Performance EvaluationsAn overview and some lessons learnedKnowing is not enough; we must apply. Willing is not enough; we must do. Johann Wolfgang von Goethe

    Performance Evaluations - 2007 (U. Roehm)

  • Performance EvaluationResearch methodology: quantitative evaluationof an existing software or hardware artefactSUT - system under testusing a set of experimentseach consisting of a series of performance measurementsto collect realistic performance data

    Performance Evaluations - 2007 (U. Roehm)

  • Performance EvaluationsPrepareWhat do you want to measure?PlanHow can you measure and what is needed?ImplementPitfalls for the correct implementation of benchmarks.EvaluateHow to conduct performance experiments.Analyse and VisualiseWhat does the outcome mean?

    Performance Evaluations - 2007 (U. Roehm)

  • Step 1: PreparationsWhat do you want to show?Understanding of behaviour of existing system?Or proof an approachs superiority?Which performance metrics are you interested in?Mean response timesThroughputScalabilityWhat are the variables?Parameters of own system / algorithmnumber of concurrent users, number of nodes,

    Performance Evaluations - 2007 (U. Roehm)

  • Performance MetricsResponse Time (rt)Time duration the SUT takes to answer a requestNote: For complex tasks, response time runtime Mean Response Time (mrt)Mean of the response times of a set of requestsOnly successful requests count!Throughput (thp)Number of successful requests per time unit

    Performance Evaluations - 2007 (U. Roehm)

  • Runtime versus Response TimeModel: client-server communication with server queueing incoming requests (e.g. web servers or database servers)client sends a request arunt(a) - runtime to complete request a runt(a) = treceiveN(last_result) - tsendrt(a) - response time for action a (until first result comes back to client!) rt(a) = treceive1(first_result) - tsendwt(a) - waiting time of action a in server queueet(a) - execution time of action a (after server took request from queue)frt(a) - first result time of action a (Note: frt(a)
  • More Performance MetricsScalability / Speedup:thpn / thp1

    Fairness:

    Resource Consumptionmemory usageCPU loadenergy consumption Note: The later are typically server-side statistics!

    Performance Evaluations - 2007 (U. Roehm)

  • Step 2: PlaningWhich experiments to show the intended results?What do you need to run those experiment?HardwareSoftwareData (!)Prepare an evaluation scheduleEvaluations always take longer than you expect!Expect to change some goals / your approach based on the outcome of initial experimentsSome initial runs might be helpful to explore the space

    Performance Evaluations - 2007 (U. Roehm)

  • Typical Client-Server Evaluation SetupClientClientServerServerTest NetworkServer - Server NetworkClient Emulator(s)Test NetworkSystem Under Test (SUT)In general, the SUT can be arbitrary complex;e.g. clustered servers or multi-tier architecturesThe client emulator(s) should run on a separate machine than the server(s).

    Performance Evaluations - 2007 (U. Roehm)

  • Workload SpecificationsMultiprogramming Level (MPL) How many concurrent users / clients?Heterogeneous workload?Is every client doing the same or are there variations?Typically: Well-defined set of transactions / request kinds with a defined distributionDo you emulate programs or users?If just interested in peak performance, then as many requests as possibleSometimes more complex user model neededE.g. TPC-C users with think times and sleep times

    Performance Evaluations - 2007 (U. Roehm)

  • Experimental DataProblem: How do we get reasonable test data?Approach 1: TracingIf you have an existing system available, trace typical usages and use such traces to drive your experimentsApproach 2: Standard BenchmarksUse the data generator of a standard benchmarkIn some areas, there are explicit data corpi to evaluateApproach 3: Make something up yourselfAlways the least preferable way!!!Justify why think that your data setup is representative e.g. by using the pattern of a standard benchmark

    Performance Evaluations - 2007 (U. Roehm)

  • Standard BenchmarksThere are many standard benchmarks availablevery helpful to make results more comparablemost come with synthetic data generatorsmake your results more publishable (reviewers will have more trust in your experimental setup)Disadvantages:Standard benchmarks can be very complexSome specifications are not free, but cost moneyExamples :TPC-C, TPC-H, TPC-R, TPC-WECPerf, SPECjAppServer, IBMs Trade2, etc.

    Performance Evaluations - 2007 (U. Roehm)

  • Example: TPC BenchmarksTPC - Transaction Processing Council (tpc.org)Non-profit corporation of commercial software vendorsDefined a set of database and e-business performance benchmarksTPC-CMeasures the performance of OLTP systems (bank scenario)V1.0 became official in 1992; current version v5.8TPC-H and TPC-R (former TPC-D)Performance of OLAP systems (warehouse scenario); In 1999, TPC-D replaced by TPC-H (ad-hoc queries) + TPC-R (reporting)TPC-W and TPC-Apptransactional web benchmark simulating interactive e-business websiteTPC-W obsolete since April 2005; replaced(?) by TPC-AppTPC-ENew OLTP benchmark that simulates the workload of a brokerage firm

    Performance Evaluations - 2007 (U. Roehm)

  • Step 3: ImplementationGoal: Evaluation program(s) (client emulators) that measure what you have planned.Typical elements to take care ofAccurate timingRandom number generationFast loggingNo hidden serialisation, e.g. via global singletonsNo screen output during measurement intervalAvoid measuring the test harness rather than SUT

    Performance Evaluations - 2007 (U. Roehm)

  • Time MeasurementsEvery programming language offers some timing functionsBut be aware that there is a timer resolutionE.g. Javas System.CurrentTimeInMillis() suggests by its name that it measures time in milliseconds The question is, how many milliseconds between updatesThere is no point in trying to measure something taking microseconds with timers with milliseconds resolution!

    Performance Evaluations - 2007 (U. Roehm)

  • Example: Java TimingStandard (all JDKs): System.currentTimeInMillis()Be aware: Has different resolutions depending on OS and platforms!

    Since JDK 1.4.2 (portable, undocumented!!): sun.misc.PerfExample: // may throw SecurityException sun.misc.Perf perf = sun.misc.Perf.getPerf(); long ticksPerSecond = perf.highResFrequency(); long currTick = perf.highResCounter(); long milliSeconds = (currTick * 1000) / ticksPerSecond;In JDK 1.5: java.lang.System.nanoTime()Always uses the best precision available on a system, but no guaranteed resolutionSome third party solutions using, e.g., Windows HighPerformanceTimers through Java JNI (hence limited portability, best for Windows)

    Linux (2.2, x86) 1 msMac OS X 1 msWindows 200010 msWindows 9860 msSolaris (2.7/i386, 2.8/sun4u) 1 ms

    Performance Evaluations - 2007 (U. Roehm)

  • Example: Wrong Timer Usage

    Performance Evaluations - 2007 (U. Roehm)

  • Xmple:Same Experiment - High-res Timer

    Performance Evaluations - 2007 (U. Roehm)

    ClientRT

    Client response times for GetShares()Beans:500

    cache-hitTimesTenMySQL (local)MySQL (remote)Oracle (remote)

    500 beans17862375278712228

    1 bean avg2.3243.5724.755.57424.456

    (this value is copied from 100 beans average of data2.xls)

    factor11.53700516352.04388984512.398450946610.5232358003

    2.39845094664.7379

  • Random Number GeneratorsCommon Mistakes:A multi-threaded client, but all threads use the same global objectThis effectively serialises your threads!Large set of random numbers is generated within codeWe do not want to measure how fast Java can generate random numbersUse an array with pre-generated random numbers (space vs. time)Seeds are the sameYou make your program deterministic

    Performance Evaluations - 2007 (U. Roehm)

  • LoggingGoal: Fast logging of results during experiments without interfering with the measurementsApproach:Log to a file, not the screenscreen output / scrolling is VERY slowVery common mistakeUse standard log libraries with low overheade.g. Javas log4j (http://sourceforge.net/projects/log4j/) or Windows performance counter API If your client reads data from a hard drive, write your log data to a different diskLog asynchronously, be fast, be thread-safe (careful!)

    Performance Evaluations - 2007 (U. Roehm)

  • Windows Performance MonitorWindows includes a performance monitor applicationonline GUIcan capture to filesupports remote monitoring!Based on Windows API Performance Counters\\ComputerName\Object(Instance)\Countersupported by basically every server application; huge # of statisticscan be used in own programsCf. http://technet2.microsoft.com/WindowsServer/en/library/3fb01419-b1ab-4f52-a9f8-09d5ebeb9ef21033.mspx?mfr=trueFrom Java: http://www.javaworld.com/javaworld/jw-11-2004/jw-1108-windowspm.html

    Performance Evaluations - 2007 (U. Roehm)

  • Step 4: EvaluationObjective: To collect accurate performance data in a set of experiments.Three major issues:Controlled evaluation environmentDocumentationArchiving of raw data

    Performance Evaluations - 2007 (U. Roehm)

  • Evaluation EnvironmentWe want a stable evaluation environment that allows us to measure the system under test under repeatable settings without interferenceClean computer initialisation (client(s) and server(s))No concurrent programs, minimum set of sys servicesDisable anti-virus software!Decide: open or closed system?Decide: cold or warm caches?Make sure you do not measure any side-effects!Many prefer to measure during night - why?

    Performance Evaluations - 2007 (U. Roehm)

  • Open vs. Closed SystemsOpen system: clients arrive and leave the system (with appropriate distribution)Closed system: fix number of clients; each starts a new task after finishing the previous oneOpen system is generally more realisticClosed system is much easier to write as test harnessOpen system behaves much worse when contention is highSee Schroeder et al, Proc NSDI06

    Performance Evaluations - 2007 (U. Roehm)

  • Sequential EvaluationThe special case of MPL = 1There are not multiple clients or the system is centralisedTest harness is typically the same; also many of the implementation details still applySeveral experiment repetitionsIn order to determine stable statisticse.g. mean response time Note: Same number of repetitions for all experiments!

    Performance Evaluations - 2007 (U. Roehm)

  • Parallel EvaluationMultiprogramming level > 1A parallel evaluation has three distinct phases:

    Performance can be measured either periodically during steady phase, or in summary at the end (measuring period = whole steady phase)then it must be repeated several times to the mean valuetimeperformanceramp-upsteady phaseclose-downOnly measure during the steady-phase!Note: fixed MPL

    Performance Evaluations - 2007 (U. Roehm)

  • Performance Behaviour over MPLThe measurements from the previous slide give you a single measuring point for one fixed MPLTypically interested in system behaviour of varying MPLsOne needs to conduct separate experiments for each MPL What to expect if we plot the throughput over the MPL?

    Performance Evaluations - 2007 (U. Roehm)

  • Outlier PolicyInherent complexity of the evaluated systems There is always a noise signal or uncertainty factorWe expect individual measurements to vary around a stable mean valueNote that the standard deviation can be quite high.But some measurements are way off
  • Evaluation ScriptingConducting evaluations is tedious and error-proneE.g. to test scalability in example paper: 3 scenarios x 4 algorithms = 12 configurations 12 configurations x 10 MPLs x 21 measurements = 2520Note: The benchmark implementation is only one of those 2520Need a test harness - set of scripts/programs thatautomates the evaluation as much as possible, andmakes the evaluation repeatable! start/stop servers, copy logging and configuration files, check for hanging runs etc.Write scripts in the language that you preferPython, shell scripts, good knowledge of OS and shell scripting is helpful

    Performance Evaluations - 2007 (U. Roehm)

  • Example Evaluation ScriptREM @Echo OFFTITLE QueryTestREM parameters:REM %1 number of nodesREM (this script has a bunch of parameters)REM -----------------------------------------------------------------------------------------

    REM freshly start server Z:/programs/gnu/bin/sleep 10 start ./server.exe Z:/programs/gnu/bin/sleep 20

    REM configure server and prepare cache ./pdbsetconfig Nodes=%2 MaxQueueSize=22 MaxLoad=1 WrapperType=%3 Nodes=%2 Routing=RoundRobin Password=%4 Username=%4 ScanInputQueue=FCFS ./client Verbose=0 Directory=P:/tpc-r/%3/pdb_queries Files=CacheFlush.sql Loops=%1 REM run the actual measurement REM the result logging is implemented as part of the client program ./pdbstatistics RESET IF %10==verbose ./pdbgetconfig IF %10==verbose echo === Testing %5-Routing (maxload %6 history %7) with %8/%9ms IF %10==verbose ./client Files=%8 Verbose=1 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True ServerStats=True IF NOT %10==verbose ./client Files=%8 Verbose=0 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True ServerStats=True ExcelOutput="%1%6%7%8" REM stop server againREM first try normally, but if it hangs - kill it. ./pdbsetconfig Shutdown Z:/programs/gnu/bin/sleep 20 Z:/programs/reskit4/kill -F server.exeArchiving of results as part of the client;dir structure is parameterised in the scriptlogs the current configuration and which test it is

    Performance Evaluations - 2007 (U. Roehm)

  • Evaluation DocumentationGoal: Being able to verify your results, e.g. by re-running your experiments.Keep detailed documentation of what you are doingFull disclosure of evaluation environmentincluding hardware, OS, software with full version numbers and any patches/changesWrite a Readme: how you set-up and conduct the testsChances are high that you want to get re-run something later, but already a few days(!) after you will not remember every detail anymore Not to mention follow-up projects Keep Evaluation LogbookWhen done, what evaluations etc.

    Performance Evaluations - 2007 (U. Roehm)

  • Evaluation: Archiving of ResultsGoal: To be able to analyse the results later in all details, even with regard to some aspects that you did not think about when you planned the evaluationArchive ALL RAW DATA of your resultsnot just average values etc.include any server logfiles, error files, and configurationsBest practice:Keep a directory structure that corresponds to your evaluationsCollect all raw result files, logfiles, config files etc. for each individual experiment; include the environment description

    Performance Evaluations - 2007 (U. Roehm)

  • Example Result Archive StructureEvaluationCluster Mix 1Evaluation Setup.ReadmeCluster Mix 1.xlsSerial locking2006-8-12_run1Client.logServer.log2006-8-13_run2Object-level lockingField-level lockingSemantic LockingCluster Mix 2Semantic MixI prefer to have separate sheets in the Excel file for all raw results, and then one aggregation sheet for the means, which are plotte...

Recommended

View more >