linux cluster production readiness egan ford ibm egan@us.ibm.com egan@sense.net

Download Linux Cluster Production Readiness Egan Ford IBM egan@us.ibm.com egan@sense.net

If you can't read please download the document

Post on 26-Dec-2015

219 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Linux Cluster Production Readiness Egan Ford IBM egan@us.ibm.com egan@sense.net
  • Slide 2
  • Agenda Production Readiness Diagnostics Benchmarks STAB Case Study SCAB
  • Slide 3
  • What is Production Readiness? Production readiness is a series of tests to help determine if a system is ready for use. Production readiness falls into two categories: diagnostic benchmark The purpose is to confirm that all hardware is good and identical (per class). The search for consistency and predictability.
  • Slide 4
  • What are diagnostics? Diagnostic tests are usually pass/fail and include but are not limited to simple version checks OS, BIOS versions inventory checks Memory, CPU, etc configuration checks Is HT off? vendor supplied diagnostics DOS on a CD
  • Slide 5
  • Why benchmark? Diagnostics are usually pass/fail. Thresholds may be undocumented. Why is difficult to answer. Diagnostics may be incomplete. They may not test all subsystems. Other issues with diagnostics: False positives. Inconsistent from vendor to vendor. Do no real work, cannot check for accuracy. Usually hardware based. What about software? What about the user environment?
  • Slide 6
  • Why benchmark? Benchmarks can be checked for accuracy. Benchmarks can stress all used subsystems. Benchmarks can stress all used software. Benchmarks can be measured and you can determine the thresholds.
  • Slide 7
  • Benchmark or diagnostics? Do both. All diagnostics should pass first. Benchmarks will be inconsistent if diagnostics fail.
  • Slide 8
  • WARNING! The following slides will contain the word statistics. Statistics cannot prove anything. Exercise commonsense.
  • Slide 9
  • A few words on statistics Statistics increases human knowledge through the use of empirical data. There are three kinds of lies: lies, damned lies and statistics. -- Benjamin Disraeli (1804-1881) There are three kinds of lies: lies, damned lies and linpack.
  • Slide 10
  • What is STAB? STatistical Analysis of Benchmarks A systematic way of running a series of increasing complex benchmarks to find avoidable inconsistencies. Avoidable inconsistencies may lead to performance problems. GOAL: consistent, repeatable, accurate results.
  • Slide 11
  • What is STAB? Each benchmark is run one or more times per node, then the best representative of each node (ignore for multinode tests) is grouped together and analyzed as a single population. The results are not as interesting as the shape of the distribution of the results. Empirical evidence for all the benchmarks in the STAB HOWTO suggest that they should all form a normal distribution. A normal distribution is the classic bell curve that appears so frequently in statistics. It is the sum of smaller, independent (may be unobservable), identically-distributed variables or random events.
  • Slide 12
  • Uniform Distribution Plot below is of 20000 random dice.
  • Slide 13
  • Normal Distribution Sum of 5 dice thrown 10000 times.
  • Slide 14
  • Normal Distribution Benchmarks also have many small independent (may be unobservable) identically-distributed variables that may affect performance, e.g.: Competing processes Context switching Hardware interrupts Software interrupts Memory management Process/Thread scheduling Cosmic rays The above may be unavoidable, but is in part the source a normal distribution.
  • Slide 15
  • Non-normal Distribution Benchmarks may also have non-identically-distributed observable variables that may affect performance, e.g.: Memory configuration BIOS Version Processor speed Operating system Kernel type (e.g. NUMA vs SMP vs UNI) Kernel version Bad memory (e.g. excessive ECCs) Chipset revisions Hyper-Threading or SMT Non-uniform competing processes (e.g. httpd running on some nodes, but not others) Shared library versions Bad cables Bad administrators Users The above is avoidable and is the purpose of the STAB HOWTO. Avoidable inconsistencies may lead to multimodal or non-normal distributions.
  • Slide 16
  • STAB Toolkit The STAB Tools are a collection of scripts to help run selected benchmarks and to analyze their results. Some of the tools are specific to a particular benchmark. Others are general and operate on the data collected by the specific tools. Benchmark specific tools comprise of benchmark launch scripts, accuracy validation scripts, miscellaneous utilities, and analysis scripts to collect the data, report some basic descriptive statistics, and create input files to be used with general STAB tools for additional analysis.
  • Slide 17
  • STAB Toolkit With a goal of consistent, repeatable, accurate results it is best to start with as few variables as possible. Start with single node benchmarks, e.g., STREAM. If all machines have similar STREAM results, then memory can be ruled out as a factor with other benchmark anomalies. Next, work your way up to processor and disk benchmarks, then two node (parallel) benchmarks, then multi-node (parallel) benchmarks. After each more complicated benchmark run a check for consistent, repeatable, accurate results before continuing.
  • Slide 18
  • The STAB Benchmarks Single Node (serial) Benchmarks: STREAM (memory MB/s) NPB Serial (uni-processor FLOP/s and memory) NPB OpenMP (multi-processor FLOP/s and memory) HPL MPI Shared Memory (multi-processor FLOP/s and memory) IOzone (disk MB/s, memory, and processor) Parallel Benchmarks (for MPI systems only): Ping-Pong (interconnect sec and MB/s) NAS Parallel (multi-node FLOP/s, memory, and interconnect) HPL Parallel (multi-node FLOP/s, memory, and interconnect)
  • Slide 19
  • Getting STAB http://sense.net/~egan/bench bench.tgz Code with source (all script) bench-oss.tgz OSS code (e.g. Gnuplot) bench-examples.tgz 1GB of collected data (all text, 186000+ files) stab.pdf (currently 150 pages) Documentation (WIP, check back before 11/30/2005)
  • Slide 20
  • Install STAB Extract bench*.tgz into home directory: cd ~ tar zxvf bench.tgz tar zxvf bench-oss.tgz tar zxvf bench-examples.tgz Add STAB tools to PATH: export PATH=~/bench/bin:$PATH Append to.bashrc: export PATH=~/bench/bin:$PATH
  • Slide 21
  • Install STAB STAB requires Gnuplot 4 and it must be built a specific way: cd ~/bench/src tar zxvf gnuplot-4.0.0.tar.gz cd gnuplot-4.0.0./configure --prefix=$HOME/bench --enable-thin-splines make make install
  • Slide 22
  • STAB Benchmark Tools Each benchmark supported in this document contains an anal (short for analysis) script. This script is usually run from a output directory, e.g.: cd ~/bench/benchmark/output../anal benchmark nodes low high % mean median std dev bt.A.i686 4 615.77 632.08 2.65 627.85 632.02 8.06 cg.A.i686 4 159.78 225.08 40.87 191.05 193.16 26.86 ep.A.i686 4 11.51 11.53 0.17 11.52 11.52 0.01 ft.A.i686 4 448.05 448.90 0.19 448.63 448.81 0.39 lu.A.i686 4 430.60 436.59 1.39 433.87 434.72 2.51 mg.A.i686 4 468.12 472.54 0.94 470.86 472.12 2.00 sp.A.i686 4 449.01 449.87 0.19 449.58 449.72 0.39 The anal scripts produce statistics about the results to help find anomalies. The theory is that if you have identical nodes then you should be able to obtain identical results (not always true). The anal scripts will also produce plot.* files for use by dplot to graphically represent the distribution of the results, and by cplot to plot 2D correlations.
  • Slide 23
  • Rant: % vs. normal distribution % is good? % variability can tell you something about the data with respect to itself without knowing anything about the data It is non-dimensional with a range (usually 0-100) that has meaning to anyone. IOW, management understands percentages. % is not good? It minimizes the amount of useful empirical data. It hides the truth.
  • Slide 24
  • % is not good, exhibit A Clearly this is a normal distribution, but the variability is 500%. This is an extreme case where all the possible values exist for predetermined range.
  • Slide 25
  • % is not good, exhibit B Low variability can hide a skewed distribution. Variability is low, only 1.27%. But the distribution is clearly skewed to the right.
  • Slide 26
  • % is not good, exhibit C A 5.74% variability hides a bimodal distribution. Bimodal distributions are clear indicators that there is an observable difference between two different sets of nodes.
  • Slide 27
  • STAB General Analysis Tools dplot is for plotting distributions. All the graphical output used as illustrations in this document up to this point was created with dplot. dplot provides a number of options for binning the data and analyzing the distribution. cplot is for correlating the results between two different sets of results. E.g., does poor memory performance correlate to poor application performance? danal is very similar to the output provided by the custom anal scripts provided with each benchmark, but has additional output options. You can safely discard any anal screen output because it can be recreated with danal and the resulting plot.benchmark file. Each script will require one or more plot.benchmark files. dplot and danal are less strict and will work with any file of numbers as long as the numbers are in the first column; subsequent columns are ignored. cplot however requires the 2nd column; it is impossible to correlate two sets of results without an index.

Recommended

View more >