1 special topics in grid enablement of scientific applications (cis 6612) presented by javier...
TRANSCRIPT
1
Special Topics in Grid Enablement of Special Topics in Grid Enablement of Scientific Applications (CIS 6612)Scientific Applications (CIS 6612)
Presented by Javier Figueroa
Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/
sadjadi At cs Dot fiu Dot edu
Performance/ProfilingPerformance/Profiling
2
Presentation Outline
Performance Benchmarking Parallel Performance
Profiling Dimemas Paraver Acknowledgements
3
Why Performance? To do a time-consuming operation in less time
I am an aircraft engineer I need to run a simulation to test the stability of the wings
at high speed I’d rather have the result in 5 minutes than in 5 hours so
that I can complete the aircraft final design sooner. To do an operation before a tighter deadline
I am a weather prediction agency I am getting input from weather stations/sensors I’d like to make the forecast for tomorrow before
tomorrow
4
Why Performance? To do a high number of operations per
seconds I am the CTO of Amazon.com My Web server gets 1,000 hits per seconds I’d like my Web server and my databases to
handle 1,000 transactions per seconds so that customers do not experience delays
Also called scalability Amazon does “process” several GBytes of data
per seconds
5
Performance - Software “High Performance Code” Performance concerns
Correctness You will see that when trying to make code go fast one often
breaks it Readability
Fast code typically requires more lines! Modularity can hurt performance
e.g., Too many classes Portability
Code that is fast on machine A can be slow on machine B At the extreme, highly optimized code is not portable at all, and in
fact is done in hardware!
6
Performance - Time Time between the start and the end of an
operation Also called running time, elapsed time, wall-clock
time, response time, latency, execution time, ... Most straightforward measure: “my program
takes 12.5s on a Pentium 3.5GHz” Can be normalized to some reference time
Must be measured on a “dedicated” machine
7
Performance - Rate Used often so that performance can be
independent on the “size” of the application e.g., compressing a 1MB file takes 1 minute.
compressing a 2MB file takes 2 minutes. The performance is the same.
Millions of instructions / sec (MIPS) MIPS = instruction count / (execution time * 106)
= clock rate / (CPI * 106) But Instructions Set Architectures are not equivalent
1 CISC instruction = many RISC instructions Programs use different instruction mixes May be ok for same program on same architectures
8
Performance – Rate cont. Millions of floating point operations /sec (MFlops)
Very popular, but often misleading e.g., A high MFlops rate in a badly designed algorithm could have
poor application performance Application-specific:
Millions of frames rendered per second Millions of amino-acid compared per second Millions of HTTP requests served per seconds
Application-specific metrics are often preferable and others may be misleading
MFlops can be application-specific thought For instance:
I want to add to n-element vectors This requires 2*n Floating Point Operations Therefore MFlops is a good measure
9
What is “Peak” Performance? Resource vendors always talk about peak performance rate
computed based on specifications of the machine For instance:
I build a machine with 2 floating point units Each unit can do an operation in 2 cycles My CPU is at 1GHz Therefore I have a 1*2/2 =1GFlops Machine
Problem: In real code you will never be able to use the two floating point units
constantly Data needs to come from memory and cause the floating point units to
be idle Typically, real code achieves only a (often small) fraction of
the peak performance
10
Performance - Benchmarks Since many performance metrics turn out to be
misleading, people have designed benchmarks Example: SPEC Benchmark
Integer benchmark Floating point benchmark
These benchmarks are typically a collection of several codes that come from “real-world software”
The question “what is a good benchmark?” is difficult If the benchmarks do not correspond to what you’ll do
with your computer, then the benchmark results are not relevant to you
11
Performance – A Program Concerned with improving a program’s performance
For a given platform, take a given program Run it and measure its wall-clock time Enhance it, run it again an quantify the performance
improvement i.e., the reduction in wall-clock time
For each version compute its performance preferably as a relevant performance rate so that you can say: the best implementation we have so far goes
“this fast” (perhaps a % of the peak performance)
12
Performance – Program Speedup We need a metric to quantify the impact of
your performance enhancement Speedup: ratio of “old” time to “new” time
old time = 2h new time = 1h speedup = 2h / 1h = 2h
Sometimes one talks about a “slowdown” in case the “enhancement” is not beneficial Happens more often than one thinks
13
Parallel Performance The notion of speedup is completely generic
By using a rice cooker I’ve achieved a 1.20 speedup for rice cooking
For parallel programs on defines the Parallel Speedup (we’ll just say “speedup”):
Parallel program takes time T1 on 1 processor
Parallel program takes time Tp on p processors
Parallel Speedup(p) = T1 / Tp
In the ideal case, if my sequential program takes 2
hours on 1 processor, it will take 1 hour on 2
processors: called linear speedup
14
Speedup
speed
up
number of processors
linear speedup
sub-linear speedup
supe
rline
ar s
peed
up!!
15
Superlinear Speedup?
There are several possible causes Algorithm
e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast
Hardware e.g., with many processors, it is possible that the
entire application data resides in cache (vs. RAM) or in RAM (vs. Disk)
16
Amdahl’s Law
Consider a program whose execution consists of two phases One sequential phase One phase that can be perfectly parallelized (linear speedup)
T1 T2
Sequential programT1 = time spent in phase that cannot be parallelized.
T2 = time spent in phase that can be parallelized.
T2’ = time spent in parallelized phase
Old time: T = T1 + T2
T1’ = T1 T2’ <= T2
Parallel program
New time: T’ = T1’ + T2’
17
Amdahl’s Law cont.
P1 = 11%, P2 = 18%, P3 = 23%, P4 = 48%, which add up to 100%
S1 = 1 or 100%, P2 is sped up 5x, so S2 = 500%, P3 is sped up 20x, so S3 = 2000%, and P4 is sped up 1.6x, so S4 = 160%
P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the running time is 0.11/1 + 0.18/5 + 0.23/20 + 0.48/1.6 = 0.4575 or a little less than ½ the original running time which we know is 1
the overall speed boost is 1/0.4575 = 2.186 or a little more than double the original speed using the formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)−1
18
Amdahl’s Law - Parallelization
P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is
19
Parallel Efficiency Definition: Eff(p) = S(p) / p Typically < 1, unless linear or superlinear speedup Used to measure how well the processors are
utilized If increasing the number of processors by a factor 10
increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5
Important when purchasing a parallel machine for instance: if due to the application’s behavior efficiency is low, forget buying a large cluster
20
Scalability Measure of the “effort” needed to maintain
efficiency while adding processors For a given problem size, plot Efd(p) for increasing
values of p It should stay close to a flat line
Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency By making a problem ridiculously large, one can typically
achieve good efficiency Problem: is it how the machine/code will be used?
21
Performance Measurement how does one measure the performance of a
program in practice? Two issues:
Measuring wall-clock times We’ll see how it can be done shortly
Measuring performance rates Measure wall clock time (see above) “Count” number of “operations” (frames, flops, amino-acids:
whatever makes sense for the application) Either by actively counting (count++) Or by looking at the code and figure out how many operations are
performed Divide the count by the wall-clock time
22
Why is Performance Poor? Performance is poor because the code
suffers from a performance bottleneck Definition
An application runs on a platform that has many components
CPU, Memory, Operating System, Network, Hard Drive, Video Card, etc.
Pick a component and make it faster If the application performance increases, that
component was the bottleneck!
23
Removing a Bottleneck Hardware Upgrade
Is sometimes necessary But can only get you so far and may be very
costly e.g., memory technology
Code Modification The bottleneck is there because the code uses a
“resource” heavily or in non-intelligent manner We will learn techniques to alleviate bottlenecks
at the software level
24
Identifying a Bottleneck It can be difficult
You’re not going to change the memory bus just to see what happens to the application
But you can run the code on a different machine and see what happens
One Approach Know/discover the characteristics of the machine Observe the application execution on the machine Tinker with the code Run the application again Repeat Reason about what the bottleneck is
25
Memory Latency and Bandwidth
The performance of memory is typically defined by Latency and Bandwidth (or Rate)
Latency: time to read one byte from memory measured in nanoseconds these days
Bandwidth: how many bytes can be read per seconds measured in GB/sec
Note that you don’t have bandwidth = 1 / latency! Reading 2 bytes in sequence may be cheaper than
reading one byte only Let’s see why...
26
Latency and Bandwidth
Latency = time for one “data item” to go from memory to the CPU
In-flight = number of data items that can be “in flight” on the memory bus
Bandwidth = Capacity / Latency Maximum number of items I get in a latency period
Pipelining: initial delay for getting the first data item = latency if I ask for enough data items (> in-flight), after latency seconds, I receive
data items at rate bandwidth Just like networks, hardware pipelines, etc.
CPU
“memory bus”
Memory
27
Latency and Bandwidth Why is memory bandwidth important?
Because it gives us an upper bound on how fast one could feed data to the CPU
Why is memory latency important? Because typically one cannot feed the CPU data
constantly and one gets impacted by the latency Latency numbers must be put in perspective with
the clock rate, i.e., measured in cycles
28
Memory Bottleneck: Example Fragment of code: a[i] = b[j] + c[k]
Three memory references: 2 reads, 1 write One addition: can be done in one cycle
If the memory bandwidth is 12.8GB/sec, then the rate at which the processor can access integers (4 bytes) is: 12.8*1024*1024*1024 / 4 = 3.4GHz
The above code needs to access 3 integers Therefore, the rate at which the code gets its data is ~
1.1GHz But the CPU could perform additions at 4GHz! Therefore: The memory is the bottleneck
And we assumed memory worked at the peak!!! We ignored other possible overheads on the bus In practice the gap can be around a factor 15 or higher
29
A realistic approach: Profiling A profiler is a tool that monitors the execution of a program
and reports the computational characteristics used in different parts of the code i.e. dynamic program analysis
Useful to identify “expensive” functions Profiling cycle
Compile the code with the profiler Run the code Identify the most expensive function Optimize that function
call it less often if possible, make it faster, etc. Repeat
30
Profiler Types based on Output
Flat profiler Flat profiler's compute the average call times, from the
calls, and do not breakdown the call times based on the callee or the context.
Call-Graph profiler Call Graph profilers show the call times, and
frequencies of the functions, and also the call-chains involved based on the callee. However context is not preserved.
31
Methods of data gathering Event based profilers
In Programming languages listed, all of them have event-based profilers
Java: JVM-Profiler Interface JVM API provides hooks to profiler, for trapping events like calls, class-load, unload, thread enter leave.
Python: Python profilers are profile module, hotspot which are call-graph based, and use the 'sys.set_profile()' module to trap events like c_{call,return,exception}, python_{call,return,exception}.
32
Methods of data gathering Statistical profilers
Some profilers operate by sampling. Some profilers instrument the target program with
additional instructions to collect the required information The resulting data are not exact, but a statistical
approximation. Some of the most commonly used statistical profilers are
GNU's gprof, Oprofile and SGI's Pixie.
33
Methods of data gathering Instrumentation
Manual: Done by the programmer, e.g. by adding instructions to explicitly calculate runtimes.
Compiler assisted: Example: "gcc -pg ..." for gprof, "quantify g++ ..." for Quantify
Binary translation: The tool adds instrumentation to a compiled binary. Example: ATOM
Runtime instrumentation: Directly before execution the code is instrumented. The program run is fully supervised and controlled by the tool. Examples: PIN, Valgrind
Runtime injection: More lightweight than runtime instrumentation. Code is modified at runtime to have jumps to helper functions. Example: DynInst
Hypervisor: Data are collected by running the (usually) unmodified program under a hypervisor. Example: SIMMON
Simulator: Data are collected by running under an Instruction Set Simulator. Example: SIMMON
34
UPC – BSC Profiling Tools
Dimemas Simulation tool for the parametric analysis of the
behavior of message-passing applications on a configurable parallel platform.
Paraver A very powerful performance visualization and
analysis tool based on traces that can be used to analyze any information that is expressed on its input trace format.
35
Dimemas
Development Tuning Benchmarking Training
Predict behavior of target architecture
Forecast effects of code optimization
What if analysis
Utilization - Features
36
Dimemas cont.
Integrated with visualization tool Extensive application characterization Squeezing the information that a single
instrumented run can provide
Usage
37
Dimemas cont.
enables the user to develop and tune parallel applications on a workstation
providing an accurate prediction of their performance on the parallel target machine
The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modeled by a set of performance parameters
38
Dimemas – DiP Environment
39
Dimemas - Benefits
the possibility of not using a parallel machine to run the application to get the traces
the possibility to analyze different application choices without changing neither rerunning the application itself
40
Dimemas – Benefits cont.
The loop involving Dimemas and Paraver, the simulator and the visualization tool. The user has the chance to analyze the
application behavior when some parameters are changed, for example, what will happen to application execution time is the execution time of a given function is reduced in 50%?
41
Dimemas - Prediction
A tracefile can be either obtained in a dedicated parallel machine or in a shared sequential machine.
Using the instrumentation records, Dimemas
will rebuild the application behavior, using the tracefile and the architectural parameters defined using a GUI. Per process CPU time is used as opposed to elapsed time.
42
Dimemas – Prediction Advantages and Disadvantages
Cost reduction
Analysis of non-existent machines
Quality of performance prediction
Performance
43
Dimemas – Output Information
Basic results execution time and speedup For each task
computation time blocking time information related to messages: number of sends,
number of received (divided in number of receives that produces a block because a message is not local to the task, and number of messages that promotes no block).
44
Dimemas – Output Information cont.
Basic results
45
Dimemas – Output Information cont.
Application Analysis Is my application well balanced? Are there any
dependencies? Can we improve the application grouping
messages? Does latency influence application time?
Are we planning to run our application in a low bandwidth network? Does bandwidth have influence in application time?
Do resources limit application time?
46
Dimemas – Output Information cont.
Additional Analysis What is analysis Critical path Synthetic perturbation analysis
47
Paraver
Paraver is a flexible performance visualization and analysis tool that can be used to analyze MPI OpenMP MPI+OpenMP Java Hardware counters profile Operating system activity... and many other
things you may think of
48
Paraver – DiP Environment
49
Paraver – Features
Motif based GUI Detailed quantitative analysis of program
performance Concurrent comparative analysis of several
traces Fast analysis of very large traces
50
Paraver – Features cont.
Support for mixed message passing and shared memory (networks of SMPs)
Customizable semantics of the visualized information Cooperative work, sharing views of the tracefile
Building of derived metrics
51
Paraver – Views
Graphical view Textual view
Analysis view
52
Paraver – Views cont.
Paraver displayed information consists of three elements: a time dependent value (called semantic
value) for each represented object flags that correspond to punctual events
within a represented object communication lines that relate the
displayed objects.
53
Paraver – Expressive Power
separation between the visualization power (how to display) and the semantic module (value to display) offers a flexibility and expressive power above
the filtering module is in front of the semantic one
54
Paraver – Quantitative Analysis
1D analysis Number of messages sent
in the selected interval Average CPU utilization Number of events of a given
type on each processor Average CPU time between
two communications.
55
Paraver – Quantitative Analysis
2D analysis Number of bytes sent between each pair of
tasks Average number of L2 cache misses per thread
on each parallel region Number of TLB misses per thread on each
application routine Average number of IPC (instructions per cycle)
per thread on each iteration
56
Paraver – Quantitative Analysis
1D analysis
2D analysis
57
Paraver – Other functionality
Multiple traces Large traces Cooperative analysis Derived Metrics Batch processing
58
Acknowledgments
GCB FIU SCIS NSF UPC – BSC Dr. Masoud Sadjadi Javier Muñoz and Diego Lopez