1 special topics in grid enablement of scientific applications (cis 6612) presented by javier...

1

Special Topics in Grid Enablement of Special Topics in Grid Enablement of Scientific Applications (CIS 6612)Scientific Applications (CIS 6612)

Presented by Javier Figueroa

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

Performance/ProfilingPerformance/Profiling

2

Presentation Outline

Performance Benchmarking Parallel Performance

Profiling Dimemas Paraver Acknowledgements

3

Why Performance? To do a time-consuming operation in less time

I am an aircraft engineer I need to run a simulation to test the stability of the wings

at high speed I’d rather have the result in 5 minutes than in 5 hours so

that I can complete the aircraft final design sooner. To do an operation before a tighter deadline

I am a weather prediction agency I am getting input from weather stations/sensors I’d like to make the forecast for tomorrow before

tomorrow

4

Why Performance? To do a high number of operations per

seconds I am the CTO of Amazon.com My Web server gets 1,000 hits per seconds I’d like my Web server and my databases to

handle 1,000 transactions per seconds so that customers do not experience delays

Also called scalability Amazon does “process” several GBytes of data

per seconds

5

Performance - Software “High Performance Code” Performance concerns

Correctness You will see that when trying to make code go fast one often

breaks it Readability

Fast code typically requires more lines! Modularity can hurt performance

e.g., Too many classes Portability

Code that is fast on machine A can be slow on machine B At the extreme, highly optimized code is not portable at all, and in

fact is done in hardware!

6

Performance - Time Time between the start and the end of an

operation Also called running time, elapsed time, wall-clock

time, response time, latency, execution time, ... Most straightforward measure: “my program

takes 12.5s on a Pentium 3.5GHz” Can be normalized to some reference time

Must be measured on a “dedicated” machine

7

Performance - Rate Used often so that performance can be

independent on the “size” of the application e.g., compressing a 1MB file takes 1 minute.

compressing a 2MB file takes 2 minutes. The performance is the same.

Millions of instructions / sec (MIPS) MIPS = instruction count / (execution time * 106)

= clock rate / (CPI * 106) But Instructions Set Architectures are not equivalent

1 CISC instruction = many RISC instructions Programs use different instruction mixes May be ok for same program on same architectures

8

Performance – Rate cont. Millions of floating point operations /sec (MFlops)

Very popular, but often misleading e.g., A high MFlops rate in a badly designed algorithm could have

poor application performance Application-specific:

Millions of frames rendered per second Millions of amino-acid compared per second Millions of HTTP requests served per seconds

Application-specific metrics are often preferable and others may be misleading

MFlops can be application-specific thought For instance:

I want to add to n-element vectors This requires 2*n Floating Point Operations Therefore MFlops is a good measure

9

What is “Peak” Performance? Resource vendors always talk about peak performance rate

computed based on specifications of the machine For instance:

I build a machine with 2 floating point units Each unit can do an operation in 2 cycles My CPU is at 1GHz Therefore I have a 1*2/2 =1GFlops Machine

Problem: In real code you will never be able to use the two floating point units

constantly Data needs to come from memory and cause the floating point units to

be idle Typically, real code achieves only a (often small) fraction of

the peak performance

10

Performance - Benchmarks Since many performance metrics turn out to be

misleading, people have designed benchmarks Example: SPEC Benchmark

Integer benchmark Floating point benchmark

These benchmarks are typically a collection of several codes that come from “real-world software”

The question “what is a good benchmark?” is difficult If the benchmarks do not correspond to what you’ll do

with your computer, then the benchmark results are not relevant to you

11

Performance – A Program Concerned with improving a program’s performance

For a given platform, take a given program Run it and measure its wall-clock time Enhance it, run it again an quantify the performance

improvement i.e., the reduction in wall-clock time

For each version compute its performance preferably as a relevant performance rate so that you can say: the best implementation we have so far goes

“this fast” (perhaps a % of the peak performance)

12

Performance – Program Speedup We need a metric to quantify the impact of

your performance enhancement Speedup: ratio of “old” time to “new” time

old time = 2h new time = 1h speedup = 2h / 1h = 2h

Sometimes one talks about a “slowdown” in case the “enhancement” is not beneficial Happens more often than one thinks

13

Parallel Performance The notion of speedup is completely generic

By using a rice cooker I’ve achieved a 1.20 speedup for rice cooking

For parallel programs on defines the Parallel Speedup (we’ll just say “speedup”):

Parallel program takes time T1 on 1 processor

Parallel program takes time Tp on p processors

Parallel Speedup(p) = T1 / Tp

In the ideal case, if my sequential program takes 2

hours on 1 processor, it will take 1 hour on 2

processors: called linear speedup

14

Speedup

speed

up

number of processors

linear speedup

sub-linear speedup

supe

rline

ar s

peed

up!!

15

Superlinear Speedup?

There are several possible causes Algorithm

e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast

Hardware e.g., with many processors, it is possible that the

entire application data resides in cache (vs. RAM) or in RAM (vs. Disk)

16

Amdahl’s Law

Consider a program whose execution consists of two phases One sequential phase One phase that can be perfectly parallelized (linear speedup)

T1 T2

Sequential programT1 = time spent in phase that cannot be parallelized.

T2 = time spent in phase that can be parallelized.

T2’ = time spent in parallelized phase

Old time: T = T1 + T2

T1’ = T1 T2’ <= T2

Parallel program

New time: T’ = T1’ + T2’

17

Amdahl’s Law cont.

P1 = 11%, P2 = 18%, P3 = 23%, P4 = 48%, which add up to 100%

S1 = 1 or 100%, P2 is sped up 5x, so S2 = 500%, P3 is sped up 20x, so S3 = 2000%, and P4 is sped up 1.6x, so S4 = 160%

P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the running time is 0.11/1 + 0.18/5 + 0.23/20 + 0.48/1.6 = 0.4575 or a little less than ½ the original running time which we know is 1

the overall speed boost is 1/0.4575 = 2.186 or a little more than double the original speed using the formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)−1

18

Amdahl’s Law - Parallelization

P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is

19

Parallel Efficiency Definition: Eff(p) = S(p) / p Typically < 1, unless linear or superlinear speedup Used to measure how well the processors are

utilized If increasing the number of processors by a factor 10

increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5

Important when purchasing a parallel machine for instance: if due to the application’s behavior efficiency is low, forget buying a large cluster

20

Scalability Measure of the “effort” needed to maintain

efficiency while adding processors For a given problem size, plot Efd(p) for increasing

values of p It should stay close to a flat line

Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency By making a problem ridiculously large, one can typically

achieve good efficiency Problem: is it how the machine/code will be used?

21

Performance Measurement how does one measure the performance of a

program in practice? Two issues:

Measuring wall-clock times We’ll see how it can be done shortly

Measuring performance rates Measure wall clock time (see above) “Count” number of “operations” (frames, flops, amino-acids:

whatever makes sense for the application) Either by actively counting (count++) Or by looking at the code and figure out how many operations are

performed Divide the count by the wall-clock time

22

Why is Performance Poor? Performance is poor because the code

suffers from a performance bottleneck Definition

An application runs on a platform that has many components

CPU, Memory, Operating System, Network, Hard Drive, Video Card, etc.

Pick a component and make it faster If the application performance increases, that

component was the bottleneck!

23

Removing a Bottleneck Hardware Upgrade

Is sometimes necessary But can only get you so far and may be very

costly e.g., memory technology

Code Modification The bottleneck is there because the code uses a

“resource” heavily or in non-intelligent manner We will learn techniques to alleviate bottlenecks

at the software level

24

Identifying a Bottleneck It can be difficult

You’re not going to change the memory bus just to see what happens to the application

But you can run the code on a different machine and see what happens

One Approach Know/discover the characteristics of the machine Observe the application execution on the machine Tinker with the code Run the application again Repeat Reason about what the bottleneck is

25

Memory Latency and Bandwidth

The performance of memory is typically defined by Latency and Bandwidth (or Rate)

Latency: time to read one byte from memory measured in nanoseconds these days

Bandwidth: how many bytes can be read per seconds measured in GB/sec

Note that you don’t have bandwidth = 1 / latency! Reading 2 bytes in sequence may be cheaper than

reading one byte only Let’s see why...

26

Latency and Bandwidth

Latency = time for one “data item” to go from memory to the CPU

In-flight = number of data items that can be “in flight” on the memory bus

Bandwidth = Capacity / Latency Maximum number of items I get in a latency period

Pipelining: initial delay for getting the first data item = latency if I ask for enough data items (> in-flight), after latency seconds, I receive

data items at rate bandwidth Just like networks, hardware pipelines, etc.

CPU

“memory bus”

Memory

27

Latency and Bandwidth Why is memory bandwidth important?

Because it gives us an upper bound on how fast one could feed data to the CPU

Why is memory latency important? Because typically one cannot feed the CPU data

constantly and one gets impacted by the latency Latency numbers must be put in perspective with

the clock rate, i.e., measured in cycles

28

Memory Bottleneck: Example Fragment of code: a[i] = b[j] + c[k]

Three memory references: 2 reads, 1 write One addition: can be done in one cycle

If the memory bandwidth is 12.8GB/sec, then the rate at which the processor can access integers (4 bytes) is: 12.8*1024*1024*1024 / 4 = 3.4GHz

The above code needs to access 3 integers Therefore, the rate at which the code gets its data is ~

1.1GHz But the CPU could perform additions at 4GHz! Therefore: The memory is the bottleneck

And we assumed memory worked at the peak!!! We ignored other possible overheads on the bus In practice the gap can be around a factor 15 or higher

29

A realistic approach: Profiling A profiler is a tool that monitors the execution of a program

and reports the computational characteristics used in different parts of the code i.e. dynamic program analysis

Useful to identify “expensive” functions Profiling cycle

Compile the code with the profiler Run the code Identify the most expensive function Optimize that function

call it less often if possible, make it faster, etc. Repeat

30

Profiler Types based on Output

Flat profiler Flat profiler's compute the average call times, from the

calls, and do not breakdown the call times based on the callee or the context.

Call-Graph profiler Call Graph profilers show the call times, and

frequencies of the functions, and also the call-chains involved based on the callee. However context is not preserved.

31

Methods of data gathering Event based profilers

In Programming languages listed, all of them have event-based profilers

Java: JVM-Profiler Interface JVM API provides hooks to profiler, for trapping events like calls, class-load, unload, thread enter leave.

Python: Python profilers are profile module, hotspot which are call-graph based, and use the 'sys.set_profile()' module to trap events like c_{call,return,exception}, python_{call,return,exception}.

32

Methods of data gathering Statistical profilers

Some profilers operate by sampling. Some profilers instrument the target program with

additional instructions to collect the required information The resulting data are not exact, but a statistical

approximation. Some of the most commonly used statistical profilers are

GNU's gprof, Oprofile and SGI's Pixie.

33

Methods of data gathering Instrumentation

Manual: Done by the programmer, e.g. by adding instructions to explicitly calculate runtimes.

Compiler assisted: Example: "gcc -pg ..." for gprof, "quantify g++ ..." for Quantify

Binary translation: The tool adds instrumentation to a compiled binary. Example: ATOM

Runtime instrumentation: Directly before execution the code is instrumented. The program run is fully supervised and controlled by the tool. Examples: PIN, Valgrind

Runtime injection: More lightweight than runtime instrumentation. Code is modified at runtime to have jumps to helper functions. Example: DynInst

Hypervisor: Data are collected by running the (usually) unmodified program under a hypervisor. Example: SIMMON

Simulator: Data are collected by running under an Instruction Set Simulator. Example: SIMMON

34

UPC – BSC Profiling Tools

Dimemas Simulation tool for the parametric analysis of the

behavior of message-passing applications on a configurable parallel platform.

Paraver A very powerful performance visualization and

analysis tool based on traces that can be used to analyze any information that is expressed on its input trace format.

35

Dimemas

Development Tuning Benchmarking Training

Predict behavior of target architecture

Forecast effects of code optimization

What if analysis

Utilization - Features

36

Dimemas cont.

Integrated with visualization tool Extensive application characterization Squeezing the information that a single

instrumented run can provide

Usage

37

Dimemas cont.

enables the user to develop and tune parallel applications on a workstation

providing an accurate prediction of their performance on the parallel target machine

The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modeled by a set of performance parameters

38

Dimemas – DiP Environment

39

Dimemas - Benefits

the possibility of not using a parallel machine to run the application to get the traces

the possibility to analyze different application choices without changing neither rerunning the application itself

40

Dimemas – Benefits cont.

The loop involving Dimemas and Paraver, the simulator and the visualization tool. The user has the chance to analyze the

application behavior when some parameters are changed, for example, what will happen to application execution time is the execution time of a given function is reduced in 50%?

41

Dimemas - Prediction

A tracefile can be either obtained in a dedicated parallel machine or in a shared sequential machine.

Using the instrumentation records, Dimemas

will rebuild the application behavior, using the tracefile and the architectural parameters defined using a GUI. Per process CPU time is used as opposed to elapsed time.

42

Dimemas – Prediction Advantages and Disadvantages

Cost reduction

Analysis of non-existent machines

Quality of performance prediction

Performance

43

Dimemas – Output Information

Basic results execution time and speedup For each task

computation time blocking time information related to messages: number of sends,

number of received (divided in number of receives that produces a block because a message is not local to the task, and number of messages that promotes no block).

44

Dimemas – Output Information cont.

Basic results

45


Application Analysis Is my application well balanced? Are there any

dependencies? Can we improve the application grouping

messages? Does latency influence application time?

Are we planning to run our application in a low bandwidth network? Does bandwidth have influence in application time?

Do resources limit application time?

46


Additional Analysis What is analysis Critical path Synthetic perturbation analysis

47

Paraver

Paraver is a flexible performance visualization and analysis tool that can be used to analyze MPI OpenMP MPI+OpenMP Java Hardware counters profile Operating system activity... and many other

things you may think of

48

Paraver – DiP Environment

49

Paraver – Features

Motif based GUI Detailed quantitative analysis of program

performance Concurrent comparative analysis of several

traces Fast analysis of very large traces

50

Paraver – Features cont.

Support for mixed message passing and shared memory (networks of SMPs)

Customizable semantics of the visualized information Cooperative work, sharing views of the tracefile

Building of derived metrics

51

Paraver – Views

Graphical view Textual view

Analysis view

52

Paraver – Views cont.

Paraver displayed information consists of three elements: a time dependent value (called semantic

value) for each represented object flags that correspond to punctual events

within a represented object communication lines that relate the

displayed objects.

53

Paraver – Expressive Power

separation between the visualization power (how to display) and the semantic module (value to display) offers a flexibility and expressive power above

the filtering module is in front of the semantic one

54

Paraver – Quantitative Analysis

1D analysis Number of messages sent

in the selected interval Average CPU utilization Number of events of a given

type on each processor Average CPU time between

two communications.

55


2D analysis Number of bytes sent between each pair of

tasks Average number of L2 cache misses per thread

on each parallel region Number of TLB misses per thread on each

application routine Average number of IPC (instructions per cycle)

per thread on each iteration

56


1D analysis

2D analysis

57

Paraver – Other functionality

Multiple traces Large traces Cooperative analysis Derived Metrics Batch processing

58

Acknowledgments

GCB FIU SCIS NSF UPC – BSC Dr. Masoud Sadjadi Javier Muñoz and Diego Lopez

1 special topics in grid enablement of scientific applications (cis 6612) presented by javier...

Documents

peak performance

performance timetime

performance rateused

performance rate cont

execution time

elapsed time

response time

running time