1 special topics in grid enablement of scientific applications (cis 6612) presented by javier...

58
1 Special Topics in Grid Enablement Special Topics in Grid Enablement of Scientific Applications (CIS of Scientific Applications (CIS 6612) 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu Performance/Profiling Performance/Profiling

Upload: harold-barton

Post on 11-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

1

Special Topics in Grid Enablement of Special Topics in Grid Enablement of Scientific Applications (CIS 6612)Scientific Applications (CIS 6612)

Presented by Javier Figueroa

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

Performance/ProfilingPerformance/Profiling

Page 2: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

2

Presentation Outline

Performance Benchmarking Parallel Performance

Profiling Dimemas Paraver Acknowledgements

Page 3: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

3

Why Performance? To do a time-consuming operation in less time

I am an aircraft engineer I need to run a simulation to test the stability of the wings

at high speed I’d rather have the result in 5 minutes than in 5 hours so

that I can complete the aircraft final design sooner. To do an operation before a tighter deadline

I am a weather prediction agency I am getting input from weather stations/sensors I’d like to make the forecast for tomorrow before

tomorrow

Page 4: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

4

Why Performance? To do a high number of operations per

seconds I am the CTO of Amazon.com My Web server gets 1,000 hits per seconds I’d like my Web server and my databases to

handle 1,000 transactions per seconds so that customers do not experience delays

Also called scalability Amazon does “process” several GBytes of data

per seconds

Page 5: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

5

Performance - Software “High Performance Code” Performance concerns

Correctness You will see that when trying to make code go fast one often

breaks it Readability

Fast code typically requires more lines! Modularity can hurt performance

e.g., Too many classes Portability

Code that is fast on machine A can be slow on machine B At the extreme, highly optimized code is not portable at all, and in

fact is done in hardware!

Page 6: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

6

Performance - Time Time between the start and the end of an

operation Also called running time, elapsed time, wall-clock

time, response time, latency, execution time, ... Most straightforward measure: “my program

takes 12.5s on a Pentium 3.5GHz” Can be normalized to some reference time

Must be measured on a “dedicated” machine

Page 7: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

7

Performance - Rate Used often so that performance can be

independent on the “size” of the application e.g., compressing a 1MB file takes 1 minute.

compressing a 2MB file takes 2 minutes. The performance is the same.

Millions of instructions / sec (MIPS) MIPS = instruction count / (execution time * 106)

= clock rate / (CPI * 106) But Instructions Set Architectures are not equivalent

1 CISC instruction = many RISC instructions Programs use different instruction mixes May be ok for same program on same architectures

Page 8: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

8

Performance – Rate cont. Millions of floating point operations /sec (MFlops)

Very popular, but often misleading e.g., A high MFlops rate in a badly designed algorithm could have

poor application performance Application-specific:

Millions of frames rendered per second Millions of amino-acid compared per second Millions of HTTP requests served per seconds

Application-specific metrics are often preferable and others may be misleading

MFlops can be application-specific thought For instance:

I want to add to n-element vectors This requires 2*n Floating Point Operations Therefore MFlops is a good measure

Page 9: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

9

What is “Peak” Performance? Resource vendors always talk about peak performance rate

computed based on specifications of the machine For instance:

I build a machine with 2 floating point units Each unit can do an operation in 2 cycles My CPU is at 1GHz Therefore I have a 1*2/2 =1GFlops Machine

Problem: In real code you will never be able to use the two floating point units

constantly Data needs to come from memory and cause the floating point units to

be idle Typically, real code achieves only a (often small) fraction of

the peak performance

Page 10: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

10

Performance - Benchmarks Since many performance metrics turn out to be

misleading, people have designed benchmarks Example: SPEC Benchmark

Integer benchmark Floating point benchmark

These benchmarks are typically a collection of several codes that come from “real-world software”

The question “what is a good benchmark?” is difficult If the benchmarks do not correspond to what you’ll do

with your computer, then the benchmark results are not relevant to you

Page 11: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

11

Performance – A Program Concerned with improving a program’s performance

For a given platform, take a given program Run it and measure its wall-clock time Enhance it, run it again an quantify the performance

improvement i.e., the reduction in wall-clock time

For each version compute its performance preferably as a relevant performance rate so that you can say: the best implementation we have so far goes

“this fast” (perhaps a % of the peak performance)

Page 12: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

12

Performance – Program Speedup We need a metric to quantify the impact of

your performance enhancement Speedup: ratio of “old” time to “new” time

old time = 2h new time = 1h speedup = 2h / 1h = 2h

Sometimes one talks about a “slowdown” in case the “enhancement” is not beneficial Happens more often than one thinks

Page 13: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

13

Parallel Performance The notion of speedup is completely generic

By using a rice cooker I’ve achieved a 1.20 speedup for rice cooking

For parallel programs on defines the Parallel Speedup (we’ll just say “speedup”):

Parallel program takes time T1 on 1 processor

Parallel program takes time Tp on p processors

Parallel Speedup(p) = T1 / Tp

In the ideal case, if my sequential program takes 2

hours on 1 processor, it will take 1 hour on 2

processors: called linear speedup

Page 14: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

14

Speedup

speed

up

number of processors

linear speedup

sub-linear speedup

supe

rline

ar s

peed

up!!

Page 15: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

15

Superlinear Speedup?

There are several possible causes Algorithm

e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast

Hardware e.g., with many processors, it is possible that the

entire application data resides in cache (vs. RAM) or in RAM (vs. Disk)

Page 16: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

16

Amdahl’s Law

Consider a program whose execution consists of two phases One sequential phase One phase that can be perfectly parallelized (linear speedup)

T1 T2

Sequential programT1 = time spent in phase that cannot be parallelized.

T2 = time spent in phase that can be parallelized.

T2’ = time spent in parallelized phase

Old time: T = T1 + T2

T1’ = T1 T2’ <= T2

Parallel program

New time: T’ = T1’ + T2’

Page 17: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

17

Amdahl’s Law cont.

P1 = 11%, P2 = 18%, P3 = 23%, P4 = 48%, which add up to 100%

S1 = 1 or 100%, P2 is sped up 5x, so S2 = 500%, P3 is sped up 20x, so S3 = 2000%, and P4 is sped up 1.6x, so S4 = 160%

P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the running time is 0.11/1 + 0.18/5 + 0.23/20 + 0.48/1.6 = 0.4575 or a little less than ½ the original running time which we know is 1

the overall speed boost is 1/0.4575 = 2.186 or a little more than double the original speed using the formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)−1

Page 18: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

18

Amdahl’s Law - Parallelization

P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is

Page 19: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

19

Parallel Efficiency Definition: Eff(p) = S(p) / p Typically < 1, unless linear or superlinear speedup Used to measure how well the processors are

utilized If increasing the number of processors by a factor 10

increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5

Important when purchasing a parallel machine for instance: if due to the application’s behavior efficiency is low, forget buying a large cluster

Page 20: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

20

Scalability Measure of the “effort” needed to maintain

efficiency while adding processors For a given problem size, plot Efd(p) for increasing

values of p It should stay close to a flat line

Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency By making a problem ridiculously large, one can typically

achieve good efficiency Problem: is it how the machine/code will be used?

Page 21: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

21

Performance Measurement how does one measure the performance of a

program in practice? Two issues:

Measuring wall-clock times We’ll see how it can be done shortly

Measuring performance rates Measure wall clock time (see above) “Count” number of “operations” (frames, flops, amino-acids:

whatever makes sense for the application) Either by actively counting (count++) Or by looking at the code and figure out how many operations are

performed Divide the count by the wall-clock time

Page 22: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

22

Why is Performance Poor? Performance is poor because the code

suffers from a performance bottleneck Definition

An application runs on a platform that has many components

CPU, Memory, Operating System, Network, Hard Drive, Video Card, etc.

Pick a component and make it faster If the application performance increases, that

component was the bottleneck!

Page 23: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

23

Removing a Bottleneck Hardware Upgrade

Is sometimes necessary But can only get you so far and may be very

costly e.g., memory technology

Code Modification The bottleneck is there because the code uses a

“resource” heavily or in non-intelligent manner We will learn techniques to alleviate bottlenecks

at the software level

Page 24: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

24

Identifying a Bottleneck It can be difficult

You’re not going to change the memory bus just to see what happens to the application

But you can run the code on a different machine and see what happens

One Approach Know/discover the characteristics of the machine Observe the application execution on the machine Tinker with the code Run the application again Repeat Reason about what the bottleneck is

Page 25: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

25

Memory Latency and Bandwidth

The performance of memory is typically defined by Latency and Bandwidth (or Rate)

Latency: time to read one byte from memory measured in nanoseconds these days

Bandwidth: how many bytes can be read per seconds measured in GB/sec

Note that you don’t have bandwidth = 1 / latency! Reading 2 bytes in sequence may be cheaper than

reading one byte only Let’s see why...

Page 26: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

26

Latency and Bandwidth

Latency = time for one “data item” to go from memory to the CPU

In-flight = number of data items that can be “in flight” on the memory bus

Bandwidth = Capacity / Latency Maximum number of items I get in a latency period

Pipelining: initial delay for getting the first data item = latency if I ask for enough data items (> in-flight), after latency seconds, I receive

data items at rate bandwidth Just like networks, hardware pipelines, etc.

CPU

“memory bus”

Memory

Page 27: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

27

Latency and Bandwidth Why is memory bandwidth important?

Because it gives us an upper bound on how fast one could feed data to the CPU

Why is memory latency important? Because typically one cannot feed the CPU data

constantly and one gets impacted by the latency Latency numbers must be put in perspective with

the clock rate, i.e., measured in cycles

Page 28: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

28

Memory Bottleneck: Example Fragment of code: a[i] = b[j] + c[k]

Three memory references: 2 reads, 1 write One addition: can be done in one cycle

If the memory bandwidth is 12.8GB/sec, then the rate at which the processor can access integers (4 bytes) is: 12.8*1024*1024*1024 / 4 = 3.4GHz

The above code needs to access 3 integers Therefore, the rate at which the code gets its data is ~

1.1GHz But the CPU could perform additions at 4GHz! Therefore: The memory is the bottleneck

And we assumed memory worked at the peak!!! We ignored other possible overheads on the bus In practice the gap can be around a factor 15 or higher

Page 29: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

29

A realistic approach: Profiling A profiler is a tool that monitors the execution of a program

and reports the computational characteristics used in different parts of the code i.e. dynamic program analysis

Useful to identify “expensive” functions Profiling cycle

Compile the code with the profiler Run the code Identify the most expensive function Optimize that function

call it less often if possible, make it faster, etc. Repeat

Page 30: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

30

Profiler Types based on Output

Flat profiler Flat profiler's compute the average call times, from the

calls, and do not breakdown the call times based on the callee or the context.

Call-Graph profiler Call Graph profilers show the call times, and

frequencies of the functions, and also the call-chains involved based on the callee. However context is not preserved.

Page 31: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

31

Methods of data gathering Event based profilers

In Programming languages listed, all of them have event-based profilers

Java: JVM-Profiler Interface JVM API provides hooks to profiler, for trapping events like calls, class-load, unload, thread enter leave.

Python: Python profilers are profile module, hotspot which are call-graph based, and use the 'sys.set_profile()' module to trap events like c_{call,return,exception}, python_{call,return,exception}.

Page 32: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

32

Methods of data gathering Statistical profilers

Some profilers operate by sampling. Some profilers instrument the target program with

additional instructions to collect the required information The resulting data are not exact, but a statistical

approximation. Some of the most commonly used statistical profilers are

GNU's gprof, Oprofile and SGI's Pixie.

Page 33: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

33

Methods of data gathering Instrumentation

Manual: Done by the programmer, e.g. by adding instructions to explicitly calculate runtimes.

Compiler assisted: Example: "gcc -pg ..." for gprof, "quantify g++ ..." for Quantify

Binary translation: The tool adds instrumentation to a compiled binary. Example: ATOM

Runtime instrumentation: Directly before execution the code is instrumented. The program run is fully supervised and controlled by the tool. Examples: PIN, Valgrind

Runtime injection: More lightweight than runtime instrumentation. Code is modified at runtime to have jumps to helper functions. Example: DynInst

Hypervisor: Data are collected by running the (usually) unmodified program under a hypervisor. Example: SIMMON

Simulator: Data are collected by running under an Instruction Set Simulator. Example: SIMMON

Page 34: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

34

UPC – BSC Profiling Tools

Dimemas Simulation tool for the parametric analysis of the

behavior of message-passing applications on a configurable parallel platform.

Paraver A very powerful performance visualization and

analysis tool based on traces that can be used to analyze any information that is expressed on its input trace format.

Page 35: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

35

Dimemas

Development Tuning Benchmarking Training

Predict behavior of target architecture

Forecast effects of code optimization

What if analysis

Utilization - Features

Page 36: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

36

Dimemas cont.

Integrated with visualization tool Extensive application characterization Squeezing the information that a single

instrumented run can provide

Usage

Page 37: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

37

Dimemas cont.

enables the user to develop and tune parallel applications on a workstation

providing an accurate prediction of their performance on the parallel target machine

The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modeled by a set of performance parameters

Page 38: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

38

Dimemas – DiP Environment

Page 39: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

39

Dimemas - Benefits

the possibility of not using a parallel machine to run the application to get the traces

the possibility to analyze different application choices without changing neither rerunning the application itself

Page 40: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

40

Dimemas – Benefits cont.

The loop involving Dimemas and Paraver, the simulator and the visualization tool. The user has the chance to analyze the

application behavior when some parameters are changed, for example, what will happen to application execution time is the execution time of a given function is reduced in 50%?

Page 41: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

41

Dimemas - Prediction

A tracefile can be either obtained in a dedicated parallel machine or in a shared sequential machine.

Using the instrumentation records, Dimemas

will rebuild the application behavior, using the tracefile and the architectural parameters defined using a GUI. Per process CPU time is used as opposed to elapsed time.

Page 42: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

42

Dimemas – Prediction Advantages and Disadvantages

Cost reduction

Analysis of non-existent machines

Quality of performance prediction

Performance

Page 43: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

43

Dimemas – Output Information

Basic results execution time and speedup For each task

computation time blocking time information related to messages: number of sends,

number of received (divided in number of receives that produces a block because a message is not local to the task, and number of messages that promotes no block).

Page 44: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

44

Dimemas – Output Information cont.

Basic results

Page 45: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

45

Dimemas – Output Information cont.

Application Analysis Is my application well balanced? Are there any

dependencies? Can we improve the application grouping

messages? Does latency influence application time?

Are we planning to run our application in a low bandwidth network? Does bandwidth have influence in application time?

Do resources limit application time?

Page 46: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

46

Dimemas – Output Information cont.

Additional Analysis What is analysis Critical path Synthetic perturbation analysis

Page 47: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

47

Paraver

Paraver is a flexible performance visualization and analysis tool that can be used to analyze MPI OpenMP MPI+OpenMP Java Hardware counters profile Operating system activity... and many other

things you may think of

Page 48: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

48

Paraver – DiP Environment

Page 49: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

49

Paraver – Features

Motif based GUI Detailed quantitative analysis of program

performance Concurrent comparative analysis of several

traces Fast analysis of very large traces

Page 50: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

50

Paraver – Features cont.

Support for mixed message passing and shared memory (networks of SMPs)

Customizable semantics of the visualized information Cooperative work, sharing views of the tracefile

Building of derived metrics

Page 51: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

51

Paraver – Views

Graphical view Textual view

Analysis view

Page 52: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

52

Paraver – Views cont.

Paraver displayed information consists of three elements: a time dependent value (called semantic

value) for each represented object flags that correspond to punctual events

within a represented object communication lines that relate the

displayed objects.

Page 53: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

53

Paraver – Expressive Power

separation between the visualization power (how to display) and the semantic module (value to display) offers a flexibility and expressive power above

the filtering module is in front of the semantic one

Page 54: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

54

Paraver – Quantitative Analysis

1D analysis Number of messages sent

in the selected interval Average CPU utilization Number of events of a given

type on each processor Average CPU time between

two communications.

Page 55: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

55

Paraver – Quantitative Analysis

2D analysis Number of bytes sent between each pair of

tasks Average number of L2 cache misses per thread

on each parallel region Number of TLB misses per thread on each

application routine Average number of IPC (instructions per cycle)

per thread on each iteration

Page 56: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

56

Paraver – Quantitative Analysis

1D analysis

2D analysis

Page 57: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

57

Paraver – Other functionality

Multiple traces Large traces Cooperative analysis Derived Metrics Batch processing

Page 58: 1 Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Presented by Javier Figueroa Instructor: S. Masoud Sadjadi sadjadi/Teaching

58

Acknowledgments

GCB FIU SCIS NSF UPC – BSC Dr. Masoud Sadjadi Javier Muñoz and Diego Lopez