performance evaluation maqao toolsuite - teratec...hardware counters profiles: cache oriented...
TRANSCRIPT
![Page 1: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/1.jpg)
Andrés S. Charif Rubial
Ter@tec – 2nd July 2014
Andrés S. Charif-Rubial, William Jalby
Performance Evaluation
MAQAO Toolsuite
![Page 2: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/2.jpg)
2 / 48
Outline
Andres S. CHARIF-RUBIAL
1. Introduction
2. PAMDA Methodology
3. MAQAO Framework
4. PerfEval: Profiling
5. CQA: Code Quality Analysis
6. DECAN: Differential Analysis
7. Success Stories
![Page 3: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/3.jpg)
3 / 48
Introduction: Performance evaluation
Characterize the performance of an application
Complex multicore CPUs and memory systems
How well does it behaves on a given machine
Generally a multifaceted problem
What are the issues (numerous but finite) ?
Which one(s) dominates ?
Maximizing the number of views
=> Need for specialized tools
Several tools available
Which one to use ?
=> Need for a methodology ?
![Page 4: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/4.jpg)
4 / 48
Introduction: Existing tools and methodologies (1/2)
ROI-oriented and global view:
Lack of performance impact prediction:=> Will fixing a given pathology pay off ?=> No way to get a return on investment metric
Global view:=> what are the issues=> which one has a high level speedup potential
Can lead to useless optimization:
Example 1: restructuring data accesses across all the applicationmay be a loss of time if the potential speedup is only 2%
Example 2: various tools can detect high miss rates. It can beuseless to fix a high miss rate if combined with div/sqrt operationsbecause the dominating bottleneck might be FP operations.
![Page 5: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/5.jpg)
5 / 48
Introduction: Existing tools and methodologies (2/2)
One-way approaches/techniques:
• HPCToolKit, PerfExpert, VTune heavily rely on sampling and hardware events.=> Sampling-based profiling aggregates everything together (all instances): might be counterproductive
• Scalasca/Vampir is heavily relying on tracing and source code probe insertion=> Tracing-based profiling is heavier (time consuming, subject to deviation with the number of function invocations)
• In practice, it is usually a trade-off: the best choice orcombination have to be found for given application
![Page 6: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/6.jpg)
6 / 48
Introduction: Motivating example
6) Vector vs Scalar
2) Non-unit stride accesses
4) DIV/SQRT
5) Reductions
Special issues:
Low trip count: from 2 to
2186 at binary level
3) Indirect accesses
Is it possible to: – detect all these issues with current tools ?– obtain potential speedup(s) estimation to guide optimization effort ?
1) High number of
statements
Source code and associated issues ~10% walltime
![Page 7: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/7.jpg)
7 / 48
PAMDA Methodology: overview
Our approach: Performance Assessment using MAQAO toolset and Differential Analysis
• Work done at binary level
• Get a global hierarchical view of performance pathologies/bottleneck
• Estimate the performance impact of a given performance pathology while taking into account all of the other pathologies present
• Use different tools for pathology detection and pathology analysis
• Tool selection on pathology basis
• Fine grain - “expensive” - tools only used if necessary on specific issues
![Page 8: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/8.jpg)
8 / 48
PAMDA Methodology: overview
Decision tree:
Profiling
Loops of interest
Differential analysis
CPU oriented
Code Quality Analysis
Value Profiling
Differential analysis
Memory oriented
Memory behavior
characterization
Differential analysis
![Page 9: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/9.jpg)
9 / 48
PAMDA Methodology: overview
Compiler remains our best friend
Be sure to select proper flags
Know default flags (e.g., -xHost on AVX capable machines)
Bypass conservative behavior when possible
Pragmas:
Vectorization, Alignement, Unrolling, etc…
Portable transformations
![Page 10: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/10.jpg)
10 / 48
Open source (LGPL 3.0)
Currently binary release
Source release soon
Available for:
x86-64
Xeon Phi
MAQAO: Introduction
www.maqao.org
![Page 11: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/11.jpg)
11 / 48
Audience
User/Tool developer:
analysis and optimization tool
Performance tool developer: framework services
BULL SAS: on-going effort – PerfCloud (MIL*)
University of Oregon: TAU tool – tau_rewrite (MIL*)
ScoreP project: on-going effort – VI-HPS (MIL*)
MAQAO: Introduction
www.maqao.org
* MAQAO Instrumentation Language
![Page 12: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/12.jpg)
12 / 48
History
Started ten years ago on Itanium
Strong emphasis on code generated by the compiler
Contributors
ECR (Intel, CEA, GENCI, UVSQ)
UVSQ through non-ECR funded projects:
H4H
PerfCloud
University of Bordeaux
MAQAO: Introduction
![Page 13: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/13.jpg)
13 / 48
Binary level
Framework services
Scripting language
Low level API
Loop-centric (HPC)
Produce reports
We deal with low level details
Users get high level reports
MAQAO: Introduction
CQA
![Page 14: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/14.jpg)
14 / 48
PerfEval
Profiling
Locating hotspots
![Page 15: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/15.jpg)
15 / 48
Measurement methods
Instrumentation
Through binary rewriting
High overhead / More precision
Sampling
Hardware counter through perf_event_open system call
Very low overhead / less details
Default method: Sampling using hardware counters
MAQAO PerfEval
![Page 16: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/16.jpg)
16 / 48
Collection level
Inter-Node
Node
Sockets
Core level
SIMD: data //
ILP: instruction level //
Runtime-agnostic:
Only system processes and threads are considered
Function hotspots load balancing vue at (multi)node level
Categorization (MPI/OpenMP/Pthreads/IO/…)
MAQAO PerfEval
![Page 17: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/17.jpg)
17 / 48
Display functions and their exclusive time
Associated callchains and their contribution
Loops
Hardware counters profiles:
cache oriented
compute oriented
Innermost loops can then be analyzed by the code quality analyzer module (CQA)
Command line and GUI (HTML) outputs
MAQAO PerfEval
![Page 18: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/18.jpg)
18 / 48
MAQAO PerfEval
Example: NPB-MPI bt.C 36 processes
![Page 19: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/19.jpg)
19 / 48
MAQAO PerfEval
(multi)node load balancing vue
![Page 20: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/20.jpg)
20 / 48
MAQAO PerfEval
Node vue
![Page 21: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/21.jpg)
21 / 48
MAQAO PerfEval
Node vue
![Page 22: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/22.jpg)
22 / 48
MAQAO PerfEval
Profiling
Runtime specific tools
![Page 23: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/23.jpg)
23 / 48
Online profiling
Aggregated metrics (coarse grained analyses)
No traces
No IOs (only one result file)
Reduced memory footprint
Scalable on 100+ procs
MAQAO PerfEval MPI
![Page 24: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/24.jpg)
24 / 48
MAQAO PerfEval MPI
![Page 25: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/25.jpg)
25 / 48
CQA
Code Quality Analysis
![Page 26: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/26.jpg)
26 / 48
Main performance issues:
Core level
Multicore interactions
Communications
Most of the time core level is forgotten
CQA: Code Quality Analyzer
![Page 27: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/27.jpg)
27 / 48
Targets innermost loops
Source loop versus assembly loop(s)
Versioning
Peel / Main / Tail
Or combination of both
CQA: Code Quality Analyzer
Source Loop
ASM
Loop 1ASM
Loop 2
ASM
Loop 3
ASM
Loop 4
ASM
Loop 5
![Page 28: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/28.jpg)
28 / 48
Simplified static performance model
Simulates a target (micro)architecture execution pipeline
Instructions description (latency, uops dispatch...)
Microbench MAQAO module
Out of order considered as ideal => no buffers (ROB, RS, PRF)
Data is considered resident in L1$=> Memory issues should be solved before using CQA
CQA: Code Quality Analyzer
![Page 29: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/29.jpg)
29 / 48
Assess code quality given a binary loop
Static performance estimation: lower bounds on cycles
Quality metrics:
Vectorization degree
Impact of address computations (scalar integers)
FP contribution (all or pure arith without memory)
Detect high latency instructions
Unrolling factor detection
Provide high level reports
Provide source loop context when available
Describing a pathology
Suggested workarounds to improve static performance
Reports categorized by confidence level:
gain, potential gain, hint and expert
CQA: Code Quality Analyzer
![Page 30: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/30.jpg)
30 / 48
CQA: Code Quality Analyzer
![Page 31: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/31.jpg)
31 / 48
CQA: Code Quality Analyzer
![Page 32: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/32.jpg)
32 / 48
DECAN
Differential Analysis
![Page 33: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/33.jpg)
33 / 48
Targets innermost loops
Assembly transformations:
Insert a new instruction
Replace an existing instruction
Remove an existing instruction (fill with nops)
Differential analysis:
Compare the performance of two loops
The original binary loop (ref) and a transformed copy of it
Goal: create transformations that can
Detect bottlenecks
Estimate associated ROI
DECAN
![Page 34: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/34.jpg)
34 / 48
Principle
Performance of the original loop is measured
Some instructions are removed in the loop body (for example loads and stores)
Performance of the transformed loop is measured
Usage
Can perform sampling by transforming only 1 instance and abort execution
Can replay original loop execution after modified one
The Diff. Analysis speedup is an upper bound for optimization on the removed instructions
DECAN
![Page 35: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/35.jpg)
35 / 48
Typical transformations:
FP: only FP arithmetic instructions are preserved=> loads and stores are removed)
LS: only loads and stores are preserved=> compute instructions are removed)
DL1: memory references replaced with global variables ones=> data now accessed from L1
DECAN
![Page 36: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/36.jpg)
36 / 48
DECAN
FP LS
Ref
![Page 37: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/37.jpg)
37 / 48
DECAN
Monitor : • Execution times• Loop Iteration numbers• Hardware counter values
![Page 38: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/38.jpg)
38 / 48
DECAN: Polaris example
Polaris: introduction motivating example solution
6) Vector vs Scalar
2) Non-unit stride accesses
4) DIV/SQRT
5) Reductions
Special issues:
Low trip count: from 2 to
2186 at binary level
3) Indirect accesses
1) High number of
statements
![Page 39: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/39.jpg)
39 / 48
DECAN: Polaris example
FP / LS transformations
0
5
10
15
20
25
30
35
40
45
50
Best_estimated REF FP LS
Cyc
les
pe
r so
urc
e it
era
tio
n
Variants
Execution time
Execution time
ROI = FP / LS = 4,1
Imbalance between the two streams
=> Try to consume more elements inside one iteration.
![Page 40: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/40.jpg)
40 / 48
DECAN: Polaris example
FP bound: CQA provides the following metrics:
Estimated cycles: 43 (FP = 44)
Vector efficiency ratio: 25% (4 DP elements can fit into a 256 bits vector, only 1 is used)
DIV/SQRT bound + DP elements:
~4/8x speedup on a 128/256 bits DIV/SQRT unit (2/4x by vectorization + ~2x by using SP)
Sandy/Ivy Bridge: still 128 bits (potential speedup 2x DP 4x SP)
=> First optimization = VECTORIZATION
Using SIMD directive
Two binary loops Main (packed instructions, 4 elements per iteration)
Tail (scalar instructions, 1 element per iteration)
![Page 41: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/41.jpg)
41 / 48
DECAN: Polaris example
ROI = FP / LS = 2,07 - Initial ROI was at 4,1
Removing loads/stores provides a speedup much more smaller than removing
arithmetical instructions => focus on them
0
5
10
15
20
25
30
35
40
45
50
Best_estimated REF FP LS
Cyc
les
pe
r so
urc
e it
era
tio
ns
Variants
Execution time
Execution time
After vectorization
![Page 42: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/42.jpg)
42 / 48
DECAN: Polaris example
REF_NSD : removing DIV/SQRT instructions provides a 2x speedup
=> the bottleneck is the presence of these DIV/SQRT instructions
FPIS_NSD : removing loads/stores after DIV/SQRT provides a small additional speedup
Conclusion: No room left for improvement here (algorithm bound)
DIV/SQRT
instructions
removed
Loads/stores +
DIV/SQRT instructions
removed
0
5
10
15
20
25
30
35
40
45
50
Best_estimated REF FP LS REF_NSD FPIS_NSD
Cyc
les
pe
r so
urc
e it
era
tio
ns
Variants
Execution time
Execution time
One step further
![Page 43: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/43.jpg)
43 / 48
Success stories
Success stories
![Page 44: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/44.jpg)
44 / 48
POLARIS (MD)
Anti-Coagulant
(7.46 nm)3
Example of multi scale problem:
Factor Xa, involved in thrombosis
• CEA-DSV : Direction des Sciences du Vivant
• Molecular Dynamics
• Speedup: 1.5 – 1.7x
• Effort to speedup:
• ~ 2 men × months (*)
* For the MAQAO team, using ECR tools (MAQAO) and methodology
![Page 45: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/45.jpg)
45 / 48
QMC=Chem
• IRSAMC : Institut de Recherche sur les Systèmes Atomiques et Moléculaires Complexes
• Quantum chemistry (Monte Carlo)
• Speedup: > 3x
• Effort to speedup:
• ~ 2 men × months (*)
* For the MAQAO team, using ECR tools (MAQAO) and methodology
![Page 46: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/46.jpg)
46 / 48
YALES2
• CORIA : Complexe de Recherche Inter-professionnel en Aérothermochimie
• Computational fluid dynamics (CFD)
• Speedup: up to 2.8x
• Effort to speedup:
• ~ 3 men × months (*)
* For the MAQAO team, using ECR tools (MAQAO) and methodology
![Page 47: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/47.jpg)
47 / 48
Thanks for your attention !
Questions ?
Acknowledgements
This work was supported by CEA, GENCI, Intel and UVSQ.
www.maqao.org
![Page 48: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/48.jpg)
48 / 48
Thanks for your attention !
Questions ?
Thanks for your attention !
Questions ?
Meet us @ ECR Booth 24
www.maqao.org
![Page 49: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/49.jpg)
49 / 48
Backup Slides
![Page 50: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/50.jpg)
50 / 48
MIL: MAQAO Instrumentation Language
MAQAO Instrumentation Language
![Page 51: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/51.jpg)
51 / 48
MIL: MAQAO Instrumentation Language
A domain specific language to easily build custom tools
Fast prototyping of evaluation tools
Easy to use easy to express productivity
Focus on what (research) and not how (technical)
Coupling static and dynamic analyses
Static binary instrumentation
Efficient: lowest overhead
Robust: ensure the program semantics
Accurate: correctly identify program structure
Drive binary manipulation layer of MAQAO tool
![Page 52: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/52.jpg)
52 / 48
MIL: MAQAO Instrumentation Language
Current state of the art:
Dyninst appears as the most complete
Not sufficient given our goals
Dynsinst PIN PEBIL
Language type API Oriented / DSL API Oriented API Oriented
Instrumentation type Static/Dynamic binary Dynamic binary Static binary
Overhead High/High High Low
Safe Method Yes Yes No
![Page 53: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/53.jpg)
53 / 48
MIL: MAQAO Instrumentation Language
Objects
Events
Filters
Probes
Actions
Variable classes
Runtime embedded code
Configuration features (output, properties,etc.)
![Page 54: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/54.jpg)
54 / 48
MIL: MAQAO Instrumentation Language
Example 1:TAU Profiler
![Page 55: Performance Evaluation MAQAO Toolsuite - Teratec...Hardware counters profiles: cache oriented compute oriented Innermost loops can then be analyzed by the code quality analyzer module](https://reader035.vdocuments.site/reader035/viewer/2022081621/6123d4e8e2b66b54840dc5aa/html5/thumbnails/55.jpg)
55 / 48
MIL: MAQAO Instrumentation Language
Example 2:Filtering
Previous example only needs an
additional statement