kernel-level measurement for integrated parallel performance views ktau: kernel - tau aroon nataraj...
TRANSCRIPT
![Page 1: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/1.jpg)
Kernel-level Measurement for Integrated
Parallel Performance Views
KTAU: Kernel - TAU
Aroon Nataraj
Performance Research LabUniversity of Oregon
![Page 2: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/2.jpg)
KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study Future work and directions Acknowledgements
![Page 3: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/3.jpg)
Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific
Computation(Fastos) Conduct OS research to provide effective OS/Runtime for
petascale systems
ZeptoOS (under Fastos) Scalable components for petascale architectures Joint project Argonne National Lab and University of Oregon ANL: Putting light-weight kernel (based on Linux) on BG/L and
other platforms (XT3)
University of Oregon Kernel performance monitoring, tuning KTAU
Integration of TAU infrastructure with Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms
![Page 4: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/4.jpg)
KTAU: Motivation Application Performance
user-level execution performance + OS-level operations performance
Domains: Time and Hardware Perf. Metrics
PAPI (Performance Application Programming Interface) Exposes virtualized hardware counters
TAU (Tuning and Analysis Utility) Measures a lot of interesting user-level entities: parallel
application, MPI, libraries … Time domain Uses PAPI to correlate counter information with source
![Page 5: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/5.jpg)
KTAU: Motivation
Simple Parallel Model
![Page 6: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/6.jpg)
KTAU: Motivation
Simple Parallel Model - Scale
![Page 7: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/7.jpg)
As HPC systems continue to scale to larger processor counts Application performance more sensitive
New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…])
Isolating these system-level issues as bottlenecks is non-trivial
Comprehensive performance understanding Observation of all performance factors
Relative contributions and interrelationship: can we correlate?
KTAU: Motivation
Effects of Scale
![Page 8: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/8.jpg)
KTAU: MotivationProgram - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points
Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls)
Indirect - OS takes actions without explicit invocation by application
Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-
half handling, etc)
Indirect interactions can occur at any OS entry (not just when entering through Syscalls)
![Page 9: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/9.jpg)
KTAU: Motivation
Program - OS Interactions
![Page 10: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/10.jpg)
KTAU: Motivation
Program - OS Interactions Direct Interactions easier to handle
Synchronous with user-code and in process-context
Indirect Interactions more difficult to handle Usually asynchronous and in interrupt-context: Hard to measure
and harder to correlate/integrate with app. measurements
Indirect interactions may be unrelated to current task E.g. Kernel-level packet processing for another process But related in terms of time to current process
![Page 11: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/11.jpg)
KTAU: Motivation
Program - OS Interactions(Partial)
![Page 12: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/12.jpg)
KTAU: Motivation Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active
processes in system Understand overall OS behavior, identify and remove kernel
hot spots. Cannot show what parts of app. spend time in OS and why
Process-centric perspective - OS performance within context of a specific application’s execution Virtualization and Mapping performance to process Interactions between programs, daemons, and system services Tune OS for specific workload or tune application to better
conform to OS config. Expose real source of performance problems (in the OS or the
application)
![Page 13: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/13.jpg)
KTAU: Motivation Kernel-wide vs. Process-centric
![Page 14: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/14.jpg)
KTAU: Motivation Existing Approaches User-space Only measurement tools
Many tools only work at user-level and cannot observe system-level performance influences
Kernel-level Only measurement tools Most only provide the kernel-wide perspective – lack proper
mapping/virtualization Some provide process-centric views but cannot integrate OS and
user-level measurements Combined or Integrated User/Kernel Measurement Tools
A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance
Typically these focus only on Direct OS interactions. Indirect interactions not merged.
Using Combinations of above tools Without better integration, does not allow fine-grained correlation
between OS and App. Many kernel tools do not explicitly recognize Parallel workloads
(e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses
![Page 15: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/15.jpg)
KTAU: High-Level Objectives
Support low-overhead OS performance measurement at multiple levels of function and detail
Provide both kernel-wide and process-centric perspectives of OS performance
Merge user-level and kernel-level performance information across all program-OS interactions
Provide online information and the ability to function without a daemon where possible
Support both profiling and tracing for kernel-wide and process-centric views in parallel systems
Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data
![Page 16: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/16.jpg)
KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study ZeptoOS – KTAU on Blue Gene / L Future work and directions Acknowledgements
![Page 17: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/17.jpg)
KTAU Architecture
![Page 18: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/18.jpg)
KTAU: Arch. / Impl. Choices Instrumentation
Static Source instrumentation Macro Map-ID: Map block of code and process-context to
unique index (dense id-space) – easy array lookup. Macro Start, Stop – provide the mapping index and process-
context is implicit Measurement
Differentiate between ‘local/self’ and ‘inter-context’ access. HPC codes primarily use ‘self’.
Store performance data in PCB (task_struct) Integrating Kernel/User Performance state
Don’t assume synchronous kernel-entry or process-context Have to use memory mapping between kernel and appl. State Pinning shared state in memory Kernel Call Groups – program-OS interactions summary
Analyses and Visualization – Use TAU facilities
![Page 19: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/19.jpg)
KTAU: Controlled Experiments Controlled Experiments
Exercise kernel in controlled fashion Check if KTAU produces the expected correct and meaningful
views
Test machines Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux
2.6.14.3(ktau) Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node,
Redhat Enterprise Linux 2.4(ktau)
Benchmarks NPB LU application [NPB]
Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution.
LMBENCH [LMBENCH]
Suite of micro-benchmarks exercising Linux kernel A few others not shown (e.g. SKAMPI)
![Page 20: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/20.jpg)
KTA
U: C
ontro
lled E
xam
ple
s contin
ued…
Pro
filin
g
![Page 21: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/21.jpg)
KTAU: Controlled Examples continued…Tracing
Merging App / OS Traces
MPI_Send OS Routines
Fine-grained Tracing
Shows detail inside interrupts and bottom halves
Using VAMPIR Trace Visualization [VAMPIR]
![Page 22: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/22.jpg)
KTAU: Controlled Examples continued…Tracing
Correlating CIOD and RPC-IOD Activity
![Page 23: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/23.jpg)
KTAU: Larger-Scale Runs Run parallel benchmarks on larger-scale (128 dual-cpu nodes)
Identify (and remove) system-level performance issues Understand perturbation overheads introduced by KTAU
NPB benchmark: LU Application [NPB]
Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution.
ASC benchmark: Sweep3D [Sweep3d]
Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh.
Test machine: Chiba-City Linux cluster (ANL) 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux
2.6.14.2 (ktau) kernel, connected by Ethernet
![Page 24: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/24.jpg)
KTAU: Larger-Scale Runs Experienced problems on Chiba by chance Initially ran NPB-LU and Sweep3D codes on 128x1
configuration Then ran on 64x2 configuration Extreme performance hit (72% slower!) with the
64x2 runs Used KTAU views to identify and solve issues
iteratively Eventually brought performance gap to 13% for LU
and 9% for Sweep.
![Page 25: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/25.jpg)
KTAU: Larger-scale Runs
Two ranks - relatively very low MPI_Recv() time.
Two ranks - MPI_Recv() diff. from Mean in
OS-SCHED.
User-level MPI_Recv MPI_Recv OS Interactions
![Page 26: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/26.jpg)
KTAU: Larger-scale Runs
Two ranks have very low voluntary
scheduling durations.
(Same) Two ranks have very large
preemptive scheduling.
Voluntary Scheduling Preemptive Scheduling
Note: x-axis log scale
![Page 27: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/27.jpg)
KTAU Larger-scale Runs
NPB LU processes PID:4066, PID:4068
active. No other significant activity!
Why the Pre-emption?
64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.
ccn10 Node-level View Interrupt Activity
![Page 28: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/28.jpg)
KTAU Larger-scale Runs
Many more OS-TCP CallsApprox. 100% longer
100% More background OS-TCP activity in Compute
phase.More imbalance!
Use ‘Merged’ performance data to identify imbalance.Why does purely compute bound region have lots of I/O?
TCP within Compute : Time TCP within Compute : Calls
![Page 29: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/29.jpg)
KTAU Larger-scale Runs
OS-TCP in SMP Costlier
IRQ-Balancing blindly distributes interrupts and bottom-halves. E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1
Cache issues! [COMSWARE]
Cost / Call of OS-level TCP
![Page 30: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/30.jpg)
KTAU Perturbation Study Five different Configurations
Base: Vanilla kernel, un-instrumented benchmark Ktau-Off: Kernel patched with Ktau and instrumentations compiled-
in. But all instrumentations turned Off (boot-time control) Prof-All: All kernel instrumentations turned On. Prof-Sched: Only scheduler subssystem’s instrumentations turned
on Prof-All+TAU: ProfAll, but also with user-level Tau instrumentation
enabled
NPB LU application benchmark: 16 nodes, 5 different configurations, Mean over 5 runs each
ASC Sweep3D: 128 nodes, Base and Prof-All+TAU, Mean over 5 runs each.
Test machine: Chiba-City ANL
![Page 31: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/31.jpg)
KTAU Perturbation Study
Disabled probe effect. Single
instrumentation very cheap.
E.g. Scheduling.
Complete Integrated Profiling Cost under
3% on Avg. and as low as 1.58%.
Sweep3d on 128 Nodes
Base ProfAll+TAU
Elapsed Time: 368.25 369.9
% Avg Slow.: 0.49%
![Page 32: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/32.jpg)
KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study Future work and directions Acknowledgements
![Page 33: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/33.jpg)
KTAU: Future Work Dynamic measurement control - enable/disable events w/o
recompilation or reboot Improve performance data sources that KTAU can access - E.g.
PAPI
Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information full callpaths, phase-based profiling, merged user/kernel traces
Integration of Tau, Ktau with Supermon (possibly MRNet?), TAUg (next)
Porting efforts: IA-64, PPC-64 and AMD Opteron
ZeptoOS: Planned characterization efforts BGL I/O node Dynamically adaptive kernels
![Page 34: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/34.jpg)
Acknowledgements
Prof. Allen D Malony
Dr. Sameer Shende, Senior Scientist
Alan Morris, Senior Software Engineer, PRL
Suravee Suthikulpanit , MS Student (Graduated)
![Page 35: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/35.jpg)
Support Acknowledgements
Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and
National Science Foundation (grant no. NSF CCF 0444475)
![Page 36: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/36.jpg)
References
[petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03
[jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03
[PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000.
[VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996.
[ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/
[NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.
![Page 37: Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon](https://reader035.vdocuments.site/reader035/viewer/2022062313/56649f515503460f94c75421/html5/thumbnails/37.jpg)
References
[Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000
[LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294
[TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/
[KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006.
[KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” (under submission)