system-level performance management

23
TM System-level Performance Management Ken McDonell Engineering Manager, CSBU [email protected]

Upload: sanura

Post on 20-Jan-2016

39 views

Category:

Documents


4 download

DESCRIPTION

System-level Performance Management. Ken McDonell Engineering Manager, CSBU [email protected]. Overview. Status quo for system-level performance monitoring and management in Linux. Factors conspiring to change this. Features of a desirable solution. Porting considerations. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: System-level Performance Management

TM

System-level Performance Management

Ken McDonellEngineering Manager, [email protected]

Page 2: System-level Performance Management

Overview

• Status quo for system-level performance monitoring and management in Linux.

• Factors conspiring to change this.

• Features of a desirable solution.

• Porting considerations.

• Support for distributed processing environments.

Page 3: System-level Performance Management

Influence of Linux Philosophies

• Anti-bloat mantra … available instrumentation is very sparse.

• 1-2p design center … many hard problems are off the radar screen.

• Developer-centric view leads to terse tools … and making them more like sar is not innovative.

• /proc/stat model is both good and bad.• Bias towards running tools on system under

investigation.

Page 4: System-level Performance Management

Challenges to the Status Quo

• Linux deployment on larger platforms.• Linux deployment in production

environments.• Cluster and federated server

configurations. • More complex application architectures.• Focus shift from kernel performance:

– applications performance is key– quality of service matters– systems-level performance mgmt

Page 5: System-level Performance Management

Large Systems Influences

• There may be a lot of data, e.g. for a large (128p) server 1000+ metrics and 30,000+ values from the platform & O/S.

• Data comes from the hardware, the operating system, the service layers, the libraries and the applications.

• Clustered and distributed architectures compound the difficulties.

• All of the data is needed at some time, but only a small part is needed for each specific problem.

Page 6: System-level Performance Management

Production Environment Influences

• Something is broken all of the time.• Cyclic patterns of workload and demand.• Transients are common.• Service-level agreements are written in terms

of performance as seen by an end-user.• Environmental evolution changes the

assumptions, rules and bottlenecks, e.g. upgrades, workload, filesystem age, re-organization.

Page 7: System-level Performance Management

Neanderthal Approaches

Making the Problem Harder• Tool and data islands: ownership, functional,

temporal and geographic domains.• Primitive filtering and information

presentation.• Protocols and UIs that are not scalable.• Emphasis on tools rather than toolkits.• Very little automated monitoring that is useful

for the hard problems.

Page 8: System-level Performance Management

Features of a Desirable Export Infrastructure

• Low overhead and small perturbation.• Unified API for all performance data.• Extensible (plug-in) architecture to

accommodate new sources of performance data.

• Sufficient metadata to allow evolution and change.

• Support for remote access to performance data.

• Platform neutral protocols & data formats.

Page 9: System-level Performance Management

Plug-in Collector and Client-Server Architecture

Page 10: System-level Performance Management

Features of a Desirable Performance Tool Environment

• Complement, not displace, simple tools.

• The same tools for both real-time and retrospective analysis.

• Visualization and drill-down user navigation.

• Remote and multi-host monitoring.

• Toolkits not tools.

• Smarter reasoning about performance data.

Page 11: System-level Performance Management

2-D Performance Visualization

Page 12: System-level Performance Management

3-D Performance Visualization

Page 13: System-level Performance Management

3-D Visualization of Platform Performance

Page 14: System-level Performance Management

3-D Visualization of Application Performance

Page 15: System-level Performance Management

Reasoning About Performance Data

Thresholds are not enough

• Need quantification predicates: existential, universal, percentile, temporal, instantial.

• Multi-source predicates for client-server and distributed applications.

• Retrospection is essential.

• Customized alarms and notification.

Page 16: System-level Performance Management

Performance Co-Pilot Porting History

• Initial development for IRIX

• 1994 Linux experiments

• 1995-96 HP/UX port

• 1998 NT port

• 1998-99 Linux port

Page 17: System-level Performance Management

Performance Co-Pilot Porting

Some things that did not help

• For efficiency and historical reasons we’d chosen to avoid xdr and SNMP.

• HP/UX secrets.

• Lack of instrumentation in the Linux kernel.

• Tool frameworks used for IRIX development are not universally available, e.g. Motif, ViewKit, OpenInventor, XRT.

Page 18: System-level Performance Management

Performance Co-Pilot Porting

Some things that did help

• Programmer discipline.

• Obsessive attitude to automated QA.

• Orthogonal functionality, especially for APIs.

• Monitoring tools that are predominantly shell scripts in front of a small number of generic applications (the “toolkit” approach).

Page 19: System-level Performance Management

A Linux Performance Monitoring Architecture

Linux kernel

linuxpmda

procfs and/proc/stat

pmcd

Page 20: System-level Performance Management

A Beowulf Perf Monitoring Architecture - Node View

Linux kernel

linuxpmda

procfs and/proc/stat

pmcd

beowulfpmda

clusterinfrastructure

Page 21: System-level Performance Management

A Beowulf Perf Monitoring Architecture - Application View

Linux kernel

linuxpmda

procfs and/proc/stat

pmcd

beowulfpmda

clusterinfrastructure

my applicationmypmda

Page 22: System-level Performance Management

A Beowulf Perf Monitoring Architecture - Cluster View

pmcd

Node C

monitor pmcd

Node A

pmcd

Node B

Page 23: System-level Performance Management

Some Concluding Comments

• System-level performance management for large systems is a hard problem.

• Simple solutions do not exist.• Need an extensible collection architecture• Monitoring tools should provide centralized

control for distributed processing.• Retrospection is not optional.• Linux offers real opportunities for “better”

solutions in this area.