april 14, 2004 the distributed performance consultant: automated performance diagnosis on 1000s of...

April 14, 2004

The Distributed Performance Consultant:

Automated Performance Diagnosis on 1000s of Processors

Philip C. [email protected]

Computer Sciences DepartmentUniversity of Wisconsin-Madison

Madison, WI 53706USA

© Philip C. Roth 2004 -2- Distributed Performance Consultant

High Performance Computing Today

• Large parallel computing resources– Tightly coupled systems

• Earth Simulator (Japan, 5120 CPUs)• ASCI Purple (LLNL, 12K CPUs planned)• BlueGene/L (LLNL, 128K CPUs planned)

– Clusters• Lightning (LANL, 2816 CPUs)• Aspen Systems (Forecast Systems Lab, 1536 CPUs)

– Grid

• Large applications– ASCI Q job sizes (1/1 to 1/24 2004)

• 257–512 cpus: 19.7%• 513–1024 cpus: 24.6%• 1024–2048 cpus: 16.4%


Tool Front End

d0 d1 d2 d3 dP-4 dP-3 dP-2 dP-1

Barriers to Large-Scale Performance Diagnosis

a0 a1 a2 a3 aP-4 aP-3 aP-2 aP-1

Tool Daemons

App Processes

•Managing performance data volume

•Communicating efficiently between distributed tool components

•Making scalable presentation of data and analysis results


Overcoming Scalability Barriers: Our Approach

1. MRNet: multicast/reduction infrastructure for building scalable tools

2. SStart: strategy for improving tool start-up latency

3. Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications

4. Distributed Data Manager: for efficient, distributed performance data management

5. Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck search results for large-scale applications


The Performance Consultant• Automated performance diagnosis• Search for application performance problems

– Start with global, general experiments (e.g., test CPUbound across all processes)

– Collect performance data using dynamic instrumentation

– Make decisions about truth of each experiment– Refine search: create more specific experiments

based on “true” experiments– Repeat

CPUbound

main

Do_row Do_col

… …

…

c002.cs.wisc.educ001.cs.wisc.edu c128.cs.wisc.edu…

main

…

…

main

…

…

main

…

…


The Performance Consultant

• Works well for sequential and small-scale parallel applications…

• …but front-end is a bottleneck when looking for performance problems in large-scale applications – High data processing load– Limited network bandwidth


Our Approach• MRNet to reduce load for processing global

performance data (e.g., average CPU utilization across all processes)

• But front-end still processes local performance data (e.g., CPU utilization in process 5247 on host blue26.pacific.llnl.gov)

Examine behavior of a specific process on the host that runs the process


c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

…

… …

…

……

…

…

cham.cs.wisc.edu

c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Our Approach: Example


The Distributed Performance Consultant (DPC)

• Distributed performance problem search strategy

• Distributed performance data management

• DPC-aware Dynamic Instrumentation management

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control






MRNet MRNet

Front-End

DataManager



Daemon

DataCollector


Manager

LocalData

Manager

Data

Control


DPC: Search Strategy• Local Performance

Consultant agents (LPCs) in each daemon, Global PC in front-end

• GPC controls overall search, manage global experiments

• LPCs control portion of search specific to their local host

MRNet MRNet

Front-End

DataManager



Daemon

DataCollector


Manager

LocalData

Manager

Data

Control


DPC: Search Strategy• Like traditional PC, DPC begins search

with global, general experiments (e.g., CPUbound across all processes)

• When search is refined to specific host, GPC delegates that portion of the search to LPC on that host

• LPC examines local behavior of all processes on the local host, returns results to GPC






MRNet MRNet

Front-End

DataManager



Daemon

DataCollector


Manager

LocalData

Manager

Data

Control


DPC: Performance Data Management

• Local Data Managers (LDMs) in each daemon, Global Data Manager in front-end

• Publish/subscribe model for data flow

• MRNet for global data aggregation

• Possible caching at LDMs (no caching in initial implementation)

MRNet MRNet

Front-End

DataManager



Daemon

DataCollector


Manager

LocalData

Manager

Data

Control






MRNet MRNet

Front-End

DataManager



Daemon

DataCollector


Manager

LocalData

Manager

Data

Control


DPC: Instrumentation Management• Local Instrumentation

Manager (LIM) services requests from GPC and LPC without starving either

• Instrumentation management tailored to distributed nature of DPC– New model for

instrumentation cost– New policy for scheduling

instrumentation

MRNet MRNet

Front-End

DataManager



Daemon

DataCollector


Manager

LocalData

Manager

Data

Control


DPC: Instrumentation Management

• Mismatch between centralized control of instrumentation cost and distributed nature of DPC

• DPC’s instrumentation management– New model for cost of Dynamic

Instrumentation– New policy for scheduling Dynamic

Instrumentation requests


DPC: Instrumentation Cost Model

• User-specified cost threshold, broadcast to each Local Instrumentation Manager (LIM)

• Each LIM restricts active instrumentation to cost threshold

• GPC throttles instrumentation requests when it does not receive timely responses to instrumentation requests (avoids need to gather instrumentation cost from all LIMs to front-end)


DPC: Instrumentation Scheduling Policy

• LIM needs to schedule instrumentation requests from both LPC and GPC– GPC requests must be satisfied by LIM in all

daemons to be useful– LPC requests are restricted to a single daemon

• Prototype LIM’s instrumentation scheduling policy chosen to ensure both local and global searches make progress– Distinguishes instrumentation cost by

instrumentation type (local vs. global)– If both types are present, guarantee a percentage of

total instrumentation cost threshold for each type (percentage is user-configurable)


DPC: Status• Implementation well underway, but

prototype implementation not yet complete– Daemons have been multithreaded– Local Data Managers in place– Local Performance Consultant agents in

progress

• Planning experiments to evaluate overall scalability with various instrumentation cost models and scheduling policies


SummaryThe DPC is a major part of our approach for overcoming barriers that limit the scalability of automated performance diagnosis tools.

Using a distributed search strategy, distributed performance data manager, and instrumentation management techniques tailored to its distributed nature, the DPC will greatly increase the scalability of Paradyn’s automated performance diagnosis functionality.

http://[email protected]

april 14, 2004 the distributed performance consultant: automated performance diagnosis on 1000s of...

Documents

performance bottlenecks

local performance data

process philip

usa philip

analysis results philip

bottleneck search results

cpus plannedbluegenel

7q31024 cpus