april 14, 2004 the distributed performance consultant: automated performance diagnosis on 1000s of...

22
April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth [email protected] Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA

Upload: dominick-dixon

Post on 20-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

April 14, 2004

The Distributed Performance Consultant:

Automated Performance Diagnosis on 1000s of Processors

Philip C. [email protected]

Computer Sciences DepartmentUniversity of Wisconsin-Madison

Madison, WI 53706USA

Page 2: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -2- Distributed Performance Consultant

High Performance Computing Today

• Large parallel computing resources– Tightly coupled systems

• Earth Simulator (Japan, 5120 CPUs)• ASCI Purple (LLNL, 12K CPUs planned)• BlueGene/L (LLNL, 128K CPUs planned)

– Clusters• Lightning (LANL, 2816 CPUs)• Aspen Systems (Forecast Systems Lab, 1536 CPUs)

– Grid

• Large applications– ASCI Q job sizes (1/1 to 1/24 2004)

• 257–512 cpus: 19.7%• 513–1024 cpus: 24.6%• 1024–2048 cpus: 16.4%

Page 3: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -3- Distributed Performance Consultant

Tool Front End

d0 d1 d2 d3 dP-4 dP-3 dP-2 dP-1

Barriers to Large-Scale Performance Diagnosis

a0 a1 a2 a3 aP-4 aP-3 aP-2 aP-1

Tool Daemons

App Processes

•Managing performance data volume

•Communicating efficiently between distributed tool components

•Making scalable presentation of data and analysis results

Page 4: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -4- Distributed Performance Consultant

Overcoming Scalability Barriers: Our Approach

1. MRNet: multicast/reduction infrastructure for building scalable tools

2. SStart: strategy for improving tool start-up latency

3. Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications

4. Distributed Data Manager: for efficient, distributed performance data management

5. Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck search results for large-scale applications

Page 5: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -5- Distributed Performance Consultant

Overcoming Scalability Barriers: Our Approach

1. MRNet: multicast/reduction infrastructure for building scalable tools

2. SStart: strategy for improving tool start-up latency

3. Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications

4. Distributed Data Manager: for efficient, distributed performance data management

5. Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck search results for large-scale applications

Page 6: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -6- Distributed Performance Consultant

The Performance Consultant• Automated performance diagnosis• Search for application performance problems

– Start with global, general experiments (e.g., test CPUbound across all processes)

– Collect performance data using dynamic instrumentation

– Make decisions about truth of each experiment– Refine search: create more specific experiments

based on “true” experiments– Repeat

CPUbound

main

Do_row Do_col

… …

c002.cs.wisc.educ001.cs.wisc.edu c128.cs.wisc.edu…

main

main

main

Page 7: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -7- Distributed Performance Consultant

The Performance Consultant

• Works well for sequential and small-scale parallel applications…

• …but front-end is a bottleneck when looking for performance problems in large-scale applications – High data processing load– Limited network bandwidth

Page 8: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -8- Distributed Performance Consultant

Our Approach• MRNet to reduce load for processing global

performance data (e.g., average CPU utilization across all processes)

• But front-end still processes local performance data (e.g., CPU utilization in process 5247 on host blue26.pacific.llnl.gov)

Examine behavior of a specific process on the host that runs the process

Page 9: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -9- Distributed Performance Consultant

c002.cs.wisc.educ001.cs.wisc.edu

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

cham.cs.wisc.edu

c128.cs.wisc.edu

myapp367 myapp4287 myapp27549

Our Approach: Example

Page 10: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -10- Distributed Performance Consultant

The Distributed Performance Consultant (DPC)

• Distributed performance problem search strategy

• Distributed performance data management

• DPC-aware Dynamic Instrumentation management

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 11: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -11- Distributed Performance Consultant

The Distributed Performance Consultant (DPC)

• Distributed performance problem search strategy

• Distributed performance data management

• DPC-aware Dynamic Instrumentation management

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 12: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -12- Distributed Performance Consultant

DPC: Search Strategy• Local Performance

Consultant agents (LPCs) in each daemon, Global PC in front-end

• GPC controls overall search, manage global experiments

• LPCs control portion of search specific to their local host

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 13: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -13- Distributed Performance Consultant

DPC: Search Strategy• Like traditional PC, DPC begins search

with global, general experiments (e.g., CPUbound across all processes)

• When search is refined to specific host, GPC delegates that portion of the search to LPC on that host

• LPC examines local behavior of all processes on the local host, returns results to GPC

Page 14: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -14- Distributed Performance Consultant

The Distributed Performance Consultant (DPC)

• Distributed performance problem search strategy

• Distributed performance data management

• DPC-aware Dynamic Instrumentation management

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 15: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -15- Distributed Performance Consultant

DPC: Performance Data Management

• Local Data Managers (LDMs) in each daemon, Global Data Manager in front-end

• Publish/subscribe model for data flow

• MRNet for global data aggregation

• Possible caching at LDMs (no caching in initial implementation)

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 16: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -16- Distributed Performance Consultant

The Distributed Performance Consultant (DPC)

• Distributed performance problem search strategy

• Distributed performance data management

• DPC-aware Dynamic Instrumentation management

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 17: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -17- Distributed Performance Consultant

DPC: Instrumentation Management• Local Instrumentation

Manager (LIM) services requests from GPC and LPC without starving either

• Instrumentation management tailored to distributed nature of DPC– New model for

instrumentation cost– New policy for scheduling

instrumentation

MRNet MRNet

Front-End

DataManager

LocalPerformanceConsultant

PerformanceConsultant

Daemon

DataCollector

LocalInstrumentation

Manager

LocalData

Manager

Data

Control

Page 18: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -18- Distributed Performance Consultant

DPC: Instrumentation Management

• Mismatch between centralized control of instrumentation cost and distributed nature of DPC

• DPC’s instrumentation management– New model for cost of Dynamic

Instrumentation– New policy for scheduling Dynamic

Instrumentation requests

Page 19: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -19- Distributed Performance Consultant

DPC: Instrumentation Cost Model

• User-specified cost threshold, broadcast to each Local Instrumentation Manager (LIM)

• Each LIM restricts active instrumentation to cost threshold

• GPC throttles instrumentation requests when it does not receive timely responses to instrumentation requests (avoids need to gather instrumentation cost from all LIMs to front-end)

Page 20: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -20- Distributed Performance Consultant

DPC: Instrumentation Scheduling Policy

• LIM needs to schedule instrumentation requests from both LPC and GPC– GPC requests must be satisfied by LIM in all

daemons to be useful– LPC requests are restricted to a single daemon

• Prototype LIM’s instrumentation scheduling policy chosen to ensure both local and global searches make progress– Distinguishes instrumentation cost by

instrumentation type (local vs. global)– If both types are present, guarantee a percentage of

total instrumentation cost threshold for each type (percentage is user-configurable)

Page 21: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -21- Distributed Performance Consultant

DPC: Status• Implementation well underway, but

prototype implementation not yet complete– Daemons have been multithreaded– Local Data Managers in place– Local Performance Consultant agents in

progress

• Planning experiments to evaluate overall scalability with various instrumentation cost models and scheduling policies

Page 22: April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer

© Philip C. Roth 2004 -22- Distributed Performance Consultant

SummaryThe DPC is a major part of our approach for overcoming barriers that limit the scalability of automated performance diagnosis tools.

Using a distributed search strategy, distributed performance data manager, and instrumentation management techniques tailored to its distributed nature, the DPC will greatly increase the scalability of Paradyn’s automated performance diagnosis functionality.

http://[email protected]