tools at scale - requirements and experience

Tools at Scale - Requirements and Experience

Mary Zosel, LLNL

ASCI / PSE

ASCI Simulation Development Environment

Tools Project

Prepared for SciComp 2000

La Jolla, Ca.

Aug 14-16, 2000

UCRL: VG - 139702

Presentation Outline:

Overview of Systems

Requirements for Scale

Experience/Progress in debugging and tuning

ASCI WHITE• 8192 P3 cpu’s• NightHawk 2 nodes• Colony Switch• 12.3 TF peak• 160 TB disk• 28 tractor trailers• Classified Network

Full system at IBM

120 nodes in new home atLLNL - remainder due late Aug.

White joins these IBM platforms at LLNL

• 128 cpu - SNOW - (8-way P3 NH 1 nodes - Colony)– Experimental software development platform - Unclassified

• 1344 cpu - BLUE - (4-way 604e silver nodes / TB3MX)– Production unclassified platform

• 16 cpu - BABY - (4-way 604e silver nodes / TB3MX)– Experimental development platform - first stop for new system software

• 64 cpu - ER - (4 way 604e silver nodes / TB3MX)– Backup production system “parts” - and experimental software

• 5856 cpu - SKY (3 sectors of 488 silver nodes - connected with TB3MX and 6 HPGN IP routers) - Classified production system.

• When White is complete - ~2/3 of SKY will become the unclassified production system

Why the big machines?

• The purpose of ASCI is new 3-D codes for use in place of testing for Stockpile Certification.

• ASCI program plan calls for series of application milepost demonstrations of increasingly complex calculations which require the very large platforms.– Last year- 1000 cpu requirement

– This year - 1500 cpu requirement

– Next year - ~4000 cpu requirement

• Tri-lab resource -> multiple code teams with large scale requirements

What does this imply for development environment?Pressure Stress Pressure

• Deadlines: multiple code teams working against time

• Long Calculations: need to understand and optimize time requirements of each component to plan for production runs

• Large Scale: easy to push past the knee of scalability - and past the Troutbeck US limit of 1024 tasks

• Large Memory: n**2 buffer management schemes hurt • Access Contention: not easy to get large test runs -

especially for tool work

What Tools are in use?Staying with standards helps make tools usable

• Languages/Compilers: – C, C++, Fortran from both IBM and KAI

• Runtime: OpenMP and MPI– Production codes not using pvm, shmem, direct LAPI use, etc. and direct

use of pthreads is very limited

• Debugging / Tuning:– TotalView, LCF, Great Circle, ZeroFault, Guide, Vampir, xprofiler,

pmapi / papi, and hopefully new IBM tools

Debugging --- LLNL Experience• Users DO want to use the debugger with large # cpus• There have been lots of frustrations - but there is progress and

expectation of further improvements– Slow to attach / start … what was hours is now minutes– Experience / education helps avoid some problems ...

• Need large memory settings in ld• Now have MP_SYNC_ON_CONNECT off by default• Set startup timeouts (MP_TIMEOUT)

– “Sluggish but tolerable” describes a recent 512 cpu session

• Local feature development aimed at scale ... – Subsetting, collapse, shortcuts, filtering, … both CLI and X versions

• Etnus continuing to address scalability

New Attach Option to get subset of tasks

Root window collapsed Shows task 4 in different

state.

Same Root window opened to show all tasks

Example of thumb-screw on msg window

Cycle thru message state

Performance … status quo is less promising

• MPI scale is an issue - OpenMP reduces problem

• Understanding thread performance is issue

• Users DO want to use the tools - this is new– They need estimates for their large code runs …

• Is my job is running or hung?

• Tools aren’t yet ready for scale -

including size-of-code scaling

• Several tools do not support threads

• Problems often not in the user’s code

List of sample problems User observes that …

• … as the number of tasks grows, the code becomes relatively slower and slower. The sum of the CPU time and the system time doesn't add up to wall-clock time – and this missing time is the component growing the fastest. [Diagnosis – bad adaptor software configuration was causing excessive fragmentation and retransmission of MPI messages]

• … unexplained code slow-down from previous runs and nothing in the code has changed. [Diagnosis – orphaned processes on one node slowed down entire code,]

• … threaded version of code much slower than straight MPI. [Diagnosis – code had many small malloc calls and was serializing through the malloc code.]

• … certain part of code takes 10 seconds to run while the problem is small – and then after a call to a memory-intensive routine – the same portion of code takes 18 seconds to run. [Diagnosis – not sure – but believed to be memory heap fragmentation causing paging.]

• … job runs faster on Blue (604e system) than it does on Snow (P3 system). [Diagnosis – not yet known – wonder about flow-control default setting].

• … a non-blocking message-test code is taking up to 15 times longer to run on Snow than it does on Blue. [Diagnosis - not yet known - flow control setting doesn’t help.]

What are we doing about this?• PathForward contracts: KAI/Pallas, Etnus, MSTI

• Infrastructure development: to facilitate new tools / probes – supports click-back to source– currently QT on DPCL … future???

• Probe components: -memory usage, mpi classification

• Lightweight CoreFile … and Performance Monitors

• External observation … Monitor, PS, VMSTAT …

• Testing new IBM beta tools

• Sys admins starting performance regression database

4 8 16 32 64 128 256.00

25,000,000.00

50,000,000.00

75,000,000.00

100,000,000.00

125,000,000.00

150,000,000.00

175,000,000.00

User code

Wait

Send

Irecv

Init

Comm_size

Comm_rank

Bcast

Barrier

Allreduce

Number of Processors

Microseconds

Tool Work In Progress

the faster I go, the behinder I get

… we ARE making progress, but the problems are getting harder and coming in faster ...

It’s a Team EffortRich Zwakenberg - debuggingKaren WarrenBor ChanJohn May - performance toolsJeff VetterJohn GyllenhaalChris ChambreauMike McCrackenJohn Engle - compiler supportLinda Stanberry - mpi relatedBronis deSupinskiSusan Post - system testingBrian Carnes - general Mary ZoselScott Taylor - emeritasJohn Ranelletti

tools at scale - requirements and experience

Documents