al geist august 17-19, 2005 oak ridge, tn

Working Group updates, Suite Tests and Scalability, Race conditions,

SSS-OSCAR Releases and Hackerfest

Working Group updates, Suite Tests and Scalability, Race conditions,

SSS-OSCAR Releases and Hackerfest

Al GeistAugust 17-19, 2005

Oak Ridge, TN

Welcome to Oak Ridge National Lab!First quarterly meeting here

Welcome to Oak Ridge National Lab!First quarterly meeting here

Demonstration: Demonstration: Faster than Light ComputerFaster than Light Computer

Able to calculate the answer before the problem is specified

Faster than Light ComputerFaster than Light Computer

ORNLORNLORNLORNL ORNLORNL ORNLORNL ORNLORNL

IBMCrayIntelSGI

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystems

To learn more visit

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfacesauthentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Validation & Testing

HardwareInfrastructure

Manager

SSS-OSCAR

Scalable Systems Software SuiteScalable Systems Software SuiteAny Updates to this diagram?Any Updates to this diagram?

Components in SuitesComponents in Suites

Gold

EM

SD

Grid scheduler

Warehouse MetaManager

Mauisched

NSM

Gold

PM

UsageReports

Meta Services

Warehouse(superMonNWPerf)

BambooQM

BCM

Multiple Component

Implementations exits

ssslib

BLCR

APITest

HIMCompliant with PBS, Loadlever

job scripts

Scalable Systems UsersScalable Systems Users

Production use today:• Running an SSS suite at ANL, and Ames• ORNL industrial cluster (soon)• Running components at PNNL• Maui w/ SSS API (3000/mo), Moab (Amazon, Ford, TeraGrid, …)

Who can we involve before the end of the project?- National Leadership-class facility?

NLCF is a partnership between ORNL (Cray), ANL (BG), PNNL (cluster)

- NERSC and NSF centersNCSA cluster(s)NERSC cluster?NCAR BG

Goals for This MeetingGoals for This Meeting

Updates on the Integrated Software Suite components Change in Resource Management Group

Scott Jackson left PNNL

Planning for SciDAC phase 2 – discuss new directions

Preparing for next SSS-OSCAR software suite release What needs to be done at hackerfest?

Getting more outside Users. Production and feedback to suite

Since Last MeetingSince Last Meeting

• FastOS Meeting in DC• Any chatting about leveraging our System Software?

• SciDAC 2 Meeting in San Francisco• Scalable Systems Poster• Talk on ISICS• Several SSS members there. Anything to report?

• Telecoms and New entries in Electronic Notebooks • Pretty sparse since last meeting

Agenda – August 17Agenda – August 17

8:00 Continental Breakfast CSB room B2268:30 Al Geist - Project Status 9:00 Craig Steffen – Race Conditions in Suite 9:30 Paul Hargrove Process Management and Monitoring

10:30 Break 11:00 Todd Kordenbrock – Robustness and Scalability Testing

12:00 Lunch (on own at cafeteria )

1:30 Brett Bode - Resource Management components 2:30 Narayan Desai - Node Build, Configure, Cobalt status

3:30 Break 4:00 Craig Steffen – SSSRMAP in ssslib 4:30 Discuss proposal ideas for SciDAC 24:30 Discussion of getting SSS users and feedback5:30 Adjourn for dinner


8:00 Continental Breakfast 8:30 Thomas Naughton - SSS OSCAR software releases 9:30 Discussion and voting

• Your name here 10:30 Group discussion of ideas for SciDAC-2.11:30 Discussion of Hackerfest goals

Set next meeting date/location: 12:00 Lunch (walk over to cafeteria)

1:30 Hackerfest begins room B226 3:00 Break 5:30 or whenever break for dinner


8:00 Continental Breakfast 8:30 Hackfest continues

12:00 Hackerfest ends

What is going on inSciDAC 2

Executive Panel

Five Workshops in past 5 wksPreparing a SciDAC 2 program plan

at LBNL today!

ISIC section has words about system software and tools

What is going on inSciDAC 2

Executive Panel

Five Workshops in past 5 wksPreparing a SciDAC 2 program plan

at LBNL today!

ISIC section has words about system software and tools

View to the FutureView to the Future

HW, CS, and Science Teams all contribute to the science breakthroughs

Leadership-class Platforms

BreakthroughScience

Software & Libs

SciDAC CS teamsTuned codes Research

teamHigh-End

science problem

Computing EnvironmentCommon look&feel across diverse HW

UltrascaleHardware

Rainer, Blue Gene, Red Storm OS/HW teams

SciDACScience Teams

SciDAC Phase 2 and CS ISICsSciDAC Phase 2 and CS ISICs

Future CS ISICs need to be mindful of needs of

National Leadership Computing facility w/ Cray, IBM BG, SGI, clusters, multiple OSNo one architecture is best for all applications

SciDAC Science TeamsNeeds depend on application areas chosenEnd stations? Do they have special SW needs?

FastOS Research ProjectsComplement, don’t duplicate these efforts

Cray software roadmapMaking the Leadership computers usable, efficient, fast

Gaps and potential next stepsGaps and potential next steps

Heterogeneous leadership-class machines science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines.

Fault tolerance requirements in apps and systems softwareparticularly as systems scale up to petascale around 2010

Support for application users submitting interactive jobs

computational steering as means of scientific discovery

High performance File System and I/O research

increasing demands of security, scalability, and fault tolerance

Security

One-time-passwords and impact on scientific progress

Heterogeneous MachinesHeterogeneous Machines

Heterogeneous Architectures Vector architectures, Scalar, SMP, Hybrids, Clusters

How is a science team to know what is best for them?

Multiple OSEven within one machine, eg. Blue Gene, Red Storm

How to effectively and efficiently administer such systems?

Diverse programming environment

science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines

Diverse system management environmentManaging and scheduling multiple node types

System updates, accounting, … everything will be harder in round 2

Fault ToleranceFault Tolerance

Holistic Fault Tolerance Research into schemes that take into account the full impact of faults:

application, middleware, OS, and hardware

Fault tolerance in systems software• Research into prediction and prevention• Survivability and resiliency when faults can not be avoided

Application recovery

• transparent failure recovery

• Research into Intelligent checkpointing based on active monitoring, sophisticated rule-based recoverys, diskless checkpointing…

• For petascale systems research into recovery w/o checkpointing

Interactive ComputingInteractive Computing

Batch jobs are not the always the best for Science Good for large numbers of users, wide mix of jobs, but

National Leadership Computing Facility has different focus

Computational Steering as a paradigm for discoveryBreak the cycle: simulate, dump results, analyze, rerun simulation

More efficient use of the computer resources

Needed for Application developmentScaling studies on terascale systems

Debugging applications which only fail at scale

File System and I/O ResearchFile System and I/O Research

Lustre is today’s answer There are already concerns about its capabilities as systems scale up to 100+ TF

What is the answer for 2010?Research is needed to explore the file system and I/O requirements for petascale systems that will be here in 5 years

I/O continues to be a bottleneck in large systems

Hitting the memory access wall on a node

To expensive to scale I/O bandwidth with Teraflops across nodes

Research needed to understand how to structure applications or modify I/O to allow applications to run efficiently

SecuritySecurity

New stricter access policies to computer centers Attacks on supercomputer centers have gotten worse.

One-Time-Passwords, PIV?Sites are shifting policies, tightening firewalls, going to SecureID tokens

Impact on scientific progressCollaborations within international teams

Foreign nationals clearance delays

Access to data and computational resources

Advances required in system softwareTo allow compliance with different site policies and be able to handle tightest requirements

Study how to reduce impact on scientists

Meeting notesMeeting notes

Al Geist – see slidesCraig Steffen – Exciting new race condition!Nodes go offline – Warehouse doesn’t know quick enoughEvent manager, scheduler, lots of components affectedProblem grows linear with system size. COrder of operations need to be considered – something we haven’t considered before. Issue can be reduced, can’t be solvedGood discussion on ways to reduce race conditions.

SSS use at NCSAPaul Egli rewrote Warehouse – many new features added, Sandia userNow monitoring sessionsAll configuration is dynamicMultiple debugging channelsSandia user – tested to 1024 virtual nodesWeb site – http://arrakis.ncsa.uiuc.edu/warehouse/New hire full time on SSSLining up T2 scheduling (500 proc)

http://arrakis.ncsa.uiuc.edu/warehouse/


Paul Hargrove – Checkpoint Manager BLCR statusAMD64/EM64T port now in beta (crashes some users machines)Recently discovered kernel panic during signal interaction (must fix at hackerfest)Next step process groups/sessions – begin next weekLRS-XML and Events “real soon now”Open MPI chpt/restart support by SC2005Torque integration done at U. Mich. for phd thesis (needs hardening)Process manager – MPD rewrite “refactoring” Getting a PM stable and working on BG.Todd K – Scalability and Robustness testsESP2 Efficiency ratio 0.9173 on 64 nodesScalability – Bamboo 1000 job submission

Gold (java version) – reservation slow – PERL version not testedWarehouse – up to 1024 nodesMaui on 64 nodes (need more testing)

Durability – Node Warm stop – 30 seconds to Maui notificationNode Warm start – 10 secondsNode Cold stop – 30 seconds


Todd K – testing continuedSingle node failure – goodResource hog (stress)Resource exhaustion – service node (Gold fails in logging package)Anomalies

MauiWarehouseGoldhappynsm

ToDoTest BLCR moduleRetest on larger clusterGet latest release of all software and retestWrite report on results.


Brett Bode – RM statusNew release of componentsBamboo v1.1Maui 3.2.6p13Gold 2b2.10.2Gold being used on Utah clusterSSS suite on several systems at AmesNew fountain component – to front end Supermon, ganglia, etc.Demos new tool called Goanna for looking at fountain outputHas same interface as Warehouse – could plug right inGeneral release of GOLD 2.0.0.0 available. New perl cgi gui

no Java dependency at all in GoldX509 support in Mcom (for Maui and Silver)Cluster scheduler bunch of new featuresGrid scheduler – enabled basic accounting for grid jobs.Future work – Gary needs to get up to speed on Gold code

make it all work with LRS


Narian – LRS Conversion statusAll components in center cloud converted to LRS

Service directory, Event manager, BCM stack, Processor ManagerTargeted for SC05 releaseSSSlib changeover – completedSDK support – completed

Cobalt OverviewSSS suite on Chiba and BGMotivations – scalability, flexibility, simplicity, support for research ideasTools included: parallel programming toolsPorting has been easy – now running on Linux, MacOS, and BG/LOnly about 5K lines of code. Targeted for Cray xt3, x1, zeptoOSUnique features- small partition support on BG/L, OS Spec supportAgile – swap out components. User and admin requests easier to satisfyRunning on ANL and NCAR (evaluation at other BG sites)May be running on JAZZ soon.Future- better scheduler, new platforms, more front ends, better docs


Narian – Parallel tool development-parallel Unix tools suite-File staging -Parallel rsync

al geist august 17-19, 2005 oak ridge, tn

Documents

sss suite

systems components

quarterly suite

suite tests

system software

sssoscar releases

scidac phase

meetingfastos meeting