gsfc nccs nccs user forum 25 september 2008. gsfc nccs nccs user forum9/25/082 agenda welcome &...
TRANSCRIPT
GSFCNCCS
NCCS User Forum
25 September 2008
GSFCNCCS9/25/08 2NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO ChiefScott Wallace, CSC PM
Current System StatusFred Reitz, Operations Manager
Compute Capabilities at the NCCSDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
SMD AllocationsSally Stemwedel, HEC Allocation Specialist
GSFCNCCS9/25/08 3NCCS User Forum
• Large scale HEC computing cluster and on-line storage
• Comprehensive toolsets for job scheduling and monitoring
• Large capacity storage
• Tools to manage and protect data
• Data migration support
•Help Desk
•Account/Allocation support
•Computational science support
•User teleconferences
•Training & tutorials
• Interactive analysis environment
• Software tools for image display
• Easy access to data archive
• Specialized visualization support
• Capability to share data & results
• Supports community-based development
• Facilitates data distribution and publishing
• Code repository for collaboration
• Environment for code development and test
• Code porting and optimization support
• Web based tools
• Internal high speed interconnects for HEC components
• High-bandwidth to NCCS for GSFC users
• Multi-gigabit network supports on-demand data transfers
NCCS Support Services
HEC Compute Data Archival and Stewardship
Code Development & Collaboration
Analysis & VisualizationUser Services
Data Sharing
High Speed Networks
DATA
Global file system
enables data access for full
range of modeling and
analysis activities
GSFCNCCS9/25/08 4NCCS User Forum
Resource Growth at the NCCS
0
20
40
60
80
100
Jun-
05
Dec
-05
Jun-
06
Dec
-06
Jun-
07
Dec
-07
Jun-
08
Dec
-08
Jun-
09
Dec
-09
Tfl
op
/s
Dsc SCU5Dsc SCU4Dsc SCU3Dsc SCU2Dsc SCU1Dsc BaseExplorePalmHalem
GSFCNCCS9/25/08 5NCCS User Forum
NCCS Staff Transitions
∙ New Govt. Lead Architect: Dan Duffy(301) 286-8830, [email protected]
∙ New CSC Lead Architect: Jim McElvaney(301) 286-2485, [email protected]
∙ New User Services Lead: Bill Ward(301) 286-2954, [email protected]
∙ New HPC System Administrator: Bill Woodford(301) 286-5852, [email protected]
GSFCNCCS9/25/08 6NCCS User Forum
Key Accomplishments
∙SLES10 upgrade
∙GPFS 3.2 upgrade
∙ Integrated SCU3
∙Data Portal storage migration
∙Transition from Explore to Discover
GSFCNCCS9/25/08 7NCCS User Forum
Integration of Discover SCU4
∙SCU4 to be connected to Discover 10/1 (date firm)
∙Staff testing and scalability 10/1-10/5 (dates approximate)
∙Opportunity for interested users to run large jobs 10/6-10/13 (dates approximate)
GSFCNCCS9/25/08 8NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO ChiefScott Wallace, CSC PM
Current System StatusFred Reitz, Operations Manager
Compute Capabilities at the NCCSDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
SMD AllocationsSally Stemwedel, HEC Allocation Specialist
GSFCNCCS9/25/08 9NCCS User Forum
Explore Utilization Past Year
67.0% 69.0%
85.6%
73.0%
613,975CPU hours
GSFCNCCS9/25/08 10NCCS User Forum
93.0%
94.0%
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
May June July August
0.0
5.0
10.0
15.0
20.0
25.0
Outage Duration (Hours)
8/13
– H
ardwar
e fa
ilure
8/13
– F
acili
ties
mai
ntenan
ce
5/14
– H
ardwar
e m
ainte
nance
May through August availability
∙13 outages▶8 unscheduled
◆7 hardware failures◆1 human error
▶5 scheduled
∙65 hours total downtime▶32.8 unscheduled▶32.2 scheduled Longest outages
∙8/13 – Hardware failure, 19.9 hrs▶E2 hardware issues after scheduled power down for facilities maintenance▶System left down until normal business hours for vendor repair▶Replaced/unseated PCI bus, IB connection
∙8/13 – Hardware maintenance, 11.25 hrs
▶Scheduled outage▶Electrical work – facility upgrade
Explore Availability
GSFCNCCS9/25/08 11NCCS User Forum
Queue Wait Time + Run TimeRun Time
Weighted over all queues for all jobs(Background and Test queues excluded)
Explore Queue Expansion Factor
GSFCNCCS9/25/08 12NCCS User Forum
Discover Utilization Past Year
81.1%76.1%
58.7%
69.2%
1,320,683CPU Hours
GSFCNCCS9/25/08 13NCCS User Forum
Discover Utilization
SCU3 cores added
GSFCNCCS9/25/08 14NCCS User Forum
Discover Availability
92.0%
93.0%
94.0%
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
May June July August
May through August availability
∙ 11 outages▶ 6 unscheduled
◆ 2 hardware failures◆ 2 software failures◆ 3 extended maintenance windows
▶ 5 scheduled
∙ 89.9 hours total downtime▶ 35.1 unscheduled▶ 54.8 scheduled
Longest outages
∙ 7/10 – SLES 10 upgrade, 36 hrs▶ 15 hours planned▶ 21 hour extension
∙ 8/20 – Connect SCU3, 15 hrs▶ Scheduled outage
∙ 8/13 – Facilities maintenance, 10.6 hrs▶ Electrical work – facility upgrade
0.0
5.0
10.0
15.0
20.0
25.0
Outage Duration (Hours)
7/10
– E
xten
ded S
LES10
upgrade
window
7/10
– S
LES10 u
pgrade
8/20
– C
onnect S
CU3 to
clust
er
8/13
– F
acili
ties
elec
trica
l
upgrade
GSFCNCCS9/25/08 15NCCS User Forum
Discover Queue Expansion Factor
Queue Wait Time + Run TimeRun Time
Weighted over all queues for all jobs(Background and Test queues excluded)
GSFCNCCS9/25/08 16NCCS User Forum
Current Issues on Discover:Infiniband Subnet Manager
∙ Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes.
∙ Outcome: Job failures due to node removal
∙ Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred; admins monitoring.
GSFCNCCS9/25/08 17NCCS User Forum
Current Issues on Discover:PBS Hangs
∙ Symptom: PBS server experiencing 3-minute hangs several times per day
∙ Outcome: PBS-related commands (qsub, qstat, etc.) hang
∙ Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work-around on 9/17 to prevent conflicting use of these ports. No further occurrences.
GSFCNCCS9/25/08 18NCCS User Forum
Current Issues on Discover:Problems with PBS –V Option
∙ Symptom: Jobs with large environments not starting
∙ Outcome: Jobs placed on hold by PBS
∙ Status: Investigating with Altair (vendor). In the interim, requested users not pass full environment via –V, instead use –v or define necessary variables within job scripts.
GSFCNCCS9/25/08 19NCCS User Forum
Current Issues on Discover:Problem with PBS and LDAP
∙ Symptom: Intermittent PBS failures while communicating with LDAP server.
∙ Outcome: Jobs rejected with bad UID error due to failed lookup
∙ Status: LDAP configuration changes to improve information caching and reduce queries to LDAP server; significantly reduced problem frequency. Still investigating with Altair.
GSFCNCCS9/25/08 20NCCS User Forum
Future Enhancements
∙ Discover Cluster▶ Hardware platform – SCU4 10/1/2008▶ Additional storage
∙ Discover PBS Select Changes▶ Syntax changes to streamline job resource
requests
∙ Data Portal▶ Hardware platform
∙ DMF▶ Hardware platform
GSFCNCCS9/25/08 21NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO ChiefScott Wallace, CSC PM
Current System StatusFred Reitz, Operations Manager
Compute Capabilities at the NCCSDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
SMD AllocationsSally Stemwedel, HEC Allocation Specialist
GSFCNCCS9/25/08 22NCCS User Forum
Overall Acquisition Planning Schedule2008 – 2009
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2009
Stage 1: Power & Cooling UpgradeStage 2: Cooling Upgrade
Sto
rag
e U
pg
rad
eC
om
pu
te U
pg
rad
eF
ac
ilit
ies
: E
10
0
Write RFP
Issue RFP, Evaluate Responses, Purchase
Delivery & Integration
Write RFP Issue RFP, Evaluate Responses, Purchase
Delivery
Explore Decommissioned
Stage 1: Integration & Acceptance
Stage 2: Integration & Acceptance
We are here!
GSFCNCCS9/25/08 23NCCS User Forum
What does this schedule mean to you?Expect some outages – Please be patient
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2009
Sto
rag
e U
pg
rad
eC
om
pu
te U
pg
rad
e
Additional Storage On-line
Stage 1 Compute CapabilityAvailable for Users
Dis
co
ve
r M
od
s GPFS 3.2 Upgrade (not RDMA)
SLES 10 Software Stack Upgrade
Stage 2 Compute CapabilityAvailable for Users
Decommission Explore
DONE
DONE
DONE
Delayed
GSFCNCCS9/25/08 24NCCS User Forum
Cubed Sphere Finite VolumeDynamic Core Benchmark
Cubed Sphere - Benchmark 3
100.00
1,000.00
10,000.00
100,000.00
10 100 1000 10000
Number of Cores
Tim
e (
se
co
nd
s)
IBM - Proposed
Discover - Measured
IBM - Measured
∙ Non-hydrostatic, 10 KM resolution
∙ Most computationally intensive benchmark
∙ Discover Reference Timings▶ 216 cores (6x6) – 6,466s▶ 288 cores (6x8) – 4,879s▶ 384 cores (8x8) – 3,200s
∙ All runs made using ALL cores on a node.
GSFCNCCS9/25/08 25NCCS User Forum
Near Future
∙ Additional storage to be added to the cluster▶ 240 TB RAW▶ By the end of the calendar year▶ RDMA pushed into next year
∙ Potentially one additional scalable unit▶ Same as the new IBM units▶ By the end of the calendar year
∙ Small IBM Cell application development testing environment▶ 2 to 3 months
GSFCNCCS9/25/08 26NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO ChiefScott Wallace, CSC PM
Current System StatusFred Reitz, Operations Manager
Compute Capabilities at the NCCSDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
SMD AllocationsSally Stemwedel, HEC Allocation Specialist
GSFCNCCS9/25/08 27NCCS User Forum
SMD Allocation Policy Revisions 8/1/08
∙ 1-year allocations only during regular spring and fall cycles▶ Fall e-Books deadline 9/20 for November 1 awards▶ Spring e-Books deadline 3/20 for May 1 awards
∙ Projects started off-cycle ▶ Must have support of HQ Program Manager to start off-cycle▶ Will get limited allocation expiring at next regular cycle
award date∙ Increases over 10% of award or 100K processor-
hours during award period need support of funding manager; email [email protected] to request increase.
∙ Projects using up allocation faster than anticipated are encouraged to submit for next regular cycle.
GSFCNCCS9/25/08 28NCCS User Forum
Questions about Allocations?
∙Allocation POCSally StemwedelHEC Allocation Specialist [email protected](301) 286-5049
∙SMD allocation procedure ande-Books submission linkwww.HEC.nasa.gov
GSFCNCCS9/25/08 29NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO ChiefScott Wallace, CSC PM
Current System StatusFred Reitz, Operations Manager
Compute Capabilities at the NCCSDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
SMD AllocationsSally Stemwedel, HEC Allocation Specialist
GSFCNCCS9/25/08 30NCCS User Forum
Explore Will Be Decommissioned
∙ It is a leased system
∙e1, e2, e3 must be returned to vendor
∙Palm will remain
∙Palm will be repurposed
∙Users must move to Discover
GSFCNCCS9/25/08 31NCCS User Forum
Transition to Discover - Phases
∙PI notified
∙Users on Discover
∙Code migrated
∙Data accessible
∙Code acceptably tuned
∙Performing production runs
GSFCNCCS9/25/08 32NCCS User Forum
Transition to Discover - Status
0
20
40
60
80
100
3-Ju
l
17-J
ul
31-J
ul
14-A
ug
28-A
ug
11-S
ep
25-S
ep
Per
cen
t o
f 58
Pro
ject
s D
on
e
PI notifiedOn systemCode portedData accessibleCode tunedProduction runs
GSFCNCCS9/25/08 33NCCS User Forum
Transition to Discover - Status
0
20
40
60
80
100
25-S
ep
Per
cen
t 58
Pro
ject
s D
on
e
PI notifiedOn systemCode portedData accessibleCode tunedProduction runs
GSFCNCCS9/25/08 34NCCS User Forum
Transition to Discover - Status
0
20
40
60
80
100
25-S
ep
Per
cen
t A
llo
cati
on
Do
ne
PI notifiedOn systemCode portedData accessibleCode tunedProduction runs
GSFCNCCS9/25/08 35NCCS User Forum
Accessing Discover Nodes
∙We are in the process of making the PBS select statement more simple and streamlined
∙Keep doing what you are doing until we publish something better
∙For most folks, changes will not break what you are using now
GSFCNCCS9/25/08 36NCCS User Forum
Discover Compilers
∙comp/intel-10.1.017preferred
∙comp/intel-9.1.052if intel-10 doesn’t work
∙comp/intel-9.1.042only if absolutely necessary
GSFCNCCS9/25/08 37NCCS User Forum
MPI on Discover
∙ mpi/scali-5▶ Not supported on new nodes▶ -l select=<n>:scali=true to get it
∙ mpi/impi-3.1.038▶ Slower startup than OpenMPI▶ Catches up later (anecdotal)▶ Self-tuning feature still under evaluation
∙ mpi/openmpi-1.2.5/intel-10▶ Does not pass user environment▶ Faster startup due to built-in PBS support
GSFCNCCS9/25/08 38NCCS User Forum
Data Migration Facility Transition
∙DMF hosts Dirac/Newmintz (SGI Origin 3800s) to be replaced by parts of Palm (SGI Altix)
∙Actual cutover Q1 CY09
∙ Impacts to Dirac users:▶ Source code must be recompiled▶ Some COTS must be relicensed▶ Other COTS must be rehosted
GSFCNCCS9/25/08 39NCCS User Forum
Accounts for Foreign Nationals
∙ Codes 240, 600, and 700 have established a well-defined process for creating NCCS accounts for foreign nationals
∙ Several candidate users have been navigated through the process
∙ Prospective users from designated countries must go to NASA HQ
∙ Process will be posted on the web very soonhttp://securitydivision.gsfc.nasa.gov/index.cfm?topic=visitors.foreign
GSFCNCCS9/25/08 40NCCS User Forum
Feedback
∙Now – Voice your …▶ Praises?▶ Complaints?
∙Later – NCCS Support▶ [email protected]▶ (301) 286-9120
∙Later – USG Lead (me!)▶ [email protected]▶ (301) 286-2954
GSFCNCCS9/25/08 41NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO ChiefScott Wallace, CSC PM
Current System StatusFred Reitz, Operations Manager
Compute Capabilities at the NCCSDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
SMD AllocationsSally Stemwedel, HEC Allocation Specialist
GSFCNCCS
Open DiscussionQuestionsComments