user environment enhancements in the dod hpc modernization program
DESCRIPTION
User Environment Enhancements in the DoD HPC Modernization Program. 7 April 2011 Steve Scherr, DoD HPCMP. Topics. Background: HPCMP Storage Initiative Enhanced User Environment HPC EUE Infrastructure HPC Portal. MB Revised: 5/4/2009. HPC Modernization Program. Vision - PowerPoint PPT PresentationTRANSCRIPT
Solving the hard problems . . .Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-1
User Environment Enhancements in the DoD HPC
Modernization Program7 April 2011
Steve Scherr, DoD HPCMP
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-2
Topics
Background: HPCMP Storage Initiative Enhanced User Environment HPC EUE Infrastructure HPC Portal
MB Revised: 5/4/2009
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-3
HPC Modernization Program
VisionA pervasive culture existing among DoD’s scientists and
engineers where they routinely use advanced computational environments to solve the most demanding problems
transforming the way DoD does business─finding better solutions faster.
MissionAccelerate development and transition of advanced defense
technologies into superior warfighting capabilities by exploiting and strengthening US leadership in
supercomputing, communications and computational modeling.
MB Revised: 12/11/2009
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-4
HPCMP Serves a Large, Diverse DoD User Community
● FY11 statistics– 501 active projects with 4,408 users at
250 sites – 5,098 Habus* batch requirements
● FY10 statistics (as of 9/30/2010)– 496 projects with 4,345 users– 2,866 Habus* non-real-time requirements
* Requirements and usage measured in Habus
92 users are self characterized as “Other”New CTA Space and Astrophysical Science (SAS)
Computational Structural Mechanics – 465 Users
Electronics, Networking, and Systems/C4I – 211 Users
Computational Chemistry, Biology & Materials Science – 690 Users
Computational Electromagnetics & Acoustics – 323 Users
Computational Fluid Dynamics – 1,223 Users
Environmental Quality Modeling & Simulation – 163 Users
Signal/Image Processing – 586 Users
Integrated Modeling & Test Environments – 105 Users
Climate/Weather/Ocean Modeling & Simulation – 315 Users
Forces Modeling & Simulation – 235 Users
Source: Portal to the Information Environment – July 2010
MB Revised: 1/26/2011
Customer Focus
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-5
● DSRC systems support classified, unclassified and open computing capabilities
● 17 large HPC systems– 1 systems ― 44,000+ cores– 6 systems ― 10,000 to 22,000+ cores– 10 systems ― 2,000 to 9,000+ cores– 1.873 peak PetaFlops – 4,750 Habus
● Three new FY10 HPC systems– 773 TeraFlops– 2,251 Habus
● 14 Petabytes single copy data storage– 28 Petabytes including Disaster
Recovery
● Connections to Customers
– 212 locations MB Revised: 12/22/2010
DoD Supercomputing Resource Centers (DSRCs)Six Large HPC Centers
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-6
HPCMP Data Storage Growth
43% increase over FY 2008
34% increase over FY 2009
MB Revised: 12/22/2010
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-7
HPC B
HPC A
ArchiveServer
HPC File System
Center Archive Cache
Tape
$WORKDIRshort-term storage
$WORKDIR short-term storageHPC File
System
DRCache
DRTape
Computational results used in many different ways
– Source for additional computation
– Interrogated for post-processing
– Archived for scientific value
Users are mobile within HPCMP
User View of HPCMP Storage
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-8
Storage Lifecycle Management (SLM) Rationale
● HPCMP can provide enough storage for NEW data● Centers support 2+ generations of storage media
– Older media unreadable after tech obsolescence
Users: we can live with constraints & manage data– Need tools to manage data– Need intermediate-length storage
Active Use Archival Use
RemovalCreation
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-9
ENHANCED USER ENVIRONMENT
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-10
Evolving Enterprise Service Model
Single AuthenticationAdvance ReservationsWeb Portal FrameworkRemote SciVizezHPC
Research CommunityT & E
Software Development
Acquisition Community
HPC Center 1HPC
5
HPC 6
Utility Server
SciViz
Job Submit
Metadata Attributes
Disk Tape
HPC Center 2HPC
3
HPC 4
Utility Server
SciViz
Job Submit
Metadata Attributes
Disk Tape
HPC Center 3HPC
1
HPC 2
Utility Server
SciViz
Job Submit
Metadata Attributes
Disk Tape
HPC Center 4HPC System 1HPC
System 2
Utility Server
SciVizJob
Submit
Metadata Attributes30-day
Disk Storage
Archive Tape
Storage
HPC Center 5HPC
1
HPC 2
Utility Server
SciViz
Job Submit
Metadata Attributes
Disk Tape
HPC Center 6HPC
14
HPC 15
Utility Server
SciViz
Job Submit
Metadata Attributes
Disk Tape
Remote Job Management
Computational Infrastructurefor Software Development(Tools / Environment)
Data Management Tools – Metadata
Batch
Customers Services Infrastructure
Interactive Grid Generation
MB Revised: 8/27/2010
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-11
HPCSYSTEM
A
Temporary Storage10 days
Temporary Storage10 days
HPCSYSTEM
BMetadata
Replication Between all DSRCs
HPC Enhanced User EnvironmentArchitecture
Data Analysis Services
Center-wide Job
Management
DR&E Portal
Grid-Generation Capabilities
Single Point of Access
Services Compute Storage
Storage Lifecycle Management
Software Development Environment
UtilityServer
Center-wide ILM-
managed File System30 days
SLM Metadata Catalog Service
Remote Disaster
Recovery Facility
LocalTape
Archive
ArchiveServer
MB Revised: 12/22/2010
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-12
HPC Enhanced User Environment
● Interactive Computing– Single point of access– Center-wide job management– Remote data analysis
● Center-wide filesystem– Medium-term storage– User-specified metadata
● Data Management Tools– Insight into file archives– Program-wide visibility
● HPC Portal– Supercharge the engineering
desktop
MB Revised: 8/3/2010
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-13
HPC EUE INFRASTRUCTURE
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-14
Hardware Components
● Center-wide File System: Panasas PAS 8– 340 blades, 4 TB unformatted– Arista 7508 switch
● Utility Server: Appro 1U Tetra, 88 nodes– 44 compute: 2 AMD Opteron 2.3 GHz CPUs, 16 cores, 128 GB
memory– 22 large memory: 4 AMD Opteron 2.3 GHz CPUs, 32 cores, 256 GB
memory– 22 graphics: 2 AMD Opteron 2.3 GHz CPUs, 16 cores, 256 GB
memory, NVIDIA Tesla M2050
14
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-15
System Configuration
● $HOME− 10 GB quota
● $WORKDIR− 200 TB− 100 TB user quota− Standard scrubbing
● $CENTER− 800 TB− Possible user quota (200 TB)− 30-day scrub policy− SLM compatible
● $ARCHIVE− Managed by SLM− Accessed through SLM
● Center-wide Job Management− qsub, qstat, qdel
● Resource Requests− PBS Pro
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-16
Storage Lifecycle Management
● Based on Nirvana SRB and SAM-QFS
● Manages $ARCHIVE– Set metadata to specify retention period
● Can register files on $CENTER -- target to automate registration by end 2011
● HPC access to $ARCHIVE through transfer queue– Also working PBS parameter mechanism – future just-in-time
● Customer Experience workgroup developing auxiliary commands (Sdata) for user-defined metadata
● Global visibility
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-17
HPC PORTAL
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-18
HPC Desktop Portal Initiative
GoalsEnable DoD scientists and engineers to apply the power of HPC without being HPC expertsProvide access to HPC resources using current web technology—attract and retain new technology experts to DoD
Methods– Provide HPC Software as a Service over web with
zero or minimal footprint– Provide common analysis tools enabled for
seamless HPC use (MATLAB)– Provide accessible optimized tools for technology
domains (CREATE, institutes)– Extension of desktop; interactive response– Single sign-on through CAC
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-19
HPC Portal
● Engaging with DoD engineering organizations– Understand their requirements and how we can support
● Examining Cloud Computing Concepts– Software as a Service– Infrastructure as a Service
● Phase 1: Parallel MATLAB capability– ARL lead, deliver in June– Built on Microsoft HPC Server– Additional available applications, FMS, CFD, etc.
● Phase 2: Present CREATE capability– Identifying API, middleware, design framework
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-20
HPC Modernization Program
MB Revised: 11/23/2009
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-21
BACKUP
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-22
Storage Lifecycle Management
● Layered Software Capability– Information Lifecycle Management
− Metadata – user and system defined− Policies – drive HSM− Reporting
– Hierarchical Storage Management− Tiered Storage− Disaster Recovery
● Multi-system, multi-center– Assign metadata attributes from all HPC systems– Work toward “shared” files between centers
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-23
Storage Lifecycle Management
● Information Lifecycle Management– Provide capability to users and
administrators– Control costs
● Hierarchical Storage Management– Based on ILM information– Includes disaster recovery
● Common user interface● Work toward shared files
ILM
)
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-24
ILM Requirements
● Metadata attributes– User-assignable– System-assignable– Defaults
● Tools and Reports– Enable management of data files
● Policies– Based on attributes– Used to drive HSM
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-25
ILM Attribute Requirements
● Associated with all objects● Arbitrary number, size, type● Attribute permissions separate from underlying files
– System read/write– Creator/Owner read/write– Collections of other users
● Inheritance or default-setting at creation– Settable via templates or functions
● ILM must scale to 1B files today– No impact on I/O performance for HSM
● Attributes can be output textually
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-26
ILM Tool Requirements
● Tools for manipulating files under ILM control– Attribute-aware– Attribute-preserving– Operate on files, directories, or lists of objects– Create/modify attributes
● Reports– Based on multiple criteria, attribute values– Status of pending operations– Consistent with attribute permissions
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-27
HPCMP Storage Initiative
Computing power grows annually—so do stored files Archived data is hard for users to use and manage Costs: User time, labor, hardware, software and media Storage Initiative
– Objective: Refresh to manage data for next 10 years– Goals: 10-year architecture
− Leverage advances in technology− Improve user productivity− Improve reliability & adaptability− Sustain within current storage budget
MB Revised: 5/4/2009
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-28
2001 2002 2003 2004 2005 2006 2007 2008 2009 20100
2
4
6
8
10
12
14
16
Sing
le C
opy
of H
PCM
P St
orag
e in
Pet
abyt
esHPCMP Data Storage Growth
Single Copy Data Storage
● Impact of 16x growth in eight years– Data Analysis– Data Locality and
Movement– Data Duplication– Disaster Recovery– Network Loading– Storage
Technologies
22 x
MB Revised: 12/22/2010
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-29
HPC Enhanced User Environment (HEUE)
● Purpose– Provide computational scientists more tools
and capabilities to perform research more efficiently and effectively
● Benefit– Decrease time-to-solution, increase S&E
productivity and analytical power, reduce future costs of data archive
● Tasks– Storage lifecycle management
implementation− Metadata for file management and identification− Program-wide datafile visibility and access
– Center-wide filesystem: efficient storage for data analysis and extraction
– Center-wide job management: single point-of-access, increase user productivity
– Remote visualization for large datasets– Web-based access to HPC capability
MB Revised: 12/22/2010
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-30
Requested Software
System Software● PBS Pro, OpenMPI
● InfiniBand Software Stack
● NVIDIA Linux x86_64 driver set
● Compliance with BCT policies
Development Tools● PGI Compiler Suite (C/C++/Fortran)
● GNU Compiler Suite & debugger
● TotalView debugger
● NVIDIA GPGPU development Environment (OpenCL and CUDA)
● Common Set of Open Source Utilities
● BC policy: PAPII, SCALASCA, TAU, PDT, Valgrind
● DDT and DDT with CUDA debugger
Data Analysis Tools● CEI – Ensight Suite
● Intelligent Light – FieldView
● RSI, Inc. – IDL
● Mathworks – Matlab
● NCAR Graphics Library
● Kitware – ParaView
● Tecplot, Inc. –Tecplot
● VisIt Visualization Tool
● Computational Science Environment (CSE)
● ezVIZ
Distribution A: Approved for public release, distribution unlimited.HPC User Forum7 Apr 2011 Page-31
Requested Software
Pre/Post Processing Software● ANSYS CFD
● Abaqus
● LS-PrePost
● Parasolid Designer (pre)
● Pointwise – Gridgen
Math Libraries● ARPACK, FFTW, PETSc, SuperLU,
LAPACK, ScaLAPACK, BLAS, ATLAS, GotoBLAS, SPRNG, GSL
New● Pipeline Pilot (Accelrys product) –
automation of the process of predicting compute intensity on the fly and submitting jobs to the US
● Isight (DSS product) - design optimization & process integration (some portions are interactive & some are for batch processing)
Secure Remote Visualization● PKI-VNC
● Longhorn