hpc resource management: futures
TRANSCRIPT
Intel Confidential
HPC Resource Management: View to the FutureRalph H. CastainJan 27, 2016
Definition: RM
• Scheduler/Workload Manager (WLM)o Allocates resources to sessiono Interactive and batch
• Run-Time Environmento Launch and monitor applicationso Support inter-process communication wireupo Serve as intermediary between applications and WLM
• Dynamic resource requests• Error notification
o Implement failure policies
2
• Provided as complete package (WLM+RTE)o Sometimes offer hook to replace oneo Mostly proprietaryo Open source often GPL (integration difficult)
• Programming model specific, static
• Independent islando Not integrated with monitoring, file systems, or networks
• Limited fault tolerance supporto Restart failed job
Current State
RM => Orchestrator
Monitoring Consol
e
DB
FileSyste
mNetwor
k
ResourceManagerOverlayNetwor
k
Pub-
Sub
SCON
Prov. Agent
RM APP
5
Resource Management Emerging Needs
• Emerging Application Needso Dynamic resource managemento Application involvement in decisions, including fault notificationo Workflow orchestration
• Emerging System Needso Scalable operationso Cross-subsystem integration (RM, RAS, site utilities, …)o Data staging, Burst Buffers, persistent memory managemento System notification requests, topology, QoSo Power management
Intel Confidential
Multi-Tiered Strategy
Existing HPC RMs
Open Resilient Cluster Mgr
SLURM MOAB PBSPr
o
LSF
PMIx RASIntegration
HadoopSupport
Enable
ReferenceImplementation
OverlayNetwork
Containers
(S&S)
Provisioning
(WW)Data
AnalyticsULFM &
Fault Pred.
Technology
Reference Implementation: ORCM• Scalable to exascale
levels & beyond– Better-than-linear scaling– Constrained memory
footprint• Dynamically
configurable– Sense and adapt, user-
directable– On-the-fly updates– Fully utilize underlying
hw
• Open source (non-GPL)– Support proprietary
add-ons• Maintainable, flexible
– Plugin architecture– Easy extension for R&D
• Resilient– Self-heal around failures– Reintegrate recovered
resources
7
Intel Confidential
Launch Scaling: Core Capability
Laun
ch
Initi
aliz
atio
n
Exc
hang
e M
PI c
onta
ct in
fo
Set
up M
PI s
truct
ures
barr
ier
mpi
_ini
t com
plet
ion
barr
ier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Tim
e (
sec)
Laun
ch
Initi
aliz
atio
n
Exc
hang
e M
PI c
onta
ct in
fo
Set
up M
PI s
truct
ures
barr
ier
mpi
_ini
t com
plet
ion
barr
ier
MPI_InitMPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Tim
e (
sec)
Provide method for
RM to share job info
Work with fabric and library
implementers to compute endpt from
RM info
Stage I
Laun
ch
Initi
aliz
atio
n
Exc
hang
e M
PI c
onta
ct in
fo
Set
up M
PI s
truct
ures
barr
ier
mpi
_ini
t com
plet
ion
barr
ier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Tim
e (
sec)
Add on 1st communication
(non-PMIx)
Stage II
Laun
ch
Initi
aliz
atio
n
Exc
hang
e M
PI c
onta
ct in
fo
Set
up M
PI s
truct
ures
barr
ier
mpi
_ini
t com
plet
ion
barr
ier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Tim
e (
sec)
Use HSN, Daemon “instant on”, Coll
Stage III
Intel Confidential
Current Status
Stage I-II
Stage III (projected)
Adoption: SLURM LSF Moab PBSPro
Flexible Allocation Support
• Request additional resourceso Compute, memory, network, NVM, burst buffero Immediate, forecasto Expand existing allocation, separate allocation
• Return extra resourceso No longer requiredo Will not be used for some specified time, reclaim (handshake)
when ready to use
• Notification of preemptiono Provide opportunity to cleanly pause
I/O Support
• Asynchronous operationso Anticipatory data fetch, stagingo Advise time to completeo Notify upon available
• Storage policy requestso Hot/warm/cold data movemento Desired locations and striping/replication patternso Persistence of files, shared memory regions across jobs, sessionso ACL to generated data across jobs, sessions
Spawn Support
• Staging supporto Files, libraries required by new appso Allow RM to consider in scheduler
• Current location of data• Time to retrieve and position• Schedule scalable preload• Specified topology, min/max procs, …
• Provisioning requestso Allow RM to consider in selecting resources, minimize startup
time due to provisioningo Desired image, packages
Network Integration
• Quality of service requestso Bandwidth, traffic priority, power constraintso Multi-fabric failover, striping prioritizationo Security requirements
• Network domain definitions, ACLs
• Notification requestso State-of-healtho Update process endpoint upon fault recovery
• Topology informationo Torus, dragonfly, …
Power Control/Management
• Scheduler analyticso Predict power usage, advise updates
• Application requestso Advise of changing workload requirementso Request changes in policyo Specify desired policy for spawned applications
• RM notificationso Need to change power policyo Preemption notification
• Allow application to accept, request pause
Fault Tolerance
• Notificationo App can register for error notifications, incipient faults
• RM-app negotiate to determine responseo App can notify RM of errors
• RM will notify specified, registered procs
• Rapid application-driven checkpointo Local on-node NVRAM, auto-stripe checkpointso Policy-driven “bleed” to remote burst buffers and/or global file system
• Restart supporto Specify source (remote NVM checkpoint, global filesystem, etc)o Location hints/requestso Entire job, specific processes
20
Workflow Orchestration
• Growing range of workflow patternso Traditional HPC: bulk synchronous, single applicationo Analytics: asynchronous, stagedo Parametric: large number of simultaneous independent jobs
• Ability to dynamically spawn a jobo Need flexibility – min/max size, rolling allocation, …o Events between jobs and cross-linking of I/Oo Queuing of spawn requests
• Tool interfaces that work in these environments
Workload Manager: Job Description Language• Complexity of describing job is growing
o Power, file/lib positioningo Performance vs capacity, programming modelo System, project, application-level defaults
• Provide templates?o System defaults, with modifiers
• --hadoop:mapper=foo,reducer=baro User-defined
• Application templates• Shared, group templates
o Markup language definition of behaviors, priorities
30 yrs
Flexible Architecture
• Each tool built on top of same plugin systemo Different combinations of frameworkso Different plugins activated to play different roleso Example: orcmd on compute node vs on rack/row controllers
• Designed for distributed, centralized, hybrid operationso Centralized for small clusterso Hybrid for larger clusterso Example: centralized scheduler, distributed “worker-bees”
• Accessible to users for interacting with RMo Add shim libraries (abstract, public APIs) to access framework APIso Examples: SCON, pub-sub, in-flight analytics
Breaking it Down
• Workload Managero Dedicated frameworko Plugins for two-way integration to external WM (Moab, Cobalt)o Plugins for implementing internal WM (FIFO)
• Run-Time Environmento Broken down into functional blocks, each with own framework
• Loosely divided into three general categories• Messaging, launch, error handling• One or more frameworks for each category
o Knitted together via “state machine”• Event-driven, async• Each functional block can be separate thread• Each plugin within each block can be separate thread(s)
Analytics Workflow Concept
Input
Modu l
eGeneralized
Format
Sensors
OtherWorkflows
OtherWorkflow
RASEvent
Available in SCON as well
Pub-Sub
Database
Workflow Elements
• Average (window, running, etc.)
• Rate (convert incoming data to events/sec)
• Threshold (high, low)
• Filtero Selects input values based on provided params
• RAS evento Generates a RAS event corresponding to input description
• Publish data
Analytics
• Execute on aggregator nodes for in-flight reductiono Sys admin defines, user can define (if permitted)
• Event-based state machineo Each workflow in own thread, own instance of each plugino Branch and merge of workflowo Tap stream between workflow stepso Tap data streams (sensors, others)
• Event generationo Generate events/alarmso Specify data to be included (window)
26
Distributed Architecture
• Hierarchical, distributed approach for unlimited scalabilityo Utilize daemons on rack/row controllers
• Analysis done at each level of the hierarchyo Support rapid response to critical eventso Distribute processing loado Minimize data movement
• RM’s error manager framework controls responseo Based on specified policies
Fault Diagnosis
• Identify root cause and locationo Sometimes obvious – e.g., when direct measuremento Other times non-obvious
• Multiple cascading impacts• Cause identified by multi-sensor correlations (indirect
measurement)• Direct measurement yields early report of non-root cause• Example: power supply fails due to borderline cooling + high load
• Estimate severityo Safety issue, long-term damage, imminent failure
• Requires in-depth understanding of hardware
Fault Prediction: Methodology
• Exploit access to internalso Investigate optimal location, number of sensorso Embed intelligence, communications capability
• Integrate data from all available sourceso Engineering design testso Reliability life testso Production qualification tests
• Utilize learning algorithms to improve performanceo Both embedded, post processo Seed with expert knowledge
Fault Prediction: Outcomes
• Continuous update of mean-time-to-preventative-maintenanceo Feed into projected downtime planningo Incorporate into scheduling algo
• Alarm reports for imminent failureso Notify impacted sessions/applicationso Plan/execute preemptive actions
• Store predictionso Algorithm improvement
HPC Controls
Thank You!