hpc resource management: futures

Intel Confidential

HPC Resource Management: View to the FutureRalph H. CastainJan 27, 2016

Definition: RM

• Scheduler/Workload Manager (WLM)o Allocates resources to sessiono Interactive and batch

• Run-Time Environmento Launch and monitor applicationso Support inter-process communication wireupo Serve as intermediary between applications and WLM

• Dynamic resource requests• Error notification

o Implement failure policies

2

• Provided as complete package (WLM+RTE)o Sometimes offer hook to replace oneo Mostly proprietaryo Open source often GPL (integration difficult)

• Programming model specific, static

• Independent islando Not integrated with monitoring, file systems, or networks

• Limited fault tolerance supporto Restart failed job

Current State

RM => Orchestrator

Monitoring Consol

e

DB

FileSyste

mNetwor

k

ResourceManagerOverlayNetwor

k

Pub-

Sub

SCON

Prov. Agent

RM APP

5

Resource Management Emerging Needs

• Emerging Application Needso Dynamic resource managemento Application involvement in decisions, including fault notificationo Workflow orchestration

• Emerging System Needso Scalable operationso Cross-subsystem integration (RM, RAS, site utilities, …)o Data staging, Burst Buffers, persistent memory managemento System notification requests, topology, QoSo Power management

Intel Confidential

Multi-Tiered Strategy

Existing HPC RMs

Open Resilient Cluster Mgr

SLURM MOAB PBSPr

o

LSF

PMIx RASIntegration

HadoopSupport

Enable

ReferenceImplementation

OverlayNetwork

Containers

(S&S)

Provisioning

(WW)Data

AnalyticsULFM &

Fault Pred.

Technology

Reference Implementation: ORCM• Scalable to exascale

levels & beyond– Better-than-linear scaling– Constrained memory

footprint• Dynamically

configurable– Sense and adapt, user-

directable– On-the-fly updates– Fully utilize underlying

hw

• Open source (non-GPL)– Support proprietary

add-ons• Maintainable, flexible

– Plugin architecture– Easy extension for R&D

• Resilient– Self-heal around failures– Reintegrate recovered

resources

7

Intel Confidential

Launch Scaling: Core Capability

Laun

ch

Initi

aliz

atio

n

Exc

hang

e M

PI c

onta

ct in

fo

Set

up M

PI s

truct

ures

barr

ier

mpi

_ini

t com

plet

ion

barr

ier

MPI_Init MPI_Finalize

RRZ, 16-nodes, 8ppn, rank=0

Tim

e (

sec)

Laun

ch

Initi

aliz

atio

n

Exc

hang

e M

PI c

onta

ct in

fo

Set

up M

PI s

truct

ures

barr

ier

mpi

_ini

t com

plet

ion

barr

ier

MPI_InitMPI_Finalize


Tim

e (

sec)

Provide method for

RM to share job info

Work with fabric and library

implementers to compute endpt from

RM info

Stage I

Laun

ch

Initi

aliz

atio

n

Exc

hang

e M

PI c

onta

ct in

fo

Set

up M

PI s

truct

ures

barr

ier

mpi

_ini

t com

plet

ion

barr

ier



Tim

e (

sec)

Add on 1st communication

(non-PMIx)

Stage II

Laun

ch

Initi

aliz

atio

n

Exc

hang

e M

PI c

onta

ct in

fo

Set

up M

PI s

truct

ures

barr

ier

mpi

_ini

t com

plet

ion

barr

ier



Tim

e (

sec)

Use HSN, Daemon “instant on”, Coll

Stage III

Intel Confidential

Current Status

Stage I-II

Stage III (projected)

Adoption: SLURM LSF Moab PBSPro

Flexible Allocation Support

• Request additional resourceso Compute, memory, network, NVM, burst buffero Immediate, forecasto Expand existing allocation, separate allocation

• Return extra resourceso No longer requiredo Will not be used for some specified time, reclaim (handshake)

when ready to use

• Notification of preemptiono Provide opportunity to cleanly pause

I/O Support

• Asynchronous operationso Anticipatory data fetch, stagingo Advise time to completeo Notify upon available

• Storage policy requestso Hot/warm/cold data movemento Desired locations and striping/replication patternso Persistence of files, shared memory regions across jobs, sessionso ACL to generated data across jobs, sessions

Spawn Support

• Staging supporto Files, libraries required by new appso Allow RM to consider in scheduler

• Current location of data• Time to retrieve and position• Schedule scalable preload• Specified topology, min/max procs, …

• Provisioning requestso Allow RM to consider in selecting resources, minimize startup

time due to provisioningo Desired image, packages

Network Integration

• Quality of service requestso Bandwidth, traffic priority, power constraintso Multi-fabric failover, striping prioritizationo Security requirements

• Network domain definitions, ACLs

• Notification requestso State-of-healtho Update process endpoint upon fault recovery

• Topology informationo Torus, dragonfly, …

Power Control/Management

• Scheduler analyticso Predict power usage, advise updates

• Application requestso Advise of changing workload requirementso Request changes in policyo Specify desired policy for spawned applications

• RM notificationso Need to change power policyo Preemption notification

• Allow application to accept, request pause

Fault Tolerance

• Notificationo App can register for error notifications, incipient faults

• RM-app negotiate to determine responseo App can notify RM of errors

• RM will notify specified, registered procs

• Rapid application-driven checkpointo Local on-node NVRAM, auto-stripe checkpointso Policy-driven “bleed” to remote burst buffers and/or global file system

• Restart supporto Specify source (remote NVM checkpoint, global filesystem, etc)o Location hints/requestso Entire job, specific processes

20

Workflow Orchestration

• Growing range of workflow patternso Traditional HPC: bulk synchronous, single applicationo Analytics: asynchronous, stagedo Parametric: large number of simultaneous independent jobs

• Ability to dynamically spawn a jobo Need flexibility – min/max size, rolling allocation, …o Events between jobs and cross-linking of I/Oo Queuing of spawn requests

• Tool interfaces that work in these environments

Workload Manager: Job Description Language• Complexity of describing job is growing

o Power, file/lib positioningo Performance vs capacity, programming modelo System, project, application-level defaults

• Provide templates?o System defaults, with modifiers

• --hadoop:mapper=foo,reducer=baro User-defined

• Application templates• Shared, group templates

o Markup language definition of behaviors, priorities

30 yrs

Flexible Architecture

• Each tool built on top of same plugin systemo Different combinations of frameworkso Different plugins activated to play different roleso Example: orcmd on compute node vs on rack/row controllers

• Designed for distributed, centralized, hybrid operationso Centralized for small clusterso Hybrid for larger clusterso Example: centralized scheduler, distributed “worker-bees”

• Accessible to users for interacting with RMo Add shim libraries (abstract, public APIs) to access framework APIso Examples: SCON, pub-sub, in-flight analytics

Breaking it Down

• Workload Managero Dedicated frameworko Plugins for two-way integration to external WM (Moab, Cobalt)o Plugins for implementing internal WM (FIFO)

• Run-Time Environmento Broken down into functional blocks, each with own framework

• Loosely divided into three general categories• Messaging, launch, error handling• One or more frameworks for each category

o Knitted together via “state machine”• Event-driven, async• Each functional block can be separate thread• Each plugin within each block can be separate thread(s)

Analytics Workflow Concept

Input

Modu l

eGeneralized

Format

Sensors

OtherWorkflows

OtherWorkflow

RASEvent

Available in SCON as well

Pub-Sub

Database

Workflow Elements

• Average (window, running, etc.)

• Rate (convert incoming data to events/sec)

• Threshold (high, low)

• Filtero Selects input values based on provided params

• RAS evento Generates a RAS event corresponding to input description

• Publish data

Analytics

• Execute on aggregator nodes for in-flight reductiono Sys admin defines, user can define (if permitted)

• Event-based state machineo Each workflow in own thread, own instance of each plugino Branch and merge of workflowo Tap stream between workflow stepso Tap data streams (sensors, others)

• Event generationo Generate events/alarmso Specify data to be included (window)

26

Distributed Architecture

• Hierarchical, distributed approach for unlimited scalabilityo Utilize daemons on rack/row controllers

• Analysis done at each level of the hierarchyo Support rapid response to critical eventso Distribute processing loado Minimize data movement

• RM’s error manager framework controls responseo Based on specified policies

Fault Diagnosis

• Identify root cause and locationo Sometimes obvious – e.g., when direct measuremento Other times non-obvious

• Multiple cascading impacts• Cause identified by multi-sensor correlations (indirect

measurement)• Direct measurement yields early report of non-root cause• Example: power supply fails due to borderline cooling + high load

• Estimate severityo Safety issue, long-term damage, imminent failure

• Requires in-depth understanding of hardware

Fault Prediction: Methodology

• Exploit access to internalso Investigate optimal location, number of sensorso Embed intelligence, communications capability

• Integrate data from all available sourceso Engineering design testso Reliability life testso Production qualification tests

• Utilize learning algorithms to improve performanceo Both embedded, post processo Seed with expert knowledge

Fault Prediction: Outcomes

• Continuous update of mean-time-to-preventative-maintenanceo Feed into projected downtime planningo Incorporate into scheduling algo

• Alarm reports for imminent failureso Notify impacted sessions/applicationso Plan/execute preemptive actions

• Store predictionso Algorithm improvement

HPC Controls

Thank You!

hpc resource management: futures

Software