hpc issues thomas radke max planck institute for gravitational physics [email protected] john shalf...

www.cactuscode.org www.gridlab.org

HPC Issues

Thomas RadkeMax Planck Institute for

Gravitational [email protected]

John ShalfLawrence Berkeley

[email protected]


Architectures and Programming Models


Parallel ArchitecturesParallel Architectures

Hardware Architectures for High Performance Computing and their Parallel Programming Models

Shared Memory SystemsSMP

Shared Memory Programming

Message Passing

Distributed Memory SystemsMPP, Clusters

Vector ProcessingData Parallel Progr.

Specialized Architectures

Grid ComputingThe Grid


Parallel Architectures ComparisonParallel Architectures Comparison

Taken from http://www.top500.org/slides/2002/06/


Shared Memory Architecture

Shared Memory Architecture

Processors have direct access to global memory and I/O which is accessed through bus or fast switching network

Cache Coherency Protocol guarantees consistency of memory and I/O accesses

Each processor also has its own memory (cache) Data structures are shared in global address space Concurrent access to shared must be sychronized Programming Models

Multithreading (Thread Libraries) OpenMP

P0Cache

P0Cache

P1Cache

PnCache

Global Shared Memory

Shared Bus

...


Multithreading Programming ModelMultithreading Programming Model

A thread is an independent sequential flow of execution Threads can run in parallel in a multithreaded program All threads run in the same single process address space

and share all of the process' resources (heap memory, file handles, process environment, etc.)

OS / Runtime System

Process resources

Process address space

Threads are light-weight (just have a private stack to maintain) and thus cheaper to manage by the OS/runtime system than multiple cooperating processes


Programming with ThreadsProgramming with Threads Thread management routines

Thread creation and termination (create/exit/join) Get Thread ID Thread scheduling (suspend/resume/yield) Thread-specific data

Thread communication – done via synchronization primitives Locks/mutexes Semaphores Condition variables Barriers

Several platform-specific multithreading APIs and thread libraries exist which slightly different from each other: Solaris threads, SGI threads, Windows Threads Library, OS/2

threads POSIX Standard 1003.4a: Pthreads


OpenMPOpenMP

OpenMP: portable shared memory parallelism Higher-level API for writing portable multithreaded applications Provides a set of compiler directives and library routines

for parallel application programmers API bindings for Fortran, C, and C++ Standardizes last 15 years of SMP practice Supported by

Hardware vendors (Intel, HP, SGI, IBM, SUN, Compaq) Software tool vendors (KAI, PGI, PSR, APR, Absoft) Application vendors (Fluent, NAG, DOE ASCI, Dash, etc.)

http://www.OpenMP.org


Programming in OpenMPProgramming in OpenMP

Insert compiler directives or pragmas in your code For C/C++ code:

#pragma omp construct [clause [clause]...] For Fortran code:

C$OMP construct [clause [clause]...]!$OMP construct [clause [clause]...]*$OMP construct [clause [clause]...]

OpenMP constructs: Parallel Regions (to create parallel threads) Worksharing (for loop parallelizations) Data Environment clauses (to declare shared/private variables) Synchronization (mutexes, atomic updates, critical sections,

barriers)


ParallelizingCompiler

OpenMP program

Performance tuning

Inserts OpenMPdirectives

Writing OpenMP Applications

Writing OpenMP Applications

Program is built with OpenMP-enabled compiler flags Programmer explicitely adds OpenMP pragmas Fine tuning using OpenMP Profiling and

Performance Analysis Tools

Programmer


Distributed Memory ArchitectureDistributed Memory Architecture

Each Processor has direct access only to its local memory

Processors are connected via high-speed interconnect Data structures must be decomposed Data exchange is done via explicit processor-to-

processor communication: send/receive messages Programming Models

The standard: MPI Others: PVM, Express, P4, Chameleon, PARMACS, ...

P0

Communication Interconnect

...

Memory

Memory

Memory

P0 P1 Pn


Message Passing InterfaceMessage Passing Interface

MPI provides: Point-to-point communication Collective operations

Barrier synchronization gather/scatter operations Broadcast, reductions

Different communication modes Synchronous/asynchronous Blocking/non-blocking Buffered/unbuffered

Predefined and derived datatypes Virtual topologies One-sided communication (MPI 2) Parallel I/O (MPI 2) C/C++ and Fortran bindings


Message Passing InterfaceMessage Passing Interface MPI is a standard defined by the MPI Forum

http://www.mpi-forum.org

with many implementations for essentially all HPC systems Vendor-MPI for specific platforms (native MPI) Freely available implementations, eg.

MPICH: different devices optimized for different platforms, eg. ch_p4, ch_shmem

http://www-unix.mcs.anl.gov/mpi/mpich/LAM:

http://www.lam-mpi.org/mpi and also for Distributed/Grid Computing

MPICH-G2 http://www3.niu.edu/mpi/

PACX http://www.hlrs.de/organization/pds/projects/pacx-mpi/


Message Passing Example


ParallelisationParallelisation

Consider a finite difference scheme with a stencil width 1

A standard way to use multiple processors is to decompose the computational grid between processors

Message passing is then used for communicating between the different processors

Proc 0


ParallelisationParallelisation For example split the grid in

an optimal way between different processors, e.g. in this example two processors.

If the grid is simply split between two processors there is no longer enough data to update the blue point

Proc 0 Proc 1


ParallelisationParallelisation To overcome this

ghostzones are used. Ghostzones are additional

(duplicated) grid points which are added to each processor containing the required data to advance to the next iteration

Proc 0 Proc 1


ParallelisationParallelisation The data on each grid must

be periodically synchronised The ghostzones are updated

with current data

Proc 0 Proc 1


HPC and IO


Scientific Data Libraries

Parallel I/O Libraries

Parallel Filesystems

HPC Hardware

Applications

Data Models and Formats

I/O in HPC ApplicationsI/O in HPC Applications Large-scale parallel applications generate data

in the terabytes range which cannot be efficiently managed by traditional serial I/O schemes (I/O bottleneck) Design your applications with parallel I/O in mind !

Datafiles should be interchangeable in a heterogenous computing environment

Make use of existing tools for data postprocessingand visualization

Efficient support for checkpointing/recovery

Need for high performance I/O systems and techniques,scientific data libraries, andstandard data representation


HPC I/O RequirementsHPC I/O RequirementsI/O in HPC Applications should be: Fast

Implement parallel I/O mode(s)– Use parallel I/O drivers and filesystems if available– Provide postprocessing tools for chunked data

Provide asynchronous/buffered I/O for overlapping with calculations Dump data in native format Automatically apply necessary datatype conversions (transparent to

app.) Flexible

General scientific data model applicable for broad range of applications Standard data representation Self-describing datafiles

Portable Generate platform-independent data files Work in the same way on different machines:

create data on one platform, post-process on another Interface with standard post-processing / visualization packages


HDF5HDF5Hierachical Data Format - Version 5: http://hdf.ncsa.uiuc.edu/hdf5

Provides a general Scientific Data Format specification and a supporting I/O library implementation

Simple but powerful data model: datasets + hierarchical groups Self-describing data format (implicit and user-supplied

metadata) Predefined standard datatypes + user-defined datatypes Efficient partial reads/writes of datasets (hyperslab selections) Provides parallel I/O driver (based on MPI I/O) HDF5 API with different language bindings (C, C++, Fortran) Thread-safe library code for multithreaded applications Modular library design with Virtual File Driver layer to plug in

different low-level drivers (Stream, SRB, GASS, GridFtp, etc.)

Support and active development by NCSA's HDF5 group More widely accepted by the HPC community as a standard


Serial vs. Parallel I/OSerial vs. Parallel I/O Serial I/O via dedicated I/O processor

I/O processor becomes bottleneck

P3 P2

P1 P0

P0 gathers datafrom all processors

Serial I/O

Single unchunked file


Serial vs. Parallel I/OSerial vs. Parallel I/O Parallel chunked I/O

System limit for number of open files Chunked data needs to be recombined

P3 P2

P1 P0

All processorswrite concurrently

Parallel I/O

Multiple chunked files(possibly distributed)


Recombination


Serial vs. Parallel I/OSerial vs. Parallel I/O Parallel unchunked I/O

MPI I/O requires parallel filesystem

P3 P2

P1 P0

MPI File I/O

Parallel I/O



Remote Data AnalysisRemote Data Analysis

Data Analysis Example: Visualization of output data

Run the visualization tools where the data is stored Tools need to be installed on remote side Interactivity is often limited by bandwidth/latency issues

Stage the data to be visualized to the local machineand run the visualization as usual Data staging requires a lot of bandwidth and local disk resources Often transfers much more data than is actually needed

Direct remote access to data Needs adequate remote I/O techniques

(partial file access, subsampling, etc.) Combination of remote rendering and local visualization

Visapult application


IO Examples


Parallel IOParallel IO In this example just want to

output fields from 2 processors, but it could be 2000

Each processor could write it’s own data to disk

Then the data usually is moved to one place and “recombined” to produce a single coherent file

Proc 0 Proc 1


Parallel IOParallel IO Alternatively processor 0

can gather data from the other processors and write it all to disk

Usually a combination of these works best … let every nth processor gather data and write to disk

Proc 0 Proc 1


Checkpointing


Checkpointing & RecoveryCheckpointing & Recovery

What is Checkpointing & Recovery ?

At a checkpoint, the running process is interrupted and its current program state is recorded in a checkpoint file in order to be able to restart the program at that point at a later time

A checkpoint file is a "snapshot picture" of a programwhich includes all necessary context information such as CPU state Relevant memory contents (registers, data, stack) information about opened files process environment

At recovery, the state contained in the checkpoint fileis restored by the recovery startup code, and the process resumes execution at the point where the checkpoint was created



Why is it needed ?

Job swapping Jobs run out of CPU time in a queue, or are about to ecxeed

other allocated resources (available memory, disk space, etc.) Fault-tolerant applications

Backup current state before process enters a critical section (where it might crash)

periodically checkpoint applications with undeterministic behaviour Data replication

For parameter studies: calculate compute-intensive initial dataonly once

Job migration move a job to bigger/faster/better resources which become

available



How to enable applications for checkpointing/recovery:

OS / Runtime system driven Runtime system sends a checkpoint signal to freeze application Complete memory image is written to persistent storage (like a core

dump), along with other process information Example: Condor

http://www.cs.wisc.edu/condor– Application just needs to be linked against Condor runtime library– All system calls are wrapped by Condor– After checkpointing, Condor finds another available resource,

moves checkpoint there and restarts the job User-level checkpointing

Application decides itself when to checkpoint and what will be written to the checkpoint file

Application must be instrumented to use a checkpointing libraryor provide its own checkpoint/recovery functionality


Requirements on Checkpointing

Requirements on Checkpointing

Same as those for I/O:

Fast and efficient Use parallel file I/O to write checkpoint files Dump only relevant state information (use application hints)

Flexible restart on different number of processors

Portable Checkpoint files should be platform-independent:

checkpoint on one platform, recover on another Hide system-specific details from the user


Checkpointing & Recovery Example

Checkpointing & Recovery Example

Checkpointing Example: Job migration

SMP System MPP System

PC Cluster


Interactive Issues


Monitoring and SteeringMonitoring and Steering

Interactive Monitoring Determine job's state and progress View current parameters List used resources How good (or bad) is it performing ?

Computational Steering Correct/adjust runtime-steerable

parameters Modify algorithms Enable/disable output Kill job

Online Data Visualization Analyze the data as they are being

calculated

Submit job tobatch system

Job is runningand generates

data

Analysedata

Wait forscheduling

Shorten the traditional production cycle:


Interactive visualization Integrated steering capabilities Individual parameters for each client:

each user can customize what they are seeing (variables, viewpoints, etc.)or see exactly what someone else is seeing

Collaboration User authentification: who can interact in which way with the application

(monitor, access data, steer parameters) Security issues (data encryption, port handling, etc.)

Non-intrusive interaction Parallel streaming Data reduction, sub-/downsampling Data selection

Generic network protocols, standard data formats Support both remote online and offline visualization

by same set of tools

Remote Visualization IssuesRemote Visualization Issues


Programming Frameworks


FrameworksFrameworks Provide the necessary infrastructure to connect exisiting software

modules with different functionality together in a single application Integrate software packages and libraries

through a common interface Manage scheduling of and communication between modules

Enable collaborative application development Minimize dependencies between independent modules Define standardized module interfaces with clear APIs Allow seamless interchanging of different modules

providing the same functionality Easy integration of new modules as they are developed

Development of reusable, cross-problem-domain components to enable rapid application development

Shorter time from problem inception to working parallel simulations Portability across serial, distributed, and parallel architectures

without need to change source code


Why use a Framework ?Why use a Framework ?

Modular frameworks enable code sharing within a large collaboration of many people

Scientists can focus on solving their science problems and let other people concentrate on what they know best Separate computational physics from CS infrastructure headaches,

eg. make system, parallelization, I/O, checkpointing, Scheduling, numerical methods, etc.

Community building Promote open-source software philosophy Make collaborative software development projects manageable Simplify code development and reuse Guarantee code portability Improve code quality and performance by having

it tested in the community


Deciding to use a Framework (I)

Deciding to use a Framework (I)

Does it suit your application's needs ? Try it out with a simple problem… then find out if it can handle the

most complicated thing you want to do. Is it already used in your community?

How does it integrate your exisiting software components? Flexible generic interfaces Bindings for your favorite programming language(s) Legacy code support

Is it available on the platforms/machines you want to use? Is it open source? Does it use standard technologies? Are there performance issues? How does it scale on parallel

architectures?


Deciding to use a Framework (II)

Deciding to use a Framework (II)

Is it supported and developed, will it be so next year? Code versions and releases Bug tracking system Feedback to maintainers and developers

Is there documentation? Working examples Demos Tutorials, other training material

How does it interact with other packages and tools? Parallel debuggers Profiling and performance analysis tools Monitoring and Remote Steering Visualization toolkits Communication libraries I/O libraries Math packages

hpc issues thomas radke max planck institute for gravitational physics [email protected] john shalf...

Documents