hpc issues thomas radke max planck institute for gravitational physics [email protected] john shalf...
TRANSCRIPT
www.cactuscode.org www.gridlab.org
HPC Issues
Thomas RadkeMax Planck Institute for
Gravitational [email protected]
John ShalfLawrence Berkeley
www.cactuscode.org www.gridlab.org
Architectures and Programming Models
www.cactuscode.org www.gridlab.org
Parallel ArchitecturesParallel Architectures
Hardware Architectures for High Performance Computing and their Parallel Programming Models
Shared Memory SystemsSMP
Shared Memory Programming
Message Passing
Distributed Memory SystemsMPP, Clusters
Vector ProcessingData Parallel Progr.
Specialized Architectures
Grid ComputingThe Grid
www.cactuscode.org www.gridlab.org
Parallel Architectures ComparisonParallel Architectures Comparison
Taken from http://www.top500.org/slides/2002/06/
www.cactuscode.org www.gridlab.org
Shared Memory Architecture
Shared Memory Architecture
Processors have direct access to global memory and I/O which is accessed through bus or fast switching network
Cache Coherency Protocol guarantees consistency of memory and I/O accesses
Each processor also has its own memory (cache) Data structures are shared in global address space Concurrent access to shared must be sychronized Programming Models
Multithreading (Thread Libraries) OpenMP
P0Cache
P0Cache
P1Cache
PnCache
Global Shared Memory
Shared Bus
...
www.cactuscode.org www.gridlab.org
Multithreading Programming ModelMultithreading Programming Model
A thread is an independent sequential flow of execution Threads can run in parallel in a multithreaded program All threads run in the same single process address space
and share all of the process' resources (heap memory, file handles, process environment, etc.)
OS / Runtime System
Process resources
Process address space
Threads are light-weight (just have a private stack to maintain) and thus cheaper to manage by the OS/runtime system than multiple cooperating processes
www.cactuscode.org www.gridlab.org
Programming with ThreadsProgramming with Threads Thread management routines
Thread creation and termination (create/exit/join) Get Thread ID Thread scheduling (suspend/resume/yield) Thread-specific data
Thread communication – done via synchronization primitives Locks/mutexes Semaphores Condition variables Barriers
Several platform-specific multithreading APIs and thread libraries exist which slightly different from each other: Solaris threads, SGI threads, Windows Threads Library, OS/2
threads POSIX Standard 1003.4a: Pthreads
www.cactuscode.org www.gridlab.org
OpenMPOpenMP
OpenMP: portable shared memory parallelism Higher-level API for writing portable multithreaded applications Provides a set of compiler directives and library routines
for parallel application programmers API bindings for Fortran, C, and C++ Standardizes last 15 years of SMP practice Supported by
Hardware vendors (Intel, HP, SGI, IBM, SUN, Compaq) Software tool vendors (KAI, PGI, PSR, APR, Absoft) Application vendors (Fluent, NAG, DOE ASCI, Dash, etc.)
http://www.OpenMP.org
www.cactuscode.org www.gridlab.org
Programming in OpenMPProgramming in OpenMP
Insert compiler directives or pragmas in your code For C/C++ code:
#pragma omp construct [clause [clause]...] For Fortran code:
C$OMP construct [clause [clause]...]!$OMP construct [clause [clause]...]*$OMP construct [clause [clause]...]
OpenMP constructs: Parallel Regions (to create parallel threads) Worksharing (for loop parallelizations) Data Environment clauses (to declare shared/private variables) Synchronization (mutexes, atomic updates, critical sections,
barriers)
www.cactuscode.org www.gridlab.org
ParallelizingCompiler
OpenMP program
Performance tuning
Inserts OpenMPdirectives
Writing OpenMP Applications
Writing OpenMP Applications
Program is built with OpenMP-enabled compiler flags Programmer explicitely adds OpenMP pragmas Fine tuning using OpenMP Profiling and
Performance Analysis Tools
Programmer
www.cactuscode.org www.gridlab.org
Distributed Memory ArchitectureDistributed Memory Architecture
Each Processor has direct access only to its local memory
Processors are connected via high-speed interconnect Data structures must be decomposed Data exchange is done via explicit processor-to-
processor communication: send/receive messages Programming Models
The standard: MPI Others: PVM, Express, P4, Chameleon, PARMACS, ...
P0
Communication Interconnect
...
Memory
Memory
Memory
P0 P1 Pn
www.cactuscode.org www.gridlab.org
Message Passing InterfaceMessage Passing Interface
MPI provides: Point-to-point communication Collective operations
Barrier synchronization gather/scatter operations Broadcast, reductions
Different communication modes Synchronous/asynchronous Blocking/non-blocking Buffered/unbuffered
Predefined and derived datatypes Virtual topologies One-sided communication (MPI 2) Parallel I/O (MPI 2) C/C++ and Fortran bindings
www.cactuscode.org www.gridlab.org
Message Passing InterfaceMessage Passing Interface MPI is a standard defined by the MPI Forum
http://www.mpi-forum.org
with many implementations for essentially all HPC systems Vendor-MPI for specific platforms (native MPI) Freely available implementations, eg.
MPICH: different devices optimized for different platforms, eg. ch_p4, ch_shmem
http://www-unix.mcs.anl.gov/mpi/mpich/LAM:
http://www.lam-mpi.org/mpi and also for Distributed/Grid Computing
MPICH-G2 http://www3.niu.edu/mpi/
PACX http://www.hlrs.de/organization/pds/projects/pacx-mpi/
www.cactuscode.org www.gridlab.org
Message Passing Example
www.cactuscode.org www.gridlab.org
ParallelisationParallelisation
Consider a finite difference scheme with a stencil width 1
A standard way to use multiple processors is to decompose the computational grid between processors
Message passing is then used for communicating between the different processors
Proc 0
www.cactuscode.org www.gridlab.org
ParallelisationParallelisation For example split the grid in
an optimal way between different processors, e.g. in this example two processors.
If the grid is simply split between two processors there is no longer enough data to update the blue point
Proc 0 Proc 1
www.cactuscode.org www.gridlab.org
ParallelisationParallelisation To overcome this
ghostzones are used. Ghostzones are additional
(duplicated) grid points which are added to each processor containing the required data to advance to the next iteration
Proc 0 Proc 1
www.cactuscode.org www.gridlab.org
ParallelisationParallelisation The data on each grid must
be periodically synchronised The ghostzones are updated
with current data
Proc 0 Proc 1
www.cactuscode.org www.gridlab.org
HPC and IO
www.cactuscode.org www.gridlab.org
Scientific Data Libraries
Parallel I/O Libraries
Parallel Filesystems
HPC Hardware
Applications
Data Models and Formats
I/O in HPC ApplicationsI/O in HPC Applications Large-scale parallel applications generate data
in the terabytes range which cannot be efficiently managed by traditional serial I/O schemes (I/O bottleneck) Design your applications with parallel I/O in mind !
Datafiles should be interchangeable in a heterogenous computing environment
Make use of existing tools for data postprocessingand visualization
Efficient support for checkpointing/recovery
Need for high performance I/O systems and techniques,scientific data libraries, andstandard data representation
www.cactuscode.org www.gridlab.org
HPC I/O RequirementsHPC I/O RequirementsI/O in HPC Applications should be: Fast
Implement parallel I/O mode(s)– Use parallel I/O drivers and filesystems if available– Provide postprocessing tools for chunked data
Provide asynchronous/buffered I/O for overlapping with calculations Dump data in native format Automatically apply necessary datatype conversions (transparent to
app.) Flexible
General scientific data model applicable for broad range of applications Standard data representation Self-describing datafiles
Portable Generate platform-independent data files Work in the same way on different machines:
create data on one platform, post-process on another Interface with standard post-processing / visualization packages
www.cactuscode.org www.gridlab.org
HDF5HDF5Hierachical Data Format - Version 5: http://hdf.ncsa.uiuc.edu/hdf5
Provides a general Scientific Data Format specification and a supporting I/O library implementation
Simple but powerful data model: datasets + hierarchical groups Self-describing data format (implicit and user-supplied
metadata) Predefined standard datatypes + user-defined datatypes Efficient partial reads/writes of datasets (hyperslab selections) Provides parallel I/O driver (based on MPI I/O) HDF5 API with different language bindings (C, C++, Fortran) Thread-safe library code for multithreaded applications Modular library design with Virtual File Driver layer to plug in
different low-level drivers (Stream, SRB, GASS, GridFtp, etc.)
Support and active development by NCSA's HDF5 group More widely accepted by the HPC community as a standard
www.cactuscode.org www.gridlab.org
Serial vs. Parallel I/OSerial vs. Parallel I/O Serial I/O via dedicated I/O processor
I/O processor becomes bottleneck
P3 P2
P1 P0
P0 gathers datafrom all processors
Serial I/O
Single unchunked file
www.cactuscode.org www.gridlab.org
Serial vs. Parallel I/OSerial vs. Parallel I/O Parallel chunked I/O
System limit for number of open files Chunked data needs to be recombined
P3 P2
P1 P0
All processorswrite concurrently
Parallel I/O
Multiple chunked files(possibly distributed)
Single unchunked file
Recombination
www.cactuscode.org www.gridlab.org
Serial vs. Parallel I/OSerial vs. Parallel I/O Parallel unchunked I/O
MPI I/O requires parallel filesystem
P3 P2
P1 P0
MPI File I/O
Parallel I/O
Single unchunked file
www.cactuscode.org www.gridlab.org
Remote Data AnalysisRemote Data Analysis
Data Analysis Example: Visualization of output data
Run the visualization tools where the data is stored Tools need to be installed on remote side Interactivity is often limited by bandwidth/latency issues
Stage the data to be visualized to the local machineand run the visualization as usual Data staging requires a lot of bandwidth and local disk resources Often transfers much more data than is actually needed
Direct remote access to data Needs adequate remote I/O techniques
(partial file access, subsampling, etc.) Combination of remote rendering and local visualization
Visapult application
www.cactuscode.org www.gridlab.org
IO Examples
www.cactuscode.org www.gridlab.org
Parallel IOParallel IO In this example just want to
output fields from 2 processors, but it could be 2000
Each processor could write it’s own data to disk
Then the data usually is moved to one place and “recombined” to produce a single coherent file
Proc 0 Proc 1
www.cactuscode.org www.gridlab.org
Parallel IOParallel IO Alternatively processor 0
can gather data from the other processors and write it all to disk
Usually a combination of these works best … let every nth processor gather data and write to disk
Proc 0 Proc 1
www.cactuscode.org www.gridlab.org
Checkpointing
www.cactuscode.org www.gridlab.org
Checkpointing & RecoveryCheckpointing & Recovery
What is Checkpointing & Recovery ?
At a checkpoint, the running process is interrupted and its current program state is recorded in a checkpoint file in order to be able to restart the program at that point at a later time
A checkpoint file is a "snapshot picture" of a programwhich includes all necessary context information such as CPU state Relevant memory contents (registers, data, stack) information about opened files process environment
At recovery, the state contained in the checkpoint fileis restored by the recovery startup code, and the process resumes execution at the point where the checkpoint was created
www.cactuscode.org www.gridlab.org
Checkpointing & RecoveryCheckpointing & Recovery
Why is it needed ?
Job swapping Jobs run out of CPU time in a queue, or are about to ecxeed
other allocated resources (available memory, disk space, etc.) Fault-tolerant applications
Backup current state before process enters a critical section (where it might crash)
periodically checkpoint applications with undeterministic behaviour Data replication
For parameter studies: calculate compute-intensive initial dataonly once
Job migration move a job to bigger/faster/better resources which become
available
www.cactuscode.org www.gridlab.org
Checkpointing & RecoveryCheckpointing & Recovery
How to enable applications for checkpointing/recovery:
OS / Runtime system driven Runtime system sends a checkpoint signal to freeze application Complete memory image is written to persistent storage (like a core
dump), along with other process information Example: Condor
http://www.cs.wisc.edu/condor– Application just needs to be linked against Condor runtime library– All system calls are wrapped by Condor– After checkpointing, Condor finds another available resource,
moves checkpoint there and restarts the job User-level checkpointing
Application decides itself when to checkpoint and what will be written to the checkpoint file
Application must be instrumented to use a checkpointing libraryor provide its own checkpoint/recovery functionality
www.cactuscode.org www.gridlab.org
Requirements on Checkpointing
Requirements on Checkpointing
Same as those for I/O:
Fast and efficient Use parallel file I/O to write checkpoint files Dump only relevant state information (use application hints)
Flexible restart on different number of processors
Portable Checkpoint files should be platform-independent:
checkpoint on one platform, recover on another Hide system-specific details from the user
www.cactuscode.org www.gridlab.org
Checkpointing & Recovery Example
Checkpointing & Recovery Example
Checkpointing Example: Job migration
SMP System MPP System
PC Cluster
www.cactuscode.org www.gridlab.org
Interactive Issues
www.cactuscode.org www.gridlab.org
Monitoring and SteeringMonitoring and Steering
Interactive Monitoring Determine job's state and progress View current parameters List used resources How good (or bad) is it performing ?
Computational Steering Correct/adjust runtime-steerable
parameters Modify algorithms Enable/disable output Kill job
Online Data Visualization Analyze the data as they are being
calculated
Submit job tobatch system
Job is runningand generates
data
Analysedata
Wait forscheduling
Shorten the traditional production cycle:
www.cactuscode.org www.gridlab.org
Interactive visualization Integrated steering capabilities Individual parameters for each client:
each user can customize what they are seeing (variables, viewpoints, etc.)or see exactly what someone else is seeing
Collaboration User authentification: who can interact in which way with the application
(monitor, access data, steer parameters) Security issues (data encryption, port handling, etc.)
Non-intrusive interaction Parallel streaming Data reduction, sub-/downsampling Data selection
Generic network protocols, standard data formats Support both remote online and offline visualization
by same set of tools
Remote Visualization IssuesRemote Visualization Issues
www.cactuscode.org www.gridlab.org
Programming Frameworks
www.cactuscode.org www.gridlab.org
FrameworksFrameworks Provide the necessary infrastructure to connect exisiting software
modules with different functionality together in a single application Integrate software packages and libraries
through a common interface Manage scheduling of and communication between modules
Enable collaborative application development Minimize dependencies between independent modules Define standardized module interfaces with clear APIs Allow seamless interchanging of different modules
providing the same functionality Easy integration of new modules as they are developed
Development of reusable, cross-problem-domain components to enable rapid application development
Shorter time from problem inception to working parallel simulations Portability across serial, distributed, and parallel architectures
without need to change source code
www.cactuscode.org www.gridlab.org
Why use a Framework ?Why use a Framework ?
Modular frameworks enable code sharing within a large collaboration of many people
Scientists can focus on solving their science problems and let other people concentrate on what they know best Separate computational physics from CS infrastructure headaches,
eg. make system, parallelization, I/O, checkpointing, Scheduling, numerical methods, etc.
Community building Promote open-source software philosophy Make collaborative software development projects manageable Simplify code development and reuse Guarantee code portability Improve code quality and performance by having
it tested in the community
www.cactuscode.org www.gridlab.org
Deciding to use a Framework (I)
Deciding to use a Framework (I)
Does it suit your application's needs ? Try it out with a simple problem… then find out if it can handle the
most complicated thing you want to do. Is it already used in your community?
How does it integrate your exisiting software components? Flexible generic interfaces Bindings for your favorite programming language(s) Legacy code support
Is it available on the platforms/machines you want to use? Is it open source? Does it use standard technologies? Are there performance issues? How does it scale on parallel
architectures?
www.cactuscode.org www.gridlab.org
Deciding to use a Framework (II)
Deciding to use a Framework (II)
Is it supported and developed, will it be so next year? Code versions and releases Bug tracking system Feedback to maintainers and developers
Is there documentation? Working examples Demos Tutorials, other training material
How does it interact with other packages and tools? Parallel debuggers Profiling and performance analysis tools Monitoring and Remote Steering Visualization toolkits Communication libraries I/O libraries Math packages