Workshop on Parallelization of Coupled-Cluster Methods
Panel 1: Parallel efficiency
An incomplete list of thoughts
Workshop on Parallelization of Coupled-Cluster Methods
Panel 1: Parallel efficiency
An incomplete list of thoughts
Bert de JongHigh Performance Software Development
Molecular Science Computing Facility
2
Overall hardware issues
Computer power per node has increased Increase of single CPU has flattened out (but you never know!) Multiple cores together tax out other hardware resources in a
node
Bandwidth and latency for other major hardware resources are far behind Affecting the flops we actually use Memory
Very difficult to feed the CPU Multiple cores further reduce bandwidth
Network Data access considerably slower than memory Speed of light is our enemy
Disk input/output Slowest of them all, disks spin only so fast
3
Dealing with memory
Amounts of data needed in coupled cluster can be huge Amplitudes
Too large to store on a single node (except for T1) Shared memory would be good, but will shared memory of 100s of
terabytes be feasible and accessible? Integrals
Recompute vs store (on disk or in memory) Can we avoid access to memory when recomputing
Coupled cluster has one advantage: it can easily be formulated as matrix-multiplication Can be very efficient: DGEMM on EMSL’s 1.5 GHz Itanium-2
system reached over 95% of peak efficiency As long as we can get all the needed data in memory!
4
Dealing with networks
With 10s of terabytes of data and distributed memory systems, getting data from remote nodes is inevitable Can be no problem, as long as you can hide the communication
behind computation Fetch data while computing = one-sided communication NWChem uses Global Arrays to accomplish this
Issues are Low bandwidth and high latency relative to increasing node speed Non-uniform network
- Cabling a full fat tree can be cost prohibitive - Effect of network topology - Fault resiliency of network
Multiple cores need to compete for limited number of busses Data contention increase with increasing node count
Data locality, data locality, data locality
5
Dealing with spinning disks
Using local disk Will only contain data needed by its own node Can be fast enough if you put large number of spindles behind it
And, again, if you can hide behind computation (pre-fetch) With 100,000s of disks, chances of failure become significant
Fault tolerance of computation becomes an issue
Using globally shared disk Crucial when going to very large systems Allows for large files shared by large numbers of nodes Lustre file system of petabytes possible Speed limited by number of access points (hosts)
Large number of reads and writes need to be handled by small number of hosts, creating lock and access contention
6
What about beyond 1 petaflop?
Possibly 100,000s of multicore nodes How does one create a fat enough network between that many nodes?
Possibly 32, 64, 128 or more cores per node All cores simply cannot do the same thing anymore
Not enough memory bandwidth Not enough network bandwidth
Heterogenous computing within a node (CPU+GPU) Designate nodes for certain tasks
- Communication- Memory access, put and get- Recompute integrals hopefully using cache only- DGEMM operations
Task scheduling will become an issue
7
WR Wiley Environmental Molecular Sciences Laboratory
A national scientific user facility integrating experimental and computational resources for
discovery and technological innovation