parallel cc & petaflop applications ryan olson cray, inc
TRANSCRIPT
![Page 1: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/1.jpg)
Parallel CC &Petaflop Applications
Ryan OlsonCray, Inc.
![Page 2: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/2.jpg)
Did you know …
Teraflop - CurrentPetaflop - ImminentWhat’s next?
ExaflopZettaflopYOTTAflop!
![Page 3: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/3.jpg)
Outline
Sanibel Symposium
Programming Models
Parallel CC Implementations
Benchmarks
Petascale Applications
This Talk
Distributed Data Interface
GAMESS MP-CCSD(T)
O vs. V
Local & Many-Body Methods
![Page 4: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/4.jpg)
Programming ModelsThe Distributed Data Interface (DDI)
Programming Interface, not Programming Model
Choose the key functionality from the best programming models and provide: Common Interface Simple and Portable General Implementation
Provide an interface to: SPMD: TCGMSG, MPI AMOs: SHMEM, GA SMPs: OpenMP, pThreads SIMD: GPUs, Vector directives, SSE, etc.
Use the best models for the underlying hardware.
![Page 5: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/5.jpg)
Overview
GAMESSApplication Level
Distributed Data Interface (DDI)High-Level API
Implementation
SHMEM / GPSHMEM MPI-2 MPI-1 + GA MPI-1 TCP/IP System V IPCHardware APIElan, GM, etc.
Native Implementations Non-Native Implementations
![Page 6: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/6.jpg)
Programming ModelsThe Distributed Data Interface
Overview Virtual Shared-Memory Model (Native) Cluster Implementation (Non-Native)
Shared Memory/SMP Awareness Clusters of SMP (DDI versions 2-3)
Goal: Multilevel Parallelism Intra/Inter-node Parallelism Maximize Data Locality Minimize Latency / Maximize Bandwidth
![Page 7: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/7.jpg)
Virtual Shared Memory ModelCPU 1
Distributed Memory Storage
CPU 0 CPU 2 CPU 3
0 1 2 3
Distributed MatrixDDI_Create(Handle,NRows,NCols)
CPU0 CPU1 CPU2 CPU3
NCols
NR
ows
Subpatch
Key Point:• The physical memory available to each CPU is divided into
two parts: replicated storage and distributed storage.
![Page 8: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/8.jpg)
Non-Native Implementations(and lost opportunities … )
Distributed Memory Storage(on separate data servers)
GETPUT
0 1 2 3
4 5 6 7
Node 0 (CPU0 + CPU1)
Node 1 (CPU2 + CPU3)
ComputeProcesses
DataServers
ACC(+=)
![Page 9: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/9.jpg)
DDI till 2003 …
![Page 10: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/10.jpg)
SystemV Shared Memory(Fast Model)
0
4 76
GET
PUTACC(+=)
1 2 3
Node 0 (CPU0 + CPU1)
Node 1 (CPU2 + CPU3)
ComputeProcesses
Data Servers
SharedMemory
Segments
5
Distributed Memory Storage(in SysV Shared Memory Segments)
![Page 11: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/11.jpg)
DDI v2 - Full SMP Awareness
Distributed Memory Storage(on separate System V Shared Memory Segments)
GET PUTACC(+=)
0 1 2 3
4 5 6 7
Node 0 (CPU0 + CPU1)
Node 1 (CPU2 + CPU3)
ComputeProcesses
DataServers
SharedMemory
Segments
![Page 12: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/12.jpg)
Proof of Principle - 2003
8 16 32 64 96
DDI v2 18283 12978 8024 5034 3718
DDI–Fast 27400 19534 14809 11424 9010
DDI v1 Limit 109839
95627 85972 N/A
UMP2 Gradient Calculation - 380 BFsDual AMD MP2200 Cluster using SCI network(2003 Results)
Note: DDI v1 was especially problematic onthe SCI network.
![Page 13: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/13.jpg)
DDI v2The DDI Library is SMP Aware.
offers new interfaces to make application SMP aware.
DDI programs inherit improvements in the library.
DDI programs do not automatically become SMP aware, unless they utilize the new interface.
![Page 14: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/14.jpg)
Parallel CC and Threads(Shared Memory Parallelism)
Bentz and KendallParallel BLAS3WOMPAT ‘05
OpenMPParallelized Remaining TermsProof of Principle
![Page 15: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/15.jpg)
Results• Au4 ==> GOOD
• CCSD = (T)• No Disk I/O problems• Both CCSD and (T) scale well
• Au+(C3H6) ==> POOR/AVERAGE• CCSD scales poorly due to I/O vs. FLOP Balance• (T) scales well, overshadowed by bad CCSD performance
• Au8 ==> GOOD• CCSD scales reasonable
(Greater FLOP count, about equal I/O).• N7 (T) step dominates over the relatively small time for CCSD.• (T) scales well, so the overall performance is good.
![Page 16: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/16.jpg)
Detailed Speedups …
Au4 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.002 1.91 1.80 2.17 1.88 1.904 3.18 3.55 4.20 3.70 3.398 4.60 5.30 6.29 5.52 4.97
Au+(C3H6) CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.008 1.99 5.40 6.07 5.52 2.61
Au8 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.0 1.0 1.0 1.0 1.08 4.5 5.6 6.8 5.8 5.2
![Page 17: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/17.jpg)
DDI v3Shared Memory for ALL
2 3
6 7
ComputeProcesses
DataServers
AggregrateDistributed
Storage
0 1
2 3
Replicated Storage ~ 500MB –1GB
Shared Memory ~ 1GB – 12GB
Distributed Memory ~ 10 – 1000GB
![Page 18: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/18.jpg)
DDI v3
Memory Hierarchy Replicated, Shared and Distributed
Program Models Traditional DDI Multilevel Model DDI Groups (a different talk)
Multilevel Models Intra/Internode Parallelism Superset of MPI/OpenMP and/or
MPI/pThreads models MPI lacks “true” one-sided messaging
![Page 19: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/19.jpg)
Parallel Coupled Cluster(Topics)
Data Distribution for CCSD(T) Integrals Distributed Amplitudes in Shared Memory once per node Direct [vv|vv] term
Parallelism based on Data Locality
First Generation Algorithm Ignore I/O Focus on Data and FLOP parallelism
![Page 20: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/20.jpg)
Important Array Sizes (in GB)
300 400 500 600 700 800 900 1000
10 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 15 0.2 0.3 0.4 0.6 0.8 1.1 1.4 1.7 20 0.3 0.5 0.7 1.1 1.5 1.9 2.4 3.0 25 0.4 0.7 1.2 1.7 2.3 3.0 3.8 4.7 30 0.6 1.1 1.7 2.4 3.3 4.3 5.4 6.7 35 0.8 1.5 2.3 3.3 4.5 5.8 7.4 9.1 40 1.1 1.9 3.0 4.3 5.8 7.6 9.7 11.9 45 1.4 2.4 3.8 5.4 7.4 9.7 12.2 15.1 50 1.7 3.0 4.7 6.7 9.1 11.9 15.1 18.6 55 2.0 3.6 5.6 8.1 11.0 14.4 18.3 22.5 60 2.4 4.3 6.7 9.7 13.1 17.2 21.7 26.8
300 400 500 600 700 800 900 1000
10 1 2 5 8 13 19 27 37 15 2 4 7 12 19 29 41 56 20 2 5 9 16 26 38 54 75 25 3 6 12 20 32 48 68 93 30 3 7 14 24 38 57 82 112 35 4 8 16 28 45 67 95 131 40 4 10 19 32 51 76 109 149 45 5 11 21 36 58 86 122 168 50 5 12 23 40 64 95 136 186 55 6 13 26 44 70 105 150 205 60 6 14 28 48 77 115 163 224
v
o
o
v
[vv|oo][vo|vo]T2
[vv|vo]
![Page 21: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/21.jpg)
MO Based Terms
![Page 22: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/22.jpg)
Some code …
DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE
CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO)
CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)
![Page 23: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/23.jpg)
MO Parallelization
0 1
[vo*|vo*], [vv|o*o*][vv|v*o*]
2 3
T2 Soln T2 Soln
[vo*|vo*], [vv|o*o*][vv|v*o*]
Goal: Disjoint updates to the solution matrix.Avoid locking/critical sections whenever possible.
![Page 24: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/24.jpg)
Direct [VV|VV] Term
0 1 … processes … P-1
PUT Iijνσ
11 12 13 …atom
ic orbital indices … N
bf 2
do = 1,nshell
do = 1,nshell
compute:
transform:
end do
end do
transform:
contract:
PUT and for ijI ijνσ
ν σ( )aν σ( ) = Ca ν σ( )
∑
vabνσ = aν bσ( ) = Cb aν σ( )
∑
I ijνσ =vab
νσcijab
I ijσν
Iijσν
do ν = 1,nshell do σ = 1,ν
end doend dosynchronizefor each “local” ij column do
GET
reorder: shell --> AO order
transform:
STORE in “local” solution vector
I ijνσ
GET Iijνσ
Iijab = I ij
νσCνaCσbνσ∑I ijab
end do
11 21 22…occ indices…(NoNo)*
![Page 25: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/25.jpg)
(T) Parallelism
Trivial -- in theory[vv|vo] distributedv3 work arrays
at large v -- stored in shared memory
disjoint updates where both quantities are shared
![Page 26: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/26.jpg)
Timings … 1 Node Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 1.00 100% 1.90 95% 3.70 92% 6.18 77% CCSD-MO 1.00 100% 1.87 93% 3.11 78% 4.21 53% CCSD-Total 1.00 100% 1.86 93% 3.58 89% 5.68 71% Triples Correction (T) 1.00 100% 1.78 89% 2.59 65% 4.06 51% 2Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 2.00 100% 3.76 94% 7.43 93% 12.31 77% CCSD-MO 1.38 69% 2.46 62% 4.10 51% 6.21 39% CCSD-Total 1.88 94% 3.34 84% 6.53 82% 9.56 60% Triples Correction (T) 1.94 97% 3.38 85% 4.73 59% 7.13 45% 3 Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 3.00 100% 5.85 97% 11.07 92% 18.48 77% CCSD-MO 1.68 56% 2.96 49% 4.56 38% 6.91 29% CCSD-Total 2.55 85% 4.80 80% 8.28 69% 14.57 61% Triples Correction (T) 2.95 98% 5.24 87% 7.63 64% 11.82 49%
(H2O)6 Prism - aug’-cc-pVTZFastest timing: < 6 hours on 8x8 Power5
![Page 27: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/27.jpg)
Improvements …
Semi-Direct [vv|vv] term (IKCUT)
Concurrent MO terms
Generalized amplitudes storage
![Page 28: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/28.jpg)
Semi-Direct [VV|VV] Termdo = 1,nshell
do = 1,nshell
compute:
transform:
end do
end do
transform:
contract:
PUT and for ijI ijνσ
ν σ( )aν σ( ) = Ca ν σ( )
∑
vabνσ = aν bσ( ) = Cb aν σ( )
∑
I ijνσ =vab
νσcijab
I ijσν
do ν = 1,nshell ! I-SHELL do σ = 1,ν ! K-SHELL
end doend do
Define IKCUT
Store if: LEN(I)+LEN(K) > IKCUT
Automatic contention avoidance
Adjustable: Fully direct to fully conventional.
![Page 29: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/29.jpg)
Semi-Direct [vv|vv] Timings
IKCUT Direct 12 8 6 Save AllCCSD - 64 cores 3122 2563 1805 1710 1702CCSD - 32 cores 5076 4088 2620 2363
Storage (GB) 7.6 18.8 21.3 25.6Seconds per MB - 64 73 70 66 55Seconds per MB - 32 129 131 127
However:
GPUs generate AOs much faster than they can be read off the disk.
Water Tetramer / aug’-cc-pVTZ
Storage: Shared NFS mounted (bad example).
Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better.
![Page 30: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/30.jpg)
Concurrency
Everything N-ways parallelNO
Biggest mistakeParallelizing every MO term over all
cores.
FixConcurrency
![Page 31: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/31.jpg)
Concurrent MO termsNodes
MO Terms - Parallelized over the minimum number of nodes while still efficient & fast.
[vv|vv]
MO nodes join the [vv|vv] term already in progress … dynamic load balancing.
![Page 32: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/32.jpg)
Adaptive Computing
Self Adjusting / Self TuningConcurrent MO termsValue of IKCUT
Use the iterations to improve the calculation:
Adjust initial node assignmentsIncrease IKCUT
Monte Carlo approach to tuning paramaters.
![Page 33: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/33.jpg)
Conclusions …
Good First Start … [vv|vv] scales perfectly with node count. multilevel parallelism adjustable i/o usage
A lot to do … improve intra-node memory bottlenecks concurrent MO terms generalized amplitude storage adaptive computing
Use the knowledge from these hand coded methods to refine the CS structure in automated methods.
![Page 34: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/34.jpg)
Acknowledgements
PeopleMark GordonMike SchmidtJonathan BentzRicky KendallAlistair Rendell
FundingDoE SciDACSCL (Ames
Lab)APAC / ANUNSFMSI
![Page 35: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649c7d5503460f94931982/html5/thumbnails/35.jpg)
Petaflop Applications(benchmarks, too)
Petaflop = ~125,000 2.2 GHz AMD Opteron cores.
O vs. V small O, big V ==> CBS Limit big O ==> see below
Local and Many-Body Methods FMO, EE-MB, etc. - use existing parallel
methods Sampling