parallel cc & petaflop applications ryan olson cray, inc

Parallel CC &Petaflop Applications

Ryan OlsonCray, Inc.

Did you know …

Teraflop - CurrentPetaflop - ImminentWhat’s next?

ExaflopZettaflopYOTTAflop!

Outline

Sanibel Symposium

Programming Models

Parallel CC Implementations

Benchmarks

Petascale Applications

This Talk

Distributed Data Interface

GAMESS MP-CCSD(T)

O vs. V

Local & Many-Body Methods

Programming ModelsThe Distributed Data Interface (DDI)

Programming Interface, not Programming Model

Choose the key functionality from the best programming models and provide: Common Interface Simple and Portable General Implementation

Provide an interface to: SPMD: TCGMSG, MPI AMOs: SHMEM, GA SMPs: OpenMP, pThreads SIMD: GPUs, Vector directives, SSE, etc.

Use the best models for the underlying hardware.

Overview

GAMESSApplication Level

Distributed Data Interface (DDI)High-Level API

Implementation

SHMEM / GPSHMEM MPI-2 MPI-1 + GA MPI-1 TCP/IP System V IPCHardware APIElan, GM, etc.

Native Implementations Non-Native Implementations

Programming ModelsThe Distributed Data Interface

Overview Virtual Shared-Memory Model (Native) Cluster Implementation (Non-Native)

Shared Memory/SMP Awareness Clusters of SMP (DDI versions 2-3)

Goal: Multilevel Parallelism Intra/Inter-node Parallelism Maximize Data Locality Minimize Latency / Maximize Bandwidth

Virtual Shared Memory ModelCPU 1

Distributed Memory Storage

CPU 0 CPU 2 CPU 3

0 1 2 3

Distributed MatrixDDI_Create(Handle,NRows,NCols)

CPU0 CPU1 CPU2 CPU3

NCols

NR

ows

Subpatch

Key Point:• The physical memory available to each CPU is divided into

two parts: replicated storage and distributed storage.

Non-Native Implementations(and lost opportunities … )

Distributed Memory Storage(on separate data servers)

GETPUT

0 1 2 3

4 5 6 7

Node 0 (CPU0 + CPU1)


ComputeProcesses

DataServers

ACC(+=)

DDI till 2003 …

SystemV Shared Memory(Fast Model)

0

4 76

GET

PUTACC(+=)

1 2 3



ComputeProcesses

Data Servers

SharedMemory

Segments

5

Distributed Memory Storage(in SysV Shared Memory Segments)

DDI v2 - Full SMP Awareness

Distributed Memory Storage(on separate System V Shared Memory Segments)

GET PUTACC(+=)

0 1 2 3

4 5 6 7



ComputeProcesses

DataServers

SharedMemory

Segments

Proof of Principle - 2003

8 16 32 64 96

DDI v2 18283 12978 8024 5034 3718

DDI–Fast 27400 19534 14809 11424 9010

DDI v1 Limit 109839

95627 85972 N/A

UMP2 Gradient Calculation - 380 BFsDual AMD MP2200 Cluster using SCI network(2003 Results)

Note: DDI v1 was especially problematic onthe SCI network.

DDI v2The DDI Library is SMP Aware.

offers new interfaces to make application SMP aware.

DDI programs inherit improvements in the library.

DDI programs do not automatically become SMP aware, unless they utilize the new interface.

Parallel CC and Threads(Shared Memory Parallelism)

Bentz and KendallParallel BLAS3WOMPAT ‘05

OpenMPParallelized Remaining TermsProof of Principle

Results• Au4 ==> GOOD

• CCSD = (T)• No Disk I/O problems• Both CCSD and (T) scale well

• Au+(C3H6) ==> POOR/AVERAGE• CCSD scales poorly due to I/O vs. FLOP Balance• (T) scales well, overshadowed by bad CCSD performance

• Au8 ==> GOOD• CCSD scales reasonable

(Greater FLOP count, about equal I/O).• N7 (T) step dominates over the relatively small time for CCSD.• (T) scales well, so the overall performance is good.

Detailed Speedups …

Au4 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.002 1.91 1.80 2.17 1.88 1.904 3.18 3.55 4.20 3.70 3.398 4.60 5.30 6.29 5.52 4.97

Au+(C3H6) CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.008 1.99 5.40 6.07 5.52 2.61

Au8 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.0 1.0 1.0 1.0 1.08 4.5 5.6 6.8 5.8 5.2

DDI v3Shared Memory for ALL

2 3

6 7

ComputeProcesses

DataServers

AggregrateDistributed

Storage

0 1

2 3

Replicated Storage ~ 500MB –1GB

Shared Memory ~ 1GB – 12GB

Distributed Memory ~ 10 – 1000GB

DDI v3

Memory Hierarchy Replicated, Shared and Distributed

Program Models Traditional DDI Multilevel Model DDI Groups (a different talk)

Multilevel Models Intra/Internode Parallelism Superset of MPI/OpenMP and/or

MPI/pThreads models MPI lacks “true” one-sided messaging

Parallel Coupled Cluster(Topics)

Data Distribution for CCSD(T) Integrals Distributed Amplitudes in Shared Memory once per node Direct [vv|vv] term

Parallelism based on Data Locality

First Generation Algorithm Ignore I/O Focus on Data and FLOP parallelism

Important Array Sizes (in GB)

300 400 500 600 700 800 900 1000

10 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 15 0.2 0.3 0.4 0.6 0.8 1.1 1.4 1.7 20 0.3 0.5 0.7 1.1 1.5 1.9 2.4 3.0 25 0.4 0.7 1.2 1.7 2.3 3.0 3.8 4.7 30 0.6 1.1 1.7 2.4 3.3 4.3 5.4 6.7 35 0.8 1.5 2.3 3.3 4.5 5.8 7.4 9.1 40 1.1 1.9 3.0 4.3 5.8 7.6 9.7 11.9 45 1.4 2.4 3.8 5.4 7.4 9.7 12.2 15.1 50 1.7 3.0 4.7 6.7 9.1 11.9 15.1 18.6 55 2.0 3.6 5.6 8.1 11.0 14.4 18.3 22.5 60 2.4 4.3 6.7 9.7 13.1 17.2 21.7 26.8

300 400 500 600 700 800 900 1000

10 1 2 5 8 13 19 27 37 15 2 4 7 12 19 29 41 56 20 2 5 9 16 26 38 54 75 25 3 6 12 20 32 48 68 93 30 3 7 14 24 38 57 82 112 35 4 8 16 28 45 67 95 131 40 4 10 19 32 51 76 109 149 45 5 11 21 36 58 86 122 168 50 5 12 23 40 64 95 136 186 55 6 13 26 44 70 105 150 205 60 6 14 28 48 77 115 163 224

v

o

o

v

[vv|oo][vo|vo]T2

[vv|vo]

MO Based Terms

Some code …

DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE

CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO)

CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)

Direct [VV|VV] Term

0 1 … processes … P-1

PUT Iijνσ

11 12 13 …atom

ic orbital indices … N

bf 2

do = 1,nshell

do = 1,nshell

compute:

transform:

end do

end do

transform:

contract:

PUT and for ijI ijνσ

ν σ( )aν σ( ) = Ca ν σ( )

∑

vabνσ = aν bσ( ) = Cb aν σ( )

∑

I ijνσ =vab

νσcijab

I ijσν

Iijσν

do ν = 1,nshell do σ = 1,ν

end doend dosynchronizefor each “local” ij column do

GET

reorder: shell --> AO order

transform:

STORE in “local” solution vector

I ijνσ

GET Iijνσ

Iijab = I ij

νσCνaCσbνσ∑I ijab

end do

11 21 22…occ indices…(NoNo)*

(T) Parallelism

Trivial -- in theory[vv|vo] distributedv3 work arrays

at large v -- stored in shared memory

disjoint updates where both quantities are shared

Timings … 1 Node Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 1.00 100% 1.90 95% 3.70 92% 6.18 77% CCSD-MO 1.00 100% 1.87 93% 3.11 78% 4.21 53% CCSD-Total 1.00 100% 1.86 93% 3.58 89% 5.68 71% Triples Correction (T) 1.00 100% 1.78 89% 2.59 65% 4.06 51% 2Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 2.00 100% 3.76 94% 7.43 93% 12.31 77% CCSD-MO 1.38 69% 2.46 62% 4.10 51% 6.21 39% CCSD-Total 1.88 94% 3.34 84% 6.53 82% 9.56 60% Triples Correction (T) 1.94 97% 3.38 85% 4.73 59% 7.13 45% 3 Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 3.00 100% 5.85 97% 11.07 92% 18.48 77% CCSD-MO 1.68 56% 2.96 49% 4.56 38% 6.91 29% CCSD-Total 2.55 85% 4.80 80% 8.28 69% 14.57 61% Triples Correction (T) 2.95 98% 5.24 87% 7.63 64% 11.82 49%

(H2O)6 Prism - aug’-cc-pVTZFastest timing: < 6 hours on 8x8 Power5

Improvements …

Semi-Direct [vv|vv] term (IKCUT)

Concurrent MO terms

Generalized amplitudes storage

Semi-Direct [VV|VV] Termdo = 1,nshell

do = 1,nshell

compute:

transform:

end do

end do

transform:

contract:

PUT and for ijI ijνσ

ν σ( )aν σ( ) = Ca ν σ( )

∑

vabνσ = aν bσ( ) = Cb aν σ( )

∑

I ijνσ =vab

νσcijab

I ijσν

do ν = 1,nshell ! I-SHELL do σ = 1,ν ! K-SHELL

end doend do

Define IKCUT

Store if: LEN(I)+LEN(K) > IKCUT

Automatic contention avoidance

Adjustable: Fully direct to fully conventional.

Semi-Direct [vv|vv] Timings

IKCUT Direct 12 8 6 Save AllCCSD - 64 cores 3122 2563 1805 1710 1702CCSD - 32 cores 5076 4088 2620 2363

Storage (GB) 7.6 18.8 21.3 25.6Seconds per MB - 64 73 70 66 55Seconds per MB - 32 129 131 127

However:

GPUs generate AOs much faster than they can be read off the disk.

Water Tetramer / aug’-cc-pVTZ

Storage: Shared NFS mounted (bad example).

Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better.

Concurrency

Everything N-ways parallelNO

Biggest mistakeParallelizing every MO term over all

cores.

FixConcurrency

Concurrent MO termsNodes

MO Terms - Parallelized over the minimum number of nodes while still efficient & fast.

[vv|vv]

MO nodes join the [vv|vv] term already in progress … dynamic load balancing.

Adaptive Computing

Self Adjusting / Self TuningConcurrent MO termsValue of IKCUT

Use the iterations to improve the calculation:

Adjust initial node assignmentsIncrease IKCUT

Monte Carlo approach to tuning paramaters.

Conclusions …

Good First Start … [vv|vv] scales perfectly with node count. multilevel parallelism adjustable i/o usage

A lot to do … improve intra-node memory bottlenecks concurrent MO terms generalized amplitude storage adaptive computing

Use the knowledge from these hand coded methods to refine the CS structure in automated methods.

Acknowledgements

PeopleMark GordonMike SchmidtJonathan BentzRicky KendallAlistair Rendell

FundingDoE SciDACSCL (Ames

Lab)APAC / ANUNSFMSI

Petaflop Applications(benchmarks, too)

Petaflop = ~125,000 2.2 GHz AMD Opteron cores.

O vs. V small O, big V ==> CBS Limit big O ==> see below

Local and Many-Body Methods FMO, EE-MB, etc. - use existing parallel

methods Sampling

parallel cc & petaflop applications ryan olson cray, inc

Documents

distributed memory storage

distributed storage

bandwidth slide

distributed matrix ddi

sysv shared memory segments

ddi v2 zthe ddi library

best models

separate data servers