update updated: june 16, 2015...

84
update [email protected] Updated: June 16, 2015

Upload: others

Post on 07-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

update

[email protected] Updated: June 16, 2015

Page 2: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND,

ESPResso, Folding@Home, GPUgrid.net, GROMACS,

HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore,

NAMD, OpenMM, SOP-GPU*

Great multi-GPU performance!

Focus: on dense (up to 16) GPU nodes & large # of GPU nodes

QC: All key codes are ported or optimizing: GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS,

Quantum Espresso/PWscf, MOLCAS, MOPAC2012,

NWChem, OCTOPUS*, QUICK, Q-Chem, TeraChem*

Active GPU acceleration projects:

CASTEP, CPMD, GAMESS, Gaussian, NWChem,

ONETEP, Quantum Supercharger Library*, VASP & more

Focus: on using GPU-accelerated math libraries, OpenACC directives

green* = application where all the workload is on GPU

Page 3: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming

Languages

Maximum

Flexibility

OpenACC

Directives

Easily Accelerate

Applications

Page 4: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications

Linear Algebra FFT, BLAS,

SPARSE, Matrix

Numerical & Math RAND, Statistics

Data Struct. & AI Sort, Scan, Zero Sum

Visual Processing Image & Video

NVIDIA

cuFFT,

cuBLAS,

cuSPARSE

NVIDIA

Math Lib NVIDIA cuRAND

NVIDIA

NPP

NVIDIA

Video

Encode

GPU AI –

Board Games

GPU AI –

Path Finding

Page 5: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz
Page 6: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

6

AMBER 14

Page 7: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Courtesy of

Scott Le Grand

From GTC 2014

presentation

Page 8: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

8

AMBER 14; large P2P and small Boost Clocks impacts

2 x Xeon E5-2690 [email protected] + 4 xTesla K40@745Mhz (no P2P)

2 x Xeon E5-2690 [email protected] + 4 xTesla K40@875Mhz (no P2P)

2 x Xeon E5-2690 [email protected] + 4 xTesla K40@745Mhz (P2P)

2 x Xeon E5-2690 [email protected] + 4 xTesla K40@875Mhz (P2P)

系列1 125.77 132.97 196.68 215.18

125.77 132.97

196.68

215.18

0

50

100

150

200

250

ns/d

ay

AMBER 14 (ns/day) on 4x K40; P2P and Boost Clocks Impact DHFR NVE PME, 2fs Benchmark (CUDA 6.0, ECC off)

Boost

P2P

Boost

No P2P

No Boost

P2P No Boost

No P2P

Page 9: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Courtesy of

Scott Le Grand

From GTC 2014

presentation

Page 10: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

10

AMBER 14 with P2P, Higher Density Nodes

2 x Xeon E5-2690 [email protected] x Xeon E5-2690 [email protected] + 1 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 2 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 4 x

Tesla K40@875Mhz

系列1 23.79 133.74 200.58 223.56

23.79

133.74

200.58

223.56

0

50

100

150

200

250

ns/d

ay

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks DHFR NVE PME, 2fs Benchmark (CUDA 5.5, ECC off)

1 K40 2 K40 4 K40

Page 11: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

11

AMBER 14 and K40 with P2P, fastest GPU yet!

2 x Xeon E5-2690 [email protected] x Xeon E5-2690 [email protected] + 1 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 2 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 4 x

Tesla K40@875Mhz

系列1 6.1 37.15 56.08 60.27

6.1

37.15

56.08

60.27

0

10

20

30

40

50

60

70

ns/d

ay

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Factor IX NPT PME, 2fs Benchmark (CUDA 5.5, ECC off)

Page 12: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

12

AMBER 14 and K40 with P2P, fastest GPU yet!

2 x Xeon E5-2690 [email protected] x Xeon E5-2690 [email protected] + 1 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 2 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 4 x

Tesla K40@875Mhz

系列1 1.32 9.22 14.03 16.64

1.32

9.22

14.03

16.64

0

2

4

6

8

10

12

14

16

18

ns/d

ay

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Cellulose NVE PME, 2fs Benchmark (CUDA 5.5, ECC off)

Page 13: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

13

AMBER 14 and K40 with P2P, fastest GPU yet!

2 x Xeon E5-2690 [email protected] x Xeon E5-2690 [email protected] + 1 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 2 x

Tesla K40@875Mhz2 x Xeon E5-2690 [email protected] + 4 x

Tesla K40@875Mhz

系列1 0.08 3.97 7.6 13.77

0.08

3.97

7.6

13.77

0

2

4

6

8

10

12

14

16

ns/d

ay

AMBER 14 (ns/day) on K40 with P2P and Boost Clocks Nucleosome GB, 2fs Benchmark (CUDA 5.5, ECC off)

Page 14: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

14

Kepler - Our Fastest Family of GPUs Yet

Running AMBER 14

The blue node contains Dual

E5-2697 CPUs (12 Cores per

CPU).

The green nodes contain Dual

E5-2697 CPUs (12 Cores per

CPU) and either 1x or 2x

NVIDIA K20X, K40 or K80 for

the GPU

DHFR (JAC) 1 x [email protected]

GHz

1 x [email protected] + 1

x Phi5110p

(Offload)

1 x [email protected] + 1

x Phi7120p

(Offload)

1 x [email protected] + 1x TeslaK20X

1 x [email protected] + 1x Tesla

K40@875Mhz

1 x [email protected] + 2x TeslaK20X

1 x [email protected] + 1x Tesla

K80Board

1 x [email protected] + 2x Tesla

K40@875Mhz

2 x [email protected]

GHz

2 x [email protected] + 1x TeslaK20X

2 x [email protected] + 1x Tesla

K40@875Mhz

2 x [email protected] + 2x TeslaK20X

2 x [email protected] + 2x Tesla

K40@875Mhz

系列1 14.54 4.08 3.82 111.32 134.08 159.25 175.43 196.69 25.80 110.87 132.68 159.06 196.86

14.54

4.08 3.82

111.32

134.08

159.25

175.43

196.69

25.80

110.87

132.68

159.06

196.86

0.00

50.00

100.00

150.00

200.00

250.00

ns/D

ay

AMBER 14, SPFP-DHFR_production_NVE

Page 15: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

15

Kepler - Our Fastest Family of GPUs Yet

Running AMBER 14

The blue node contains Dual

E5-2697 CPUs (12 Cores per

CPU).

The green nodes contain Dual

E5-2697 CPUs (12 Cores per

CPU) and either 1x or 2x

NVIDIA K20X, K40 or K80 for

the GPU

Factor IX

1 x [email protected]

GHz

1 x [email protected] + 1

x Phi5110p

(Offload)

1 x [email protected] + 1

x Phi7120p

(Offload)

1 x [email protected] + 1x TeslaK20X

1 x [email protected] + 1x Tesla

K40@875Mhz

1 x [email protected] + 2x TeslaK20X

1 x [email protected] + 1x Tesla

K80Board

1 x [email protected] + 2x Tesla

K40@875Mhz

2 x [email protected]

GHz

2 x [email protected] + 1x TeslaK20X

2 x [email protected] + 1x Tesla

K40@875Mhz

2 x [email protected] + 2x TeslaK20X

2 x [email protected] + 2x Tesla

K40@875Mhz

系列1 3.70 3.29 3.35 32.45 38.65 46.58 51.12 57.89 6.87 32.30 38.60 46.50 57.83

3.70 3.29 3.35

32.45

38.65

46.58

51.12

57.89

6.87

32.30

38.60

46.50

57.83

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

ns

/Da

y

AMBER 14, SPFP-Factor_IX_Production_NVE

Page 16: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

16

Kepler - Our Fastest Family of GPUs Yet

Running AMBER 14

The blue node contains Dual

E5-2697 CPUs (12 Cores per

CPU).

The green nodes contain Dual

E5-2697 CPUs (12 Cores per

CPU) and either 1x or 2x

NVIDIA K20X, K40 or K80 for

the GPU

Cellulose 1 x [email protected]

GHz

1 x [email protected] + 1

x Phi5110p

(Offload)

1 x [email protected] + 1

x Phi7120p

(Offload)

1 x [email protected] + 1x TeslaK20X

1 x [email protected] + 1x Tesla

K40@875Mhz

1 x [email protected] + 2x TeslaK20X

1 x [email protected] + 1x Tesla

K80Board

1 x [email protected] + 2x Tesla

K40@875Mhz

2 x [email protected]

GHz

2 x [email protected] + 1x TeslaK20X

2 x [email protected] + 1x Tesla

K40@875Mhz

2 x [email protected] + 2x TeslaK20X

2 x [email protected] + 2x Tesla

K40@875Mhz

系列1 0.74 1.50 1.56 7.60 8.95 10.82 11.86 13.29 1.38 7.60 8.95 10.83 13.29

0.74

1.50 1.56

7.60

8.95

10.82

11.86

13.29

1.38

7.60

8.95

10.83

13.29

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

ns

/Da

y

AMBER 14, SPFP-Cellulose_Production_NVE

Page 17: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

17

Replace 8 Nodes with 1 K20 GPU

Cut down simulation costs to ¼ and gain higher performance

Running AMBER 12 GPU Support Revision

12.1 SPFP with CUDA 4.2.9 ECC Off

The eight (8) blue nodes each contain 2x

Intel E5-2687W CPUs (8 Cores per CPU)

Each green node contains 2x Intel E5-

2687W CPUs (8 Cores per CPU) plus 1x

NVIDIA K20 GPU

Note: Typical CPU and GPU node pricing

used. Pricing may vary depending on node

configuration. Contact your preferred HW

vendor for actual pricing.

65.00

81.09 $32,000.00

$6,500.00

0

5000

10000

15000

20000

25000

30000

35000

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

Nanoseconds/Day Cost

DHFR

Page 18: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

When used with GPUs, dual CPU sockets perform worse than single CPU sockets.

Extra CPUs decrease Performance Running AMBER 12 GPU Support Revision 12.1

The orange bars contains one E5-2687W CPUs

(8 Cores per CPU).

The blue bars contain Dual E5-2687W CPUs (8

Cores per CPU)

0

1

2

3

4

5

6

7

8

CPU Only CPU with dual K20s

Nan

oseco

nd

s /

Day

Cellulose NVE

1 E5-2687W

2 E5-2687W

Cellulose 1 C

PU

2 G

PU

s

2 C

PU

s 2

GP

Us

Page 19: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Kepler - Greener Science

The GPU Accelerated systems use 65-75% less energy

Energy Expended

= Power x Time

Lower is better

0

500

1000

1500

2000

2500

CPU Only CPU + K10 CPU + K20 CPU + K20X

En

erg

y E

xp

en

ded

(kJ)

Energy used in simulating 1 ns of DHFR JAC Running AMBER 12 GPU Support Revision 12.1

The blue node contains Dual E5-2687W CPUs

(150W each, 8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs

(8 Cores per CPU) and 1x NVIDIA K10, K20, or

K20X GPUs (235W each).

Page 20: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Recommended GPU Node Configuration for AMBER Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+

System memory per node (GB) 16

GPUs Kepler K20, K40, K80

# of GPUs per CPU socket 1-4

GPU memory preference (GB) 6

GPU to CPU connection PCIe 3.0 16x or higher

Server storage 2 TB

Network configuration Infiniband QDR or better Scale to multiple nodes with same single node configuration 20

Page 21: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

21

Benefits of GPU AMBER Accelerated Computing

Faster than CPU only systems in all tests

Most major compute intensive aspects of classical MD ported

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive

Page 22: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10

Page 23: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Kepler – Universally Faster

The Kepler GPUs accelerate all simulations, up to 5.1x Average acceleration printed in bars

Running NAMD version 2.9

The CPU Only node contains Dual E5-2687W

CPUs (8 Cores per CPU).

The Kepler nodes contain Dual E5-2687W CPUs

(8 Cores per CPU) and 1 or two NVIDIA K10,

K20, or K20X GPUs.

0

1

2

3

4

5

6

CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X

Sp

eed

up

Co

mp

are

d t

o C

PU

On

ly

F1-ATPase

ApoA1

STMV

F1-ATPase | Kepler nodes use Dual CPUs |

2.4x

4.7x

2.9x 2.6x

4.3x

5.1x

Page 24: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10 (streaming patch)

Page 25: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; ApoA1 on Intel Phi, Tesla K40s and K80s & IVB CPUs

2 x Xeon [email protected] (1 Ivybridge

node)

2 x Xeon [email protected] (1 Haswell

CPU)

2 x Xeon [email protected] + 1 x TeslaK40@875Mhz (1 node)

2 x Xeon [email protected] + 0.5 x

Tesla K80 (autoboost) (1node)

2 x Xeon [email protected] + 1 x TeslaK80 (autoboost) (1 node)

2 x Xeon [email protected] + 2 x TeslaK40@875Mhz (1 node)

2 x Xeon [email protected] + 2 x TeslaK80 (autoboost) (1 node)

系列1 2.00 2.67 5.48 4.60 8.32 9.89 12.37

2.00

2.67

5.48

4.60

8.32

9.89

12.37

0

3

6

9

12

15

Sim

ula

tion T

ime (

ns/D

ay)

(1 Node: Simulation Time in ns/Day)

Page 26: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; ApoA1 on Tesla K40s & IVB CPUs

2 x [email protected] (2nodes)

2 x [email protected] + 2x Tesla

K40@875Mhz (2nodes)

2 x [email protected] + 4x Tesla

K40@875Mhz (2nodes)

4 x [email protected] (2nodes)

4 x [email protected] + 2x Tesla

K40@875Mhz (2nodes)

4 x [email protected] + 4x Tesla

K40@875Mhz (2nodes)

4 x [email protected] (4node)

4 x [email protected] + 4x Tesla

K40@875Mhz (4nodes)

4 x [email protected] + 8x Tesla

K40@875Mhz (4nodes)

8 x [email protected] (4node)

8 x [email protected] + 4x Tesla

K40@875Mhz (4nodes)

8 x [email protected] + 8x Tesla

K40@875Mhz (4nodes)

8 x [email protected] (8node)

8 x [email protected] + 8x Tesla

K40@875Mhz (8nodes)

8 x [email protected] + 16x Tesla

K40@875Mhz (8nodes)

16 xXeon E5-

[email protected] (8node)

16 xXeon E5-

[email protected] + 8x Tesla

K40@875Mhz (8nodes)

16 xXeon E5-

[email protected] + 16x Tesla

K40@875Mhz (8nodes)

系列1 1.78 5.31 9.44 3.21 9.85 14.38 3.30 15.72 17.89 5.65 13.53 14.99 5.94 16.60 16.81 9.99 12.66 12.33

1.78

5.31

9.44

3.21

9.85

14.38

3.30

15.72

17.89

5.65

13.53

14.99

5.94

16.60 16.81

9.99

12.66 12.33

0

3

6

9

12

15

18

21

Sim

ula

tion T

ime (

ns/D

ay)

(2-8 Nodes: Simulation Time in ns/Day)

Page 27: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; ApoA1 on Tesla K80s & IVB CPUs

2 x XeonE5-2697

[email protected] (2

nodes)

2 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (2 nodes)

2 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (2 nodes)

2 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] (2

nodes)

4 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] (4node)

4 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (4 nodes)

4 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (4 nodes)

4 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] (4node)

8 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (4 nodes)

系列1 1.78 4.66 8.33 10.80 3.21 8.55 12.77 15.21 3.30 14.27 16.89 17.75 5.65 12.53 14.02 14.90

1.78

4.66

8.33

10.80

3.21

8.55

12.77

15.21

3.30

14.27

16.89

17.75

5.65

12.53

14.02

14.90

0

3

6

9

12

15

18

21

Sim

ula

tion T

ime (

ns/D

ay)

(2-4 Nodes: Simulation Time in ns/Day)

Page 28: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; F1-ATPase on Intel Phi, Tesla K40s and K80s & IVB CPUs

2 x Xeon [email protected] (1Ivybridge node)

2 x Xeon [email protected] (1 Haswell

CPU)

2 x Xeon [email protected] + 1 x TeslaK40@875Mhz (1 node)

2 x Xeon [email protected] + 0.5 x

Tesla K80 (autoboost) (1node)

2 x Xeon [email protected] + 1 x TeslaK80 (autoboost) (1 node)

2 x Xeon [email protected] + 2 x TeslaK40@875Mhz (1 node)

2 x Xeon [email protected] + 2 x TeslaK80 (autoboost) (1 node)

系列1 0.668 0.981 1.751 1.489 2.884 3.364 4.265

0.668

0.981

1.751

1.489

2.884

3.364

4.265

0

1

2

3

4

5

Sim

ula

tion

Tim

e (

ns/D

ay)

(1 Node: Simulation Time in ns/Day)

Page 29: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; F1-ATPase on Tesla K40s & IVB CPUs

2 x [email protected] (2nodes)

2 x [email protected] + 2x Tesla

K40@875Mhz (2nodes)

2 x [email protected] + 4x Tesla

K40@875Mhz (2nodes)

4 x [email protected] (2nodes)

4 x [email protected] + 2x Tesla

K40@875Mhz (2nodes)

4 x [email protected] + 4x Tesla

K40@875Mhz (2nodes)

4 x [email protected] (4node)

4 x [email protected] + 4x Tesla

K40@875Mhz (4nodes)

4 x [email protected] + 8x Tesla

K40@875Mhz (4nodes)

8 x [email protected] (4node)

8 x [email protected] + 4x Tesla

K40@875Mhz (4nodes)

8 x [email protected] + 8x Tesla

K40@875Mhz (4nodes)

8 x [email protected] (8node)

8 x [email protected] + 8x Tesla

K40@875Mhz (8nodes)

8 x [email protected]

GHz +16 x

TeslaK40@875Mhz (8nodes)

16 xXeon E5-

[email protected] (8node)

16 xXeon E5-

[email protected] + 8x Tesla

K40@875Mhz (8nodes)

16 xXeon E5-

[email protected]

GHz +16 xTesla

K40@875Mhz (8nodes)

系列1 0.598 3.247 4.186 1.124 3.353 5.673 1.171 5.731 6.832 2.083 5.486 7.246 2.199 7.480 8.160 3.810 7.818 8.864

0.598

3.247

4.186

1.124

3.353

5.673

1.171

5.731

6.832

2.083

5.486

7.246

2.199

7.480

8.160

3.810

7.818

8.864

0

2

4

6

8

10

Sim

ula

tion T

ime (

ns/D

ay)

(2-8 Nodes: Simulation Time in ns/Day)

Page 30: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; F1-ATPase on Tesla K80s & IVB CPUs

2 x XeonE5-2697

[email protected] (2

nodes)

2 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (2 nodes)

2 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (2 nodes)

2 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] (2

nodes)

4 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] (4node)

4 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (4 nodes)

4 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (4 nodes)

4 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] (4node)

8 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (4 nodes)

系列1 0.598 2.922 4.229 4.290 1.124 2.944 5.199 6.702 1.171 5.191 6.703 6.962 2.083 5.061 6.965 7.450

0.598

2.922

4.229 4.290

1.124

2.944

5.199

6.702

1.171

5.191

6.703 6.962

2.083

5.061

6.965

7.450

0

2

4

6

8

Sim

ula

tion T

ime (

ns/D

ay)

(2-4 Nodes: Simulation Time in ns/Day)

Page 31: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; STMV on Intel Phi, Tesla K40s and K80s & IVB CPUs

2 x Xeon [email protected] (1Ivybridge node)

2 x Xeon [email protected] (1 Haswell

CPU)

2 x Xeon [email protected] + 1 x

Tesla K40@875Mhz (1node)

2 x Xeon [email protected] + 0.5 x

Tesla K80 (autoboost) (1node)

2 x Xeon [email protected] + 1 x

Tesla K80 (autoboost) (1node)

2 x Xeon [email protected] + 2 x

Tesla K40@875Mhz (1node)

2 x Xeon [email protected] + 2 x

Tesla K80 (autoboost) (1node)

系列1 0.167 0.239 0.488 0.415 0.794 0.920 1.226

0.167

0.239

0.488

0.415

0.794

0.920

1.226

0.00

0.30

0.60

0.90

1.20

1.50

S

imula

tion T

ime (

ns/D

ay)

(1 Node: Simulation Time in ns/Day)

Page 32: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; STMV on Tesla K40s & IVB CPUs

2 x [email protected] (2nodes)

2 x [email protected] + 2x Tesla

K40@875Mhz (2nodes)

2 x [email protected] + 4x Tesla

K40@875Mhz (2nodes)

4 x [email protected] (2nodes)

4 x [email protected] + 2x Tesla

K40@875Mhz (2nodes)

4 x [email protected] + 4x Tesla

K40@875Mhz (2nodes)

4 x [email protected] (4node)

4 x [email protected] + 4x Tesla

K40@875Mhz (4nodes)

4 x [email protected] + 8x Tesla

K40@875Mhz (4nodes)

8 x [email protected] (4node)

8 x [email protected] + 4x Tesla

K40@875Mhz (4nodes)

8 x [email protected] + 8x Tesla

K40@875Mhz (4nodes)

8 x [email protected] (8node)

8 x [email protected] + 8x Tesla

K40@875Mhz (8nodes)

8 x [email protected] + 16x Tesla

K40@875Mhz (8nodes)

16 xXeon E5-

[email protected] (8node)

16 xXeon E5-

[email protected] + 8x Tesla

K40@875Mhz (8nodes)

16 xXeon E5-

[email protected] + 16x Tesla

K40@875Mhz (8nodes)

系列1 0.150 0.477 0.673 0.283 0.484 0.887 0.298 1.697 2.220 0.542 1.694 2.541 0.573 2.899 3.520 1.016 2.773 3.445

0.150

0.477

0.673

0.283

0.484

0.887

0.298

1.697

2.220

0.542

1.694

2.541

0.573

2.899

3.520

1.016

2.773

3.445

0.0

1.0

2.0

3.0

4.0

Sim

ula

tion T

ime (

ns/D

ay)

(2-8 Nodes: Simulation Time in ns/Day)

Page 33: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

NAMD 2.10; STMV on Tesla K80s & IVB CPUs

2 x XeonE5-2697

[email protected] (2

nodes)

2 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (2 nodes)

2 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (2 nodes)

2 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] (2

nodes)

4 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (2 nodes)

4 x XeonE5-2697

[email protected] (4node)

4 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (4 nodes)

4 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (4 nodes)

4 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] (4node)

8 x XeonE5-2697

[email protected] + 0.5 xTesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] + 1 x

Tesla K80(autoboost) (4 nodes)

8 x XeonE5-2697

[email protected] + 2 x

Tesla K80(autoboost) (4 nodes)

系列1 0.150 0.421 0.630 0.690 0.283 0.425 0.778 1.110 0.298 1.555 2.154 2.253 0.542 1.518 2.388 2.837

0.150

0.421

0.630 0.690

0.283

0.425

0.778

1.110

0.298

1.555

2.154 2.253

0.542

1.518

2.388

2.837

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Sim

ula

tion T

ime (

ns/D

ay)

(2-4 Nodes: Simulation Time in ns/Day)

Page 34: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Replace 3 Nodes with 1 2090 GPU

Speedup of 1.2x for 50% the cost

Running NAMD version 2.9

Each blue node contains 2x Intel Xeon X5550 CPUs

(4 Cores per CPU).

The green node contains 2x Intel Xeon X5550

CPUs (4 Cores per CPU) and 1x NVIDIA M2090

GPU

Note: Typical CPU and GPU node pricing used.

Pricing may vary depending on node configuration.

Contact your preferred HW vendor for actual

pricing.

F1-ATPase

0.63

0.74 $8,000

$4,000

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Nanoseconds/Day

F1-ATPase 4 CPU Nodes

1 CPU Node +1x M2090 GPUs

Cost

Page 35: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

35

K20 - Greener: Twice The Science Per Watt

Running NAMD version 2.9

Each blue node contains Dual E5-2687W

CPUs (95W, 4 Cores per CPU).

Each green node contains 2x Intel Xeon

X5550 CPUs (95W, 4 Cores per CPU) and

2x NVIDIA K20 GPUs (225W per GPU)

Energy Expended

= Power x Time

Lower is better

Cut down energy usage by ½ with GPUs

0

200000

400000

600000

800000

1000000

1200000

1 Node 1 Node + 2x K20

En

erg

y E

xp

en

ded

(kJ)

Energy Used in Simulating 1 Nanosecond of ApoA1

Page 36: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Kepler - Greener: Twice The Science/Joule

Running NAMD version 2.9

The blue node contains Dual E5-2687W

CPUs (150W each, 8 Cores per CPU).

The green nodes contain Dual E5-2687W

CPUs (8 Cores per CPU) and 2x NVIDIA

K10, K20, or K20X GPUs (235W each).

Energy Expended

= Power x Time

Lower is better

Cut down energy usage by ½ with GPUs

0

50000

100000

150000

200000

250000

CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs

En

erg

y E

xp

en

ded

(kJ)

Energy used in simulating 1 ns of SMTV

Satellite Tobacco Mosaic Virus

Page 37: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

37

NAMD CPU and GPU Scaling

The next 8 slides

are courtesy of

Jim Phillips

@ Beckman

Institute UIUC

Page 38: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu GTC 2015

New Streaming Kernel Performance

(2fs timestep)

4

8

16

32

64

256 512 1024 2048 4096

Number of Nodes

21M atoms

Pe

rfo

rman

ce

(n

s p

er

day)

Blue Waters XK7 (new streaming)

Blue Waters XK7 (no streaming)

+10%

+30%

+10-30%

Page 39: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu GTC 2015

Parallelize PME Within Node

(2fs timestep)

4

8

16

32

64

256 512 1024 2048 4096

Number of Nodes

21M atoms

Pe

rfo

rman

ce

(n

s p

er

day)

Blue Waters XK7 (new streaming, tuned PME)

Blue Waters XK7 (new streaming)

Blue Waters XK7 (no streaming)

2.9 ms/step

60 ns/day

+10-30% +5%

Page 40: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu GTC 2015

Blue Waters vs Titan

(2fs timestep)

4

8

16

32

64

256 512 1024 2048 4096

Number of Nodes

21M atoms

Pe

rfo

rman

ce

(n

s p

er

day)

Blue Waters XK7 (new streaming, tuned PME)

Titan XK7 (new streaming, tuned PME)

Titan XK7 (SC14, old streaming)

Blue Waters XK7 (no streaming)

-5%

-25%

-5-25%

2.9 ms/step

60 ns/day

Page 41: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu GTC 2015

Topology-Aware Scheduling on Blue Waters

• Map jobs to convex sets to

avoid network interference

• NCSA, Cray, Adaptive

• Just enabled January 13

• Most likely explanation for

Blue Waters performance

advantage over Titan

• See Enos et al., CUG 2014

Page 42: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu GTC 2015

Blue Waters vs Titan

(2fs timestep)

0.25

0.5

1

2

4

8

16

32

64

256 512 1024 2048 4096 8192 16384

Number of Nodes

21M atoms

224M atoms

Pe

rfo

rman

ce

(n

s p

er

day)

Blue Waters XK7 (GTC15)

Titan XK7 (GTC15)

Titan XK7 (SC14)

7 ms/step

25 ns/day

Page 43: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu GTC 2015

Comparison with CPU-only Machines

(2fs timestep)

0.25

0.5

1

2

4

8

16

32

64

256 512 1024 2048 4096 8192 16384

Number of Nodes

21M atoms

224M atoms

Pe

rfo

rman

ce

(n

s p

er

day)

Blue Waters XK7 (GTC15)

Titan XK7 (GTC15)

Edison XC30 (SC14)

Blue Waters XE6 (SC14)

+50-200% +100-200%

+70%

Page 44: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Recommended GPU Node Configuration for NAMD Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+

CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Kepler K20, K40, K80

# of GPUs per CPU socket 1-2

GPU memory preference (GB) 6-12

GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

Scale to multiple nodes with same single node configuration 44

Page 45: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

45

Summary/Conclusions Benefits of GPU Accelerated Computing

Faster than CPU only systems in all tests

Large performance boost with small marginal price increase

Energy usage cut in half

GPUs scale very well within a node and over multiple nodes

Tesla K20 GPU is our fastest and lowest power high performance GPU to date

Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive

Page 46: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0

Page 47: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0: Phi vs. Kepler K40 fastest GPU!

1 x Xeon [email protected]

1 x Intel Phi 3120p(Native Mode)

1 x Intel Phi 5110p(Native Mode)

1 x Xeon [email protected] + 1 x

Tesla K40@875Mhz

1 x Xeon [email protected] + 2 x

Tesla K40@875Mhz

2 x Xeon [email protected]

2 x Xeon [email protected] + 1 x

Tesla K40@875Mhz

2 x Xeon [email protected] + 2 x

Tesla K40@875Mhz

系列1 4.96 6.02 5.9 18.19 18.55 7.9 19.29 25.84

4.96

6.02 5.9

18.19 18.55

7.9

19.29

25.84

0

5

10

15

20

25

30

ns/d

ay

GROMACS 5.0 RC1 (ns/day) on K40 with Boost Clocks and Intel Phi 192K Waters Benchmark (CUDA 6.0)

Page 48: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon [email protected]

1 x Xeon [email protected] + 1 x

Tesla K20X

1 x Xeon [email protected] + 1 x

Tesla K40@875Mhz

1 x Xeon [email protected] + 1 x

Tesla K80(autoboost)

2 x Xeon [email protected] (1

node)

2 x Xeon [email protected] + 1 x

Tesla K20X (1 node)

2 x Xeon [email protected] + 1 x

Tesla K40@875Mhz(1 node)

系列1 7.92 20.01 21.79 18.63 11.60 25.49 26.00

7.92

20.01

21.79

18.63

11.60

25.49 26.00

0.00

5.00

10.00

15.00

20.00

25.00

30.00

ns/D

ay

GROMACS 5.0, cresta_ion_channel Single Node with & without Kepler GPUs

Page 49: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5-2697

[email protected]

1 x Xeon E5-2697

[email protected] +1 x Tesla K20X

1 x Xeon E5-2697

[email protected] +1 x Tesla

K40@875Mhz

1 x Xeon E5-2697

[email protected] +1 x Tesla K80

Board

2 x Xeon E5-2697

[email protected] (1node)

2 x Xeon E5-2697

[email protected] +1 x Tesla K20X

(1 node)

2 x Xeon E5-2697

[email protected] +1 x Tesla

K40@875Mhz (1node)

2 x Xeon E5-2697

[email protected] +2 x Tesla K20X

(1 node)

2 x Xeon E5-2697

[email protected] +2 x Tesla

K40@875Mhz (1node)

系列1 13.66 35.27 37.00 31.86 17.98 41.94 45.29 42.57 45.37

13.66

35.27 37.00

31.86

17.98

41.94

45.29

42.57

45.37

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

ns

/Da

y

GROMACS 5.0, cresta_ion_channel_vsites Single node with & without Kepler GPUs

Page 50: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5-2697

[email protected]

1 x Xeon E5-2697

[email protected] + 1x Tesla K20X

1 x Xeon E5-2697

[email protected] + 1x Tesla

K40@875Mhz

1 x Xeon E5-2697

[email protected] + 1x Tesla K80

Board(autoboost)

2 x Xeon E5-2697

[email protected] (1node)

2 x Xeon E5-2697

[email protected] + 1x Tesla K20X (1

node)

2 x Xeon E5-2697

[email protected] + 1x Tesla

K40@875Mhz (1node)

2 x Xeon E5-2697

[email protected] + 2x Tesla K20X (1

node)

2 x Xeon E5-2697

[email protected] + 2x Tesla

K40@875Mhz (1node)

系列1 0.10 0.23 0.23 0.19 0.17 0.28 0.30 0.36 0.39

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

ns

/Da

y

GROMACS 5.0, cresta_methanol Single node with & without Kepler GPUs

Page 51: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5-2697

[email protected]

1 x Xeon E5-2697

[email protected] + 1x Tesla K20X

1 x Xeon E5-2697

[email protected] + 1x Tesla

K40@875Mhz

1 x Xeon E5-2697

[email protected] + 1x Tesla K80

Board

2 x Xeon E5-2697

[email protected] (1node)

2 x Xeon E5-2697

[email protected] + 1x Tesla K20X (1

node)

2 x Xeon E5-2697

[email protected] + 1x Tesla

K40@875Mhz (1node)

2 x Xeon E5-2697

[email protected] + 2x Tesla K20X (1

node)

2 x Xeon E5-2697

[email protected] + 2x Tesla

K40@875Mhz (1node)

系列1 0.12 0.27 0.31 0.34 0.19 0.30 0.36 0.46 0.52

0.12

0.27

0.31

0.34

0.19

0.30

0.36

0.46

0.52

0.00

0.10

0.20

0.30

0.40

0.50

0.60

ns

/Da

y

GROMACS 5.0, cresta_methanol_rf Single Node with & without Kepler GPUs

Page 52: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

1 x Xeon E5-2697

[email protected]

1 x Xeon E5-2697

[email protected] + 1x Tesla K20X

1 x Xeon E5-2697

[email protected] + 1x Tesla

K40@875Mhz

1 x Xeon E5-2697

[email protected] + 1x Tesla K80

Board

2 x Xeon E5-2697

[email protected] (1node)

2 x Xeon E5-2697

[email protected] + 1x Tesla K20X (1

node)

2 x Xeon E5-2697

[email protected] + 1x Tesla

K40@875Mhz (1node)

2 x Xeon E5-2697

[email protected] + 2x Tesla K20X (1

node)

2 x Xeon E5-2697

[email protected] + 2x Tesla

K40@875Mhz (1node)

系列1 0.92 2.79 2.99 3.30 1.54 3.24 3.83 4.58 5.18

0.92

2.79 2.99

3.30

1.54

3.24

3.83

4.58

5.18

0.00

1.00

2.00

3.00

4.00

5.00

6.00

ns

/Da

y

GROMACS 5.0, cresta_virus_capsid Single Node with & without Kepler GPUs

Page 53: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x XeonE5-2697

[email protected] (2

nodes)

4 x XeonE5-2697

[email protected] + 2 x

TeslaK20X (2nodes)

4 x XeonE5-2697

[email protected] + 2 x

TeslaK40@875

Mhz (2nodes)

4 x XeonE5-2697

[email protected] + 4 x

TeslaK20X (2nodes)

4 x XeonE5-2697

[email protected] + 4 x

TeslaK40@875

Mhz (2nodes)

8 x XeonE5-2697

[email protected] (4node)

8 x XeonE5-2697

[email protected] + 4 x

TeslaK20X (4nodes)

8 x XeonE5-2697

[email protected] + 4 x

TeslaK40@875

Mhz (4nodes)

8 x XeonE5-2697

[email protected] + 8 x

TeslaK20X (4nodes)

8 x XeonE5-2697

[email protected] + 8 x

TeslaK40@875

Mhz (4nodes)

16 x XeonE5-2697

[email protected] (8node)

16 x XeonE5-2697

[email protected] + 8 x

TeslaK20X (8nodes)

16 x XeonE5-2697

[email protected] + 8 x

TeslaK40@875

Mhz (8nodes)

16 x XeonE5-2697

[email protected] + 16 x

TeslaK20X (8nodes)

16 x XeonE5-2697

[email protected] + 16 x

TeslaK40@875

Mhz (8nodes)

系列1 21.32 31.80 33.76 44.49 45.92 35.99 48.85 52.25 59.28 61.16 54.72 62.95 68.11 72.18 78.48

21.32

31.80 33.76

44.49 45.92

35.99

48.85 52.25

59.28 61.16

54.72

62.95

68.11

72.18

78.48

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

ns/D

ay

GROMACS 5.0, cresta_ion_channel 2 to 8 Nodes, with & without Kepler GPUs

Page 54: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x XeonE5-2697

[email protected] (2

nodes)

4 x XeonE5-2697

[email protected] + 2 x

TeslaK20X (2nodes)

4 x XeonE5-2697

[email protected] + 2 x

TeslaK40@875

Mhz (2nodes)

4 x XeonE5-2697

[email protected] + 4 x

TeslaK20X (2nodes)

4 x XeonE5-2697

[email protected] + 4 x

TeslaK40@875

Mhz (2nodes)

8 x XeonE5-2697

[email protected] (4node)

8 x XeonE5-2697

[email protected] + 4 x

TeslaK20X (4nodes)

8 x XeonE5-2697

[email protected] + 4 x

TeslaK40@875

Mhz (4nodes)

8 x XeonE5-2697

[email protected] + 8 x

TeslaK20X (4nodes)

8 x XeonE5-2697

[email protected] + 8 x

TeslaK40@875

Mhz (4nodes)

16 x XeonE5-2697

[email protected] (8node)

16 x XeonE5-2697

[email protected] + 8 x

TeslaK20X (8nodes)

16 x XeonE5-2697

[email protected] + 8 x

TeslaK40@875

Mhz (8nodes)

16 x XeonE5-2697

[email protected] + 16 x

TeslaK20X (8nodes)

16 x XeonE5-2697

[email protected] + 16 x

TeslaK40@875

Mhz (8nodes)

系列1 32.81 47.92 53.98 70.02 76.48 55.66 75.50 81.26 99.26 98.37 82.31 102.47 105.78 131.88 140.66

32.81

47.92 53.98

70.02 76.48

55.66

75.50 81.26

99.26 98.37

82.31

102.47 105.78

131.88

140.66

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

ns/D

ay

GROMACS 5.0, cresta_ion_channel_vsites 2 to 8 Nodes, with & without Kepler GPUs

Page 55: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x XeonE5-2697

[email protected] (2

nodes)

4 x XeonE5-2697

[email protected] + 2 x

TeslaK20X (2nodes)

4 x XeonE5-2697

[email protected] + 2 x

TeslaK40@875

Mhz (2nodes)

4 x XeonE5-2697

[email protected] + 4 x

TeslaK20X (2nodes)

4 x XeonE5-2697

[email protected] + 4 x

TeslaK40@875

Mhz (2nodes)

8 x XeonE5-2697

[email protected] (4node)

8 x XeonE5-2697

[email protected] + 4 x

TeslaK20X (4nodes)

8 x XeonE5-2697

[email protected] + 4 x

TeslaK40@875

Mhz (4nodes)

8 x XeonE5-2697

[email protected] + 8 x

TeslaK20X (4nodes)

8 x XeonE5-2697

[email protected] + 8 x

TeslaK40@875

Mhz (4nodes)

16 x XeonE5-2697

[email protected] (8node)

16 x XeonE5-2697

[email protected] + 8 x

TeslaK20X (8nodes)

16 x XeonE5-2697

[email protected] + 8 x

TeslaK40@875

Mhz (8nodes)

16 x XeonE5-2697

[email protected] + 16 x

TeslaK20X (8nodes)

16 x XeonE5-2697

[email protected] + 16 x

TeslaK40@875

Mhz (8nodes)

系列1 0.33 0.44 0.47 0.63 0.80 0.60 0.84 0.97 1.38 1.53 1.25 1.73 1.83 2.73 2.85

0.33 0.44 0.47

0.63

0.80

0.60

0.84 0.97

1.38

1.53

1.25

1.73 1.83

2.73 2.85

0.00

0.50

1.00

1.50

2.00

2.50

3.00

ns

/Da

y

GROMACS 5.0, cresta_methanol 2 to 8 Nodes, with & without Kepler GPUs

Page 56: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x [email protected] (2nodes)

4 x [email protected] + 2 x

TeslaK20X (2nodes)

4 x [email protected] + 2 x

TeslaK40@875

Mhz (2nodes)

4 x [email protected] + 4 x

TeslaK20X (2nodes)

4 x [email protected] + 4 x

TeslaK40@875

Mhz (2nodes)

8 x [email protected] (4node)

8 x [email protected] + 4 x

TeslaK20X (4nodes)

8 x [email protected] + 4 x

TeslaK40@875

Mhz (4nodes)

8 x [email protected] + 8 x

TeslaK20X (4nodes)

8 x [email protected] + 8 x

TeslaK40@875

Mhz (4nodes)

16 x [email protected] (8node)

16 x [email protected] + 8 x

TeslaK20X (8nodes)

16 x [email protected] + 8 x

TeslaK40@875

Mhz (8nodes)

16 x [email protected] + 16x TeslaK20X (8nodes)

16 x [email protected] + 16x Tesla

K40@875Mhz (8nodes)

系列1 0.38 0.49 0.57 0.89 1.05 0.75 0.91 1.17 1.73 2.12 1.48 1.86 2.23 3.65 4.16

0.38 0.49 0.57

0.89 1.05

0.75 0.91

1.17

1.73

2.12

1.48

1.86

2.23

3.65

4.16

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

ns

/Da

y

GROMACS 5.0, cresta_methanol_rf 2 to 8 Nodes, with & without Kepler GPUs

Page 57: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GROMACS 5.0 & Fastest Kepler GPUs yet!

4 x [email protected] (2nodes)

4 x [email protected] + 2 x

TeslaK20X (2nodes)

4 x [email protected] + 2 x

TeslaK40@875

Mhz (2nodes)

4 x [email protected] + 4 x

TeslaK20X (2nodes)

4 x [email protected] + 4 x

TeslaK40@875

Mhz (2nodes)

8 x [email protected] (4node)

8 x [email protected] + 4 x

TeslaK20X (4nodes)

8 x [email protected] + 4 x

TeslaK40@875

Mhz (4nodes)

8 x [email protected] + 8 x

TeslaK20X (4nodes)

8 x [email protected] + 8 x

TeslaK40@875

Mhz (4nodes)

16 x [email protected] (8node)

16 x [email protected] + 8 x

TeslaK20X (8nodes)

16 x [email protected] + 8 x

TeslaK40@875

Mhz (8nodes)

16 x [email protected] + 16x TeslaK20X (8nodes)

16 x [email protected] + 16x Tesla

K40@875Mhz (8nodes)

系列1 2.93 5.44 5.71 8.36 8.63 5.53 8.99 9.81 14.20 12.93 9.18 15.24 15.57 20.30 22.01

2.93

5.44 5.71

8.36 8.63

5.53

8.99 9.81

14.20

12.93

9.18

15.24 15.57

20.30

22.01

0.00

5.00

10.00

15.00

20.00

25.00

ns

/Da

y

GROMACS 5.0, cresta_virus_capsid 2 to 8 Nodes, with & without Kepler GPUs

Page 58: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Slides – courtesy of GROMACS Dev Team

Page 59: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Greener Science

Running GROMACS 4.6 with CUDA 4.1

The blue nodes contain 2x Intel X5550 CPUs

(95W TDP, 4 Cores per CPU)

The green node contains 2x Intel X5550 CPUs,

4 Cores per CPU) and 2x NVIDIA M2090s GPUs

(225W TDP per GPU)

In simulating each nanosecond, the GPU-accelerated system uses 33% less energy

Energy Expended

= Power x Time

Lower is better

0

2000

4000

6000

8000

10000

12000

4 Nodes(760 Watts)

1 Node + 2x M2090(640 Watts)

En

erg

y E

xp

en

ded

(K

ilo

Jo

ule

s C

on

su

med

)

ADH in Water (134K Atoms)

Page 60: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+

CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Kepler K20, K40, K80

# of GPUs per CPU socket

1x

Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or

high-end AMD Opterons

GPU memory preference (GB) 6

GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand 60

Page 61: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz
Page 62: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GAUSSIAN

Page 63: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Gaussian

ACS Fall 2011 press release Joint collaboration between Gaussian, NVDA and PGI for GPU acceleration: http://www.gaussian.com/g_press/nvidia_press.htm No such press release exists for Intel MIC or AMD GPUs Mike Frisch quote from press release:

“Calculations using Gaussian are limited primarily by the available computing resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the development of hardware, compiler technology and application software among the three companies, the new application will bring the speed and cost-effectiveness of GPUs to the challenging problems and applications that Gaussian’s customers need to address.”

Page 64: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Select Slides from “Enabling Gaussian 09 on GPGPUs” at GTC March 2014

In 2011 Gaussian, Inc., NVIDIA Corp. and PGI started a long-term project to enable

all the performance critical paths of Gaussian on GPGPUs.

Ultimate goal is to show significant performance improvement by using accelerators in

conjunction with CPUs

Initial efforts are directed towards creating an infrastructure that will leverage the current CPU

code base and at the same time minimize the additional maintenance effort associated with

running on GPUs.

Current status of this work for Direct Hartree-Fock and triples-correction

calculations as applied in for example Coupled Cluster calculations that uses

mostly the directives based OpenACC framework.

Slides & Audio: http://on-demand.gputechconf.com/gtc/2014/video/S4613-

enabling-gaussian-09-gpgpus.mp4

Page 65: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Roberto Gomperts (NVIDIA, Corp.)

Michael Frisch (Gaussian, Inc.)

Giovanni Scalmani (Gaussian, Inc.)

Brent Leback (PGI)

Page 66: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

EARLY PERFORMANCE RESULTS (DIRECT SCF)

System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB

GPUs: 6 Tesla K40m (15 SMPs @ 875 MHz); 12 GB Global Memory

160.3

124.6

105.3

9.1

10.3

33.5

32.3

0.7

111.7

86.6

66.8

9.3

10.5

23.0

21.7

0.7

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Total l502: "HF" XC-Quad Rest l703: "HF" XC-Quad

Valinomycin Force Calculation Speed Ups Relative to CPU-Only Full Node

20 Threads 0 GPUs 20 Threads 6 GPUs

Labels: Time in mins

Method rB3LYP

No. of Atoms 168

Basis Set 6-31G(3df,3p)

No. of Basis Funcs 3 642

No. of Cycles 17

Page 67: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

EARLY PERFORMANCE RESULTS (CCSD(T))

24.8 24.8 22.7 2.1 3.7 3.6 1.3 2.3 0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Total l913: "triples" Iterative CCSD

Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node

20 Threads 0 GPUs 6 Threads 6 GPUs

Labels: Time in hrs

System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB

GPUs: 6 Tesla K40m (15 SMPs @ 875 MHz); 12 GB Global Memory

Method CCSD(t)

No. of Atoms 33

Basis Set 6-31G(d,p)

No. of Basis Funcs 315

No. Occ Orbitals 41

No. Virt Orbitals 259

No. of Cycles 15

No. CCSD iters 16

Page 68: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

CLOSING REMARKS

Significant progress has been made in creating a framework that keeps an unified code structure for GPU enabled Gaussian

There is room for performance improvement in the Direct SCF work

The (t) correction performance looks promising

Further work: Continue working towards a “product” quality version of Gaussian to be released to customers

— Continue unification of the code base

— Tackle non-default paths of the currently enabled code

— Expand enabling of other Gaussian functionality (2nd Derivatives, XC-quadrature, TDDFT, MP2, etc.)

— Performance tuning

Page 69: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GAMESS

Page 70: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

70

We like to push the envelope as much as we can in the direction of highly scalable efficient codes. GPU technology seems like a good way to achieve this goal. Also, since we are associated with a DOE Laboratory, energy efficiency is important, and this is another reason to explore quantum chemistry on GPUs.

Prof. Mark Gordon Distinguished Professor, Department of Chemistry, Iowa State University and

Director, Applied Mathematical Sciences Program, AMES Laboratory

GAMESS Partnership Overview Mark Gordon and Andrey Asadchev, key developers of GAMESS,

in collaboration with NVIDIA. Mark Gordon is a recipient of a

NVIDIA Professor Partnership Award.

Quantum Chemistry one of major consumers of CPU cycles at

national supercomputer centers

NVIDIA developer resources fully allocated to GAMESS code

Page 71: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

GAMESS August 2011 GPU Performance

First GPU supported GAMESS release via "libqc", a library for fast quantum

chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software

2e- AO integrals and their assembly into a closed shell Fock matrix

0.0

1.0

2.0

Ginkgolide (53 atoms) Vancomycin (176 atoms)

GA

ME

SS

Au

g.

2011 R

ele

ase R

ela

tive

P

erf

orm

an

ce

fo

r T

wo

Sm

all M

ole

cu

les

4x E5640 CPUs

4x E5640 CPUs + 4x Tesla C2070s

Page 72: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

72

Upcoming GAMESS Q4 2013 Release

Multi-nodes with multi-GPUs supported

Rys Quadrature

Hartree-Fock 8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup.

See 2012 publication

Møller–Plesset perturbation theory (MP2): Paper published

Coupled Cluster SD(T): CCSD code completed,

(T) in progress

Page 73: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

73

GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F

2.3

2.5 2.5

2.3 2.4

2.3

2.9

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Taxol 6-31G Taxol 6-31G(d) Taxol 6-31G(2d,2p)

Taxol 6-31++G(d,p)

Valinomycin 6-31G

Valinomycin 6-31G(d)

Valinomycin 6-31G(2d,2p)

Hartree-Fock GPU Speedups*

Speedup

Adding 1x 2070 GPU

speeds up computations

by 2.3x to 2.9x

* A. Asadchev, M.S. Gordon, “New

Multithreaded Hybrid CPU/GPU Approach to

Hartree-Fock,” Journal of Chemical Theory and

Computation (2012)

Page 74: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

TeraChem

Page 75: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

75

TERACHEM 1.5K; TRPCAGE ON TESLA K40S & IVB CPUS

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

2 x Xeon E5-2697 [email protected] + 1 x TeslaK40@875Mhz (1 node)

2 x Xeon E5-2697 [email protected] + 2 x TeslaK40@875Mhz (1 node)

2 x Xeon E5-2697 [email protected] + 4 x TeslaK40@875Mhz (1 node)

2 x Xeon E5-2697 [email protected] + 8 x TeslaK40@875Mhz (1 node)

TeraChem 1.5K; TriCage on Tesla K40s & IVB CPUs (Total Processing Time in Seconds)

Page 76: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

76

TERACHEM 1.5K; TRPCAGE ON TESLA K40S & HASWELL CPUS

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

2 x Xeon E5-2698 [email protected] + 1 x Tesla K40@875Mhz (1node)

2 x Xeon E5-2698 [email protected] + 2 x Tesla K40@875Mhz (1node)

2 x Xeon E5-2698 [email protected] + 4 x Tesla K40@875Mhz (1node)

TeraChem 1.5K; TriCage on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds)

Page 77: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

77

TERACHEM 1.5K; TRPCAGE ON TESLA K80S & IVB CPUS

0.00

20.00

40.00

60.00

80.00

100.00

120.00

2 x Xeon E5-2697 [email protected] + 1 x Tesla K80 board (1node)

2 x Xeon E5-2697 [email protected] + 2 x Tesla K80 boards (1node)

2 x Xeon E5-2697 [email protected] + 4 x Tesla K80 boards (1node)

TeraChem 1.5K; TripCage on Tesla K80s & IVB CPUs (Total Processing Time in Seconds)

Page 78: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

78

TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & HASWELL CPUS

0.00

20.00

40.00

60.00

80.00

100.00

120.00

2 x Xeon E5-2698 [email protected] + 1 x Tesla K80 board (1node)

2 x Xeon E5-2698 [email protected] + 2 x Tesla K80 boards (1node)

2 x Xeon E5-2698 [email protected] + 4 x Tesla K80 boards (1node)

TeraChem 1.5K; TripCage on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds)

Page 79: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

79

TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS

0.00

2000.00

4000.00

6000.00

8000.00

10000.00

12000.00

2 x Xeon E5-2697 [email protected] + 1 x TeslaK40@875Mhz (1 node)

2 x Xeon E5-2697 [email protected] + 2 x TeslaK40@875Mhz (1 node)

2 x Xeon E5-2697 [email protected] + 4 x TeslaK40@875Mhz (1 node)

2 x Xeon E5-2697 [email protected] + 8 x TeslaK40@875Mhz (1 node)

TeraChem 1.5K; BPTI on Tesla K40s & IVB CPUs (Total Processing Time in Seconds)

Page 80: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

80

TERACHEM 1.5K; BPTI ON TESLA K80S & IVB CPUS

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

2 x Xeon E5-2697 [email protected] + 1 x Tesla K80 board (1node)

2 x Xeon E5-2697 [email protected] + 2 x Tesla K80 boards (1node)

2 x Xeon E5-2697 [email protected] + 4 x Tesla K80 boards (1node)

TeraChem 1.5K; BPTI on Tesla K80s & IVB CPUs (Total Processing Time in Seconds)

Page 81: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

81

TERACHEM 1.5K; BPTI ON TESLA K40S & HASWELL CPUS

0.00

2000.00

4000.00

6000.00

8000.00

10000.00

12000.00

2 x Xeon E5-2698 [email protected] + 1 x Tesla K40@875Mhz (1node)

2 x Xeon E5-2698 [email protected] + 2 x Tesla K40@875Mhz (1node)

2 x Xeon E5-2698 [email protected] + 4 x Tesla K40@875Mhz (1node)

TeraChem 1.5K; BPTI on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds)

Page 82: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

82

TERACHEM 1.5K; BPTI ON TESLA K80S & HASWELL CPUS

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

2 x Xeon E5-2698 [email protected] + 1 x Tesla K80 board (1node)

2 x Xeon E5-2698 [email protected] + 2 x Tesla K80 boards (1node)

2 x Xeon E5-2698 [email protected] + 4 x Tesla K80 boards (1node)

TeraChem 1.5K; BPTI on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds)

Page 83: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

83

Benefits of GPU Accelerated Computing

Faster than CPU only systems in all tests

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive

Page 84: update Updated: June 16, 2015 mberger@nvidiaaccc.riken.jp/wp-content/uploads/2015/06/MarkBerge.pdfpresentation 10 AMBER 14 with P2P, Higher Density Nodes 2 x Xeon E5-2690 v2@3.00GHz

Sign up for FREE GPU Test Drive on

remotely hosted clusters

Run Computational Chemistry

Apps on Tesla K80 GPUs today

Test Drive K80 GPUs! Experience The Acceleration

www.nvidia.com/GPUTestDrive

Try Preconfigured Apps:

AMBER, NAMD, GROMACS, LAMMPS,

Quantum Espresso, TeraChem

Or Load Your Own