molecular models, threads and you

22
Molecular Models, Threads and You Jiahao Chen Martínez Group Dept. Chemistry, CATMS, MRL and Beckman CS 498 MG presentation: 2007-12-07 Optimizing the TINKER classical molecular dynamics code while maintaining code readability

Upload: jiahao-chen

Post on 27-Jan-2015

111 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Molecular models, threads and you

Molecular Models, Threads and You

Jiahao Chen

Martínez GroupDept. Chemistry, CATMS, MRL and Beckman

CS 498 MG presentation: 2007-12-07

Optimizing the TINKER classical molecular dynamics code while maintaining code readability

Page 2: Molecular models, threads and you

Molecular models/force fields

covalent bond effectsE =

+

Typical energy function

noncovalent interactions

Page 3: Molecular models, threads and you

Molecular models/force fields

bond stretch angle torsion dihedrals

electrostatics dispersion

E = !

a!angles

!a("a ! "eq,a)2!

b!bonds

kb(rb ! req,b)2

!

i<j!atoms

qiqj

rij

!

d!dihedrals

!

n

lnd cos (n!)

+ -

++

+ +

Typical energy function

!

i<j!atoms

!ij

"#"ij

rij

$12

!#

"ij

rij

$6%

computation cost = O(N2)

Page 4: Molecular models, threads and you

• The state of the system is given by the position and momentum of every atom (of mass )

• Solve the system of partial differential equations

• with user-specified initial conditions (e.g. with constant temperature and pressure)

• Subject to (user-specified) constraints, e.g. fixed bond angles

Problem description

(x1, p1, x2, p2, · · · , xN , pN ) ! R3!2!N

!xi

!t=

pi

mi,!pi

!t= ! !E

!xi, i = 1, · · · , N

mi

Page 5: Molecular models, threads and you

Many parallel and serial implementations

Package name Threads MPI GlobalArrays

NAMD CHARM++GROMACS ✓ ✓

TINKERAMBER partly ✓ ✓

CHARMM ✓LAMMPS ✓

NWChem ✓ ✓

Page 6: Molecular models, threads and you

Things I tried

• Compiler flags optimization

• Cache miss reduction

• Lookup tables

• Parallelization with OpenMP

Page 7: Molecular models, threads and you

Compiler flag optimizationflags gfortran 4.1.2 ifort 10.0.023

-O0 29.95(2) s - 36.30(2) s -

-Os 29.92(3) s +0.77(3) % 32.59(4) s +10.22(2) %

-O1 30.22(1) s -0.90(4) % 32.12(3) s +11.51(1) %

-O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) %

-O3 29.84(2) s +0.38(2) % 30.83(2) s +15.06(2) %

CE search 28.77(2) s +3.62(3) %1 28.96(2) s +20.22(1)%2

1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer-checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno-defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize-register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load -fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops”

2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static-consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”

Page 8: Molecular models, threads and you

Algorithm and time profile

>98%

for each time step

>59% <31%

Initialize model and parameters

EndMove one time step

Enforce temp. & pressure

Flush I/O

Update state by t/2

Calculate potential energy

and forces

Calculate & record kinetic energy and

properties

Update state by t/2

Enforce temp. & pressure

Remove unphysical motions

Calculate charge

interactions

Calculate dispersion

interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions

Add up all compo-nents

...

37%12% 8%9% 26%

O(N2) O(N)

N = 6gfortran 4.1.2

O(N)O(N2)

Page 9: Molecular models, threads and you

Add up all compo-nents

An unexpected cost

>98%

for each time step

>59% <31%

Initialize model and parameters

EndMove one time step

Enforce temp. & pressure

Flush I/O

Update state by t/2

Calculate potential energy

and forces

Calculate & record kinetic energy and

properties

Update state by t/2

Enforce temp. & pressure

Remove unphysical motions

Calculate charge

interactions

Calculate dispersion

interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions...

37%12% 8%9% 26%

O(N2) O(N)

N = 6

O(N)O(N2)

Text

Q: Why is 15% of total execution time spent adding

numbers!?

Page 10: Molecular models, threads and you

A: many L2 cache missesc zero out each of the first derivative components 7 do i = 1, n do j = 1, 3 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 19 desum(j,i) = deb(j,i) + ... 2 derivs(j,i) = desum(j,i) end do end do

70 of 91 cache misses per time step (n = 6) shown

22 other terms

22 other terms

Page 11: Molecular models, threads and you

A simple solutionc zero out each of the first derivative components 7 do i = 1, n do j = 1, 326 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 6 temp = deb(j,i) + ... 1 19 desum(j,i) = temp 1 2 derivs(j,i) = temp end do end do

reduced cache misses from 92 to 41 per time step

Page 12: Molecular models, threads and you

Speedup from reducing L2 cache misses

flags gfortran 4.1.2 ifort 10.0.023

original

with scalar replacement

speedup

29.95(2) s 28.96(2) s

27.43(3) s 28.95(1) s

+8.44(1) % +0.03(2) %

ifort already called with scalar replacement flag

Page 13: Molecular models, threads and you

Lookup tables (LUTs)

• Calculations of sqrt() and exp() take up 23.8% of execution time

• Idea: pre-compute values of sqrt() and exp() in an array and recall them from memory when needed

• Caution: LUT should not displace too much data from L2 cache

Page 14: Molecular models, threads and you

sqrt() with LUTLUT with linear interpolationdirect LUT

Page 15: Molecular models, threads and you

exp() with LUTLUT with first-order Taylor

series refinement*direct LUT

ex = ex0 + (x! x0)ex0 +O!(x! x0)2

"

Page 16: Molecular models, threads and you

Choice of implementation

function desired precision

table size

(doubles)

refinement expected speedup

sqrt()

exp()

10-4 10,764 none +118%

10-8 6,836 Taylor +151%

LUT aligned to 128-bitsL2 cache = 4 MB = 512K doubles

Page 17: Molecular models, threads and you

Speedup from LUT use

flags gfortran 4.1.2 ifort 10.0.023

original

with lookup tables

speedup

29.95(2) s 28.96(2) s

26.89(1) s 25.87(2) s

+10.23(2) % +7.22(3) %

Page 18: Molecular models, threads and you

Summary of serial improvements

Improvement gfortran 4.1.2 ifort 10.0.023

Best compiler flags +3.62(3) % +20.22(1) %

L2 cache miss reduction

+8.44(2) % +0.03(1) %

Lookup tables +10.23(1) % +7.22(2) %

Total 23.91(3) s+20.17(4) %

26.86(2) s+26.00(2) %

Page 19: Molecular models, threads and you

Add up all compo-nents

Parallelization targets

>98%

for each time step

>59% <31%

Initialize model and parameters

EndMove one time step

Enforce temp. & pressure

Flush I/O

Update state by t/2

Calculate potential energy

and forces

Calculate & record kinetic energy and

properties

Update state by t/2

Enforce temp. & pressure

Remove unphysical motions

Calculate charge

interactions

Calculate dispersion

interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions...

37%12% 8%9% 26%

O(N2) O(N)

N = 6

O(N)O(N2)

Text

Page 20: Molecular models, threads and you

Parallelization strategy

Add up all compo-nents

Calculate potential energy

and forces

Calculate charge

interactions

Calculate dispersion

interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions...

omp sections

omp parallel do

12%16% 11%50%

omp parallel doomp parallel do

omp parallel do

omp parallel do

omp section

omp section

2%

50%

50%

100%

Page 21: Molecular models, threads and you

Parallelization results

5

10

15

20

25

30

35

0.5 1 1.5 2 2.5 3 3.5 4 4.5

N=6N=1000Ideal

Exe

cutio

n tim

e/s

# cores

gfortran 4.1.2

Page 22: Molecular models, threads and you

Summary

• Free software can sometimes be better than non-free software

• L2 cache misses can significantly degrade performance

• Lookup tables are an effective tradeoff between speed and memory vs. precision

• Simple OpenMP parallelization is effective for small numbers of processors