accelerating knowledge-based energy evaluation in protein structure modeling with graphics...

1

Accelerating Knowledge-based Energy Evaluation in Protein

Structure Modeling with Graphics Processing Units

A. Yaseen, Yaohang Li, “Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units,” Journal of Parallel and Distributed Computing, 72(2): 297-307, 2012

March 19th 2013

http://dx.doi.org/10.1016/j.jpdc.2011.10.005

http://dx.doi.org/10.1016/j.jpdc.2011.10.005

2

Abstract

Evaluating protein energy is very costly. An efficient implementation of knowledge-based energy

function on GPU. Perfect or near-perfect load balancing. Improves the cache efficiency by reorganizing the problem.

Speedup, compared to the serial implementation~150 on NVIDIA Quadro FX3800M

~250 on NVIDIA Tesla M2050

Protein structure modeling applications, including a Monte Carlo sampling program and a local optimization program, can benefit from GPU implementation.

3

Contents

Introduction Knowledge-based Energy Implementation Details

DFIRE Potential Symmetric N-Body Problem Cache Efficiency Other GPU-Specific Implementations

Computational Results Overall Speedup Comparison with Cell Subdivision Method and Neighbor Lists Method Applications in Protein Structure Modeling Local Structure Optimization

Summary

4

1. Introduction

The most costly operations, in many protein modeling applications are usually energy evaluations of protein molecules.

The major part of most protein energy evaluation involves estimating interactions between pair-wise atoms, which is typically an N-body problem.

a protein system with N atoms, conventional evaluation of the energy function requires O(N2) operations.

substantial computational time, for large proteins

5

1. Introduction-cont.

Key contribution A design of a workload distribution scheme to achieve perfect or

nearly perfect load balancing among GPU threads in the symmetric N-body problem.

Reorder the protein atom sequence by types to improve cache efficiency in latest NVIDIA Fermi GPU architecture.

“GPU-DFIRE”: GPU implementation of DFIRE energy function using CUDA programming environment. “CPU-DFIRE”: Serial CPU version.

Comparison of the performance of GPU-DFIRE using pair-wise interactions methods with cell subdivision method and neighbor lists method.

A Monte Carlo sampling program and a local optimization program are used to demonstrate the efficiency of GPU-DFIRE in all-atom protein structure modeling.

6

2. Knowledge-based Energy

The physics-based energy functions intend to evaluate the protein molecule energy by describing atomic interactions with physics-based energy terms

The knowledge-based (statistical) energy functions are developed to estimate the feasibility of protein conformations instead of the true physical energy. Knowledge-based approaches derive rules from the

increasing number of experimentally determined protein conformations by statistical approaches

7

3. GPU-DFIRE Implementation Details

DFIRE energy function is used as an example to illustrate our GPU implementation of memory-intensive knowledge-based energy functions

DFIRE is an all-atom potential energy function The computation of DFIRE energy is near N-

body calculation.

8

3.1 DFIRE Potential

Starting from the first atom in the ATOMS array, for every atom in the protein, the DFIRE program retrieves the energy terms between the current atom and the rest of the atoms not in the same residue.

The energy term is obtained by calculating the pair-wise atom distance, converting it into a distance bin, and then looking up the large DFIRE array for the appropriate energy term value. DFIRE calculation has a distance cutoff – if the distance between

two atoms is bigger than the cutoff, the interaction between these two atoms is deemed to be small enough to be ignored

9

Symmetric DFIRE - near N-body calculations

DFIRE(ATOMS[i], ATOMS[j], d) = DFIRE(ATOMS[j], ATOMS[i], d)

3.1 DFIRE Potential-cont.

Thread 1: N-1 interactionsAtom 1

Atom 2

Thread 2: N-2 interactions

Thread N: 0 interactionsAtom N

Pseudocode of symmetric N-body calculation in serial DFIRE

Naïve assignment of workload to the threads

10

3.2 Symmetric N-Body Problem

Perfect balancing

Near-Perfect balancing

Unbalanced

11

3.2 Symmetric N-Body Problem-cont.

...

Atom 1

Atom 2

Atom 3

Atom 4

Atom(N-1)/2

Atom(N-1)/2+1

...

Thread 1

Atom 2

Atom 3

Atom4

Atom5

Atom(N-1)/2+1

Atom(N-1)/2+2

...

Thread 2

Atom3

Atom4

Atom5

Atom6

Atom(N-1)/2+2

Atom(N-1)/2+3

...

Thread 3

AtomN-1

AtomN

Atom1

Atom2

Atom(N-1)/2-2

Atom(N-1)/2-1

...

Thread N-1

AtomN

Atom1

Atom2

Atom3

Atom(N-1)/2-1

Atom(N-1)/2

...

Thread N

(N-1)/2Interactions

Perfect balancing when N is an odd number. Each thread carries out (N-1)/2 atom-atom iterations

12


...

Atom 1

Atom 2

Atom 3

Atom 4

AtomN/2

Atom N/2+1

...

Thread 1

Atom 2

Atom 3

Atom4

Atom5

AtomN/2+1

AtomN/2+2

...

Thread 2

Atom3

Atom4

Atom5

Atom6

AtomN/2+2

AtomN/2+3

...

Thread 3

AtomN-1

AtomN

Atom1

Atom2

AtomN/2-2

...

Thread N-1

AtomN

Atom1

Atom2

Atom3

AtomN/2-1

...

Thread N

AtomN/2

AtomN/2+1

AtomN/2+2

AtomN/2+3

AtomN-1

AtomN

...

Thread N/2

AtomN/2+1

AtomN/2+2

AtomN/2+3

AtomN/2+4

AtomN

...

Thread N/2+1

...

N/2-1Interactions

N/2Interactions

Second N/2 Threads

First N/2Threads

Nearly perfect balancing when N is an even number. The first N/2 threads carry out N/2 atom-atom interactions. The second N/2 threads carry out N/2-1 interactions

13


Pseudocode of assigning workload to a GPU thread in GPU–DFIRE

14

3.3 Cache Efficiency

Reorder atom sequence in a protein according to their types.

The sorting of a 6-residue fragment of protein, where atoms of the same types are clustered after reordering.

Clustering atoms of the same types together can potentially improve the cache hit rate. Another advantage of reordering the atom sequence by atom types is, in case of cache misses, the requested global memory addresses will have a good chance to fall within fewer cache-lines compared to unsorted atom sequence, which can lead to better bus utilization.

15

3.4 Other GPU-Specific Implementations

Coalesced global memory access Fine-grained threads Parallel sum reduction Take advantage of GPU memory

hierarchy Loops unrolling

(1st atom kind, 2nd atom kind, distance

bin) statistical potential energy

values of each possible atom pair

based on their distance

Atom type, Residue number,

Coordinates (XYZ)

Each thread will carry equal number of interactions

Work balancing

Create large number of threads, near the maximum a GPU hardware can launch. (The pair-wise atom interaction calculations are evenly distributed among these threads)

Fine-grained threads

Load ATOMS into shared memory, using tiling technique• Break the ATOMS array into tiles, (a tile represents a

fixed dimension of the atom data)• Load one tile into the shared memory• Once all threads are done with that tile, load the next

one. (override the previous tile) • Repeat until all computations in the thread block are

completed

Sort ATOMS Make use of L1 cache

Reconstruct the ATOMS from “array of structures” to “structure of arrays”. Threads per block are chosen to be a multiple of warp size

Coalesced memory access

GPU-DFIRe

A tree-based approach, to compute the overall DFIRE energy from the partial energy sums generated by the GPU threadsParallel sum reduction

Replacing the body of a loop with multiple copies of itself .

Loops unrolling

17

Parallel sum reduction

Parallel reduction is Tree-based approach used within each thread block. Each block reduces a portion of the array. Partial results are communicated among thread blocks by multiple launches of the reduction kernel.

18

4. Computational Results

0

50

100

150

200

250

300

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000

Spee

dup

Number of Atoms

Tesla M2050

Quadro FX3800M

•Overall speedup of GPU-DFIRE on Tesla M2050 and Quadro FX3800M on a set of proteins with various sizes with respect to CPU-DFIRE. •(CPU–DFIRE) runs on a server with an Intel i7 CPU 920 •The computation time measures are based on the average of 10 runs for each protein

19

4. Computational Results – cont. The Tesla M2050 GPU (Fermi architecture) has 14 multiprocessors with 32

cores each, 3 G of global memory, 64 kB of RAM which can be configured between Shared Memory and L1 cache and 32 kB of registers per multiprocessor.

The Quadro FX3800M GPU (non-Fermi architecture) has 16 multiprocessors with 8 cores each, 1 G of global memory and 16 K of RAM.

The CPU version of DFIRE (CPU–DFIRE) runs on a server with an Intel i7 CPU 920 @ 2.67 GHz, 8 MB cache, and 6 G memory.

We firstly benchmark GPU–DFIRE on a set of proteins of various sizes. The GPU time we measured includes the time of transferring the protein

information (ATOMS) arrays to GPU device memories, GPU execution time, and the time of retrieving the calculated overall DFIRE energy from GPU.

Then, we apply GPU–DIRE to a Monte Carlo program for protein conformation sampling and a program for protein energy local minimization.

We use the gcc compiler with the default “-O3” optimization flag specified in DFIRE package for CPU–DFIRE. For GPU–DFIRE, nvcc compiler in CUDA 2.0 is used with “-O3” flag.

20

Cell-linked list method

The cell-linked list method starts by subdividing the simulation space into cells of equal sizes, with an edge length <= cut-off radius of the interaction to be computed.

The particles are sorted into these cells and the interactions are computed between particles in the same and neighboring cells.

21

Neighboring list method

The neighboring list (NBL) method is based on the cell-linked list method, except for the second part where we calculate the interactions of the neighboring particles in the cell-linked list method; in the NBL method we store the

neighboring details of each particle in a matrix.

In a third step we go over the entries in the matrix calculating the interactions among them.

22

4.2 Cell Subdivision and Neighbor Lists Methods

0

2

4

6

8

10

12

14

16

18

20

22

24

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000

Tim

e (m

sec)

# of Atoms

Cell Subdivision

Neighbor Lists

Pair-wise Interaction

Performance comparison of GPU-DFIRE implementations using pair-wise interaction method, cell subdivision method, and neighbor lists

23

4.3 Applications in Protein Structure Modeling

Monte Carlo Sampling 3GDG with 7,992 atoms. The acceleration in the energy function evaluation

significantly improve the performance of the Monte Carlo computation with an average speedup of 82.67, where the original computation time for 105 iterations is reduced from more than an hour to less than one minute

Local Structure Optimization (BFGS) MINIROT program provided by the TINKER package an average speedup of ∼4.5

24

5. Summary

GPU-DFIRE – a GPU implementation for the all-atom, knowledge-based DFIRE potential energy function. A workload distribution scheme is designed to achieve perfect or near-

perfect load balancing in the symmetric N-body problem in DFIRE. Reordering atoms in the protein according atom types also improves the

GPU cache efficiency. Other standard GPU programming techniques are adopted to take

advantage of the GPU architecture to its fullest. Speedups of up to 150 and 250 are observed in GPU-DFIRE on Tesla

M2050 and Quadro FX3800M, respectively, relative to the DFIRE computation on the CPU.

Because evaluating the energy function is the most costly operation in many protein structure modeling applications, significant performance improvements are also found in a Monte Carlo program for protein conformation sampling and a protein structure local optimization program when GPU-DFIRE is used to replace CPU-DFIRE.

25

5. Summary-cont.

The techniques developed in GPU-DFIRE can also be applied to other knowledge-based energy functions.

Future research directions will include investigating the development of a general GPU-accelerated framework that can be easily employed to accelerate other knowledge-based energy functions.

26

Questions?

Thank You

27

GPU & CUDA

28

GPU-DFIRE in Monte Carlo Optimization

Step 1: Initialize the system temperature to T and start from an initial protein configuration, R

Step 2: Calculate DFIRE energy: E(R) Repeat

Step 3: Propose a new configuration R’ by placing a small random displacement to a set of randomly selected atoms in R

Step 4: Calculate DFIRE energy: E(R’) Step 5: Accept the new configuration R’ according to the

Metropolis acceptance probability: e-(E(R’)-E(R))/T Until convergence is reached

accelerating knowledge-based energy evaluation in protein structure modeling with graphics...

Documents

protein energy evaluation

protein molecule energy

protein system

protein modeling applications

statistical energy functions

efficiency of gpu

true physical energy

knowledgebased approaches