accelerating knowledge-based energy evaluation in protein structure modeling with graphics...

29
Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li,“ Accelerating Knowledge-based Energy Evaluation in Protein Str ucture Modeling with Graphics Processing Units ,” Journal of Parallel and Distributed Computing, 72(2): 297- March 19 th 2013

Upload: reynold-whitehead

Post on 28-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

1

Accelerating Knowledge-based Energy Evaluation in Protein

Structure Modeling with Graphics Processing Units

A. Yaseen, Yaohang Li, “Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units,” Journal of Parallel and Distributed Computing, 72(2): 297-307, 2012

March 19th 2013

Page 2: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

2

Abstract

Evaluating protein energy is very costly. An efficient implementation of knowledge-based energy

function on GPU. Perfect or near-perfect load balancing. Improves the cache efficiency by reorganizing the problem.

Speedup, compared to the serial implementation~150 on NVIDIA Quadro FX3800M

~250 on NVIDIA Tesla M2050

Protein structure modeling applications, including a Monte Carlo sampling program and a local optimization program, can benefit from GPU implementation.

Page 3: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

3

Contents

Introduction Knowledge-based Energy Implementation Details

DFIRE Potential Symmetric N-Body Problem Cache Efficiency Other GPU-Specific Implementations

Computational Results Overall Speedup Comparison with Cell Subdivision Method and Neighbor Lists Method Applications in Protein Structure Modeling Local Structure Optimization

Summary

Page 4: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

4

1. Introduction

The most costly operations, in many protein modeling applications are usually energy evaluations of protein molecules.

The major part of most protein energy evaluation involves estimating interactions between pair-wise atoms, which is typically an N-body problem.

a protein system with N atoms, conventional evaluation of the energy function requires O(N2) operations.

substantial computational time, for large proteins

Page 5: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

5

1. Introduction-cont.

Key contribution A design of a workload distribution scheme to achieve perfect or

nearly perfect load balancing among GPU threads in the symmetric N-body problem.

Reorder the protein atom sequence by types to improve cache efficiency in latest NVIDIA Fermi GPU architecture.

“GPU-DFIRE”: GPU implementation of DFIRE energy function using CUDA programming environment. “CPU-DFIRE”: Serial CPU version.

Comparison of the performance of GPU-DFIRE using pair-wise interactions methods with cell subdivision method and neighbor lists method.

A Monte Carlo sampling program and a local optimization program are used to demonstrate the efficiency of GPU-DFIRE in all-atom protein structure modeling.

Page 6: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

6

2. Knowledge-based Energy

The physics-based energy functions intend to evaluate the protein molecule energy by describing atomic interactions with physics-based energy terms

The knowledge-based (statistical) energy functions are developed to estimate the feasibility of protein conformations instead of the true physical energy. Knowledge-based approaches derive rules from the

increasing number of experimentally determined protein conformations by statistical approaches

Page 7: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

7

3. GPU-DFIRE Implementation Details

DFIRE energy function is used as an example to illustrate our GPU implementation of memory-intensive knowledge-based energy functions

DFIRE is an all-atom potential energy function The computation of DFIRE energy is near N-

body calculation.

Page 8: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

8

3.1 DFIRE Potential

Starting from the first atom in the ATOMS array, for every atom in the protein, the DFIRE program retrieves the energy terms between the current atom and the rest of the atoms not in the same residue.

The energy term is obtained by calculating the pair-wise atom distance, converting it into a distance bin, and then looking up the large DFIRE array for the appropriate energy term value. DFIRE calculation has a distance cutoff – if the distance between

two atoms is bigger than the cutoff, the interaction between these two atoms is deemed to be small enough to be ignored

Page 9: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

9

Symmetric DFIRE - near N-body calculations

DFIRE(ATOMS[i], ATOMS[j], d) = DFIRE(ATOMS[j], ATOMS[i], d)

3.1 DFIRE Potential-cont.

Thread 1: N-1 interactionsAtom 1

Atom 2

Thread 2: N-2 interactions

Thread N: 0 interactionsAtom N

Pseudocode of symmetric N-body calculation in serial DFIRE

Naïve assignment of workload to the threads

Page 10: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

10

3.2 Symmetric N-Body Problem

Perfect balancing

Near-Perfect balancing

Unbalanced

Page 11: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

11

3.2 Symmetric N-Body Problem-cont.

...

Atom 1

Atom 2

Atom 3

Atom 4

Atom(N-1)/2

Atom(N-1)/2+1

...

Thread 1

Atom 2

Atom 3

Atom4

Atom5

Atom(N-1)/2+1

Atom(N-1)/2+2

...

Thread 2

Atom3

Atom4

Atom5

Atom6

Atom(N-1)/2+2

Atom(N-1)/2+3

...

Thread 3

AtomN-1

AtomN

Atom1

Atom2

Atom(N-1)/2-2

Atom(N-1)/2-1

...

Thread N-1

AtomN

Atom1

Atom2

Atom3

Atom(N-1)/2-1

Atom(N-1)/2

...

Thread N

(N-1)/2Interactions

Perfect balancing when N is an odd number. Each thread carries out (N-1)/2 atom-atom iterations

Page 12: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

12

3.2 Symmetric N-Body Problem-cont.

...

Atom 1

Atom 2

Atom 3

Atom 4

AtomN/2

Atom N/2+1

...

Thread 1

Atom 2

Atom 3

Atom4

Atom5

AtomN/2+1

AtomN/2+2

...

Thread 2

Atom3

Atom4

Atom5

Atom6

AtomN/2+2

AtomN/2+3

...

Thread 3

AtomN-1

AtomN

Atom1

Atom2

AtomN/2-2

...

Thread N-1

AtomN

Atom1

Atom2

Atom3

AtomN/2-1

...

Thread N

AtomN/2

AtomN/2+1

AtomN/2+2

AtomN/2+3

AtomN-1

AtomN

...

Thread N/2

AtomN/2+1

AtomN/2+2

AtomN/2+3

AtomN/2+4

AtomN

...

Thread N/2+1

...

N/2-1Interactions

N/2Interactions

Second N/2 Threads

First N/2Threads

Nearly perfect balancing when N is an even number. The first N/2 threads carry out N/2 atom-atom interactions. The second N/2 threads carry out N/2-1 interactions

Page 13: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

13

3.2 Symmetric N-Body Problem-cont.

Pseudocode of assigning workload to a GPU thread in GPU–DFIRE

Page 14: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

14

3.3 Cache Efficiency

Reorder atom sequence in a protein according to their types.

The sorting of a 6-residue fragment of protein, where atoms of the same types are clustered after reordering.

Clustering atoms of the same types together can potentially improve the cache hit rate. Another advantage of reordering the atom sequence by atom types is, in case of cache misses, the requested global memory addresses will have a good chance to fall within fewer cache-lines compared to unsorted atom sequence, which can lead to better bus utilization.

Page 15: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

15

3.4 Other GPU-Specific Implementations

Coalesced global memory access Fine-grained threads Parallel sum reduction Take advantage of GPU memory

hierarchy Loops unrolling

Page 16: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

(1st atom kind, 2nd atom kind, distance

bin) statistical potential energy

values of each possible atom pair

based on their distance

Atom type, Residue number,

Coordinates (XYZ)

Each thread will carry equal number of interactions

Work balancing

Create large number of threads, near the maximum a GPU hardware can launch. (The pair-wise atom interaction calculations are evenly distributed among these threads)

Fine-grained threads

Load ATOMS into shared memory, using tiling technique• Break the ATOMS array into tiles, (a tile represents a

fixed dimension of the atom data)• Load one tile into the shared memory• Once all threads are done with that tile, load the next

one. (override the previous tile) • Repeat until all computations in the thread block are

completed

Sort ATOMS Make use of L1 cache

Reconstruct the ATOMS from “array of structures” to “structure of arrays”. Threads per block are chosen to be a multiple of warp size

Coalesced memory access

GPU-DFIRe

A tree-based approach, to compute the overall DFIRE energy from the partial energy sums generated by the GPU threadsParallel sum reduction

Replacing the body of a loop with multiple copies of itself .

Loops unrolling

Page 17: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

17

Parallel sum reduction

Parallel reduction is Tree-based approach used within each thread block. Each block reduces a portion of the array. Partial results are communicated among thread blocks by multiple launches of the reduction kernel.

Page 18: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

18

4. Computational Results

0

50

100

150

200

250

300

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000

Spee

dup

Number of Atoms

Tesla M2050

Quadro FX3800M

•Overall speedup of GPU-DFIRE on Tesla M2050 and Quadro FX3800M on a set of proteins with various sizes with respect to CPU-DFIRE. •(CPU–DFIRE) runs on a server with an Intel i7 CPU 920 •The computation time measures are based on the average of 10 runs for each protein

Page 19: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

19

4. Computational Results – cont. The Tesla M2050 GPU (Fermi architecture) has 14 multiprocessors with 32

cores each, 3 G of global memory, 64 kB of RAM which can be configured between Shared Memory and L1 cache and 32 kB of registers per multiprocessor.

The Quadro FX3800M GPU (non-Fermi architecture) has 16 multiprocessors with 8 cores each, 1 G of global memory and 16 K of RAM.

The CPU version of DFIRE (CPU–DFIRE) runs on a server with an Intel i7 CPU 920 @ 2.67 GHz, 8 MB cache, and 6 G memory.

We firstly benchmark GPU–DFIRE on a set of proteins of various sizes. The GPU time we measured includes the time of transferring the protein

information (ATOMS) arrays to GPU device memories, GPU execution time, and the time of retrieving the calculated overall DFIRE energy from GPU.

Then, we apply GPU–DIRE to a Monte Carlo program for protein conformation sampling and a program for protein energy local minimization.

We use the gcc compiler with the default “-O3” optimization flag specified in DFIRE package for CPU–DFIRE. For GPU–DFIRE, nvcc compiler in CUDA 2.0 is used with “-O3” flag.

Page 20: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

20

Cell-linked list method

The cell-linked list method starts by subdividing the simulation space into cells of equal sizes, with an edge length <= cut-off radius of the interaction to be computed.

The particles are sorted into these cells and the interactions are computed between particles in the same and neighboring cells.

Page 21: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

21

Neighboring list method

The neighboring list (NBL) method is based on the cell-linked list method, except for the second part where we calculate the interactions of the neighboring particles in the cell-linked list method; in the NBL method we store the

neighboring details of each particle in a matrix.

In a third step we go over the entries in the matrix calculating the interactions among them.

Page 22: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

22

4.2 Cell Subdivision and Neighbor Lists Methods

0

2

4

6

8

10

12

14

16

18

20

22

24

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000

Tim

e (m

sec)

# of Atoms

Cell Subdivision

Neighbor Lists

Pair-wise Interaction

Performance comparison of GPU-DFIRE implementations using pair-wise interaction method, cell subdivision method, and neighbor lists

Page 23: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

23

4.3 Applications in Protein Structure Modeling

Monte Carlo Sampling 3GDG with 7,992 atoms. The acceleration in the energy function evaluation

significantly improve the performance of the Monte Carlo computation with an average speedup of 82.67, where the original computation time for 105 iterations is reduced from more than an hour to less than one minute

Local Structure Optimization (BFGS) MINIROT program provided by the TINKER package an average speedup of ∼4.5

Page 24: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

24

5. Summary

GPU-DFIRE – a GPU implementation for the all-atom, knowledge-based DFIRE potential energy function. A workload distribution scheme is designed to achieve perfect or near-

perfect load balancing in the symmetric N-body problem in DFIRE. Reordering atoms in the protein according atom types also improves the

GPU cache efficiency. Other standard GPU programming techniques are adopted to take

advantage of the GPU architecture to its fullest. Speedups of up to 150 and 250 are observed in GPU-DFIRE on Tesla

M2050 and Quadro FX3800M, respectively, relative to the DFIRE computation on the CPU.

Because evaluating the energy function is the most costly operation in many protein structure modeling applications, significant performance improvements are also found in a Monte Carlo program for protein conformation sampling and a protein structure local optimization program when GPU-DFIRE is used to replace CPU-DFIRE.

Page 25: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

25

5. Summary-cont.

The techniques developed in GPU-DFIRE can also be applied to other knowledge-based energy functions.

Future research directions will include investigating the development of a general GPU-accelerated framework that can be easily employed to accelerate other knowledge-based energy functions.

Page 26: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

26

Questions?

Thank You

Page 27: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

27

GPU & CUDA

Page 28: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

28

GPU-DFIRE in Monte Carlo Optimization

Step 1: Initialize the system temperature to T and start from an initial protein configuration, R

Step 2: Calculate DFIRE energy: E(R) Repeat

Step 3: Propose a new configuration R’ by placing a small random displacement to a set of randomly selected atoms in R

Step 4: Calculate DFIRE energy: E(R’) Step 5: Accept the new configuration R’ according to the

Metropolis acceptance probability: e-(E(R’)-E(R))/T Until convergence is reached

Page 29: Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based

29