accelerating knowledge-based energy evaluation in protein structure modeling with graphics...
TRANSCRIPT
1
Accelerating Knowledge-based Energy Evaluation in Protein
Structure Modeling with Graphics Processing Units
A. Yaseen, Yaohang Li, “Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units,” Journal of Parallel and Distributed Computing, 72(2): 297-307, 2012
March 19th 2013
2
Abstract
Evaluating protein energy is very costly. An efficient implementation of knowledge-based energy
function on GPU. Perfect or near-perfect load balancing. Improves the cache efficiency by reorganizing the problem.
Speedup, compared to the serial implementation~150 on NVIDIA Quadro FX3800M
~250 on NVIDIA Tesla M2050
Protein structure modeling applications, including a Monte Carlo sampling program and a local optimization program, can benefit from GPU implementation.
3
Contents
Introduction Knowledge-based Energy Implementation Details
DFIRE Potential Symmetric N-Body Problem Cache Efficiency Other GPU-Specific Implementations
Computational Results Overall Speedup Comparison with Cell Subdivision Method and Neighbor Lists Method Applications in Protein Structure Modeling Local Structure Optimization
Summary
4
1. Introduction
The most costly operations, in many protein modeling applications are usually energy evaluations of protein molecules.
The major part of most protein energy evaluation involves estimating interactions between pair-wise atoms, which is typically an N-body problem.
a protein system with N atoms, conventional evaluation of the energy function requires O(N2) operations.
substantial computational time, for large proteins
5
1. Introduction-cont.
Key contribution A design of a workload distribution scheme to achieve perfect or
nearly perfect load balancing among GPU threads in the symmetric N-body problem.
Reorder the protein atom sequence by types to improve cache efficiency in latest NVIDIA Fermi GPU architecture.
“GPU-DFIRE”: GPU implementation of DFIRE energy function using CUDA programming environment. “CPU-DFIRE”: Serial CPU version.
Comparison of the performance of GPU-DFIRE using pair-wise interactions methods with cell subdivision method and neighbor lists method.
A Monte Carlo sampling program and a local optimization program are used to demonstrate the efficiency of GPU-DFIRE in all-atom protein structure modeling.
6
2. Knowledge-based Energy
The physics-based energy functions intend to evaluate the protein molecule energy by describing atomic interactions with physics-based energy terms
The knowledge-based (statistical) energy functions are developed to estimate the feasibility of protein conformations instead of the true physical energy. Knowledge-based approaches derive rules from the
increasing number of experimentally determined protein conformations by statistical approaches
7
3. GPU-DFIRE Implementation Details
DFIRE energy function is used as an example to illustrate our GPU implementation of memory-intensive knowledge-based energy functions
DFIRE is an all-atom potential energy function The computation of DFIRE energy is near N-
body calculation.
8
3.1 DFIRE Potential
Starting from the first atom in the ATOMS array, for every atom in the protein, the DFIRE program retrieves the energy terms between the current atom and the rest of the atoms not in the same residue.
The energy term is obtained by calculating the pair-wise atom distance, converting it into a distance bin, and then looking up the large DFIRE array for the appropriate energy term value. DFIRE calculation has a distance cutoff – if the distance between
two atoms is bigger than the cutoff, the interaction between these two atoms is deemed to be small enough to be ignored
9
Symmetric DFIRE - near N-body calculations
DFIRE(ATOMS[i], ATOMS[j], d) = DFIRE(ATOMS[j], ATOMS[i], d)
3.1 DFIRE Potential-cont.
Thread 1: N-1 interactionsAtom 1
Atom 2
Thread 2: N-2 interactions
Thread N: 0 interactionsAtom N
Pseudocode of symmetric N-body calculation in serial DFIRE
Naïve assignment of workload to the threads
10
3.2 Symmetric N-Body Problem
Perfect balancing
Near-Perfect balancing
Unbalanced
11
3.2 Symmetric N-Body Problem-cont.
...
Atom 1
Atom 2
Atom 3
Atom 4
Atom(N-1)/2
Atom(N-1)/2+1
...
Thread 1
Atom 2
Atom 3
Atom4
Atom5
Atom(N-1)/2+1
Atom(N-1)/2+2
...
Thread 2
Atom3
Atom4
Atom5
Atom6
Atom(N-1)/2+2
Atom(N-1)/2+3
...
Thread 3
AtomN-1
AtomN
Atom1
Atom2
Atom(N-1)/2-2
Atom(N-1)/2-1
...
Thread N-1
AtomN
Atom1
Atom2
Atom3
Atom(N-1)/2-1
Atom(N-1)/2
...
Thread N
(N-1)/2Interactions
Perfect balancing when N is an odd number. Each thread carries out (N-1)/2 atom-atom iterations
12
3.2 Symmetric N-Body Problem-cont.
...
Atom 1
Atom 2
Atom 3
Atom 4
AtomN/2
Atom N/2+1
...
Thread 1
Atom 2
Atom 3
Atom4
Atom5
AtomN/2+1
AtomN/2+2
...
Thread 2
Atom3
Atom4
Atom5
Atom6
AtomN/2+2
AtomN/2+3
...
Thread 3
AtomN-1
AtomN
Atom1
Atom2
AtomN/2-2
...
Thread N-1
AtomN
Atom1
Atom2
Atom3
AtomN/2-1
...
Thread N
AtomN/2
AtomN/2+1
AtomN/2+2
AtomN/2+3
AtomN-1
AtomN
...
Thread N/2
AtomN/2+1
AtomN/2+2
AtomN/2+3
AtomN/2+4
AtomN
...
Thread N/2+1
...
N/2-1Interactions
N/2Interactions
Second N/2 Threads
First N/2Threads
Nearly perfect balancing when N is an even number. The first N/2 threads carry out N/2 atom-atom interactions. The second N/2 threads carry out N/2-1 interactions
13
3.2 Symmetric N-Body Problem-cont.
Pseudocode of assigning workload to a GPU thread in GPU–DFIRE
14
3.3 Cache Efficiency
Reorder atom sequence in a protein according to their types.
The sorting of a 6-residue fragment of protein, where atoms of the same types are clustered after reordering.
Clustering atoms of the same types together can potentially improve the cache hit rate. Another advantage of reordering the atom sequence by atom types is, in case of cache misses, the requested global memory addresses will have a good chance to fall within fewer cache-lines compared to unsorted atom sequence, which can lead to better bus utilization.
15
3.4 Other GPU-Specific Implementations
Coalesced global memory access Fine-grained threads Parallel sum reduction Take advantage of GPU memory
hierarchy Loops unrolling
(1st atom kind, 2nd atom kind, distance
bin) statistical potential energy
values of each possible atom pair
based on their distance
Atom type, Residue number,
Coordinates (XYZ)
Each thread will carry equal number of interactions
Work balancing
Create large number of threads, near the maximum a GPU hardware can launch. (The pair-wise atom interaction calculations are evenly distributed among these threads)
Fine-grained threads
Load ATOMS into shared memory, using tiling technique• Break the ATOMS array into tiles, (a tile represents a
fixed dimension of the atom data)• Load one tile into the shared memory• Once all threads are done with that tile, load the next
one. (override the previous tile) • Repeat until all computations in the thread block are
completed
Sort ATOMS Make use of L1 cache
Reconstruct the ATOMS from “array of structures” to “structure of arrays”. Threads per block are chosen to be a multiple of warp size
Coalesced memory access
GPU-DFIRe
A tree-based approach, to compute the overall DFIRE energy from the partial energy sums generated by the GPU threadsParallel sum reduction
Replacing the body of a loop with multiple copies of itself .
Loops unrolling
17
Parallel sum reduction
Parallel reduction is Tree-based approach used within each thread block. Each block reduces a portion of the array. Partial results are communicated among thread blocks by multiple launches of the reduction kernel.
18
4. Computational Results
0
50
100
150
200
250
300
0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
Spee
dup
Number of Atoms
Tesla M2050
Quadro FX3800M
•Overall speedup of GPU-DFIRE on Tesla M2050 and Quadro FX3800M on a set of proteins with various sizes with respect to CPU-DFIRE. •(CPU–DFIRE) runs on a server with an Intel i7 CPU 920 •The computation time measures are based on the average of 10 runs for each protein
19
4. Computational Results – cont. The Tesla M2050 GPU (Fermi architecture) has 14 multiprocessors with 32
cores each, 3 G of global memory, 64 kB of RAM which can be configured between Shared Memory and L1 cache and 32 kB of registers per multiprocessor.
The Quadro FX3800M GPU (non-Fermi architecture) has 16 multiprocessors with 8 cores each, 1 G of global memory and 16 K of RAM.
The CPU version of DFIRE (CPU–DFIRE) runs on a server with an Intel i7 CPU 920 @ 2.67 GHz, 8 MB cache, and 6 G memory.
We firstly benchmark GPU–DFIRE on a set of proteins of various sizes. The GPU time we measured includes the time of transferring the protein
information (ATOMS) arrays to GPU device memories, GPU execution time, and the time of retrieving the calculated overall DFIRE energy from GPU.
Then, we apply GPU–DIRE to a Monte Carlo program for protein conformation sampling and a program for protein energy local minimization.
We use the gcc compiler with the default “-O3” optimization flag specified in DFIRE package for CPU–DFIRE. For GPU–DFIRE, nvcc compiler in CUDA 2.0 is used with “-O3” flag.
20
Cell-linked list method
The cell-linked list method starts by subdividing the simulation space into cells of equal sizes, with an edge length <= cut-off radius of the interaction to be computed.
The particles are sorted into these cells and the interactions are computed between particles in the same and neighboring cells.
21
Neighboring list method
The neighboring list (NBL) method is based on the cell-linked list method, except for the second part where we calculate the interactions of the neighboring particles in the cell-linked list method; in the NBL method we store the
neighboring details of each particle in a matrix.
In a third step we go over the entries in the matrix calculating the interactions among them.
22
4.2 Cell Subdivision and Neighbor Lists Methods
0
2
4
6
8
10
12
14
16
18
20
22
24
0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000
Tim
e (m
sec)
# of Atoms
Cell Subdivision
Neighbor Lists
Pair-wise Interaction
Performance comparison of GPU-DFIRE implementations using pair-wise interaction method, cell subdivision method, and neighbor lists
23
4.3 Applications in Protein Structure Modeling
Monte Carlo Sampling 3GDG with 7,992 atoms. The acceleration in the energy function evaluation
significantly improve the performance of the Monte Carlo computation with an average speedup of 82.67, where the original computation time for 105 iterations is reduced from more than an hour to less than one minute
Local Structure Optimization (BFGS) MINIROT program provided by the TINKER package an average speedup of ∼4.5
24
5. Summary
GPU-DFIRE – a GPU implementation for the all-atom, knowledge-based DFIRE potential energy function. A workload distribution scheme is designed to achieve perfect or near-
perfect load balancing in the symmetric N-body problem in DFIRE. Reordering atoms in the protein according atom types also improves the
GPU cache efficiency. Other standard GPU programming techniques are adopted to take
advantage of the GPU architecture to its fullest. Speedups of up to 150 and 250 are observed in GPU-DFIRE on Tesla
M2050 and Quadro FX3800M, respectively, relative to the DFIRE computation on the CPU.
Because evaluating the energy function is the most costly operation in many protein structure modeling applications, significant performance improvements are also found in a Monte Carlo program for protein conformation sampling and a protein structure local optimization program when GPU-DFIRE is used to replace CPU-DFIRE.
25
5. Summary-cont.
The techniques developed in GPU-DFIRE can also be applied to other knowledge-based energy functions.
Future research directions will include investigating the development of a general GPU-accelerated framework that can be easily employed to accelerate other knowledge-based energy functions.
26
Questions?
Thank You
27
GPU & CUDA
28
GPU-DFIRE in Monte Carlo Optimization
Step 1: Initialize the system temperature to T and start from an initial protein configuration, R
Step 2: Calculate DFIRE energy: E(R) Repeat
Step 3: Propose a new configuration R’ by placing a small random displacement to a set of randomly selected atoms in R
Step 4: Calculate DFIRE energy: E(R’) Step 5: Accept the new configuration R’ according to the
Metropolis acceptance probability: e-(E(R’)-E(R))/T Until convergence is reached
29