accelerating sanjeevini: a drug discovery …on-demand.gputechconf.com/gtc/2018/presentation/s...•...
Post on 21-Jul-2020
8 Views
Preview:
TRANSCRIPT
ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITEAbhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia
Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia
2
AGENDA
• Quick Introduction to Computer Aided Drug Discovery software Sanjeevani
• Challenges
• Code documentation in process of being improved
• Code maintained by Non Computer Science
• Designed to suit distributed programming
• Constraints
• Code modification should be minimal Ease of Maintenance.
• The current cluster has mix of CPU and GPU. Should run on both Portable
• Learnings
What to expect and what not to
3
COMPUTER AIDED DRUG DISCOVERYIntroduction
Target Discovery
Lead Generation
Lead Optimization
Preclinical Development
Phase I, II & III Clinical Trials
FDA Review & Approval
Drug to the Market
14 yrs $1.4 billion
2.5yrs
3.0yrs
1.0yrs
6.0yrs
1.5yrs
4%
15%
10%
68%
3%
4
SANJEEVINI FOR COMPUTER AIDED DRUG DESIGN
Check Lipinski compliance
Generate rapid binding energy estimates by
RASPD protocol
Predict all
possible
binding sites
and store top
ten sites
Dock and Score
Optimize geometry /
Assign TPACM4/derive quantum
mechanical charges
Assign force field
parameters
Perform molecular dynamics simulations and post facto free energy component analyses (Optional)
Generate
canonical A/B
DNA or MD
averaged
structure of B
DNA
Self drawn
ligand
molecule
Protein-ligand Complex/ Protein/DNA sequenceNRDBSM/Million molecule
library/Natural products
database
Overview
5
SANJEEVINIGPU acceleration
▪ OpenACCacceleration of ParDOCK module
▪ All atom energy based Monte Carlo docking for protein-ligand complexes
6
PERFORMANCE OPTIMIZATION Strategy
Analyze
ParallelizeOptimize
7
PERFORMANCE OPTIMIZATION Strategy
Analyze
ParallelizeOptimize
8
SANJEEVINI: PARDOCK
Flat profile:
Hotspots
% time Cumulative
seconds
Self
seconds
Calls Self calls Total
s/calls
Name
69.78 557.90 557.90 1188000 0.00 0.00 PDB::EnergyCalculator()
12.92 661.19 103.29 8 12.91 20.26 PDB::clashCombination()
7.35 719.96 58.77 26051422500 0.00 0.00 getRadius1()
5.49 763.85 43.89 885075 0.00 0.00 PDB::energyAtom()
9
PERFORMANCE OPTIMIZATION Strategy
Analyze
ParallelizeOptimize
10
SANJEEVINI: PARDOCKCPU code: EnergyCalculator
double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points
coords[], const unsigned &totalDockAtoms, … ){
for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){
for( int counter = 0; counter < vDrugGrid.size(); counter++ ){
// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]
// minDis = minimum of ‘distance’, minCounter = counter corresponding to minDis
}
ene += EnergyGrid[minCounter][atomcount];
}
return ene; }
11
OpenACCSimple | Powerful | Portable
Fueling the Next Wave of
Scientific Discoveries in HPC
University of IllinoisPowerGrid- MRI Reconstruction
70x Speed-Up
2 Days of Effort
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway
http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf
http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
RIKEN JapanNICAM- Climate Modeling
7-8x Speed-Up
5% of Code Modified
main() {
<serial code>#pragma acc kernels//automatically runs on GPU
{ <parallel code>
}}
12
OPENACC DIRECTIVES
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
#pragma acc data copyin(x,y) copyout(z){...#pragma acc parallel {#pragma acc loop gang vector
for (i = 0; i < n; ++i) {z[i] = x[i] + y[i];...
}}...
}
Performance portable
Interoperable
Single source
Incremental
13
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (1)
double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points
coords[], const unsigned &totalDockAtoms, … ){
#pragma acc parallel loop reduction(+:ene) private(minDis,minCounter) present() copyin()
firstprivate()
for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){
#pragma acc loop reduction(min:minDis)
for( int counter = 0; counter < vDrugGrid.size(); counter++ ){
// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]
minDis = (minDis > distance) ? distance;
}
14
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (2)
#pragma acc loop reduction(min:minCounter)
for( int counter = 0; counter < vDrugGrid.size(); counter++ ){
// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]
if ( distance == minDis ){
minCounter = (minCounter > counter) ? counter; }
}
ene += EnergyGrid[minCounter][atomcount];
}
return ene; }
15
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (3)
const points *vDrugGridData = vDrugGrid.data();
// compute ‘distance’ between coords[atomcount] and vDrugGridData[counter]
▪ Use ‘raw data pointer’ to access vectors
16
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (4)
unsigned totDockAtoms = totalDockAtoms;
float **eneGrid = EnergyGrid;
#pragma acc parallel loop reduction(+:ene) …
copyin(coords[0:tot DockAtoms]) present(eneGrid)
ene += eneGrid[minCounter][atomcount];
▪ Use ‘raw data pointer’ to access vectors
▪ Avoid using C++ references in OpenACC pragmas
17
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (4)
unsigned totDockAtoms = totalDockAtoms;
float **eneGrid = EnergyGrid;
#pragma acc parallel loop reduction(+:ene) …
copyin(coords[0:tot DockAtoms]) present(eneGrid)
ene += eneGrid[minCounter][atomcount];
▪ Use ‘raw data pointer’ to access vectors
▪ Avoid using C++ references in OpenACC pragmas
PDB::EnergyCalculator(float **&, const
std::vector<points, std::allocator<points>> &,
const std::vector<points, std::allocator<points>>
&, points *, const unsigned int &, energy &, int):
22, Generating present(vDrugGridData[:])
Generating copyin(coords[:totalDockAtoms->])
Generating present(EnergyGrid[:][:][:])
Runtime memory access
violation
18
OPENACC: 3 LEVELS OF PARALLELISM
• Vector threads work in
lockstep (SIMD/SIMT
parallelism)
• Workers compute a vector
• Gangs have 1 or more
workers and share resources
(such as cache, the
streaming multiprocessor,
etc.)
• Multiple gangs work
independently of each other
Workers
Gang
Workers
Gang
Vector
Vector
19
SANJEEVINI: PARDOCKOpenACC compiler output: EnergyCalculator
PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points,
std::allocator<points>> &, points *, const unsigned int &, energy &, int):
22, Generating present(vDrugGridData[:],eneGrid[:][:])
Generating copyin(coords[:totDockAtoms])
22, Accelerator kernel generated
Generating Tesla code
22, Generating reduction(+:ene)
24, #pragma acc loop gang /* blockIdx.x */
31, #pragma acc loop vector(256) /* threadIdx.x */
Generating reduction(min:minDis)
45, #pragma acc loop vector(256) /* threadIdx.x */
Generating reduction(min:minIdx)
31, Loop is parallelizable
45, Loop is parallelizable
20
MANAGE DATA HIGHER IN THE PROGRAM
Currently data is moved at the beginning and end of each function, in case the data is needed on the CPU
We know that the data is only needed on the CPU after convergence
We should inform the compiler when data movement is really needed to improved performance
21
STRUCTURED DATA REGIONS
The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region.
#pragma acc data
{
#pragma acc parallel loop
...
#pragma acc parallel loop
...
}
Data Region
Arrays used within the
data region will remain
on the GPU until the
end of the data region.
22
UNSTRUCTURED DATA DIRECTIVES
Used to define data regions when scoping doesn’t allow the use of normal data regions (e.g. the constructor/destructor of a class).
enter data Defines the start of an unstructured data lifetime
• clauses: copyin(list), create(list)
exit data Defines the end of an unstructured data lifetime
• clauses: copyout(list), delete(list), finalize
#pragma acc enter data copyin(a)
...
#pragma acc exit data delete(a)
23
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyAtom (3)
int **vProteinListData = new int
*[vProteinList.size()];
n = vProteinList.size();
#pragma acc enter data
create(vProteinListData[0:n][0:1])
for( int count = 0; count < n; count++ ){
int numPro = vProteinList[count].size();
vProteinListData[count] =
vProteinList[count].data();
#pragma acc enter data
copyin(vProteinListData[count:1][0:numPro])
}
▪ Use ‘raw data pointer’ to access vectors
▪ How will you access ‘vector of vector (jagged arrays)’ ?
Creation and copy of jagged arrays
24
SANJEEVINI: PARDOCKOpenACC parallelization: EnergyAtom (4)
for( int count = 0; count < n; count++ ){
int numPro = vProteinList[count].size();
#pragma acc exit data
delete(vProteinListData[count:1][0:numPro])
vProteinListData[count] = NULL;
}
#pragma acc exit data
delete(vProteinListData[0:n][0:1])
▪ Use ‘raw data pointer’ to access vectors
▪ How will you access ‘vector of vector (jagged arrays)’ ?
Deletion of jagged arrays
25
SANJEEVINI: PARDOCKOpenACC compiler output: EnergyAtom (1)
PDB::energyAtom(const std::vector<PDB, std::allocator<PDB>> &, PDB, points, const std::vector<Box,
std::allocator<Box>>&, const std::vector<int, std::allocator<int>>&, const std::vector<std::vector<int,
std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>>&, int **):
79, Generating enter data copyin(boxListData[:boxListNumElements],rec,coord)
85, Generating present(coord,boxListData[:],rec,vProteinListData[:][:],vProData[:])
Accelerator kernel generated
Generating Tesla code
85, Generating reduction(+:electro,vandw,ehyd)
87, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
129, Generating exit data delete(boxListData[:boxListNumElements],rec,coord)
26
SANJEEVINI: PARDOCKOpenACC compiler output: EnergyAtom (2)
main:
266, Generating enter data copyin(vProData[:vProNumElements])
Generating enter data create(vProteinListData[:vProteinListNumElements][:1])
275, Generating enter data copyin(vProteinListData[proList][:numElements])
321, Generating exit data delete(vProteinListData[proList][:numElements])
322, Generating exit data
delete(vProteinListData[:vProteinListNumElements][:1],vProData[:vProNumElements])
27
CUDA UNIFIED MEMORYSimplified Developer Effort
Without Unified Memory With Unified Memory
Unified MemorySystem Memory
GPU Memory
Sometimes referred to as
“managed memory.”
New “Pascal” GPUs handle Unified Memory in hardware.
28
PERFORMANCE OPTIMIZATION Strategy
Analyze
ParallelizeOptimize
30
SAJEEVINI: PARDOCKPerformance: CPU and GPU (1)
▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core
▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access
CPU+GPU 5.8x/3.3x faster than CPU at 8 MPI procs, ROTATE=1000/100
16 MPI procs on a single GPU -> GPU is the
bottleneck!
31
SAJEEVINI: PARDOCKPerformance: CPU and GPU (2)
▪ Average ‘time to predict’ over 160 datasets
▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core
▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access
CPU+GPU 5.3x/3.2x faster than CPU at 8 MPI procs, ROTATE=1000/100
32
TESLA V100The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
33
MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS
34
GPU ACCELERATION OF LEGACY MPI APPS
Typical legacy application
MPI parallel
Single or few threads per MPI rank (e.g. OpenMP)
Running with multiple MPI ranks per node
GPU acceleration in phases
Proof of concept prototype, …
Great speedup at kernel level
Application performance misses expectations
4/2/2018
35
MULTI PROCESS SERVICE (MPS)For Legacy MPI Applications
4/2/2018
N=4N=2N=1 N=8
Multicore CPU only
With Hyper-Q/MPSAvailable on Tesla/Quadro with CC 3.5+
(e.g. K20, K40, K80, M40,…)
N=4N=2 N=8
GPU parallelizable partCPU parallel partSerial part
GPU-accelerated
N=1
36
PROCESSES SHARING GPU WITHOUT MPSNo Overlap
4/2/2018
Process A Process B
Context A Context B
Process A Process B
GPU
37
PROCESSES SHARING GPU WITHOUT MPSContext Switch Overhead
4/2/2018
Time-slided use of GPU
Context switch Context
Switch
38
PROCESSES SHARING GPU WITH MPSMaximum Overlap
4/2/2018
Process A Process B
Context A Context B
GPU Kernels from
Process A
Kernels from
Process B
MPS Process
39
PROCESSES SHARING GPU WITH MPSNo Context Switch Overhead
4/2/2018
40
41
42
SAJEEVINI: PARDOCKPascal vs Volta
▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000
▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core
▪ 256 GB RAM▪ Tesla P100/V100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61/9.0.176▪ MPS enabled for GPU access
Volta is 2.1x faster than Pascal
43
SANJEEVINI: PARDOCKOpenACC parallelization
▪ Use ‘raw data pointer’ to access vectors
▪ Avoid using C++ references in OpenACC pragmas
▪ Standard classes called from an OpenACC region may result in compilation/linking errors. Use math.h instead of cmath ☺
▪ Unified memory has improved over time but sometimes there might be a need to explicitly use data clause to minimize data copies
▪ Volta works excellent with program needing functionality of MPS
44
ONGOING WORK
45
SAJEEVINI: PARDOCKPascal vs Volta
▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000
▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core
▪ 256 GB RAM▪ Tesla P100/V100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61/9.0.176▪ MPS enabled for GPU access
Volta is 2.1x faster than Pascal due to hardware
accelerated MPS
46
SAJEEVINI: PARDOCKMulti-GPU scalability (2)
▪ ‘1qbt’ dataset, ROTATE=1000, 8 MPI procs
▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core
▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access
▪ Higher concurrency possible with more devices->lower GPU time
▪ Lesser latency with more devices/MPS servers->lower CPU time
47
SAJEEVINI: PARDOCKMulti-GPU scalability (3)
▪ ‘5cna’ dataset, ROTATE=100, 8 MPI procs, Tesla P100 GPUs, MPS
48
SAJEEVINI: PARDOCKPascal vs Volta (2)
▪ ‘1a4w’ dataset, ROTATE=100, 8 MPI procs, Tesla P100/V100 GPUs, MPS
49
REFERENCES: PARDOCK
• Gupta, A., et al. "ParDOCK: An all atom energy based Monte Carlo docking protocol for protein-
ligand complexes." Protein and peptide letters 14.7 (2007): 632-646.
• Nishikawa, Joy L., et al. "Inhibiting fungal multidrug resistance by disrupting an activator–Mediator
interaction." Nature 530.7591 (2016): 485.
• Singh, Tanya, D. Biswas, and Bhyravabhotla Jayaram. "AADS-An automated active site
identification, docking, and scoring protocol for protein targets based on physicochemical
descriptors." Journal of chemical information and modeling 51.10 (2011): 2515-2527.
• Singh, Tanya, Olayiwola Adedotun Adekoya, and B. Jayaram. "Understanding the binding of
inhibitors of matrix metalloproteinases by molecular docking, quantum mechanical calculations,
molecular dynamics simulations, and a MMGBSA/MMBappl study." Molecular BioSystems 11.4
(2015): 1041-1051.
• Jayaram, Bhyravabhotla, et al. "Sanjeevini: a freely accessible web-server for target directed lead
molecule discovery." BMC bioinformatics. Vol. 13. No. 17. BioMed Central, 2012.
50
SANJEEVINI: PARDOCKSteps involved
top related