accelerating sanjeevini: a drug discovery …on-demand.gputechconf.com/gtc/2018/presentation/s...•...

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITEAbhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia

Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia

AGENDA

• Quick Introduction to Computer Aided Drug Discovery software Sanjeevani

• Challenges

• Code documentation in process of being improved

• Code maintained by Non Computer Science

• Designed to suit distributed programming

• Constraints

• Code modification should be minimal Ease of Maintenance.

• The current cluster has mix of CPU and GPU. Should run on both Portable

• Learnings

What to expect and what not to

COMPUTER AIDED DRUG DISCOVERYIntroduction

Target Discovery

Lead Generation

Lead Optimization

Preclinical Development

Phase I, II & III Clinical Trials

FDA Review & Approval

Drug to the Market

14 yrs $1.4 billion

2.5yrs

3.0yrs

1.0yrs

6.0yrs

1.5yrs

SANJEEVINI FOR COMPUTER AIDED DRUG DESIGN

Check Lipinski compliance

Generate rapid binding energy estimates by

RASPD protocol

Predict all

possible

binding sites

and store top

ten sites

Dock and Score

Optimize geometry /

Assign TPACM4/derive quantum

mechanical charges

Assign force field

parameters

Perform molecular dynamics simulations and post facto free energy component analyses (Optional)

Generate

canonical A/B

DNA or MD

averaged

structure of B

Self drawn

ligand

molecule

Protein-ligand Complex/ Protein/DNA sequenceNRDBSM/Million molecule

library/Natural products

database

Overview

SANJEEVINIGPU acceleration

▪ OpenACCacceleration of ParDOCK module

▪ All atom energy based Monte Carlo docking for protein-ligand complexes

PERFORMANCE OPTIMIZATION Strategy

Analyze

ParallelizeOptimize

Analyze

ParallelizeOptimize

SANJEEVINI: PARDOCK

Flat profile:

Hotspots

% time Cumulative

seconds

Calls Self calls Total

s/calls

69.78 557.90 557.90 1188000 0.00 0.00 PDB::EnergyCalculator()

12.92 661.19 103.29 8 12.91 20.26 PDB::clashCombination()

7.35 719.96 58.77 26051422500 0.00 0.00 getRadius1()

5.49 763.85 43.89 885075 0.00 0.00 PDB::energyAtom()

Analyze

ParallelizeOptimize

SANJEEVINI: PARDOCKCPU code: EnergyCalculator

double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points

coords[], const unsigned &totalDockAtoms, … ){

for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){

for( int counter = 0; counter < vDrugGrid.size(); counter++ ){

// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]

// minDis = minimum of ‘distance’, minCounter = counter corresponding to minDis

ene += EnergyGrid[minCounter][atomcount];

return ene; }

OpenACCSimple | Powerful | Portable

Fueling the Next Wave of

Scientific Discoveries in HPC

University of IllinoisPowerGrid- MRI Reconstruction

70x Speed-Up

2 Days of Effort

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway

http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf

http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

RIKEN JapanNICAM- Climate Modeling

7-8x Speed-Up

5% of Code Modified

main() {

<serial code>#pragma acc kernels//automatically runs on GPU

{ <parallel code>

OPENACC DIRECTIVES

Manage

Movement

Initiate

Parallel

Execution

Optimize

Mappings

#pragma acc data copyin(x,y) copyout(z){...#pragma acc parallel {#pragma acc loop gang vector

for (i = 0; i < n; ++i) {z[i] = x[i] + y[i];...

Performance portable

Interoperable

Single source

Incremental

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (1)

double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points

coords[], const unsigned &totalDockAtoms, … ){

#pragma acc parallel loop reduction(+:ene) private(minDis,minCounter) present() copyin()

firstprivate()

for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){

#pragma acc loop reduction(min:minDis)

minDis = (minDis > distance) ? distance;

#pragma acc loop reduction(min:minCounter)

if ( distance == minDis ){

minCounter = (minCounter > counter) ? counter; }

ene += EnergyGrid[minCounter][atomcount];

return ene; }

const points *vDrugGridData = vDrugGrid.data();

// compute ‘distance’ between coords[atomcount] and vDrugGridData[counter]

▪ Use ‘raw data pointer’ to access vectors

unsigned totDockAtoms = totalDockAtoms;

float **eneGrid = EnergyGrid;

#pragma acc parallel loop reduction(+:ene) …

copyin(coords[0:tot DockAtoms]) present(eneGrid)

ene += eneGrid[minCounter][atomcount];

▪ Avoid using C++ references in OpenACC pragmas

unsigned totDockAtoms = totalDockAtoms;

float **eneGrid = EnergyGrid;

#pragma acc parallel loop reduction(+:ene) …

copyin(coords[0:tot DockAtoms]) present(eneGrid)

ene += eneGrid[minCounter][atomcount];

PDB::EnergyCalculator(float **&, const

std::vector<points, std::allocator<points>> &,

const std::vector<points, std::allocator<points>>

&, points *, const unsigned int &, energy &, int):

22, Generating present(vDrugGridData[:])

Generating copyin(coords[:totalDockAtoms->])

Generating present(EnergyGrid[:][:][:])

Runtime memory access

violation

OPENACC: 3 LEVELS OF PARALLELISM

• Vector threads work in

lockstep (SIMD/SIMT

parallelism)

• Workers compute a vector

• Gangs have 1 or more

workers and share resources

(such as cache, the

streaming multiprocessor,

• Multiple gangs work

independently of each other

Workers

Vector

SANJEEVINI: PARDOCKOpenACC compiler output: EnergyCalculator

PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points,

std::allocator<points>> &, points *, const unsigned int &, energy &, int):

22, Generating present(vDrugGridData[:],eneGrid[:][:])

Generating copyin(coords[:totDockAtoms])

22, Accelerator kernel generated

Generating Tesla code

22, Generating reduction(+:ene)

24, #pragma acc loop gang /* blockIdx.x */

31, #pragma acc loop vector(256) /* threadIdx.x */

Generating reduction(min:minDis)

45, #pragma acc loop vector(256) /* threadIdx.x */

Generating reduction(min:minIdx)

31, Loop is parallelizable

45, Loop is parallelizable

MANAGE DATA HIGHER IN THE PROGRAM

Currently data is moved at the beginning and end of each function, in case the data is needed on the CPU

We know that the data is only needed on the CPU after convergence

We should inform the compiler when data movement is really needed to improved performance

STRUCTURED DATA REGIONS

The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region.

#pragma acc data

#pragma acc parallel loop

Data Region

Arrays used within the

data region will remain

on the GPU until the

end of the data region.

UNSTRUCTURED DATA DIRECTIVES

Used to define data regions when scoping doesn’t allow the use of normal data regions (e.g. the constructor/destructor of a class).

enter data Defines the start of an unstructured data lifetime

• clauses: copyin(list), create(list)

exit data Defines the end of an unstructured data lifetime

• clauses: copyout(list), delete(list), finalize

#pragma acc enter data copyin(a)

#pragma acc exit data delete(a)

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyAtom (3)

int **vProteinListData = new int

*[vProteinList.size()];

n = vProteinList.size();

#pragma acc enter data

create(vProteinListData[0:n][0:1])

for( int count = 0; count < n; count++ ){

int numPro = vProteinList[count].size();

vProteinListData[count] =

vProteinList[count].data();

#pragma acc enter data

copyin(vProteinListData[count:1][0:numPro])

▪ How will you access ‘vector of vector (jagged arrays)’ ?

Creation and copy of jagged arrays

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyAtom (4)

for( int count = 0; count < n; count++ ){

int numPro = vProteinList[count].size();

#pragma acc exit data

delete(vProteinListData[count:1][0:numPro])

vProteinListData[count] = NULL;

#pragma acc exit data

delete(vProteinListData[0:n][0:1])

▪ How will you access ‘vector of vector (jagged arrays)’ ?

Deletion of jagged arrays

SANJEEVINI: PARDOCKOpenACC compiler output: EnergyAtom (1)

PDB::energyAtom(const std::vector<PDB, std::allocator<PDB>> &, PDB, points, const std::vector<Box,

std::allocator<Box>>&, const std::vector<int, std::allocator<int>>&, const std::vector<std::vector<int,

std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>>&, int **):

79, Generating enter data copyin(boxListData[:boxListNumElements],rec,coord)

85, Generating present(coord,boxListData[:],rec,vProteinListData[:][:],vProData[:])

Accelerator kernel generated

Generating Tesla code

85, Generating reduction(+:electro,vandw,ehyd)

87, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

129, Generating exit data delete(boxListData[:boxListNumElements],rec,coord)

SANJEEVINI: PARDOCKOpenACC compiler output: EnergyAtom (2)

266, Generating enter data copyin(vProData[:vProNumElements])

Generating enter data create(vProteinListData[:vProteinListNumElements][:1])

275, Generating enter data copyin(vProteinListData[proList][:numElements])

321, Generating exit data delete(vProteinListData[proList][:numElements])

322, Generating exit data

delete(vProteinListData[:vProteinListNumElements][:1],vProData[:vProNumElements])

CUDA UNIFIED MEMORYSimplified Developer Effort

Without Unified Memory With Unified Memory

Unified MemorySystem Memory

GPU Memory

Sometimes referred to as

“managed memory.”

New “Pascal” GPUs handle Unified Memory in hardware.

Analyze

ParallelizeOptimize

SAJEEVINI: PARDOCKPerformance: CPU and GPU (1)

▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core

▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access

CPU+GPU 5.8x/3.3x faster than CPU at 8 MPI procs, ROTATE=1000/100

16 MPI procs on a single GPU -> GPU is the

bottleneck!

SAJEEVINI: PARDOCKPerformance: CPU and GPU (2)

▪ Average ‘time to predict’ over 160 datasets

CPU+GPU 5.3x/3.2x faster than CPU at 8 MPI procs, ROTATE=1000/100

TESLA V100The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

Efficient Bandwidth

MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS

GPU ACCELERATION OF LEGACY MPI APPS

Typical legacy application

MPI parallel

Single or few threads per MPI rank (e.g. OpenMP)

Running with multiple MPI ranks per node

GPU acceleration in phases

Proof of concept prototype, …

Great speedup at kernel level

Application performance misses expectations

4/2/2018

MULTI PROCESS SERVICE (MPS)For Legacy MPI Applications

4/2/2018

N=4N=2N=1 N=8

Multicore CPU only

With Hyper-Q/MPSAvailable on Tesla/Quadro with CC 3.5+

(e.g. K20, K40, K80, M40,…)

N=4N=2 N=8

GPU parallelizable partCPU parallel partSerial part

GPU-accelerated

PROCESSES SHARING GPU WITHOUT MPSNo Overlap

4/2/2018

Process A Process B

Context A Context B

Process A Process B

PROCESSES SHARING GPU WITHOUT MPSContext Switch Overhead

4/2/2018

Time-slided use of GPU

Context switch Context

Switch

PROCESSES SHARING GPU WITH MPSMaximum Overlap

4/2/2018

Process A Process B

Context A Context B

GPU Kernels from

Process A

Kernels from

Process B

MPS Process

PROCESSES SHARING GPU WITH MPSNo Context Switch Overhead

4/2/2018

SAJEEVINI: PARDOCKPascal vs Volta

▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000

▪ 256 GB RAM▪ Tesla P100/V100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61/9.0.176▪ MPS enabled for GPU access

Volta is 2.1x faster than Pascal

SANJEEVINI: PARDOCKOpenACC parallelization

▪ Standard classes called from an OpenACC region may result in compilation/linking errors. Use math.h instead of cmath ☺

▪ Unified memory has improved over time but sometimes there might be a need to explicitly use data clause to minimize data copies

▪ Volta works excellent with program needing functionality of MPS

ONGOING WORK

SAJEEVINI: PARDOCKPascal vs Volta

▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000

▪ 256 GB RAM▪ Tesla P100/V100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61/9.0.176▪ MPS enabled for GPU access

Volta is 2.1x faster than Pascal due to hardware

accelerated MPS

SAJEEVINI: PARDOCKMulti-GPU scalability (2)

▪ ‘1qbt’ dataset, ROTATE=1000, 8 MPI procs

▪ Higher concurrency possible with more devices->lower GPU time

▪ Lesser latency with more devices/MPS servers->lower CPU time

SAJEEVINI: PARDOCKMulti-GPU scalability (3)

▪ ‘5cna’ dataset, ROTATE=100, 8 MPI procs, Tesla P100 GPUs, MPS

SAJEEVINI: PARDOCKPascal vs Volta (2)

▪ ‘1a4w’ dataset, ROTATE=100, 8 MPI procs, Tesla P100/V100 GPUs, MPS

REFERENCES: PARDOCK

• Gupta, A., et al. "ParDOCK: An all atom energy based Monte Carlo docking protocol for protein-

ligand complexes." Protein and peptide letters 14.7 (2007): 632-646.

• Nishikawa, Joy L., et al. "Inhibiting fungal multidrug resistance by disrupting an activator–Mediator

interaction." Nature 530.7591 (2016): 485.

• Singh, Tanya, D. Biswas, and Bhyravabhotla Jayaram. "AADS-An automated active site

identification, docking, and scoring protocol for protein targets based on physicochemical

descriptors." Journal of chemical information and modeling 51.10 (2011): 2515-2527.

• Singh, Tanya, Olayiwola Adedotun Adekoya, and B. Jayaram. "Understanding the binding of

inhibitors of matrix metalloproteinases by molecular docking, quantum mechanical calculations,

molecular dynamics simulations, and a MMGBSA/MMBappl study." Molecular BioSystems 11.4

(2015): 1041-1051.

• Jayaram, Bhyravabhotla, et al. "Sanjeevini: a freely accessible web-server for target directed lead

molecule discovery." BMC bioinformatics. Vol. 13. No. 17. BioMed Central, 2012.

SANJEEVINI: PARDOCKSteps involved

accelerating sanjeevini: a drug discovery …on-demand.gputechconf.com/gtc/2018/presentation/s...•...

Documents

1 drug discovery

drug discovery model

drug discovery process.pdf

modern tools for drug discovery nimbus biotechnology modern...

sanjeevini combinations

drug discovery & development

medicinal chemistry for drug discovery & drug...

stanford drug discovery...

sanjeevini all cards

stages drug discovery

drug discovery news

drug discovery

infectious disease drug discovery: pharmacology models and...

metabolism drug discovery

drug discovery iii

drug discovery (rajkot)

drug discovery 2021

drug discovery presentation

and drug discovery

sanjeevini mineral water