accelerating sanjeevini: a drug discovery …on-demand.gputechconf.com/gtc/2018/presentation/s...•...

Post on 21-Jul-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITEAbhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia

Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia

2

AGENDA

• Quick Introduction to Computer Aided Drug Discovery software Sanjeevani

• Challenges

• Code documentation in process of being improved

• Code maintained by Non Computer Science

• Designed to suit distributed programming

• Constraints

• Code modification should be minimal Ease of Maintenance.

• The current cluster has mix of CPU and GPU. Should run on both Portable

• Learnings

What to expect and what not to

3

COMPUTER AIDED DRUG DISCOVERYIntroduction

Target Discovery

Lead Generation

Lead Optimization

Preclinical Development

Phase I, II & III Clinical Trials

FDA Review & Approval

Drug to the Market

14 yrs $1.4 billion

2.5yrs

3.0yrs

1.0yrs

6.0yrs

1.5yrs

4%

15%

10%

68%

3%

4

SANJEEVINI FOR COMPUTER AIDED DRUG DESIGN

Check Lipinski compliance

Generate rapid binding energy estimates by

RASPD protocol

Predict all

possible

binding sites

and store top

ten sites

Dock and Score

Optimize geometry /

Assign TPACM4/derive quantum

mechanical charges

Assign force field

parameters

Perform molecular dynamics simulations and post facto free energy component analyses (Optional)

Generate

canonical A/B

DNA or MD

averaged

structure of B

DNA

Self drawn

ligand

molecule

Protein-ligand Complex/ Protein/DNA sequenceNRDBSM/Million molecule

library/Natural products

database

Overview

5

SANJEEVINIGPU acceleration

▪ OpenACCacceleration of ParDOCK module

▪ All atom energy based Monte Carlo docking for protein-ligand complexes

6

PERFORMANCE OPTIMIZATION Strategy

Analyze

ParallelizeOptimize

7

PERFORMANCE OPTIMIZATION Strategy

Analyze

ParallelizeOptimize

8

SANJEEVINI: PARDOCK

Flat profile:

Hotspots

% time Cumulative

seconds

Self

seconds

Calls Self calls Total

s/calls

Name

69.78 557.90 557.90 1188000 0.00 0.00 PDB::EnergyCalculator()

12.92 661.19 103.29 8 12.91 20.26 PDB::clashCombination()

7.35 719.96 58.77 26051422500 0.00 0.00 getRadius1()

5.49 763.85 43.89 885075 0.00 0.00 PDB::energyAtom()

9

PERFORMANCE OPTIMIZATION Strategy

Analyze

ParallelizeOptimize

10

SANJEEVINI: PARDOCKCPU code: EnergyCalculator

double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points

coords[], const unsigned &totalDockAtoms, … ){

for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){

for( int counter = 0; counter < vDrugGrid.size(); counter++ ){

// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]

// minDis = minimum of ‘distance’, minCounter = counter corresponding to minDis

}

ene += EnergyGrid[minCounter][atomcount];

}

return ene; }

11

OpenACCSimple | Powerful | Portable

Fueling the Next Wave of

Scientific Discoveries in HPC

University of IllinoisPowerGrid- MRI Reconstruction

70x Speed-Up

2 Days of Effort

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway

http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf

http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

RIKEN JapanNICAM- Climate Modeling

7-8x Speed-Up

5% of Code Modified

main() {

<serial code>#pragma acc kernels//automatically runs on GPU

{ <parallel code>

}}

12

OPENACC DIRECTIVES

Manage

Data

Movement

Initiate

Parallel

Execution

Optimize

Loop

Mappings

#pragma acc data copyin(x,y) copyout(z){...#pragma acc parallel {#pragma acc loop gang vector

for (i = 0; i < n; ++i) {z[i] = x[i] + y[i];...

}}...

}

Performance portable

Interoperable

Single source

Incremental

13

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (1)

double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points

coords[], const unsigned &totalDockAtoms, … ){

#pragma acc parallel loop reduction(+:ene) private(minDis,minCounter) present() copyin()

firstprivate()

for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){

#pragma acc loop reduction(min:minDis)

for( int counter = 0; counter < vDrugGrid.size(); counter++ ){

// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]

minDis = (minDis > distance) ? distance;

}

14

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (2)

#pragma acc loop reduction(min:minCounter)

for( int counter = 0; counter < vDrugGrid.size(); counter++ ){

// compute ‘distance’ between coords[atomcount] and vDrugGrid[counter]

if ( distance == minDis ){

minCounter = (minCounter > counter) ? counter; }

}

ene += EnergyGrid[minCounter][atomcount];

}

return ene; }

15

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (3)

const points *vDrugGridData = vDrugGrid.data();

// compute ‘distance’ between coords[atomcount] and vDrugGridData[counter]

▪ Use ‘raw data pointer’ to access vectors

16

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (4)

unsigned totDockAtoms = totalDockAtoms;

float **eneGrid = EnergyGrid;

#pragma acc parallel loop reduction(+:ene) …

copyin(coords[0:tot DockAtoms]) present(eneGrid)

ene += eneGrid[minCounter][atomcount];

▪ Use ‘raw data pointer’ to access vectors

▪ Avoid using C++ references in OpenACC pragmas

17

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyCalculator (4)

unsigned totDockAtoms = totalDockAtoms;

float **eneGrid = EnergyGrid;

#pragma acc parallel loop reduction(+:ene) …

copyin(coords[0:tot DockAtoms]) present(eneGrid)

ene += eneGrid[minCounter][atomcount];

▪ Use ‘raw data pointer’ to access vectors

▪ Avoid using C++ references in OpenACC pragmas

PDB::EnergyCalculator(float **&, const

std::vector<points, std::allocator<points>> &,

const std::vector<points, std::allocator<points>>

&, points *, const unsigned int &, energy &, int):

22, Generating present(vDrugGridData[:])

Generating copyin(coords[:totalDockAtoms->])

Generating present(EnergyGrid[:][:][:])

Runtime memory access

violation

18

OPENACC: 3 LEVELS OF PARALLELISM

• Vector threads work in

lockstep (SIMD/SIMT

parallelism)

• Workers compute a vector

• Gangs have 1 or more

workers and share resources

(such as cache, the

streaming multiprocessor,

etc.)

• Multiple gangs work

independently of each other

Workers

Gang

Workers

Gang

Vector

Vector

19

SANJEEVINI: PARDOCKOpenACC compiler output: EnergyCalculator

PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points,

std::allocator<points>> &, points *, const unsigned int &, energy &, int):

22, Generating present(vDrugGridData[:],eneGrid[:][:])

Generating copyin(coords[:totDockAtoms])

22, Accelerator kernel generated

Generating Tesla code

22, Generating reduction(+:ene)

24, #pragma acc loop gang /* blockIdx.x */

31, #pragma acc loop vector(256) /* threadIdx.x */

Generating reduction(min:minDis)

45, #pragma acc loop vector(256) /* threadIdx.x */

Generating reduction(min:minIdx)

31, Loop is parallelizable

45, Loop is parallelizable

20

MANAGE DATA HIGHER IN THE PROGRAM

Currently data is moved at the beginning and end of each function, in case the data is needed on the CPU

We know that the data is only needed on the CPU after convergence

We should inform the compiler when data movement is really needed to improved performance

21

STRUCTURED DATA REGIONS

The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region.

#pragma acc data

{

#pragma acc parallel loop

...

#pragma acc parallel loop

...

}

Data Region

Arrays used within the

data region will remain

on the GPU until the

end of the data region.

22

UNSTRUCTURED DATA DIRECTIVES

Used to define data regions when scoping doesn’t allow the use of normal data regions (e.g. the constructor/destructor of a class).

enter data Defines the start of an unstructured data lifetime

• clauses: copyin(list), create(list)

exit data Defines the end of an unstructured data lifetime

• clauses: copyout(list), delete(list), finalize

#pragma acc enter data copyin(a)

...

#pragma acc exit data delete(a)

23

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyAtom (3)

int **vProteinListData = new int

*[vProteinList.size()];

n = vProteinList.size();

#pragma acc enter data

create(vProteinListData[0:n][0:1])

for( int count = 0; count < n; count++ ){

int numPro = vProteinList[count].size();

vProteinListData[count] =

vProteinList[count].data();

#pragma acc enter data

copyin(vProteinListData[count:1][0:numPro])

}

▪ Use ‘raw data pointer’ to access vectors

▪ How will you access ‘vector of vector (jagged arrays)’ ?

Creation and copy of jagged arrays

24

SANJEEVINI: PARDOCKOpenACC parallelization: EnergyAtom (4)

for( int count = 0; count < n; count++ ){

int numPro = vProteinList[count].size();

#pragma acc exit data

delete(vProteinListData[count:1][0:numPro])

vProteinListData[count] = NULL;

}

#pragma acc exit data

delete(vProteinListData[0:n][0:1])

▪ Use ‘raw data pointer’ to access vectors

▪ How will you access ‘vector of vector (jagged arrays)’ ?

Deletion of jagged arrays

25

SANJEEVINI: PARDOCKOpenACC compiler output: EnergyAtom (1)

PDB::energyAtom(const std::vector<PDB, std::allocator<PDB>> &, PDB, points, const std::vector<Box,

std::allocator<Box>>&, const std::vector<int, std::allocator<int>>&, const std::vector<std::vector<int,

std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>>&, int **):

79, Generating enter data copyin(boxListData[:boxListNumElements],rec,coord)

85, Generating present(coord,boxListData[:],rec,vProteinListData[:][:],vProData[:])

Accelerator kernel generated

Generating Tesla code

85, Generating reduction(+:electro,vandw,ehyd)

87, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

129, Generating exit data delete(boxListData[:boxListNumElements],rec,coord)

26

SANJEEVINI: PARDOCKOpenACC compiler output: EnergyAtom (2)

main:

266, Generating enter data copyin(vProData[:vProNumElements])

Generating enter data create(vProteinListData[:vProteinListNumElements][:1])

275, Generating enter data copyin(vProteinListData[proList][:numElements])

321, Generating exit data delete(vProteinListData[proList][:numElements])

322, Generating exit data

delete(vProteinListData[:vProteinListNumElements][:1],vProData[:vProNumElements])

27

CUDA UNIFIED MEMORYSimplified Developer Effort

Without Unified Memory With Unified Memory

Unified MemorySystem Memory

GPU Memory

Sometimes referred to as

“managed memory.”

New “Pascal” GPUs handle Unified Memory in hardware.

28

PERFORMANCE OPTIMIZATION Strategy

Analyze

ParallelizeOptimize

30

SAJEEVINI: PARDOCKPerformance: CPU and GPU (1)

▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core

▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access

CPU+GPU 5.8x/3.3x faster than CPU at 8 MPI procs, ROTATE=1000/100

16 MPI procs on a single GPU -> GPU is the

bottleneck!

31

SAJEEVINI: PARDOCKPerformance: CPU and GPU (2)

▪ Average ‘time to predict’ over 160 datasets

▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core

▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access

CPU+GPU 5.3x/3.2x faster than CPU at 8 MPI procs, ROTATE=1000/100

32

TESLA V100The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

HBM2

Efficient Bandwidth

33

MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS

34

GPU ACCELERATION OF LEGACY MPI APPS

Typical legacy application

MPI parallel

Single or few threads per MPI rank (e.g. OpenMP)

Running with multiple MPI ranks per node

GPU acceleration in phases

Proof of concept prototype, …

Great speedup at kernel level

Application performance misses expectations

4/2/2018

35

MULTI PROCESS SERVICE (MPS)For Legacy MPI Applications

4/2/2018

N=4N=2N=1 N=8

Multicore CPU only

With Hyper-Q/MPSAvailable on Tesla/Quadro with CC 3.5+

(e.g. K20, K40, K80, M40,…)

N=4N=2 N=8

GPU parallelizable partCPU parallel partSerial part

GPU-accelerated

N=1

36

PROCESSES SHARING GPU WITHOUT MPSNo Overlap

4/2/2018

Process A Process B

Context A Context B

Process A Process B

GPU

37

PROCESSES SHARING GPU WITHOUT MPSContext Switch Overhead

4/2/2018

Time-slided use of GPU

Context switch Context

Switch

38

PROCESSES SHARING GPU WITH MPSMaximum Overlap

4/2/2018

Process A Process B

Context A Context B

GPU Kernels from

Process A

Kernels from

Process B

MPS Process

39

PROCESSES SHARING GPU WITH MPSNo Context Switch Overhead

4/2/2018

40

41

42

SAJEEVINI: PARDOCKPascal vs Volta

▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000

▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core

▪ 256 GB RAM▪ Tesla P100/V100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61/9.0.176▪ MPS enabled for GPU access

Volta is 2.1x faster than Pascal

43

SANJEEVINI: PARDOCKOpenACC parallelization

▪ Use ‘raw data pointer’ to access vectors

▪ Avoid using C++ references in OpenACC pragmas

▪ Standard classes called from an OpenACC region may result in compilation/linking errors. Use math.h instead of cmath ☺

▪ Unified memory has improved over time but sometimes there might be a need to explicitly use data clause to minimize data copies

▪ Volta works excellent with program needing functionality of MPS

44

ONGOING WORK

45

SAJEEVINI: PARDOCKPascal vs Volta

▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000

▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core

▪ 256 GB RAM▪ Tesla P100/V100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61/9.0.176▪ MPS enabled for GPU access

Volta is 2.1x faster than Pascal due to hardware

accelerated MPS

46

SAJEEVINI: PARDOCKMulti-GPU scalability (2)

▪ ‘1qbt’ dataset, ROTATE=1000, 8 MPI procs

▪ PSG Cluster node, Haswell E5-2698 v3@ 2.3 GHz, dual socket, 16 core

▪ 256 GB RAM▪ Tesla P100 GPU▪ CentOS 7.2▪ Cuda Toolkit 8.0.61▪ MPS enabled for GPU access

▪ Higher concurrency possible with more devices->lower GPU time

▪ Lesser latency with more devices/MPS servers->lower CPU time

47

SAJEEVINI: PARDOCKMulti-GPU scalability (3)

▪ ‘5cna’ dataset, ROTATE=100, 8 MPI procs, Tesla P100 GPUs, MPS

48

SAJEEVINI: PARDOCKPascal vs Volta (2)

▪ ‘1a4w’ dataset, ROTATE=100, 8 MPI procs, Tesla P100/V100 GPUs, MPS

49

REFERENCES: PARDOCK

• Gupta, A., et al. "ParDOCK: An all atom energy based Monte Carlo docking protocol for protein-

ligand complexes." Protein and peptide letters 14.7 (2007): 632-646.

• Nishikawa, Joy L., et al. "Inhibiting fungal multidrug resistance by disrupting an activator–Mediator

interaction." Nature 530.7591 (2016): 485.

• Singh, Tanya, D. Biswas, and Bhyravabhotla Jayaram. "AADS-An automated active site

identification, docking, and scoring protocol for protein targets based on physicochemical

descriptors." Journal of chemical information and modeling 51.10 (2011): 2515-2527.

• Singh, Tanya, Olayiwola Adedotun Adekoya, and B. Jayaram. "Understanding the binding of

inhibitors of matrix metalloproteinases by molecular docking, quantum mechanical calculations,

molecular dynamics simulations, and a MMGBSA/MMBappl study." Molecular BioSystems 11.4

(2015): 1041-1051.

• Jayaram, Bhyravabhotla, et al. "Sanjeevini: a freely accessible web-server for target directed lead

molecule discovery." BMC bioinformatics. Vol. 13. No. 17. BioMed Central, 2012.

50

SANJEEVINI: PARDOCKSteps involved

top related