protein docking and molecular shape recognition what is ... · protein docking and molecular shape...

Protein Docking and Molecular Shape RecognitionUsing Polar Fourier Correlations

Dave RitchieLORIA, Nancy

What is Protein Docking?

Protein docking = shape recognition in 3D space

However, ... proteins are flexible (more complexity)!

Protein Docking and Molecular Shape RecognitionUsing Polar Fourier Correlations

Contents

• Motivation – Importance of Protein-Protein Interactions (PPIs)

• Polar Fourier Protein Shape Representation

• Application to Protein Docking – Hex

• The CAPRI Blind Docking Experiment

• New Developments – Multi-Dimensional FFTs, Using GPUs

• Application to Molecular Shape Recognition – ParaSurf & ParaFit

• Conclusions & Future Prospects

PPI Networks are Fundamental to Biological Mechanisms

• If genomes provide the “blue-print” for life ...

• ... then proteins provide the “machinery”

• Understanding PPIs could lead to immense scientific advances and therapeutic benefits

Yeast network figure from: J Hallinan & G Smith ICCS 2002 article 584

Recent Growth of Protein-Protein Interaction (PPI) Literature

Citations of key yeast functional genomics papers

(per year):

• Red: Ito et al., Uetz et al. (Y2H)

• Blue: Ho et al., Gavin et al. (TAP-MS)

• Black: All protein-protein interaction papers

Figure from: Bork et al., Curr Op. Struct. Biol. (2004) 14 292–299

Docking - Predicting PPIs at the 3D Molecular Level

Ab Initio

• Soft Docking – FFT, Polar Fourier Correlations (∼ hours)

• MC/MD – Flexible side chains + backbones (∼ days)

Re-Scoring

• Knowledge-based potentials

Data-Driven

• Biochemical: mutagenesis hot-spot residues

• Biophysical: NMR CSP/RDC, H/D exchange, 13C labeling, ...

• ET + Correlated mutations

• Structural Databases (docking by homology)

The Basic Goal of Protein-Protein Docking

Find minimum potential energy of the system as rapidly as possible:

E =

∫

φ(r)ρ(r)dVFor two proteins

φ(r) =φA(r) + φB(r)

ρ(r) =ρA(r) + ρB(r)

and so

E =

∫

(φA(r)ρB(r) + φB(r)ρA(r))dV

• With brute-force search, typically need ∼ 109 such integrals

• Current algorithms often sum several such potential/density terms...

• ... and often use 3D Cartesian FFTs to accelerate the calculation

Real Spherical Harmonic Basis Functions

Orthogonality:

∫

ylm(θ, φ)yl′m′(θ, φ)dΩ = δll′δmm′

Rotation: ylm(θ′, φ′) =l

∑

m′=−l

R(l)m′m

(α, β, γ)ylm′(θ, φ)

Spherical Harmonic Surfaces

Example: 2D Radial Expansions (256 Basis Functions)

r(θ, φ) =15∑

l=0

l∑

m=−l

almylm(θ, φ)

• Good for matching similar shapes, not so good for docking...

Radial Basis Functions: Rnl(r)

HO-type (shape): Rnl(r) = N(q)nl e−ρ/2ρl/2L

(l+1/2)n−l−1 (ρ); ρ = r2/q, q = 20.

Coulomb (electro): Rnl(r) = N(Λ)nl e−ρ/2ρlL

(2l+2)n−l−1(ρ); ρ = 2Λr, Λ = 1/2.

Orthogonality:

∫ ∞

0

Rnl(r)Rn′l(r)r2dr = δnn′

30

R15,0(r)

30

R20,0(r)

30

R25,0(r)

30

R30,0(r)

3D Protein Shape Density Representations(Ritchie & Kemp (2000) Proteins 39 178–194)

• Sample surface skins onto a (0.75A)3 grid...

Molecular Surface

Solvent Accessible Surface Surface Skin

Protein Interior

SamplingSpheres

Surface Normals

Surface Skin: σ(r) =

1; r ∈ surface skin

0; otherwiseInterior: τ (r) =

1; r ∈ protein atom

0; otherwise

Parametrise as: σ(r) =

N∑

nlm

aσnlmRnl(r) ylm(θ, φ), etc.

Estimate as: aσnlm ≃

∑

c

Rnl(rc) ylm(θc, φc)∆V

• Only need to do this once for each protein...

Polar Fourier Shape Density Reconstruction - Antibody CDRs

Image Order Coefficients

A Gaussians -

B N = 16 1,496

C N = 25 5,525

D N = 30 9,455

DW Ritchie (2003) Proteins Struct. Funct. Bionf. 52 98–106

3D Shape Density Reconstruction – CAPRI T21: Orc1/Sir1

DW Ritchie (2008) Curr. Prot. Pep. Sci. 9(1) 1-15

Docking Using 3D Polar Fourier Density Functions - “Hex”

τσ(r)

(r)

Densities: σ(r) =N

∑

nlm

aσnlmRnl(r)ylm(θ, φ) τ (r) =

N∑

nlm

aτnlmRnl(r)ylm(θ, φ)

Favourable:

∫

(σA(rA)τB(rB) + τA(rA)σB(rB))dV

Unfavourable:

∫

τA(rA)τB(rB)dV

Score: SAB =

∫

(σAτB + τAσB − QτAτB)dV Penalty Factor: Q = 11

DW Ritchie & GJL Kemp (2000) Proteins Struct. Funct. Bionf. 39 178–194

Correlations - Overlap as a Function of Coordinate Operations

Rotation: R(α, β, γ)σA(r) =N

∑

nlm

aσ′nlmRnl(r)ylm(θ, φ)

Rotated Coeffients: aσ′nlm =

l∑

m′=−l

R(l)mm′(α, β, γ)aσ

nlm′

Translation: Tz(R)σA(r) =N

∑

nlm

aσ′′nlmRnl(r)ylm(θ, φ)

Translated Coefficients: aσ′′nlm =

N∑

n′l′

T(|m|)nl,n′l′(R)aσ

n′l′m

Hence:

∫

σ′A(r)τ ′′

B(r)dV =N

∑

nlm

aσ′nlmbτ ′′

nlm etc.

Search Space: ∼ 109 orientations (∼ 106 orientations/sec)

DW Ritchie (2005) J. Appl. Cryst. 38 808–818

Translation Matrices From Fourier-Bessel Transform Theory

Using spherical Bessel transforms:

Rnl(β) =

√

2

π

∫ ∞

0

Rnl(r)jl(βr)r2dr; Rnl(r) =

√

2

π

∫ ∞

0

Rnl(β)jl(βr)β2dβ

it can be shown that

T(|m|)n′l′,nl(R) =

l+l′∑

k=|l−l′|

A(ll′|m|)k

∫ ∞

0

Rnl(β)Rn′l′(β)jk(βR)β2dβ

where

A(ll′|m|)k = (−1)

k+l′−l2

+m(2k + 1)[

(2l + 1)(2l′ + 1)]1/2

(

l l′ k

0 0 0

)(

l l′ k

m m 0

)

• Can derive analytic formulae for both GTO and ETO radial functions

• Requires high precision math library (GMP)...

• Calculate once for R = 1, 2, 3, ...50A and store on disk ( ∼ 200Mb)

6D Docking Search as a Nested Sequence of Transformations

Get 4 rotations from icosahedral tessellations ...A

(β2,γ2)(β1,γ1)

z

α2

R

βΑ B

Rotate A (×812 @ 7.5): A′(r) = R(0, β1, γ1)A(r)

Translate A (×50 @ 0.75A): A′′(r) = Tz(−R)A′(r)

Rotate B (×812 @ 7.5): B′(r) = R(0, β2, γ2)B(r)

Twist B (×64 @ 5.6): B′′(r) = R(α2, 0, 0)B′(r)

1D FFT: SAB(α2) =N−1∑

m=1−N

Pm cos mα2 + Qm sin mα2

Search Space: 812 × 50 × 812 × 64 ≃ 2 × 109 (∼ 106/s on a 1GHz PIII Xeon)

Shape Correlation Score as a Function of Twist Angle α2(Antibody HyHel-5/Lysozyme Complex)

90 0 900800800

S 2N=16

90 0 900800800

S 2N=20

90 0 900800800

S 2N=25

Re-Docking Known Protein ComplexesN = 16 N = 20 N = 25

Case Top RMS Top RMS Top RMS

SIC 3,407 0.00 2 0.22 1 0.82

KAI 17 0.41 3 0.69 7 0.81PTC 132 0.52 2 0.48 1 0.48

CGI 1 0.38 1 0.38 1 0.38CHO 1 0.45 1 0.55 1 0.55BGS 1 0.82 1 0.82 1 0.88

GGI 1 2.47 1 0.90 1 0.90TET 5 1.48 1 1.16 1 1.03

FPT 102 1.04 1 0.42 1 0.42IGF 3 0.71 1 0.77 1 0.77

JEL 4,867 0.81 1,060 0.81 2 0.81BQL 524 1.85 12 0.96 1 0.39

HFL 318 1.01 5 1.00 1 1.00HFM 7 2.19 27 1.09 10 1.09VFB 8,344 1.49 216 0.20 9 0.20

MLC 1,401 0.00 116 0.00 187 0.84MEL 9,898 1.03 27 1.03 3 1.03

JHL 385 0.62 8 0.38 1 1.08FBI 14 1.09 1 1.09 1 0.38

NCA 68 1.53 1 0.32 1 0.32NMB 160 2.43 1,630 1.39 1,009 1.39

NSN 19,992 1.11 716 0.75 1,130 2.29IAI 1,381 1.48 111 0.37 20 1.39DVF 11,145 0.00 88 1.38 49 0.44

KB5 140 0.34 1 0.34 78 1.38IGC 1,328 1.74 269 0.81 1 0.34

Show Docking Movie!

CAPRI – Critical Assessment of Predicted Interactions

• Started in 2001/2 following CASP with 19 groups & 7 targets...

• At least one protein presented in its unbound form

• Any predictive approach allowed: homology/literature, etc.

Target Receptor Ligand Type Complex Lab

1 HPr Kinase HPr U/U Fieulaine et al. Janin

2 Rotavirus VP6 MCV U/B Vaney et al. Rey

3 Hemagglutinin HC63 U/B Barbey-Martin et al. Knossow

4 α -Amylase AMD10 U/B Desmyter et al. Cambillau

5 α -Amylase AMB7 U/B Desmyter et al. Cambillau

6 α -Amylase AMD9 U/B Desmyter et al. Cambillau

7 SpeA TCR 14.3.D U/U Sundberg et al. Mariuzza

• Now > 40 groups; Currently on Targets 28 ...

• 3 Sections - Predicters, Servers, Scorers

J Janin et al. (2003) Proteins Struct. Funct. Bioinf. 52 2–9

CAPRI Target 1 - Lactobacillus HPr / HprK

CAPRI Results: Targets 1–7

Predictor Software Algorithm T1 T2 T3 T4 T5 T6 T7

Abagyan ICM FF ** *** **

Camacho CHARMM FF * *** ***

Eisenstein MolFit FFT * * ***

Sternberg FTDOCK FFT * ** *

Ten Eyck DOT FFT * * **

Gray MC ** ***

Ritchie Hex SPF ** ***

Weng ZDOCK FFT ** **

Wolfson BUDDA/PPD GH * ***

Bates Guided Docking FF - - - ***

Palma BIGGER GF - - ** *

Gardiner GAPDOCK GA * * - - - - -

Olson Surfdock SH * - - - -

Valencia ANN * - - - - - -

Vakser GRAMM FFT * - - - -

∗ low, ∗∗ medium, ∗ ∗ ∗ high accuracy prediction; − no prediction

R Mendez et al. (2003) Proteins Struct. Funct. Bionf. 52 51–67

Docked Orientation (Hex) for Target 3 - Hemagglutinin/HC63

• CAPRI “medium accuracy” ( 1A ≤ Ligand RMSD ≤ 5A)

Docked Orientation (Hex) for Target 6 - Amylase/AMD9

• CAPRI “high accuracy” (Ligand RMSD ≤ 1A)

Subsequent CAPRI Targets (Rounds 3 – 5)

Target Description Comments

T8 Nidogen- γ3 - Laminin U/U

T9 LiCT homodimer build from monomer – 12A RMS deviation

T10 TBEV trimer build from monomer – 11A RMS deviation

T11 Cohesin - dockerin U/U; model-build dockerin

T12 Cohesin - dockerin U/B

T13 SAG1 - antibody Fab SAG1 conformational change: 10A RMS

T14 MYPT1 - PP1 δ U/U; model-build PP1 α → PP1 δ

T18 TAXI - xylanase U/B

T19 Ovine prion - antibody Fab model-build prion

• T15-T17 cancelled: structures released prematurely - Google!!!

• T11, T14, T19 involved homology model-building step...

CAPRI Results: Targets 8–19

Predictor Software T8 T9 T10 T11 T12 T13 T14 T18 T19

Abagyan ICM ** * ** *** * *** ** **

Wolfson PatchDock ** * * * * - ** ** *

Weng ZDOCK/RDOCK ** * *** *** *** ** **

Bates FTDOCK * * ** * ** ** *

Baker RosettaDock - ** *** ** *** ***

Camacho SmoothDock ** *** *** ** ** *

Gray RosettaDock *** - - ** *** **

Bonvin Haddock - - ** ** *** ***

Comeau ClusPro ** *** * *

Sternberg 3D-DOCK ** * * ** *

Eisenstein MolFit *** * *** **

Ritchie Hex ** *** * *

Zhou - - - *** ** * *

Ten Eyck DOT *** *** **

Zacharias ATTRACT ** - - - - *** **

Valencia * * * - -

Vakser GRAMM - - - - - ** **

Umeyama ** *

Kaznessis - - ***

Fano Grid-Hex - - *

R Mendez et al. (2005) Proteins Struct. Funct. Bionf. 60 150-169

Docked Orientation (Hex) for Target 12 - Cohesin/Dockerin

• Here, we assumed “molecular mimicry”

• First superposed dockerin onto cohesin dimer, then docked...

• CAPRI “high accuracy” (Interface RMSD ≤ 1A)

5D FFT Correlations from Complex Overlap Expressions(Ritchie, Kozakov, Vajda, (2008) Bioinformatics 24 1865–1873)

Complex SHs, Ylm: ylm(θ, φ) =∑

t

U(l)mtYlt(θ, φ)

Complex coefficients: Anlm =∑

t

anltU(l)tm

Complex overlap: S =∑

kjsmnlv

D(j)∗ms (0, βA, γA)A∗

kjsT(|m|)kj,nl (R)D(l)

mv(αB, βB, γB)Bnlv

Collect coefficients: S(|m|)js,lv (R) =

∑

kn

A∗kjsT

(|m|)kj,nl (R)Bnlv, k > j; n > l

To give: S =∑

jsmlv

D(j)∗ms (0, βA, γA)S

(|m|)js,lv (R)D(l)

mv(αB, βB, γB)

Expand as exponentials: D(l)mv(α, β, γ) =

∑

t

Γtmlv e−imαe−itβe−ivγ

Hence: S =∑

jsmlvrt

Γrmjs S

(|m|)js,lv (R)Γtm

lv e−i(rβA−sγA+mαB+tβB+vγB)

Comparing FFT Correlation Speeds

N=25 Correlations, 2.6 × 109 Orientations

(Single CPU, 1.8GHz Xeon, 1Gb RAM)

Set-up FFT Rate Total Rate Total Time

Mins 106/ sec 106/ sec Mins

1D 8.0 1.0 0.8 43

3D 13.5 17.0 1.8 15

5D 9.8 4.5 2.2 21

The difference in 5D/3D FFT rates seems to

be due to CPU-cache/main-memory thrashing

• For two-property correlations, 5D FFT is ∼ 2x faster and 3D is ∼ 3x faster than 1D

• BUT for multi-property correlations, 5D gives almost NO extra cost per property

Porting Hex to a GPU using CUDA

• Modern GPUs have very high compute performance

• SIMT architecture = simultaneous instructions, multiple threads

• NVIDIA GPUs:

• Up to 4Gb memory

• Up to 240 arithmetic “cores”

• Up to Tflop performance

• Easy API with C++ syntax

• Grid of threads SIMT model

• BUT – for best results, need to understand the hardware...

CUDA Device Architecture

• Typically 8–16 multiprocessor blocks, each with 16 thread units

1 2 Thread Processors...

Shared Memory

15

0

0

Thread−Local Memory

Multiprocessor Block

7

(16Kb, fast)

Global Memory (256Mb − 4Gb, slow)

Host (PCIe)

• NB. global memory is ∼ 80x slower than shared memory

• Strategy: aim for “high arithmetic intensity” in shared memory

CUDA Example - Matrix Multiplication

• Matrix multiplication C = A * B

• Each thread is responsible for calculating one element: C[i,k]

• Threads cooperate by reading & sharing sub-blocks of A & B

=

=

i

k

i

kbx

by

i

k

tytx

C

C

A B

BA*

* • Conventional algorithm

• C[i,k] = A[i] * B[k]

• GPU thread-blocks

• Multiprocessor launches multiple blocks to compute all of C

• Running thread-blocks concurrently hides memory latency

CUDA Programming - Matrix Multiplication Kernel__global__ void matmul(int wA, int wB, float *A, float *B, float *C)

float Cik = 0.0; // thread-local result variable

int bx = blockIdx.x, tx = threadIdx.x; // thread subscripts

int by = blockIdx.y, ty = threadIdx.y; // ("this" thread is one of a 2-D grid)

__shared__ float a_sub[16][16], b_sub[16][16]; // declare shared memory

for (int j=0; j<wA; j+=16) // thread-local loop

int ij = (16*by+ty)*wA + (j+tx); // thread-local array subscripts

int jk = (j+ty)*wB + (16*bx+tx);

a_sub[ty][tx] = A[ij]; // copy global data -> shared memory ("I/O")

b_sub[ty][tx] = B[jk];

__syncthreads(); // wait until all memory I/O finished

for (int jj=0; jj<16; jj++) Cik += a_sub[ty][jj] * b_sub[jj][tx];

__syncthreads(); // wait until all threads finished

int ik = (16*by+ty)*wB + (16*bx+tx); // array subscript of result element

C[ik] = Cik; // copy local result -> global memory

Cuda Porting Strategy

• Only port compute-intensive steps e.g. matrix multiply ...

• Consider using provided CUDA libraries: cuFFT, cuBLAST...

• Perform recursion, random access calculations on CPU first...

• Re-write complex/clever data structures as vectors, arrays...

• ... and round-up array dimensions to multiples of 16

• Re-write loops on 1D vectors as 2D array operations, etc.

• Access array elements in natural order for best memory “I/O”

Preliminary Cuda Results for Hex Docking

• Overall speed-up depends on how you measure it !

• Currently, 30x–50x (128-core GTX-9800 v’s 1.8GHz Xeon)

• In cuFFT, 3D FFT is slow compared to 1D FFT

• For Hex, best relative improvement is 1D FFTs using N=25

• Key Hex functions implemented using 5 or 6 CUDA kernels

• Total learning + programming effort = 4 weeks

• Modern GPUs are now very powerful and easy to program!

• New FX-5800 (240 core) should give “interactive” docking...

Fast 2D Surface Envelope Matching(Ritchie & Kemp (1999) J Comp Chem 20 383–395)

• 2D surface comparisions are much faster than 3D:

SAB =

∫

|rA(θ, φ) − rB(θ, φ)|2dΩ

• Expansions to L=7 (64 coeffs) take ∼ 0.05 s per superposition...

ParaSurf – SH Surfaces & Properties from Semi-Empirical QM(Lin & Clark (2005) J Chem Inf Model 45 1010–1016; Clark (2004) J Mol graph 22 519–525)

• From MOPAC or VAMP calculate:

• Density contours of 2 × 10−4e/A3

( ∼ SAS)

• MEP, IEL, EAL, αL as expansions to L=15

• Concise/convenient non-atomistic descriptors for ComFA/QSAR?

ParaFit - High Throughput SH Surface & Property Matching

Distance: D =

∫

(rA(θ, φ) − rB(θ, φ)′)2dΩ

Orthogonality: D = |a|2 + |b|2 − 2a.b′

Rotation: b′lm =

∑

m′

R(l)mm′(α, β, γ)blm′

Hodgkin: S = 2a.b′/(|a|2 + |b|2)

Carbo: S = a.b′/(|a|.|b|)

Tanimoto: S = a.b′/(|a|2 + |b|2 − a.b′)

Multi-property: S = pSshape + qSMEP + rSIEL + sSEAL + tSαL

Fast Brute-Force Superposition Searches

• Euler rotations generated from icosahedral tesselation of sphere

• 22,500 samples (500(β, γ) × 45(α)) of about 8 degree steps

• Refine with 16 × 16 × 16 equatorial grid of 1 degree steps

• Approx 0.05 seconds / superposition on 1.8GHz P-III Xeon CPU...

Canonical Orientations – Aligning Molecules to Principal Axes

• Find principal radii by brute force search to L=6

• similar to finding moments of inertia

• but no ambiguity with respect to 180 degree flips

z

x

• Canonical orientations of similar molecules often overlay very well

Clustering the Odour Dataset using 2D Surface Shape(Takane et al. (2004) Org. Biomol. Chem. 2 3250–3255)

• Seven classes: bitter, ambergris, camphoraceous, rose, jasmine, muguet, musk

• Following Takene et al., cluster into 10 group using ParaSurf & Parafit:

unix% PS mopac run

unix% PS parasurf run

unix% parafit -matrix -dif odour data.dif * p.sdf

unix% dif2jpg -n 10 -d odour data.dif

unix% eog odour data.jpg

Visualisation of Odour Dataset Clustering Results(Mavridis et al. (2007), J. Chem. Inf. Model. 45(5) 1787-1796.)

Clustering Superposed Pairs Clustering Canonical Orientations

Shape-Based Virtual Screening of CXCR4 & CCR5 Antagonists(V. Perez-Nueno et al., (2008) J. Chem. Inf. Model. 48(3) 509-533)

• Assembled 602 known actives (TAK779, AMD3100, etc.) against CXCR4 & CCR5

• Performed virtual screening against 4700 inactives (with TAK779, AMD3100 as queries)

Comparing Ligand-Based & Docking-Based Virtual Screening

• Docking enrichments are better for CXCR4 than CCR5 (better CXCR4 homology model)

• But shape-based scoring generally gives better enrichments overall...

Conclusions & Future Prospects

• Protein Docking (“Hex”):

• Novel, fast, & fairly accurate docking algorithm

• Multi-dimensional FFT gives good speed-up, especially 3D

• Polar Fourier FFT maps v. well to GPU, with v. good speed-up

• Main challenge is now scoring & flexibility, not search...

• Small-Molecule Applications:

• SH shape-matching is at least as good as ROCS and v. fast...

• The Future?

• Extensible to ComFA/QSAR & ligand docking...?

• High throughput 2D/3D database screening now feasible...?

Acknowledgments

ANR 2009-2010

BBSRC 1996-2000, 2006-07

EPSRC 2002-06

Tim Clark, Brian Hudson & Vishwesh VenkatramanSandor Vajda & Dima Kozakov

Lazaros Mavridis & Violeta Perez-Nueno

Software & Preprints: http://www.loria.fr/∼ritchied/

PSFB special issue: Third CAPRI Evaluation Meeting Dec 2007(Google: Proteins Wiley)

Review: DW Ritchie (2008) Curr. Prot. Pep. Sci. 9(1) 1–15.

Extra Slides

Using Low Resolution Docking to Cross-Validate Predicted PPIs?

Low resolution docking of Tripsin + BPTI

a) Crystal structure + low res FFT

• Gold: BPTI location in crystal

• Red: centroid of calculated BPTI solutions

b) Model-built structure (green) + low res FFT

Figure from: Tovchigrechko et al., Prot. Sci. (2002) 11 1888–1896

Multi-Sample Docking for Very Large Molecules - Antibody-VP2

protein docking and molecular shape recognition what is ... · protein docking and molecular shape...

Documents