protein docking and molecular shape recognition what is ... · protein docking and molecular shape...
TRANSCRIPT
Protein Docking and Molecular Shape RecognitionUsing Polar Fourier Correlations
Dave RitchieLORIA, Nancy
What is Protein Docking?
Protein docking = shape recognition in 3D space
However, ... proteins are flexible (more complexity)!
Protein Docking and Molecular Shape RecognitionUsing Polar Fourier Correlations
Contents
• Motivation – Importance of Protein-Protein Interactions (PPIs)
• Polar Fourier Protein Shape Representation
• Application to Protein Docking – Hex
• The CAPRI Blind Docking Experiment
• New Developments – Multi-Dimensional FFTs, Using GPUs
• Application to Molecular Shape Recognition – ParaSurf & ParaFit
• Conclusions & Future Prospects
PPI Networks are Fundamental to Biological Mechanisms
• If genomes provide the “blue-print” for life ...
• ... then proteins provide the “machinery”
• Understanding PPIs could lead to immense scientific advances and therapeutic benefits
Yeast network figure from: J Hallinan & G Smith ICCS 2002 article 584
Recent Growth of Protein-Protein Interaction (PPI) Literature
Citations of key yeast functional genomics papers
(per year):
• Red: Ito et al., Uetz et al. (Y2H)
• Blue: Ho et al., Gavin et al. (TAP-MS)
• Black: All protein-protein interaction papers
Figure from: Bork et al., Curr Op. Struct. Biol. (2004) 14 292–299
Docking - Predicting PPIs at the 3D Molecular Level
Ab Initio
• Soft Docking – FFT, Polar Fourier Correlations (∼ hours)
• MC/MD – Flexible side chains + backbones (∼ days)
Re-Scoring
• Knowledge-based potentials
Data-Driven
• Biochemical: mutagenesis hot-spot residues
• Biophysical: NMR CSP/RDC, H/D exchange, 13C labeling, ...
• ET + Correlated mutations
• Structural Databases (docking by homology)
The Basic Goal of Protein-Protein Docking
Find minimum potential energy of the system as rapidly as possible:
E =
∫
φ(r)ρ(r)dVFor two proteins
φ(r) =φA(r) + φB(r)
ρ(r) =ρA(r) + ρB(r)
and so
E =
∫
(φA(r)ρB(r) + φB(r)ρA(r))dV
• With brute-force search, typically need ∼ 109 such integrals
• Current algorithms often sum several such potential/density terms...
• ... and often use 3D Cartesian FFTs to accelerate the calculation
Real Spherical Harmonic Basis Functions
Orthogonality:
∫
ylm(θ, φ)yl′m′(θ, φ)dΩ = δll′δmm′
Rotation: ylm(θ′, φ′) =l
∑
m′=−l
R(l)m′m
(α, β, γ)ylm′(θ, φ)
Spherical Harmonic Surfaces
Example: 2D Radial Expansions (256 Basis Functions)
r(θ, φ) =15∑
l=0
l∑
m=−l
almylm(θ, φ)
• Good for matching similar shapes, not so good for docking...
Radial Basis Functions: Rnl(r)
HO-type (shape): Rnl(r) = N(q)nl e−ρ/2ρl/2L
(l+1/2)n−l−1 (ρ); ρ = r2/q, q = 20.
Coulomb (electro): Rnl(r) = N(Λ)nl e−ρ/2ρlL
(2l+2)n−l−1(ρ); ρ = 2Λr, Λ = 1/2.
Orthogonality:
∫ ∞
0
Rnl(r)Rn′l(r)r2dr = δnn′
30
R15,0(r)
30
R20,0(r)
30
R25,0(r)
30
R30,0(r)
3D Protein Shape Density Representations(Ritchie & Kemp (2000) Proteins 39 178–194)
• Sample surface skins onto a (0.75A)3 grid...
Molecular Surface
Solvent Accessible Surface Surface Skin
Protein Interior
SamplingSpheres
Surface Normals
Surface Skin: σ(r) =
1; r ∈ surface skin
0; otherwiseInterior: τ (r) =
1; r ∈ protein atom
0; otherwise
Parametrise as: σ(r) =
N∑
nlm
aσnlmRnl(r) ylm(θ, φ), etc.
Estimate as: aσnlm ≃
∑
c
Rnl(rc) ylm(θc, φc)∆V
• Only need to do this once for each protein...
Polar Fourier Shape Density Reconstruction - Antibody CDRs
Image Order Coefficients
A Gaussians -
B N = 16 1,496
C N = 25 5,525
D N = 30 9,455
DW Ritchie (2003) Proteins Struct. Funct. Bionf. 52 98–106
3D Shape Density Reconstruction – CAPRI T21: Orc1/Sir1
DW Ritchie (2008) Curr. Prot. Pep. Sci. 9(1) 1-15
Docking Using 3D Polar Fourier Density Functions - “Hex”
τσ(r)
(r)
Densities: σ(r) =N
∑
nlm
aσnlmRnl(r)ylm(θ, φ) τ (r) =
N∑
nlm
aτnlmRnl(r)ylm(θ, φ)
Favourable:
∫
(σA(rA)τB(rB) + τA(rA)σB(rB))dV
Unfavourable:
∫
τA(rA)τB(rB)dV
Score: SAB =
∫
(σAτB + τAσB − QτAτB)dV Penalty Factor: Q = 11
DW Ritchie & GJL Kemp (2000) Proteins Struct. Funct. Bionf. 39 178–194
Correlations - Overlap as a Function of Coordinate Operations
Rotation: R(α, β, γ)σA(r) =N
∑
nlm
aσ′nlmRnl(r)ylm(θ, φ)
Rotated Coeffients: aσ′nlm =
l∑
m′=−l
R(l)mm′(α, β, γ)aσ
nlm′
Translation: Tz(R)σA(r) =N
∑
nlm
aσ′′nlmRnl(r)ylm(θ, φ)
Translated Coefficients: aσ′′nlm =
N∑
n′l′
T(|m|)nl,n′l′(R)aσ
n′l′m
Hence:
∫
σ′A(r)τ ′′
B(r)dV =N
∑
nlm
aσ′nlmbτ ′′
nlm etc.
Search Space: ∼ 109 orientations (∼ 106 orientations/sec)
DW Ritchie (2005) J. Appl. Cryst. 38 808–818
Translation Matrices From Fourier-Bessel Transform Theory
Using spherical Bessel transforms:
Rnl(β) =
√
2
π
∫ ∞
0
Rnl(r)jl(βr)r2dr; Rnl(r) =
√
2
π
∫ ∞
0
Rnl(β)jl(βr)β2dβ
it can be shown that
T(|m|)n′l′,nl(R) =
l+l′∑
k=|l−l′|
A(ll′|m|)k
∫ ∞
0
Rnl(β)Rn′l′(β)jk(βR)β2dβ
where
A(ll′|m|)k = (−1)
k+l′−l2
+m(2k + 1)[
(2l + 1)(2l′ + 1)]1/2
(
l l′ k
0 0 0
)(
l l′ k
m m 0
)
• Can derive analytic formulae for both GTO and ETO radial functions
• Requires high precision math library (GMP)...
• Calculate once for R = 1, 2, 3, ...50A and store on disk ( ∼ 200Mb)
6D Docking Search as a Nested Sequence of Transformations
Get 4 rotations from icosahedral tessellations ...A
(β2,γ2)(β1,γ1)
z
α2
R
βΑ B
Rotate A (×812 @ 7.5): A′(r) = R(0, β1, γ1)A(r)
Translate A (×50 @ 0.75A): A′′(r) = Tz(−R)A′(r)
Rotate B (×812 @ 7.5): B′(r) = R(0, β2, γ2)B(r)
Twist B (×64 @ 5.6): B′′(r) = R(α2, 0, 0)B′(r)
1D FFT: SAB(α2) =N−1∑
m=1−N
Pm cos mα2 + Qm sin mα2
Search Space: 812 × 50 × 812 × 64 ≃ 2 × 109 (∼ 106/s on a 1GHz PIII Xeon)
Shape Correlation Score as a Function of Twist Angle α2(Antibody HyHel-5/Lysozyme Complex)
90 0 900800800
S 2N=16
90 0 900800800
S 2N=20
90 0 900800800
S 2N=25
Re-Docking Known Protein ComplexesN = 16 N = 20 N = 25
Case Top RMS Top RMS Top RMS
SIC 3,407 0.00 2 0.22 1 0.82
KAI 17 0.41 3 0.69 7 0.81PTC 132 0.52 2 0.48 1 0.48
CGI 1 0.38 1 0.38 1 0.38CHO 1 0.45 1 0.55 1 0.55BGS 1 0.82 1 0.82 1 0.88
GGI 1 2.47 1 0.90 1 0.90TET 5 1.48 1 1.16 1 1.03
FPT 102 1.04 1 0.42 1 0.42IGF 3 0.71 1 0.77 1 0.77
JEL 4,867 0.81 1,060 0.81 2 0.81BQL 524 1.85 12 0.96 1 0.39
HFL 318 1.01 5 1.00 1 1.00HFM 7 2.19 27 1.09 10 1.09VFB 8,344 1.49 216 0.20 9 0.20
MLC 1,401 0.00 116 0.00 187 0.84MEL 9,898 1.03 27 1.03 3 1.03
JHL 385 0.62 8 0.38 1 1.08FBI 14 1.09 1 1.09 1 0.38
NCA 68 1.53 1 0.32 1 0.32NMB 160 2.43 1,630 1.39 1,009 1.39
NSN 19,992 1.11 716 0.75 1,130 2.29IAI 1,381 1.48 111 0.37 20 1.39DVF 11,145 0.00 88 1.38 49 0.44
KB5 140 0.34 1 0.34 78 1.38IGC 1,328 1.74 269 0.81 1 0.34
Show Docking Movie!
CAPRI – Critical Assessment of Predicted Interactions
• Started in 2001/2 following CASP with 19 groups & 7 targets...
• At least one protein presented in its unbound form
• Any predictive approach allowed: homology/literature, etc.
Target Receptor Ligand Type Complex Lab
1 HPr Kinase HPr U/U Fieulaine et al. Janin
2 Rotavirus VP6 MCV U/B Vaney et al. Rey
3 Hemagglutinin HC63 U/B Barbey-Martin et al. Knossow
4 α -Amylase AMD10 U/B Desmyter et al. Cambillau
5 α -Amylase AMB7 U/B Desmyter et al. Cambillau
6 α -Amylase AMD9 U/B Desmyter et al. Cambillau
7 SpeA TCR 14.3.D U/U Sundberg et al. Mariuzza
• Now > 40 groups; Currently on Targets 28 ...
• 3 Sections - Predicters, Servers, Scorers
J Janin et al. (2003) Proteins Struct. Funct. Bioinf. 52 2–9
CAPRI Target 1 - Lactobacillus HPr / HprK
CAPRI Results: Targets 1–7
Predictor Software Algorithm T1 T2 T3 T4 T5 T6 T7
Abagyan ICM FF ** *** **
Camacho CHARMM FF * *** ***
Eisenstein MolFit FFT * * ***
Sternberg FTDOCK FFT * ** *
Ten Eyck DOT FFT * * **
Gray MC ** ***
Ritchie Hex SPF ** ***
Weng ZDOCK FFT ** **
Wolfson BUDDA/PPD GH * ***
Bates Guided Docking FF - - - ***
Palma BIGGER GF - - ** *
Gardiner GAPDOCK GA * * - - - - -
Olson Surfdock SH * - - - -
Valencia ANN * - - - - - -
Vakser GRAMM FFT * - - - -
∗ low, ∗∗ medium, ∗ ∗ ∗ high accuracy prediction; − no prediction
R Mendez et al. (2003) Proteins Struct. Funct. Bionf. 52 51–67
Docked Orientation (Hex) for Target 3 - Hemagglutinin/HC63
• CAPRI “medium accuracy” ( 1A ≤ Ligand RMSD ≤ 5A)
Docked Orientation (Hex) for Target 6 - Amylase/AMD9
• CAPRI “high accuracy” (Ligand RMSD ≤ 1A)
Subsequent CAPRI Targets (Rounds 3 – 5)
Target Description Comments
T8 Nidogen- γ3 - Laminin U/U
T9 LiCT homodimer build from monomer – 12A RMS deviation
T10 TBEV trimer build from monomer – 11A RMS deviation
T11 Cohesin - dockerin U/U; model-build dockerin
T12 Cohesin - dockerin U/B
T13 SAG1 - antibody Fab SAG1 conformational change: 10A RMS
T14 MYPT1 - PP1 δ U/U; model-build PP1 α → PP1 δ
T18 TAXI - xylanase U/B
T19 Ovine prion - antibody Fab model-build prion
• T15-T17 cancelled: structures released prematurely - Google!!!
• T11, T14, T19 involved homology model-building step...
CAPRI Results: Targets 8–19
Predictor Software T8 T9 T10 T11 T12 T13 T14 T18 T19
Abagyan ICM ** * ** *** * *** ** **
Wolfson PatchDock ** * * * * - ** ** *
Weng ZDOCK/RDOCK ** * *** *** *** ** **
Bates FTDOCK * * ** * ** ** *
Baker RosettaDock - ** *** ** *** ***
Camacho SmoothDock ** *** *** ** ** *
Gray RosettaDock *** - - ** *** **
Bonvin Haddock - - ** ** *** ***
Comeau ClusPro ** *** * *
Sternberg 3D-DOCK ** * * ** *
Eisenstein MolFit *** * *** **
Ritchie Hex ** *** * *
Zhou - - - *** ** * *
Ten Eyck DOT *** *** **
Zacharias ATTRACT ** - - - - *** **
Valencia * * * - -
Vakser GRAMM - - - - - ** **
Umeyama ** *
Kaznessis - - ***
Fano Grid-Hex - - *
R Mendez et al. (2005) Proteins Struct. Funct. Bionf. 60 150-169
Docked Orientation (Hex) for Target 12 - Cohesin/Dockerin
• Here, we assumed “molecular mimicry”
• First superposed dockerin onto cohesin dimer, then docked...
• CAPRI “high accuracy” (Interface RMSD ≤ 1A)
5D FFT Correlations from Complex Overlap Expressions(Ritchie, Kozakov, Vajda, (2008) Bioinformatics 24 1865–1873)
Complex SHs, Ylm: ylm(θ, φ) =∑
t
U(l)mtYlt(θ, φ)
Complex coefficients: Anlm =∑
t
anltU(l)tm
Complex overlap: S =∑
kjsmnlv
D(j)∗ms (0, βA, γA)A∗
kjsT(|m|)kj,nl (R)D(l)
mv(αB, βB, γB)Bnlv
Collect coefficients: S(|m|)js,lv (R) =
∑
kn
A∗kjsT
(|m|)kj,nl (R)Bnlv, k > j; n > l
To give: S =∑
jsmlv
D(j)∗ms (0, βA, γA)S
(|m|)js,lv (R)D(l)
mv(αB, βB, γB)
Expand as exponentials: D(l)mv(α, β, γ) =
∑
t
Γtmlv e−imαe−itβe−ivγ
Hence: S =∑
jsmlvrt
Γrmjs S
(|m|)js,lv (R)Γtm
lv e−i(rβA−sγA+mαB+tβB+vγB)
Comparing FFT Correlation Speeds
N=25 Correlations, 2.6 × 109 Orientations
(Single CPU, 1.8GHz Xeon, 1Gb RAM)
Set-up FFT Rate Total Rate Total Time
Mins 106/ sec 106/ sec Mins
1D 8.0 1.0 0.8 43
3D 13.5 17.0 1.8 15
5D 9.8 4.5 2.2 21
The difference in 5D/3D FFT rates seems to
be due to CPU-cache/main-memory thrashing
• For two-property correlations, 5D FFT is ∼ 2x faster and 3D is ∼ 3x faster than 1D
• BUT for multi-property correlations, 5D gives almost NO extra cost per property
Porting Hex to a GPU using CUDA
• Modern GPUs have very high compute performance
• SIMT architecture = simultaneous instructions, multiple threads
• NVIDIA GPUs:
• Up to 4Gb memory
• Up to 240 arithmetic “cores”
• Up to Tflop performance
• Easy API with C++ syntax
• Grid of threads SIMT model
• BUT – for best results, need to understand the hardware...
CUDA Device Architecture
• Typically 8–16 multiprocessor blocks, each with 16 thread units
1 2 Thread Processors...
Shared Memory
15
0
0
Thread−Local Memory
Multiprocessor Block
7
(16Kb, fast)
Global Memory (256Mb − 4Gb, slow)
Host (PCIe)
• NB. global memory is ∼ 80x slower than shared memory
• Strategy: aim for “high arithmetic intensity” in shared memory
CUDA Example - Matrix Multiplication
• Matrix multiplication C = A * B
• Each thread is responsible for calculating one element: C[i,k]
• Threads cooperate by reading & sharing sub-blocks of A & B
=
=
i
k
i
kbx
by
i
k
tytx
C
C
A B
BA*
* • Conventional algorithm
• C[i,k] = A[i] * B[k]
• GPU thread-blocks
• Multiprocessor launches multiple blocks to compute all of C
• Running thread-blocks concurrently hides memory latency
CUDA Programming - Matrix Multiplication Kernel__global__ void matmul(int wA, int wB, float *A, float *B, float *C)
float Cik = 0.0; // thread-local result variable
int bx = blockIdx.x, tx = threadIdx.x; // thread subscripts
int by = blockIdx.y, ty = threadIdx.y; // ("this" thread is one of a 2-D grid)
__shared__ float a_sub[16][16], b_sub[16][16]; // declare shared memory
for (int j=0; j<wA; j+=16) // thread-local loop
int ij = (16*by+ty)*wA + (j+tx); // thread-local array subscripts
int jk = (j+ty)*wB + (16*bx+tx);
a_sub[ty][tx] = A[ij]; // copy global data -> shared memory ("I/O")
b_sub[ty][tx] = B[jk];
__syncthreads(); // wait until all memory I/O finished
for (int jj=0; jj<16; jj++) Cik += a_sub[ty][jj] * b_sub[jj][tx];
__syncthreads(); // wait until all threads finished
int ik = (16*by+ty)*wB + (16*bx+tx); // array subscript of result element
C[ik] = Cik; // copy local result -> global memory
Cuda Porting Strategy
• Only port compute-intensive steps e.g. matrix multiply ...
• Consider using provided CUDA libraries: cuFFT, cuBLAST...
• Perform recursion, random access calculations on CPU first...
• Re-write complex/clever data structures as vectors, arrays...
• ... and round-up array dimensions to multiples of 16
• Re-write loops on 1D vectors as 2D array operations, etc.
• Access array elements in natural order for best memory “I/O”
Preliminary Cuda Results for Hex Docking
• Overall speed-up depends on how you measure it !
• Currently, 30x–50x (128-core GTX-9800 v’s 1.8GHz Xeon)
• In cuFFT, 3D FFT is slow compared to 1D FFT
• For Hex, best relative improvement is 1D FFTs using N=25
• Key Hex functions implemented using 5 or 6 CUDA kernels
• Total learning + programming effort = 4 weeks
• Modern GPUs are now very powerful and easy to program!
• New FX-5800 (240 core) should give “interactive” docking...
Fast 2D Surface Envelope Matching(Ritchie & Kemp (1999) J Comp Chem 20 383–395)
• 2D surface comparisions are much faster than 3D:
SAB =
∫
|rA(θ, φ) − rB(θ, φ)|2dΩ
• Expansions to L=7 (64 coeffs) take ∼ 0.05 s per superposition...
ParaSurf – SH Surfaces & Properties from Semi-Empirical QM(Lin & Clark (2005) J Chem Inf Model 45 1010–1016; Clark (2004) J Mol graph 22 519–525)
• From MOPAC or VAMP calculate:
• Density contours of 2 × 10−4e/A3
( ∼ SAS)
• MEP, IEL, EAL, αL as expansions to L=15
• Concise/convenient non-atomistic descriptors for ComFA/QSAR?
ParaFit - High Throughput SH Surface & Property Matching
Distance: D =
∫
(rA(θ, φ) − rB(θ, φ)′)2dΩ
Orthogonality: D = |a|2 + |b|2 − 2a.b′
Rotation: b′lm =
∑
m′
R(l)mm′(α, β, γ)blm′
Hodgkin: S = 2a.b′/(|a|2 + |b|2)
Carbo: S = a.b′/(|a|.|b|)
Tanimoto: S = a.b′/(|a|2 + |b|2 − a.b′)
Multi-property: S = pSshape + qSMEP + rSIEL + sSEAL + tSαL
Fast Brute-Force Superposition Searches
• Euler rotations generated from icosahedral tesselation of sphere
• 22,500 samples (500(β, γ) × 45(α)) of about 8 degree steps
• Refine with 16 × 16 × 16 equatorial grid of 1 degree steps
• Approx 0.05 seconds / superposition on 1.8GHz P-III Xeon CPU...
Canonical Orientations – Aligning Molecules to Principal Axes
• Find principal radii by brute force search to L=6
• similar to finding moments of inertia
• but no ambiguity with respect to 180 degree flips
z
x
• Canonical orientations of similar molecules often overlay very well
Clustering the Odour Dataset using 2D Surface Shape(Takane et al. (2004) Org. Biomol. Chem. 2 3250–3255)
• Seven classes: bitter, ambergris, camphoraceous, rose, jasmine, muguet, musk
• Following Takene et al., cluster into 10 group using ParaSurf & Parafit:
unix% PS mopac run
unix% PS parasurf run
unix% parafit -matrix -dif odour data.dif * p.sdf
unix% dif2jpg -n 10 -d odour data.dif
unix% eog odour data.jpg
Visualisation of Odour Dataset Clustering Results(Mavridis et al. (2007), J. Chem. Inf. Model. 45(5) 1787-1796.)
Clustering Superposed Pairs Clustering Canonical Orientations
Shape-Based Virtual Screening of CXCR4 & CCR5 Antagonists(V. Perez-Nueno et al., (2008) J. Chem. Inf. Model. 48(3) 509-533)
• Assembled 602 known actives (TAK779, AMD3100, etc.) against CXCR4 & CCR5
• Performed virtual screening against 4700 inactives (with TAK779, AMD3100 as queries)
Comparing Ligand-Based & Docking-Based Virtual Screening
• Docking enrichments are better for CXCR4 than CCR5 (better CXCR4 homology model)
• But shape-based scoring generally gives better enrichments overall...
Conclusions & Future Prospects
• Protein Docking (“Hex”):
• Novel, fast, & fairly accurate docking algorithm
• Multi-dimensional FFT gives good speed-up, especially 3D
• Polar Fourier FFT maps v. well to GPU, with v. good speed-up
• Main challenge is now scoring & flexibility, not search...
• Small-Molecule Applications:
• SH shape-matching is at least as good as ROCS and v. fast...
• The Future?
• Extensible to ComFA/QSAR & ligand docking...?
• High throughput 2D/3D database screening now feasible...?
Acknowledgments
ANR 2009-2010
BBSRC 1996-2000, 2006-07
EPSRC 2002-06
Tim Clark, Brian Hudson & Vishwesh VenkatramanSandor Vajda & Dima Kozakov
Lazaros Mavridis & Violeta Perez-Nueno
Software & Preprints: http://www.loria.fr/∼ritchied/
PSFB special issue: Third CAPRI Evaluation Meeting Dec 2007(Google: Proteins Wiley)
Review: DW Ritchie (2008) Curr. Prot. Pep. Sci. 9(1) 1–15.
Extra Slides
Using Low Resolution Docking to Cross-Validate Predicted PPIs?
Low resolution docking of Tripsin + BPTI
a) Crystal structure + low res FFT
• Gold: BPTI location in crystal
• Red: centroid of calculated BPTI solutions
b) Model-built structure (green) + low res FFT
Figure from: Tovchigrechko et al., Prot. Sci. (2002) 11 1888–1896
Multi-Sample Docking for Very Large Molecules - Antibody-VP2