approximation of protein structure for fast similarity measures

31
Approximation of Protein Structure for Fast Similarity Measures Itay Lotan Fabian Schwarzer

Upload: ahmed-johnson

Post on 04-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Approximation of Protein Structure for Fast Similarity Measures. Itay Lotan Fabian Schwarzer. Comparing Protein Structures. Same protein:. vs. Analysis of MDS and MCS trajectories. Graph-based Methods. Structure prediction applications. Evaluating decoy sets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Approximation of Protein Structure for Fast Similarity Measures

Approximation of Protein Structure for Fast Similarity

Measures

Itay LotanFabian Schwarzer

Page 2: Approximation of Protein Structure for Fast Similarity Measures

Comparing Protein Structures

vs.Same protein:

Analysis of MDS

and MCS

trajectories

http://folding.stanford.edu

Structure prediction applications• Evaluating decoy sets

• Clustering predictions (Shortle et al, Biophysics ’98)

Graph-based Methods

Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)

Page 3: Approximation of Protein Structure for Fast Similarity Measures

k Nearest-Neighbors Problem

Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c.

Can be done in

N – size of S

L – time to compare two conformations

(log )O N k L

Page 4: Approximation of Protein Structure for Fast Similarity Measures

k Nearest-Neighbors Problem

What if needed for all c in S ?

2 (log )O N k L - too much time

Can be improved by:

1. Reducing L

2. A more efficient algorithm

Page 5: Approximation of Protein Structure for Fast Similarity Measures

Our Solution

Reduce structure description

Approximate but fast similarity measures

Efficient nearest-neighbor algorithms can be used

Reduce description further

Page 6: Approximation of Protein Structure for Fast Similarity Measures

Description of a Protein’s Structure

3n coordinates of Cα atoms (n – Number of residues)

Page 7: Approximation of Protein Structure for Fast Similarity Measures

m-Averaged Approximation Cut chain into pieces of length m Replace each sequence of m Cα

atoms by its centroid

3n coordinates

3n/m coordinates

Page 8: Approximation of Protein Structure for Fast Similarity Measures

Similarity Measures - cRMS

The RMS of the distances between corresponding atoms after the two conformations are optimally aligned

2

21

1( , ) min

n

T i ii

cRMS P Q p Tqn

Computed in O(n) time

Page 9: Approximation of Protein Structure for Fast Similarity Measures

Similarity Measures - dRMS

The Euclidean distance between the intra-molecular distances matrices of the two conformations

2

2 1

2( , )

( 1)

n iP Qij ij

i j

dRMS P Q d dn n

Computed in O(n2) time

Page 10: Approximation of Protein Structure for Fast Similarity Measures

1. Decoy sets: conformations from the Park-Levitt set (Park et al, JMB ’97), N =10,000

2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = 5000

Evaluation: Test Sets

8 structurally diverse proteins of size 54 -76 residues:

Page 11: Approximation of Protein Structure for Fast Similarity Measures

Decoy Sets Correlation

m cRMS dRMS

4

6

9

12

3 0.99

0.98 – 0.990.92 – 0.990.81 – 0.980.54 – 0.92

0.96 – 0.980.94 – 0.970.78 – 0.930.65 – 0.960.52 – 0.69Higher Correlation for random

sets!

Page 12: Approximation of Protein Structure for Fast Similarity Measures

Speed-up for Decoy Sets

9x for cRMS (m = 9) 36x for dRMS (m = 6)with very small error

For random sets the speed-up for dRMS goes up to 81x (m = 9)

Page 13: Approximation of Protein Structure for Fast Similarity Measures

Efficient Nearest-Neighbor Algorithms

There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures:

cRMS is not a Euclidean metric

dRMS uses a space of dimensionality n(n-1)/2

Page 14: Approximation of Protein Structure for Fast Similarity Measures

Further Dimensionality Reduction of dRMS

kd-trees require dimension 20m-averaging with dRMS is not enough

Reduce further using SVD

SVD: A tool for principal component analysis. Computes directions of greatest variance.

Page 15: Approximation of Protein Structure for Fast Similarity Measures

Reduction Using SVD

1. Stack m-averaged distance matrices as vectors

2. Compute the SVD of entire set3. Project onto most important

singular vectors

dRMS is thus reduced to 20 dimensionsWithout m-averaging SVD can be too costly

Page 16: Approximation of Protein Structure for Fast Similarity Measures

Testing the Method

Use decoy sets (N = 10,000) and random sets (N = 5,000)

m-averaging with (m = 4) Project onto 16 PCs for decoys, 12

PCs for random sets Find k = 10, 25, 100 NNs for 250

conformations in each set

Page 17: Approximation of Protein Structure for Fast Similarity Measures

Results Decoy sets:

~77% correct Furthest approximate NN off by 10% - 15% ~4k approximate NNs contain all true k NNs

Random sets: 71%, 76%, 84% correct respectively Furthest approximate NN off by 5% - 10% ~3k approximate NNs contain all true k NNs

Page 18: Approximation of Protein Structure for Fast Similarity Measures

More Results: N = 100,000

1CTF decoys: ~70% correct Furthest approximate NN off by ~20% ~6k approximate NNs contain all true k NNs

1CTF random: 46%, 48%, 60% correct respectively Furthest approximate NN off by ~16% ~7k approximate NNs contain all true k NNs

Page 19: Approximation of Protein Structure for Fast Similarity Measures

Running Time

N = 100,000, m=4, PC = 16

Find k = 100 for each conformation

Brute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41

minutesKd-tree + m-averaging + SVD: 19 minutes

kd-trees will have more impact for larger sets

Page 20: Approximation of Protein Structure for Fast Similarity Measures

Structural ClassificationComputing the similarity between structures of two different proteins is more involved:

The correspondence problem:

Which parts of the two structures should be compared?

1IRD 2MM1

vs.

Page 21: Approximation of Protein Structure for Fast Similarity Measures

STRUCTAL (Subbiah et al, ’93)

1. Compute optimal correspondence using dynamic programming

2. Optimally align the corresponding parts in space to minimize cRMS

3. Repeat until convergence

O(n1n2) time

Page 22: Approximation of Protein Structure for Fast Similarity Measures

STRUCTAL + m-averaging 256 protein domains (180 – 420 res) 3691 good matches (Sandelin’s PROTOFARM) 6375 random pairs Compute SAS scores (cRMS/length*100)

m correlation3

4

5

0.81

0.77

0.70

speed-up~9x

~16x~25x

Page 23: Approximation of Protein Structure for Fast Similarity Measures

OK (P < 0.005) BAD (P > 0.005)

SAS score

Num

ber

of

pair

s

Page 24: Approximation of Protein Structure for Fast Similarity Measures
Page 25: Approximation of Protein Structure for Fast Similarity Measures

MisclassificationP ≤ 0.005 P ≤ 0.001

Total FP FN Total FP FN

m=1410(4%)

176 225457

(4.5%)

215 242

m=3802(8%)

274 528660

(6.5%)

262 398

m=4993

(10%)248 745

870(8.5%

)282 588

m=51602(16%)

186 14161129(11%)

148 981

Page 26: Approximation of Protein Structure for Fast Similarity Measures

Random Chains

c0c1

c2

c3c4

c5

cn-1c6

c7

c8

The dimensions are uncorrelated Average behavior can be

approximated by normal variables:

1 (0,1)i i N l c c

Page 27: Approximation of Protein Structure for Fast Similarity Measures

1-D Haar Wavelet Transform

Recursive averaging and differencing of the values

Level AveragesDetail Coefficients

[ 9 7 2 6 5 1 4 6 ]

[ 8 4 3 5 ]

[ 6 4 ]

[ 5 ]

[ 1 -2 2 -1 ]

[ -2 -1 ]

[ 1 ]

3

2

1

0

[ 9 7 2 6 5 1 4 6 ]

[ 5 1 -2 -1 1 -2 2 1 ]

Page 28: Approximation of Protein Structure for Fast Similarity Measures

pdf of the detail coefficients is:

Coefficients expected to be ordered!

Discard coefficients starting at lowest level

Transform of Random Chains

m-averaging

Discarding lowest levels of detail coeeficients

logm

( ) 0, 4j jd N O

Page 29: Approximation of Protein Structure for Fast Similarity Measures

Random Chains and Proteins

Protein backbones behave on average like random chains Chain topology Limited compactness

Page 30: Approximation of Protein Structure for Fast Similarity Measures
Page 31: Approximation of Protein Structure for Fast Similarity Measures

Conclusion

Fast computation of similarity measures

Trade-off between speed and precision Exploits chain topology of proteins and

limited compactness Allows use of efficient nearest-neighbor

algorithms Can be used as filter when precision is

important