ddpin distance and density based protein indexing david hoksza charles university in prague...
TRANSCRIPT
DDPIn Distance and Density Based Protein Indexing
David Hoksza
Charles University in PragueDepartment of Software Engineering
Czech Republic
CIBCB 2009 2
Presentation Outline
Biological background
Similarity search in protein structure databases
DDPIn feature vector extraction metrics querying
one-step approach multi-step approach
Experimental results
Conclusion
CIBCB 2009 3
Biological Background Proteins
molecules translated from mRNA in ribosomes
DNA → RNA → protein sequence of amino acids (20 AAs) coded by codon (triplet of nucleotides)
Function of a protein derived from its three dimensional structure → similar proteins have similar functions similar proteins have a common ancestor
Identifying protein structure → finding similar proteins → getting clue to the function
CIBCB 2009 4
Similarity Search in Protein Databases
Similarity between a pair of proteins alignment + similarity score
RMSD, TM-score, … visual inspection
DALI, CE, SAP, VAST…
Classification SCOP (Structural Classification of Proteins)
no need for an alignment indexing various features
PSI, PSIST, ProGreSS, CTSS, …DDPIn
CIBCB 2009 5
DDPIn - Overview
Distance and Density based Protein Indexing
Classification method Indexing of protein features
distances among Cα atoms used each AA represents a feature → protein p consists of |p|
features various semantics used
based on clustering Cα atoms into rings metric indexing employed (M-tree)
kNN querying outcomes of several searches are merged to obtain final
results
CIBCB 2009 6
DDPIn - Feature Extraction Features
n-dimensional vectors of real numbers
AA ≈ viewpoint → VPT (viewpoint tag)
sDens density of AAs in rings with
a predefined width sDensSSE
enhanced with SSE information
sRad widths of rings containing
predefined percentage of AAs
sRadSSE enhanced with SSE
information sDir
number of AAs in a ring pointing from the viepoint
sDens enhanced with direction information
CIBCB 2009 7
Metrics L2
weighted L2
close neighborhood of VPs is more important
DDPIn - Similarity of VPTs
n
iii yxyxd
1
2||),(
n
iiii yxwyxd
1
2||),(
CIBCB 2009 8
DDPIn – Indexing Structure
M-tree (Metric tree) Dynamic, hierarchical indexing
structure Data space divided into ball shaped
data regions (hyper-spheres) root node represent data region
covering all data children nodes represent regions
covering parts of the space, … data regions form balanced
hierarchical structure inner nodes → routing entries
leaf nodes → ground entries
))](()),(,(,,[)( iiiOiil OTptrOparOrOOrouti
))](,(,[)( iiii OparOOOgrnd
CIBCB 2009 9
Querying / Classification
One-step extracting VPTs
from query → n queries
ranking scheme
Two-step healing reclassification
with Smith-Waterman algorithm on sequences
CIBCB 2009 10
Experimental Results SCOP 1.65 dataset
class → fold → superfamily → family
1810 proteins 181 superfamilies
at least 10 proteins each all α, all β, α + β and α /β classes
query set reduced - 181 queries full
used also by PSI, ProGreSS, PSIST methods
Testing of superfamily classification accuracy fold classification accuracy
CIBCB 2009 11
Finding Optimal k for kNN Queries
CIBCB 2009 12
Accuracy of VPT Semantics
CIBCB 2009 13
Accuracy for Increasing Dimension
CIBCB 2009 14
Accuracy of Various Metrics
CIBCB 2009 15
Suitability of Pairs of VPT Semantics for Healing
identical correct classification
identical wrong classification
CIBCB 2009 16
Comparison of Classification Methods
CIBCB 2009 17
Conclusion
We have proposed new representation of protein structures
distance and density of Cα atoms ranking scheme two-step classification
We implemented M-tree indexing for proposed representation classification against SCOP
Experimental results best results among methods using identical classification
98.9% superfamily classification accuracy 100% fold classification accuracy
comparable run time