indexing structures for biomolecular structures
TRANSCRIPT
CPM 2006
Geometric Suffix Tree:A New Index Structure for
Protein 3-D Structures
Tetsuo ShibuyaHuman Genome Center,
Institute of Medical Science, University of Tokyo
Today's Talk
BackgroundsProtein structuresSuffix Trees
Geometric suffix treeGeneralization of suffix trees for indexing protein structuresExperiments
Conclusions
Protein Structure
AV
LW K
E
ProteinA chain molecule consisting of 20 kinds of amino acidsFolded into some structure
3-D structureCoordinates of Cα atoms (backbone)
Cα atom: The representative atom of an amino acid
BackgroundsStructurally similar proteins
tend to have similar functionseven if not similar in the residue level
Structural search on a protein structure databaseFunctional analysis for proteins with newly solved structuresIncreasing database size (PDB: 35,000~ entries)
→ Sophisticated index structure is desired!
A B C
Query: Protein structure Protein Structure Database
It's similar!
Suffix Tree [Weiner '73]
A sophisticated index structure for stringsCompacted trie of all the suffixes of a string S
Each leaf corresponds to a suffix of SEnables efficient substring search
Suffix tree of 'mississippi$'
mississippi$
ip
s
pi$i$$pp$ssi
ssippi$
ppi$
i si
ppi$
ssippi$
ppi$
ssippi$
mississippi$ississippi$ssissippi$sissippi$issippi$ssippi$sippi$ippi$ppi$pi$i$
All thesuffixes
issi
Suffix Tree Features
Linear-time constructionVarious pattern matching applications
Motif findingRepeat findingLarge-scale alignmentetc.
Good! ...But they are not for structures...
Today's TopicExtend suffix trees for protein structures
"Geometric Suffix Tree"based on the RMSD measures
Related WorkSuffix trees for protein 3-D structure
PSIST [Gao et al. '05]• Covert structures into alphabetical strings• Does not support RMSD query
AB
How to compare two proteins?
RMSD: Root Mean Square Deviation The most famous measure for protein structure comparison
nvbRaBARMSDn
iiivR
/|)(|min),(1
2
, ∑=
−⋅−=
a1
b1
b2
b3b4
b5
a2 a3
a4
a5
Correspondence of atoms is given
ProblemGiven
a substructure of a protein as a querya structure DB
Find all the similar substructures i.e. RMSD ≤ some given bound d'All' means no false negatives/positives
A B C
It's similar!(i.e. RMSD≤d)
Search!
Query
a protein substructure
Protein Structure Database
Geometric TrieTree that represents multiple structuresSimilar prefix substructures are compacted into an edge
i.e. if sqrt(l)·RMSD ≤ b (l: length of the prefix, b: given bound)RMSD(Pi,Qi) is not always ≤ RMSD(Pi+1,Qi+1)But sqrt(l)·RMSD(Pi,Qi) is always ≤ sqrt(l+1)·RMSD(Pi+1,Qi+1)
Edge information - O(1) size!i, j: start / end indices (+sequence#)R, v: the rotation matrix and the translation vector
q1q2
q3q4 q5
q6 q7
q8q9
q10 q11
p1p2
p3p4
p5p6
p7p8
p9 p10
p11
P[1..7],R1, v1
Q[8..11],R2, v2
P[8..11],R1, v1
RMSD ≤b/sqrt(7)
P[1..7]Q[8..11] is similar to Q[1.11]
Geometric TreeGeometric trie over all the suffix substructuresO(n) space (though there are O(n2) substructures)
But how to compute the RMSD incrementally?
p1p2
p3p4
p5p6
p7p8
p9 p10
p11
q1q2
q3q4 q5
q6 q7
q8q9
q10 q11
RMSD, R, and v betweenp[1..6] and q[1..6]
RMSD, R, and v between p[1..7] and q[1..7]
computeincrementally
RMSD Computation
Translation optimizationIndependent from Rotation OptimizationTranslate so that two centroids comes to the same position
Rotation optimizationRotate after translation
2
1
||∑=
⋅−n
iii qRpminimize
RTranslated vectors
Optimal Rotation for Minimizing RMSD[Arun et al. '87, Schwartz et al. '87]
Problem
Solution by SVD (Singular Value Decomposition)Computation time: O(n)Post-processing is required in some rare degenerate cases
2
1
||∑=
⋅−n
iii qRpminimize
R
∑=
=n
i
Tii pqH
1
TVUR = where
VUΣ is the SVD of
3x3 matrix
Incremental RMSD Computation
Theorem: The value of the RMSD, R and v can be computed in constant time if we are given the following values
which can be computed incrementally!
∑=
n
i
Tiiqq
1Ti
n
ii pq∑
=1
∑=
n
i
Tii pp
1
∑=
n
iiq
1∑=
n
iip
1
Construction Algorithm
Just add naively each suffix substructures
O(n2) time for a string of size ncf. O(n3) time if we do not use incremental RMSD computation
O(k·n2) time for k structures of sizes at most n
New nodes
Search AlgorithmSearch for all the nodes with (prefix) structures of length lwhose RMSD to the query is ≤b/sqrt(l) + dwherel: query lengthb: bound used in construction
Check whether RMSD ≤ d
Search for substructureswith RMSD ≤ d
Computation TimeConstruction Time
Linear to DB sizeDue to the protein length bound
Search TimeInput
Query: 50DB
• 317 related structures• 41,719 atoms
RMSD Bound: 1.0Å• the bound most often used in protein
analysis
ResultsSearch time 0.39 sec
• About 3 times faster than the naive searchReasonable bound19 hits found
0
10
20
30
40
50
60
0 10000 20000 30000 40000 50000
Total Number of Atoms
Construction Time (sec)
系列1
CPU: 1.2GHz UltraSPARC III Cu
Conclusions
Geometric suffix treesSuffix trees extended for Protein 3-D structures
Future workMore flexible similarity searchFaster algorithms (construction / query)Bioinformatics applications
Motif finding / functional analysis / protein structure clustering
URMSD: Unit-Vector Root Mean Square Deviation
Variation of the RMSD
u1
u2
u3
u4
u6
u7
w1
w2
w3w4
w5
w6
w7
u5
nvwRuBAURMSDn
iiivR
/|)(|min),(1
1
2
, ∑−
=
−⋅−=
iiiii aaaau −−= ++ 11 /)(
iiiii bbbbw −−= ++ 11 /)(where
Optimizing Rotation
Problem without translation
2
1
||∑=
⋅−n
iii qRpminimize
rotation R
Translated vectors
i.e.
)(2)(
)}({||
11
1
2
1
∑∑
∑∑
==
==
⋅⋅−+=
+−+=⋅−
n
i
Tii
n
ii
Tii
Ti
n
ii
TTii
Tii
Tii
Ti
n
iii
pqRtraceqqpp
pRqRqpqqppqRp
Let it be H
Tips for Optimizing RotationTheorem
Given: Positive definite matrix M and orthogonal matrix QProperty: trace(M) >= trace(QM)
Then...If RH is positive definite, It is the R and H to compute!
Note: There's a degenerate case that R is a symmetric matrixLet R=VUT where H=UΣVT (Singular value decomposition)
RH=VΣVT is positive definite!SVD of H can be computed in constant time (as H is a 3x3 matrix)
How to guarantee that R is not a symmetric matrix?det(R) should be 1 (It is -1 in case of a symmetric matrix)If the object is on a 2-D plane, it is easy to compute the actual R by flipping R.In other cases, it is difficult to compute the actual matrix... but it's a rare case. (Heuristically, the above flipped R could be used.)
Singular Value DecompositionComputation time: O(n3) (for an nxn matrix)
1
1matrix A
Orthonormal vectors v1, v2, ... Orthonormalized translated vectors u1, u2, ...
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
n
uuvvA
σ
σσ
O2
1
2121 ,...),(,...),(V U
TVUA ∑=
Positive, diagonal matrix