indexing structures for biomolecular structures

CPM 2006

Geometric Suffix Tree:A New Index Structure for

Protein 3-D Structures

Tetsuo ShibuyaHuman Genome Center,

Institute of Medical Science, University of Tokyo

Today's Talk

BackgroundsProtein structuresSuffix Trees

Geometric suffix treeGeneralization of suffix trees for indexing protein structuresExperiments

Conclusions

Protein Structure

AV

LW K

E

ProteinA chain molecule consisting of 20 kinds of amino acidsFolded into some structure

3-D structureCoordinates of Cα atoms (backbone)

Cα atom: The representative atom of an amino acid

BackgroundsStructurally similar proteins

tend to have similar functionseven if not similar in the residue level

Structural search on a protein structure databaseFunctional analysis for proteins with newly solved structuresIncreasing database size (PDB: 35,000~ entries)

→ Sophisticated index structure is desired!

A B C

Query: Protein structure Protein Structure Database

It's similar!

Suffix Tree [Weiner '73]

A sophisticated index structure for stringsCompacted trie of all the suffixes of a string S

Each leaf corresponds to a suffix of SEnables efficient substring search

Suffix tree of 'mississippi$'

mississippi$

ip

s

pi$i$$pp$ssi

ssippi$

ppi$

i si

ppi$

ssippi$

ppi$

ssippi$

mississippi$ississippi$ssissippi$sissippi$issippi$ssippi$sippi$ippi$ppi$pi$i$

All thesuffixes

issi

Suffix Tree Features

Linear-time constructionVarious pattern matching applications

Motif findingRepeat findingLarge-scale alignmentetc.

Good! ...But they are not for structures...

Today's TopicExtend suffix trees for protein structures

"Geometric Suffix Tree"based on the RMSD measures

Related WorkSuffix trees for protein 3-D structure

PSIST [Gao et al. '05]• Covert structures into alphabetical strings• Does not support RMSD query

AB

How to compare two proteins?

RMSD: Root Mean Square Deviation The most famous measure for protein structure comparison

nvbRaBARMSDn

iiivR

/|)(|min),(1

2

, ∑=

−⋅−=

a1

b1

b2

b3b4

b5

a2 a3

a4

a5

Correspondence of atoms is given

ProblemGiven

a substructure of a protein as a querya structure DB

Find all the similar substructures i.e. RMSD ≤ some given bound d'All' means no false negatives/positives

A B C

It's similar!(i.e. RMSD≤d)

Search!

Query

a protein substructure

Protein Structure Database

Geometric TrieTree that represents multiple structuresSimilar prefix substructures are compacted into an edge

i.e. if sqrt(l)·RMSD ≤ b (l: length of the prefix, b: given bound)RMSD(Pi,Qi) is not always ≤ RMSD(Pi+1,Qi+1)But sqrt(l)·RMSD(Pi,Qi) is always ≤ sqrt(l+1)·RMSD(Pi+1,Qi+1)

Edge information - O(1) size!i, j: start / end indices (+sequence#)R, v: the rotation matrix and the translation vector

q1q2

q3q4 q5

q6 q7

q8q9

q10 q11

p1p2

p3p4

p5p6

p7p8

p9 p10

p11

P[1..7],R1, v1

Q[8..11],R2, v2

P[8..11],R1, v1

RMSD ≤b/sqrt(7)

P[1..7]Q[8..11] is similar to Q[1.11]

Geometric TreeGeometric trie over all the suffix substructuresO(n) space (though there are O(n2) substructures)

But how to compute the RMSD incrementally?

p1p2

p3p4

p5p6

p7p8

p9 p10

p11

q1q2

q3q4 q5

q6 q7

q8q9

q10 q11

RMSD, R, and v betweenp[1..6] and q[1..6]

RMSD, R, and v between p[1..7] and q[1..7]

computeincrementally

RMSD Computation

Translation optimizationIndependent from Rotation OptimizationTranslate so that two centroids comes to the same position

Rotation optimizationRotate after translation

2

1

||∑=

⋅−n

iii qRpminimize

RTranslated vectors

Optimal Rotation for Minimizing RMSD[Arun et al. '87, Schwartz et al. '87]

Problem

Solution by SVD (Singular Value Decomposition)Computation time: O(n)Post-processing is required in some rare degenerate cases

2

1

||∑=

⋅−n

iii qRpminimize

R

∑=

=n

i

Tii pqH

1

TVUR = where

VUΣ is the SVD of

3x3 matrix

Incremental RMSD Computation

Theorem: The value of the RMSD, R and v can be computed in constant time if we are given the following values

which can be computed incrementally!

∑=

n

i

Tiiqq

1Ti

n

ii pq∑

=1

∑=

n

i

Tii pp

1

∑=

n

iiq

1∑=

n

iip

1

Construction Algorithm

Just add naively each suffix substructures

O(n2) time for a string of size ncf. O(n3) time if we do not use incremental RMSD computation

O(k·n2) time for k structures of sizes at most n

New nodes

Search AlgorithmSearch for all the nodes with (prefix) structures of length lwhose RMSD to the query is ≤b/sqrt(l) + dwherel: query lengthb: bound used in construction

Check whether RMSD ≤ d

Search for substructureswith RMSD ≤ d

Computation TimeConstruction Time

Linear to DB sizeDue to the protein length bound

Search TimeInput

Query: 50DB

• 317 related structures• 41,719 atoms

RMSD Bound: 1.0Å• the bound most often used in protein

analysis

ResultsSearch time 0.39 sec

• About 3 times faster than the naive searchReasonable bound19 hits found

0

10

20

30

40

50

60

0 10000 20000 30000 40000 50000

Total Number of Atoms

Construction Time (sec)

系列1

CPU: 1.2GHz UltraSPARC III Cu

Conclusions

Geometric suffix treesSuffix trees extended for Protein 3-D structures

Future workMore flexible similarity searchFaster algorithms (construction / query)Bioinformatics applications

Motif finding / functional analysis / protein structure clustering

Thank you very much.

URMSD: Unit-Vector Root Mean Square Deviation

Variation of the RMSD

u1

u2

u3

u4

u6

u7

w1

w2

w3w4

w5

w6

w7

u5

nvwRuBAURMSDn

iiivR

/|)(|min),(1

1

2

, ∑−

=

−⋅−=

iiiii aaaau −−= ++ 11 /)(

iiiii bbbbw −−= ++ 11 /)(where

Optimizing Rotation

Problem without translation

2

1

||∑=

⋅−n

iii qRpminimize

rotation R

Translated vectors

i.e.

)(2)(

)}({||

11

1

2

1

∑∑

∑∑

==

==

⋅⋅−+=

+−+=⋅−

n

i

Tii

n

ii

Tii

Ti

n

ii

TTii

Tii

Tii

Ti

n

iii

pqRtraceqqpp

pRqRqpqqppqRp

Let it be H

Tips for Optimizing RotationTheorem

Given: Positive definite matrix M and orthogonal matrix QProperty: trace(M) >= trace(QM)

Then...If RH is positive definite, It is the R and H to compute!

Note: There's a degenerate case that R is a symmetric matrixLet R=VUT where H=UΣVT (Singular value decomposition)

RH=VΣVT is positive definite!SVD of H can be computed in constant time (as H is a 3x3 matrix)

How to guarantee that R is not a symmetric matrix?det(R) should be 1 (It is -1 in case of a symmetric matrix)If the object is on a 2-D plane, it is easy to compute the actual R by flipping R.In other cases, it is difficult to compute the actual matrix... but it's a rare case. (Heuristically, the above flipped R could be used.)

Singular Value DecompositionComputation time: O(n3) (for an nxn matrix)

1

1matrix A

Orthonormal vectors v1, v2, ... Orthonormalized translated vectors u1, u2, ...

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

n

uuvvA

σ

σσ

O2

1

2121 ,...),(,...),(V U

TVUA ∑=

Positive, diagonal matrix

URMSD: Unit-Vector Root Mean Square Deviation

Variation of the RMSD

u1

u2

u3

u4

u6

u7

w1

w2

w3w4

w5

w6

w7

u5

nvwRuBAURMSDn

iiivR

/|)(|min),(1

1

2

, ∑−

=

−⋅−=

iiiii aaaau −−= ++ 11 /)(

iiiii bbbbw −−= ++ 11 /)(where

indexing structures for biomolecular structures

Documents