evolutionary inaccuracy of pairwise structural alignments (slide)

EVOLUTIONARY INACCURACY OF

PAIRWISE STRUCTURAL ALIGNMENTS

Presenter: Nguyen Dinh Chien (阮庭戰)

Authors:

M. I. Sadowski and W. R. Taylor

From Division of Mathematical Biology,

MRC National Institute for Medical Research, London, UK

Structural alignment attempts to establish homology between two or more polymer

structures based on their shape and 3D confomation. This process is usually

applied to protein tertiary structures but can also be used for large RNA molecules.

In this study, the authors analyzed the selft-consistency of 7 widely-used structural

alignment methods, such as, SAP, TM-align, MAMMOTH, DALI, CE, and FATCAT

on a diverse, non-redundant set of 1863 domains from the SCOP database.

Results:

The degree of inconsistency of the alignments on a residue level is 30%.

Producing more consistent alignments than the rest.

The methods able to identify good structural alignments is also accessed using geometric

measures.

Outline

INTRODUCTION

METHODS

RESULTS

DISCUSSION

INTRODUCTION

The problem of alignment pairs of protein structures has attracted a significant level of

research effort.

Kolodny et al., 2005 and Mayr et al., 2007 are important contributions. Kolodny‘s study tested

find a good solution as judged by geometric criteria, and Mayr’s study agreed the aligned

residues with a set of manually curated ‘gold standard’ alignments.

They used geometric measures to assess the ability of aligners. They proposed that, if A and B are

homologous, B and C are homologous, then A and C must also be homologous.

In this study, authors compared the most widely-used methods for pairwise structural alignment, and

considering alignment accuracy relative to other annotation sources: DSSP structural classes and

solvent accessibilities.

They also used SCOP folds, GO annotations, topological distances, and several geometric scores to

external annotations.

The different assessment methods highlight different strengths and weaknesses of each methods.

Outline

INTRODUCTION

METHODS

RESULTS

DISCUSSION

Data set

Structural alignment methods

Inconsistency measure

Calibration of data

Other geometric measures

Residue annotations

Assessment of symmetry

Data set

In this study, the authors used a set of 1863 domains, which was

derived from the ASTRAL SCOP10 databases.

SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html

ASTRAL: http://astral.berkeley.edu/

The set was restricted to high quality structures by requiring a SPACI

(Summary PDB ASTRAL Check Index

http://astral.berkeley.edu/spaci.html) score >0.5 and excluding NMR

(Nuclear Magnetic Resonance) structures (http://nmr.cit.nih.gov/xplor-

nih/xplorMan/node470.html) and those with missing residues.

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html



http://astral.berkeley.edu/

http://astral.berkeley.edu/spaci.html

http://nmr.cit.nih.gov/xplor-nih/xplorMan/node470.html



Structural alignment methods

All-versus-all pairwise structural alignments were generated using 7 methods: SAP,

DALILite, MAMMOTH-MULT, FATCAT, CE, TM-align, and Fr-TM-align.

They selected these methods because of,

Many cases are used to compute large sets of alignments for publicly available resources

(FATCAT and CE for PDB; DALILite for the DALI FSSP database), or

have been used to draw conclusions about fold-space (SAP, TM-align, DALILite and

MAMMOTH-MULT)

All methods were used with default parameters.

They also used Andrea Prlic’s Java implementations of FATCAT and CE.

Inconsistency measure

Inconsistency was assessed for all positions in any triplet (in a particular threshold) of

aligned structures. In case, a gap was found at that position in any of the three alignment

sequences, the position was ignored.

For each position, they determined whether the condition E(Ai,Bj)∩E(Bj,Ck)∩E(Ai,Ck) was

true, where

The predicate E(Xj,Yj) is defined as meaning position i in sequence X is aligned to position j in

sequence Y.

If condition is false, inconsistency=1, otherwise, inconsistency=0.

The proportion of inconsistent positions was found for all aligned triples for each method

at each threshold and calculated as a percentage. All residues in this case is absolute

inconsistency.

The subsets of residues with particular annotations is called relative inconsistency.

Calibration of data

The RMSD (Root-mean-square deviation) and coverage values were used to

approximate TM-scores for the alignments generated by each method.

Approximate TM-score for TM-align – 0.981; real TM-score for them – 0.985.

However, approximate TM-score for the other methods were correlated with TM-align as

follows: SAP – 0.739; DALILite – 0.643; FATCAT-0.774; FATCAT (flexible mode)-0.639;

CE-0.837; Fr-TM-align-0.923)

Next, they compared the fTM score with the methods own summary score to determine

which was likely to provide the best ranking.

RSMD reported the- R

structures theoflength mean thebeing L

residues aligned ofnumber thebeing C

2004) Skolnick, and (Zhang 8.11524.1

,

1

30

2

0

LD

where

D

RL

CfTM

Calibration of data Method .985 .986 .987 .988 .989 .990 .991 .992 .993 .994 .995 .996 .997 .998 .999

MMT 5.15 5.28 5.42 5.57 5.75 5.95 6.16 6.43 6.73 7.11 7.61 8.34 9.33 11.03 14.56

TM 0.417 0.421 0.426 0.431 0.438 0.445 0.453 0.461 0.472 0.484 0.499 0.517 0.540 0.570 0.617

FrTM 0.437 0.441 0.446 0.452 0.458 0.465 0.473 0.481 0.492 0.505 0.520 0.539 0.561 0.590 0.636

SAP 0.263 0.270 0.276 0.283 0.292 0.301 0.312 0.325 0.339 0.355 0.376 0.401 0.433 0.476 0.548

FTCT 0.397 0.403 0.408 0.415 0.422 0.430 0.440 0.450 0.462 0.475 0.492 0.512 0.538 0.572 0.625

DALI 0.364 0.370 0.376 0.382 0.390 0.398 0.407 0.418 0.431 0.446 0.463 0.483 0.510 0.547 0.603

FTCF 0.011 0.01 0.009 0.007 0.006 0.005 0.004 0.003 0.003 0.002 0.001 7e-04 3e-04 6e-05 1e-06

CE 0.398 0.403 0.409 0.415 0.422 0.430 0.439 0.449 0.460 0.474 0.490 0.510 0.535 0.570 0.621

Table S1: Thresholds used for the top 15 increments from 98.5% to 99.9% of alignments

Other geometric measures

To assess geometric quality of reported alignments, they used the following formular

C

LLRSI

),min( 21

)21

0

,min(11

1

LLW

R

CMI

C

RSAS

100

R - RMSD

C - alignment coverage

L1 and L2 - lengths of the two sequences

W0 - weighting parameter

W0=1.5 as in Kolodny et al., 2005

Residue annotations

The catalytic site atlas annotations (Porter et al., 2004) and annotations

from PDB SITE records to produce datasets of functional residues http://www.ebi.ac.uk/thornton-srv/databases/CSA_NEW/ .

Secondary structure assignments and accessibility values were taken from DSSP

(Define Secondary Structure of Proteins) http://swift.cmbi.ru.nl/gv/dssp/

Assessment of the consistency of the annotations was assessed

separately using chi-square test

class I (π-helix) almost always aligns with class H (α-helix). Isolated β-

bridges (B) align mostly with strands (class E) and the remaining non-

coil classes align significantly together, suggesting that at greater

distances these regions are interchangeable.

http://www.ebi.ac.uk/thornton-srv/databases/CSA_NEW/



http://swift.cmbi.ru.nl/gv/dssp/

Residue annotations Fold Description Mean

Inconsistency

SD

Inconsistency

N

b.80 SS R/H Beta Helix 100.00% 0.00% 5

a.118 Alpha/alpha superhelix 86.70% 7.30% 10

a.24 Four helix up/down bundle 80.00% 8.10% 13

b.69 7-bladed beta propellor 58.70% 5.60% 8

a.102 alpha/alpha toroid 57.60% 28.50% 8

b.55 PH domain like-barrel 8.60% 3.30% 10

d.38 Thioesterase 8.00% 2.60% 7

d.131 DNA clamp 7.90% 0.40% 5

b.34 SH3-like barrel 6.90% 6.20% 11

d.37 CBS-domain pair 6.50% 1.80% 5

Table S2: Most and least consistent domains. The SCOP folds and concomitant names are shown

for the five most and least consistently aligned domains at the highest threshold across all

methods are shown along with the number of neighbours at that level in the dataset.

Assessment of symmetry

Symmetry values for protein structures were derived using the Fourier

transform-based approach described by Taylor et al. (2002)

Inconsistency values per domain were the mean for all methods at the

highest threshold, which had 803 members; domains with fewer than 5

neighbors for TM-align were culled from the set, leaving 207 domains.

Outline

INTRODUCTION

METHODS

RESULTS

DISCUSSION

Choosing a score for ranking:

ROC assessment

Assessment of self-consistency

for structural aligners

Determining structural features

associated with inconsistencies

Assessment by geometric

measures

Choosing a score for ranking: ROC

assessment

Mean AUC values for

ROC curves derived from

each possible score for

the methods presented.

CE, DALI, FATCAT (rigid

mode) and Fr-TM-align all

perform excellently when

scored using the

approximate TM-score.

MAMMOTH, FATCAT

(flexible mode) and SAP all

performless well regardless

of score.

Assessment of self-consistency for

structural aligners

Fig 2: Inconsistency of pairwise

structural alignments. The

proportion of positions failing

transitive consistency is shown

for all alignment pairs in the

relevant fraction of the set.

The methods appear in the

order FATCAT-flexible,

MAMMOTH, CE, FATCAT, TM-

align, DALI, Fr-TM-align, SAP

from top to bottom on the left-

hand edge of the graph.

Determining structural features associated with

inconsistencies

Fig 3: Improved consistency

at residues marked

functional. Absolute rates

of inconsistency are shown

for functional residues

(solid lines) and all residues

(dashed lines) for the three

most consistent methods.

These appear in the order

DALI, Fr-TM-align, SAP

from the top downwards

along the left-hand edge.



Fig 4. Relative inconsistencies for

DSSP residue classes.

Inconsistencies are shown as a

percentage of the absolute

value for each method. The

upper panel shows results for

the top 0.01% of alignments,

the bottom the top 1.5%.



Fig 5. Relative inconsistency for

three methods in relation to

solvent accessibility. Solvent

accessibility was split into classes

in bins of 20% with 0 being the

lowest. Panels are arranged as

in Figure 4.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3338010/figure/F4/

Determining structural features associated with

inconsistencies

Figure S1: symmetry and

inconsistency. Mean

inconsistency (X-axis) for 233

domains with more than 5

neighbours at the highest

level of structural similarity is

plotted against the power of

the Fourier series as a

measure of the internal

symmetry of the structure (Y-

axis, arbitrary units).



Fig 6. Relative inconsistency

as a function of gap distance.

Panels are arranged as

in Figure 4.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3338010/figure/F4/

Assessment by geometric

measures

TM-align, FATCAT (flexible),

and Fr-TM-align are best

three methods in all case

regardless of the metric used.

SAP and MAMMOTH both

rank as worst by all metrics.

Outline

INTRODUCTION

METHODS

RESULTS

DISCUSSION

Discussion

Even for the most consistent methods the level of inconsistency is very high.

The most significant contributory factor to inconsistent structural alignments is the treatment of gaps.

Another important issue is that optimization of structural similarity is not in all cases the ideal strategy for identifying homology.

Flexible alignment is correctly seen as an important innovation in aligning protein structures, however our results demonstrate that it is not a panacea.

The least consistently aligned domains are the repeats such as beta-helices and the least consistently aligned elements are generally helices.

Another possibility for improving the results of large-scale pairwise alignments (e.g. in database search or when using large datasets) is to realign significantly similar structures using a consistency criterion

Thanks for your attention!

evolutionary inaccuracy of pairwise structural alignments (slide)

Technology