evolutionary inaccuracy of pairwise structural alignments (slide)
TRANSCRIPT
EVOLUTIONARY INACCURACY OF
PAIRWISE STRUCTURAL ALIGNMENTS
Presenter: Nguyen Dinh Chien (阮庭戰)
Authors:
M. I. Sadowski and W. R. Taylor
From Division of Mathematical Biology,
MRC National Institute for Medical Research, London, UK
Structural alignment attempts to establish homology between two or more polymer
structures based on their shape and 3D confomation. This process is usually
applied to protein tertiary structures but can also be used for large RNA molecules.
In this study, the authors analyzed the selft-consistency of 7 widely-used structural
alignment methods, such as, SAP, TM-align, MAMMOTH, DALI, CE, and FATCAT
on a diverse, non-redundant set of 1863 domains from the SCOP database.
Results:
The degree of inconsistency of the alignments on a residue level is 30%.
Producing more consistent alignments than the rest.
The methods able to identify good structural alignments is also accessed using geometric
measures.
Outline
INTRODUCTION
METHODS
RESULTS
DISCUSSION
INTRODUCTION
The problem of alignment pairs of protein structures has attracted a significant level of
research effort.
Kolodny et al., 2005 and Mayr et al., 2007 are important contributions. Kolodny‘s study tested
find a good solution as judged by geometric criteria, and Mayr’s study agreed the aligned
residues with a set of manually curated ‘gold standard’ alignments.
They used geometric measures to assess the ability of aligners. They proposed that, if A and B are
homologous, B and C are homologous, then A and C must also be homologous.
In this study, authors compared the most widely-used methods for pairwise structural alignment, and
considering alignment accuracy relative to other annotation sources: DSSP structural classes and
solvent accessibilities.
They also used SCOP folds, GO annotations, topological distances, and several geometric scores to
external annotations.
The different assessment methods highlight different strengths and weaknesses of each methods.
Outline
INTRODUCTION
METHODS
RESULTS
DISCUSSION
Data set
Structural alignment methods
Inconsistency measure
Calibration of data
Other geometric measures
Residue annotations
Assessment of symmetry
Data set
In this study, the authors used a set of 1863 domains, which was
derived from the ASTRAL SCOP10 databases.
SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html
ASTRAL: http://astral.berkeley.edu/
The set was restricted to high quality structures by requiring a SPACI
(Summary PDB ASTRAL Check Index
http://astral.berkeley.edu/spaci.html) score >0.5 and excluding NMR
(Nuclear Magnetic Resonance) structures (http://nmr.cit.nih.gov/xplor-
nih/xplorMan/node470.html) and those with missing residues.
Structural alignment methods
All-versus-all pairwise structural alignments were generated using 7 methods: SAP,
DALILite, MAMMOTH-MULT, FATCAT, CE, TM-align, and Fr-TM-align.
They selected these methods because of,
Many cases are used to compute large sets of alignments for publicly available resources
(FATCAT and CE for PDB; DALILite for the DALI FSSP database), or
have been used to draw conclusions about fold-space (SAP, TM-align, DALILite and
MAMMOTH-MULT)
All methods were used with default parameters.
They also used Andrea Prlic’s Java implementations of FATCAT and CE.
Inconsistency measure
Inconsistency was assessed for all positions in any triplet (in a particular threshold) of
aligned structures. In case, a gap was found at that position in any of the three alignment
sequences, the position was ignored.
For each position, they determined whether the condition E(Ai,Bj)∩E(Bj,Ck)∩E(Ai,Ck) was
true, where
The predicate E(Xj,Yj) is defined as meaning position i in sequence X is aligned to position j in
sequence Y.
If condition is false, inconsistency=1, otherwise, inconsistency=0.
The proportion of inconsistent positions was found for all aligned triples for each method
at each threshold and calculated as a percentage. All residues in this case is absolute
inconsistency.
The subsets of residues with particular annotations is called relative inconsistency.
Calibration of data
The RMSD (Root-mean-square deviation) and coverage values were used to
approximate TM-scores for the alignments generated by each method.
Approximate TM-score for TM-align – 0.981; real TM-score for them – 0.985.
However, approximate TM-score for the other methods were correlated with TM-align as
follows: SAP – 0.739; DALILite – 0.643; FATCAT-0.774; FATCAT (flexible mode)-0.639;
CE-0.837; Fr-TM-align-0.923)
Next, they compared the fTM score with the methods own summary score to determine
which was likely to provide the best ranking.
RSMD reported the- R
structures theoflength mean thebeing L
residues aligned ofnumber thebeing C
2004) Skolnick, and (Zhang 8.11524.1
,
1
30
2
0
LD
where
D
RL
CfTM
Calibration of data Method .985 .986 .987 .988 .989 .990 .991 .992 .993 .994 .995 .996 .997 .998 .999
MMT 5.15 5.28 5.42 5.57 5.75 5.95 6.16 6.43 6.73 7.11 7.61 8.34 9.33 11.03 14.56
TM 0.417 0.421 0.426 0.431 0.438 0.445 0.453 0.461 0.472 0.484 0.499 0.517 0.540 0.570 0.617
FrTM 0.437 0.441 0.446 0.452 0.458 0.465 0.473 0.481 0.492 0.505 0.520 0.539 0.561 0.590 0.636
SAP 0.263 0.270 0.276 0.283 0.292 0.301 0.312 0.325 0.339 0.355 0.376 0.401 0.433 0.476 0.548
FTCT 0.397 0.403 0.408 0.415 0.422 0.430 0.440 0.450 0.462 0.475 0.492 0.512 0.538 0.572 0.625
DALI 0.364 0.370 0.376 0.382 0.390 0.398 0.407 0.418 0.431 0.446 0.463 0.483 0.510 0.547 0.603
FTCF 0.011 0.01 0.009 0.007 0.006 0.005 0.004 0.003 0.003 0.002 0.001 7e-04 3e-04 6e-05 1e-06
CE 0.398 0.403 0.409 0.415 0.422 0.430 0.439 0.449 0.460 0.474 0.490 0.510 0.535 0.570 0.621
Table S1: Thresholds used for the top 15 increments from 98.5% to 99.9% of alignments
Other geometric measures
To assess geometric quality of reported alignments, they used the following formular
C
LLRSI
),min( 21
)21
0
,min(11
1
LLW
R
CMI
C
RSAS
100
R - RMSD
C - alignment coverage
L1 and L2 - lengths of the two sequences
W0 - weighting parameter
W0=1.5 as in Kolodny et al., 2005
Residue annotations
The catalytic site atlas annotations (Porter et al., 2004) and annotations
from PDB SITE records to produce datasets of functional residues http://www.ebi.ac.uk/thornton-srv/databases/CSA_NEW/ .
Secondary structure assignments and accessibility values were taken from DSSP
(Define Secondary Structure of Proteins) http://swift.cmbi.ru.nl/gv/dssp/
Assessment of the consistency of the annotations was assessed
separately using chi-square test
class I (π-helix) almost always aligns with class H (α-helix). Isolated β-
bridges (B) align mostly with strands (class E) and the remaining non-
coil classes align significantly together, suggesting that at greater
distances these regions are interchangeable.
Residue annotations Fold Description Mean
Inconsistency
SD
Inconsistency
N
b.80 SS R/H Beta Helix 100.00% 0.00% 5
a.118 Alpha/alpha superhelix 86.70% 7.30% 10
a.24 Four helix up/down bundle 80.00% 8.10% 13
b.69 7-bladed beta propellor 58.70% 5.60% 8
a.102 alpha/alpha toroid 57.60% 28.50% 8
b.55 PH domain like-barrel 8.60% 3.30% 10
d.38 Thioesterase 8.00% 2.60% 7
d.131 DNA clamp 7.90% 0.40% 5
b.34 SH3-like barrel 6.90% 6.20% 11
d.37 CBS-domain pair 6.50% 1.80% 5
Table S2: Most and least consistent domains. The SCOP folds and concomitant names are shown
for the five most and least consistently aligned domains at the highest threshold across all
methods are shown along with the number of neighbours at that level in the dataset.
Assessment of symmetry
Symmetry values for protein structures were derived using the Fourier
transform-based approach described by Taylor et al. (2002)
Inconsistency values per domain were the mean for all methods at the
highest threshold, which had 803 members; domains with fewer than 5
neighbors for TM-align were culled from the set, leaving 207 domains.
Outline
INTRODUCTION
METHODS
RESULTS
DISCUSSION
Choosing a score for ranking:
ROC assessment
Assessment of self-consistency
for structural aligners
Determining structural features
associated with inconsistencies
Assessment by geometric
measures
Choosing a score for ranking: ROC
assessment
Mean AUC values for
ROC curves derived from
each possible score for
the methods presented.
CE, DALI, FATCAT (rigid
mode) and Fr-TM-align all
perform excellently when
scored using the
approximate TM-score.
MAMMOTH, FATCAT
(flexible mode) and SAP all
performless well regardless
of score.
Assessment of self-consistency for
structural aligners
Fig 2: Inconsistency of pairwise
structural alignments. The
proportion of positions failing
transitive consistency is shown
for all alignment pairs in the
relevant fraction of the set.
The methods appear in the
order FATCAT-flexible,
MAMMOTH, CE, FATCAT, TM-
align, DALI, Fr-TM-align, SAP
from top to bottom on the left-
hand edge of the graph.
Determining structural features associated with
inconsistencies
Fig 3: Improved consistency
at residues marked
functional. Absolute rates
of inconsistency are shown
for functional residues
(solid lines) and all residues
(dashed lines) for the three
most consistent methods.
These appear in the order
DALI, Fr-TM-align, SAP
from the top downwards
along the left-hand edge.
Determining structural features
associated with inconsistencies
Fig 4. Relative inconsistencies for
DSSP residue classes.
Inconsistencies are shown as a
percentage of the absolute
value for each method. The
upper panel shows results for
the top 0.01% of alignments,
the bottom the top 1.5%.
Determining structural features
associated with inconsistencies
Fig 5. Relative inconsistency for
three methods in relation to
solvent accessibility. Solvent
accessibility was split into classes
in bins of 20% with 0 being the
lowest. Panels are arranged as
in Figure 4.
Determining structural features associated with
inconsistencies
Figure S1: symmetry and
inconsistency. Mean
inconsistency (X-axis) for 233
domains with more than 5
neighbours at the highest
level of structural similarity is
plotted against the power of
the Fourier series as a
measure of the internal
symmetry of the structure (Y-
axis, arbitrary units).
Determining structural features
associated with inconsistencies
Fig 6. Relative inconsistency
as a function of gap distance.
Panels are arranged as
in Figure 4.
Assessment by geometric
measures
TM-align, FATCAT (flexible),
and Fr-TM-align are best
three methods in all case
regardless of the metric used.
SAP and MAMMOTH both
rank as worst by all metrics.
Outline
INTRODUCTION
METHODS
RESULTS
DISCUSSION
Discussion
Even for the most consistent methods the level of inconsistency is very high.
The most significant contributory factor to inconsistent structural alignments is the treatment of gaps.
Another important issue is that optimization of structural similarity is not in all cases the ideal strategy for identifying homology.
Flexible alignment is correctly seen as an important innovation in aligning protein structures, however our results demonstrate that it is not a panacea.
The least consistently aligned domains are the repeats such as beta-helices and the least consistently aligned elements are generally helices.
Another possibility for improving the results of large-scale pairwise alignments (e.g. in database search or when using large datasets) is to realign significantly similar structures using a consistency criterion
Thanks for your attention!