large-scale multiple sequence alignment - tandy...
TRANSCRIPT
Large-Scale Multiple Sequence Alignment
Tandy WarnowFounder Professor of Computer Science
The University of Illinois at Urbana-Champaignhttp://tandy.cs.illinois.edu
1kp: Thousand Transcriptome Project
l Plant Tree of Life based on transcriptomes of ~1200 speciesl More than 13,000 gene families (most not single copy)Gene Tree Incongruence
G. Ka-Shu WongU Alberta
N. WickettNorthwestern
J. Leebens-MackU Georgia
N. MatasciiPlant
T. Warnow, S. Mirarab, N. NguyenUIUC UCSD UCSD
Challenge: Alignments and trees on > 100,000 sequences
Plus many many other people…
1000-taxon models, ordered by difficulty (Liu et al., Science 324(5934):1561-1564, 2009)
Re-aligning on a treeA
B D
C
Merge sub-alignments
Estimate ML tree on merged
alignment
Decompose dataset
A B
C D
Align subsets
A B
C D
ABCD
SATé and PASTA Algorithms
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
Repeat until termination condition, and
return the alignment/tree pair with the best ML score
1000 taxon models, ordered by difficulty, Liu et al., Science 324(5934):1561-1564, 2009
24 hour SATé-I analysis, on desktop machines
(Similar improvements for biological datasets)
1000-taxon models ranked by difficulty
SATé-2 better than SATé-1
SATé-1 (Liu et al., Science 2009): can analyze up to 8K sequencesSATé-2 (Liu et al., Systematic Biology 2012): can analyze up to ~50K sequences
RNASim
0.00
0.05
0.10
0.15
0.20
10000 50000 100000 200000
Tree
Erro
r (FN
Rat
e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment
PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015– Simulated RNASim datasets from 10K to 200K taxa– Limited to 24 hours using 12 CPUs– Not all methods could run (missing bars could not finish)
PASTA: even better than SATé-2
1kp: Thousand Transcriptome Project
l Plant Tree of Life based on transcriptomes of ~1200 speciesl More than 13,000 gene families (most not single copy)Gene Tree Incongruence
G. Ka-Shu WongU Alberta
N. WickettNorthwestern
J. Leebens-MackU Georgia
N. MatasciiPlant
T. Warnow, S. Mirarab, N. NguyenUIUC UCSD UCSD
Challenge: Alignments and trees on > 100,000 sequences
Plus many many other people…
Length
Counts
0
2000
4000
6000
8000
10000
12000 Mean:317Median:266
0 500 1000 1500 2000
1KP dataset: more than 100,000 p450 amino-acidsequences, many fragmentary
Length
Counts
0
2000
4000
6000
8000
10000
12000 Mean:317Median:266
0 500 1000 1500 2000
1KP dataset: more than 100,000 p450 amino-acidsequences, many fragmentary
All standard multiplesequence alignmentmethods we tested performed poorly ondatasets with fragments.
UPPUPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles”
Nguyen, Mirarab, and Warnow. Genome Biology, 2014.
Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.
UPPUPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles”
Nguyen, Mirarab, and Warnow. Genome Biology, 2014.
Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.
Uses an ensemble of HMMs
Simple idea (not UPP)
• Select random subset of sequences, and build “backbone alignment”
• Construct a Hidden Markov Model (HMM) on the backbone alignment
• Add all remaining sequences to the backbone alignment using the HMM
One Hidden Markov Model for the entire alignment?
One Hidden Markov Model for the backbone alignment?
HMM 1
Or 2 HMMs?
HMM 1
HMM 2
HMM 1
HMM 3 HMM 4
HMM 2
Or 4 HMMs?
m
HMM 2
HMM 3
HMM 1
HMM 4
HMM 5 HMM 6
HMM 7
Or all 7 HMMs?
UPP Algorithmic Approach
1. Select random subset of full-length sequences, and build “backbone alignment” with PASTA
2. Construct an “Ensemble of Hidden Markov Models” on the backbone alignment
3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs
Evaluation• Simulated datasets (some have fragmentary sequences):– 10K to 1,000,000 sequences in RNASim – complex RNA
sequence evolution simulation– 1000-sequence nucleotide datasets from SATé papers– 5000-sequence AA datasets (from FastTree paper)– 10,000-sequence Indelible nucleotide simulation
• Biological datasets:– Proteins: largest BaliBASE and HomFam– RNA: 3 CRW datasets up to 28,000 sequences
RNASim Million Sequences: alignmenterror
Notes:• We show alignment error
using average of SP-FN and SP-FP.
• UPP variants havebetter alignment scores than PASTA.
• (Not shown: Total Column Scores – PASTA more accurate than UPP)
• No other methods tested could complete on these data
RNASim Million Sequences: tree error
Using 12 processors:
• UPP(Fast,NoDecomp) took 2.2 days,
• UPP(Fast) took 11.9 days, and
• PASTA took 10.3 days
0.0
0.2
0.4
0.6
0 12.5 25 50% Fragmentary
Mea
n al
ignm
ent e
rror
PASTA UPP(Default)
(a) Average alignment error
0.0
0.2
0.4
0 12.5 25 50% Fragmentary
Del
ta F
N tr
ee e
rror
PASTA UPP(Default)
(b) Average tree error
Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.
80
Performance on fragmentary datasets of the 1000M2 model condition
UPP is very robust to fragmentary sequencesUnder high rates of evolution,PASTA is badly impactedby fragmentary sequences (the same is true for other methods).
UPP continues to have goodaccuracy even on datasetswith many fragments underall rates of evolution.
●
●
●
●
0
5
10
15
50000 100000 150000 200000Number of sequences
Wal
l clo
ck a
lign
time
(hr)
● UPP(Fast)
UPP Running Time
Wall-clock time used (in hours) given 12 processors
What about BAli-Phy?
•BAli-Phy (Redelings and Suchard): leading method forstatistical co-estimation of alignments and trees:
•Like Bayesian phylogeny estimation, it is expectedto be the most rigorous and accurate techniquefor estimating trees and alignments!
BAli-Phy: Better than PASTA!
200 200
Simulator
100
Indelible (DNA)
100
RNAsim(RNA)
40%
30%
20%
10%
0%
#Taxa:
Tota
l-Col
umn
Scor
e
Alignment Accuracy (TC score)MAFFT
PASTA
BAli-Phy
*Averages over 10replicates
Simulated datasets with 100 or 200sequences.
But: BAli-Phy is limited to smalldatasets
From www.bali-phy.org/README.html, 5.2.1. Too manytaxa?
“BAli-Phy is quite CPU intensive, and so we recommend using 50 or fewer taxa in order to limit the *me required to accumulate enough MCMC samples. (Despite this recommendation, data sets with more than 100 taxa have occasionally been known to converge.) We recommend initially pruning as many taxa aspossible from your data set, then adding some back if the MCMC is not tooslow.”
Re-aligning on a treeA
B D
C
Merge sub-alignments
Estimate ML tree on merged
alignment
Decompose dataset
A B
C D
Align subsets: MAFFT
A B
C D
ABCD
Re-aligning on a treeA
B D
C
Merge sub-alignments
Estimate ML tree on merged
alignment
Decompose dataset
A B
C D
Align subsets: BAli-Phy??
A B
C D
ABCD
Decomposition to 100-sequence subsets, one iteration of PASTA+BAli-Phy
Comparing default PASTA to PASTA+BAli-Phyon simulated datasets with 1000 sequences
PASTA+BAli−Phy Better
PASTA Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PAST
A+BA
li−Ph
y
Total Column Score
dataIndelible M2RNAsimRose L1Rose M1Rose S1
PASTA+BAli−Phy Better
PASTA Better
0.6
0.7
0.8
0.9
1.0
0.6 0.7 0.8 0.9 1.0PASTA
PAST
A+BA
li−Ph
y
Recall (SP−Score)
PASTA+BAli−Phy Better
PASTA Better
0.00
0.05
0.10
0.15
0.00 0.05 0.10 0.15PASTA+BAli−Phy
PAST
A
Tree Error: Delta RF (RAxML)
�
�
��
�
��� �
�
PASTA+BAli−Phy Better
PASTABetter
0.000
0.025
0.050
0.075
0.100
0.000 0.025 0.075 0.1000.050PASTA
PAST
A+BA
li−Ph
y
Total Column Score
data� Indelible M2
RNAsim
� �
��
�
�
�
�
� �
PASTA+BAli−Phy Better
PASTABetter
0.900
0.925
0.950
0.975
1.000
0.900 0.925 0.975 1.0000.950PASTA
PAST
A+BA
li−Ph
y
Recall (SP−Score)
�
�
��
�
�
�
��
�rPASTA+BAli−Phy Bette
PASTABetter
0.000
0.005
0.010
0.015
0.000 0.0150.005 0.010PASTA+BAli−Phy
PAST
A
Tree Error: Delta RF (FastTree−2)
Results on 10,000-sequence datasets, backbone size1000
Comparing UPP variants where the backbone alignment is computed using either default PASTA orPASTA+BAli-Phy
Summary• Large-scale multiple sequence alignment (MSA) is achievable
with good accuracy using divide-and-conquer plus iteration, and enable preferred MSA methods to be used on very large datasets
• PASTA and UPP can provide good accuracy on large (million-sequence) datasets with high heterogeneity
• Fragmentary sequences present additional challenges that UPP can address
• PASTA and UPP at https://github.com/smirarab/
• PASTA+BAli-Phy at http://github.com/MGNute/pasta
Acknowledgments
PASTA and UPP: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), undergrad: Keerthana Kumar (at UT-Austin)PASTA+BAli-Phy: Mike Nute (PhD student at UIUC)
Current NSF grants: ABI-1458652 (multiple sequence alignment)Grainger Foundation (at UIUC), and UIUCTACC, UTCS, Blue Waters, and UIUC campus cluster
PASTA, UPP, SEPP, and TIPP are available on github at https://github.com/smirarab/; see also PASTA+BAli-Phy at http://github.com/MGNute/pasta