superfine , enabling large -scale phylogenetic estimation
DESCRIPTION
SuperFine , Enabling Large -Scale Phylogenetic Estimation. Shel Swenson University of Southern California and Georgia Institute of Technology. Phylogeny (evolutionary tree). Orangutan. Human. Gorilla. Chimpanzee. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/1.jpg)
SuperFine, Enabling Large-Scale Phylogenetic Estimation
Shel SwensonUniversity of Southern California
andGeorgia Institute of Technology
![Page 2: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/2.jpg)
Orangutan Gorilla Chimpanzee Human
(1-3) From the Tree of the Life Website,University of Arizona
Phylogeny(evolutionary tree)
1 32
“Nothing in Biology makes sense except in the light of evolution” – Dobhzhansky
![Page 3: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/3.jpg)
Tree of Life, Importance to Biology
Biomedical applicationsMechanisms of evolutionTracking ancient migrationsProtein structure and
functionDrug design
1) Nature Reviews (Genetics)2) Howard Hughes Medical Institute (BioInteractive)3) 1000 Genomes Project
1
32
We are here
![Page 4: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/4.jpg)
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
AAGACTT -3 million yrs
-2 million yrs
-1 million yrs
today
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
DNA sequence evolution (idealized)
![Page 5: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/5.jpg)
AGATTA AGACTA TGGACA TGCGACTAGGTCA
U V W X Y
U
V W
X
Y
Phylogeny Problem
U V W X Y
![Page 6: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/6.jpg)
Two basic approaches for tree estimation on multi-gene datasets
• Apply phylogeny estimation methods to concatenated (“combined”) sequence alignments for different genes
• Compute trees on individual genes and apply a supertree method
This Talk: SuperFine, boosts supertree methods, enablingfaster, more accurate estimation for large scale problems
![Page 7: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/7.jpg)
Using multiple genes
gene 1S1
S2
S3
S4
S7
S8
TCTAATGGAA
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
gene 3TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
S1
S3
S4
S7
S8
gene 2GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
S4
S5
S6
S7
![Page 8: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/8.jpg)
Concatenation
gene 1S1
S2
S3
S4
S5
S6
S7
S8
gene 2 gene 3 TCTAATGGAA
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
![Page 9: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/9.jpg)
. . .
Analyzeseparately
Supertree Method
Two competing approaches gene 1 gene 2 . . . gene k
. . . ConcatenationSpec
ies
![Page 10: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/10.jpg)
Why use supertree methods?
• Missing data• Large dataset sizes
• Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry)
• Unavailable sequence data (only trees)
![Page 11: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/11.jpg)
Many Supertree Methods
• MRP• weighted MRP• Min-Cut• Modified Min-Cut• Semi-strict
Supertree• MRF• MRD• QILI
• SDM• Q-imputation• PhySIC• Majority-Rule
Supertrees• Maximum
Likelihood Supertrees
• and many more ...
Matrix Representation with Parsimony(Most commonly used and among most accurate)
![Page 12: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/12.jpg)
Quantifying Error
FN: false negative (missing edge)FP: false positive (incorrect edge)
FN
FP50% error rate
![Page 13: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/13.jpg)
FN rateMRP vs. Concatenation
Scaffold Density (%)
FN R
ate
(%)
MRPConcatenation
Concatenation is not always an option We need better supertree methods
![Page 14: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/14.jpg)
FN RateSuperFine vs. MRP and Concatenation
Scaffold Density (%)
FN R
ate
(%)
MRPSuperFineConcatenation
![Page 15: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/15.jpg)
Running TimeSuperFine vs. MRP
(Concatenation is much slower)
MRP 8-12 sec.SuperFine 2-3 sec.
Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)
Min
utes
MRPSuperFine
![Page 16: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/16.jpg)
Idea behind SuperFine
1. Construct a supertree with low false positive rate
2. Reduce false negatives by resolving areas of uncertainty using a supertree method
Quartet Max Cut
(Swenson et al., Systematic Biology, 2011)
![Page 17: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/17.jpg)
Bipartitions and refinementLet B(T) denote the set of (non-trivial) bipartitions induced by the edges of T.
T refines T’ (T’≤T) if B(T) B(T’)
a
b
c
f
de a
b
c
f
d
e
TB(T) = {ab|cdef, abc|def, abcd|ef}
T’B(T’) = {ab|cdef, abc|def}
Polytomy
Refinement
![Page 18: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/18.jpg)
Idea behind SuperFine
1. Construct a supertree with low FP using the Strict Consensus Merger (SCM) (Huson et al. 1999)
2. Reduce FN by resolving each polytomy using a supertree method
Quartet Max Cut
![Page 19: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/19.jpg)
Strict Consensus Merger (SCM)a b
c d
e
fg
a b
cdh
i j
e
fg
hi j
a b
c
d
a b
c
d
e
fg
a b
c
dh
i j
![Page 20: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/20.jpg)
Property of SCM: Bipartitions in SCM tree correspond to bipartitions in the source trees
a b
c d
e
fg
a b
cdh
i j
e
fg
hi j
a b
c
d
a b
c
d
e
fg
a b
c
dh
i j
Swenson, Ph.D. Thesis, 2009
![Page 21: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/21.jpg)
Performance of SCM
• Low false positive (FP) rate(Estimated supertree has few false edges)
• High false negative (FN) rate(Estimated supertree is missing many true edges)
• Runs in polynomial time (in the number of source trees and total number of species)
![Page 22: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/22.jpg)
Idea behind SuperFine
1. Construct a supertree with low FP using SCM
2. Refine the tree to reduce FN by resolving each polytomy using a supertree method (eg. MRP)
Quartet Max Cut
![Page 23: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/23.jpg)
Resolving a single polytomy, v
• Step 1: Reduce each source tree to a tree on {1,2,...,d}, where d=degree(v)
• Step 2: Apply MRP to the collection of reduced trees, to produce a tree t on leafset {1,2,...,d}
• Step 3: Replace the star tree at v by tree t
![Page 24: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/24.jpg)
Back to Our Examplee
fg
a b
c
dh
i j
a bc e
hi j
d fg
1 2 3
4 5 6
a b
c d
e
fg
a b
cdh
i j
1 1
1 4
1
65
1 1
142
3 3
![Page 25: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/25.jpg)
Where We Use the Propertye
fg
a b
c
dh
i j
4
1
65
1
42 3
a b
c d
e
fg
a b
cdh
i j
![Page 26: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/26.jpg)
Step 1: Reduce each source tree to a tree on the set {1,2,...,d}
a b
c d
e
fg
a b
cdh
i j
4
1
65
1
42 3
![Page 27: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/27.jpg)
Step 2: Apply MRP to the collection of reduced trees
1
2 3
4
1 4
56MRP
1
2 3
4
6
5MRP
![Page 28: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/28.jpg)
Replace polytomy using tree from MRP
1
2 3
4
6
5
a bc e
hi j
d fg
e
fg
a b
c
dh
i jh
dg
fi
j
a
bc
e
![Page 29: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/29.jpg)
FN RateSuperFine vs. MRP and Concatenation
Scaffold Density (%)
FN R
ate
(%)
MRPSuperFineConcatenation
![Page 30: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/30.jpg)
Running TimeSuperFine vs. MRP
(Concatenation is much slower)
MRP 8-12 sec.SuperFine 2-3 sec.
Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)
Min
utes
MRPSuperFine
![Page 31: SuperFine , Enabling Large -Scale Phylogenetic Estimation](https://reader036.vdocuments.site/reader036/viewer/2022062520/56816385550346895dd46fa8/html5/thumbnails/31.jpg)
SuperFine: Boosting supertree methods• Superfine+MRP vs. MRP (Swenson et al. 2011)
– SuperFine combines the features of the SCM method (polynomial time, low false positive rates) with the lower false negative rate of MRP, to achieve greater accuracy in less time.
– Speed-up results from the re-encoding of source trees as smaller trees.
• SuperFine+QMC vs. QMC (quartet-based)– QMC (Snir 2008), polynomial time, but infeasible for 500+ taxa– SuperFine+QMC, runs where QMC cannot (Swenson et al. 2010)
• SuperFine+MRL vs. MRL (likelihood) (Nguyen et al. 2012)– SuperFine+MRL, faster and more accurate, similar likelihood scores
DACTAL (Nelesen, et al. 2012) Boosting concatenation methods; uses SuperFine in its divide-and-conquer strategy