bootstrap methods in phylogenetics
TRANSCRIPT
![Page 1: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/1.jpg)
Bootstrap Methods in Phylogenetics
Emmanuel Paradis
Institut Pertanian BogorDecember 12, 2011
![Page 2: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/2.jpg)
Uncertainty in data analyses is crucial.
Simple examples:
1. Flipping a coin 10 times: 4 heads; 10 other times: 7 headsEstimates: p̂ = 0.4; p̂ = 0.7
![Page 3: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/3.jpg)
Uncertainty in data analyses is crucial.
Simple examples:
1. Flipping a coin 10 times: 4 heads; 10 other times: 7 headsEstimates: p̂ = 0.4; p̂ = 0.7
2. Simulating random normal variates with R:
> mean(rnorm(100))
[1] 0.01393119
> mean(rnorm(100))
[1] 0.1267855
> mean(rnorm(100))
[1] -0.09425586
![Page 4: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/4.jpg)
Some more vital examples:
1. Assessing the effects of a new medicine from a limited number of clinical trials.
2. Assessing the effects of mining, logging, pesticides, etc, on natural popula-tions of animals and plants.
3. Predicting the outcome of an epidemic (H1N1 in Europe in 2009).
![Page 5: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/5.jpg)
Two theories of uncertainties:
ä The Theorem of the Central Limit (TCL): the estimator θ̂ of a parameter θfollows a normal distribution.Example of the flipping coin: p̂ ∼ N (0.5, σ2
n)
ä Maximum Likelihood (ML): more general.
![Page 6: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/6.jpg)
0.5 1.0 1.5 2.0
−26
−24
−22
−20
−18
−16lo
gL1.92
λ̂
Confidence interval with profile likelihood
95% CI
![Page 7: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/7.jpg)
Principle of the bootstrap:
Resampling with BootstrapData replacement Bootstrap samples statistic
2 4 7 8 1 5 −→ 7 5 1 2 2 2 x̄1∗1 5 7 4 4 1 x̄2∗4 2 7 8 4 7 x̄3∗
. . .
. . . ...x̄1000∗
![Page 8: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/8.jpg)
Principle of the bootstrap:
Resampling with BootstrapData replacement Bootstrap samples statistic
2 4 7 8 1 5 −→ 7 5 1 2 2 2 x̄1∗1 5 7 4 4 1 x̄2∗4 2 7 8 4 7 x̄3∗
. . .
. . . ...x̄1000∗
µ̂ = 1n
∑ni=1 xi = 2+4+7+8+1+5
6 = 4.5
SEBOOT(µ̂) =√
var(x̄∗)
![Page 9: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/9.jpg)
“Manual” bootstrap:
> sd(replicate(99999, mean(sample(x, replace = TRUE))))
[1] 1.020867
![Page 10: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/10.jpg)
“Manual” bootstrap:
> sd(replicate(99999, mean(sample(x, replace = TRUE))))
[1] 1.020867
With the package boot in R:
> library(boot)
> f <- function(x, w) mean(x * w)
> boot(x, f, R = 99999, stype = "f")
ORDINARY NONPARAMETRIC BOOTSTRAP
![Page 11: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/11.jpg)
Call:
boot(data = x, statistic = f, R = 99999, stype = "f")
Bootstrap Statistics :
original bias std. error
t1* 4.5 0.0006200062 1.021678
![Page 12: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/12.jpg)
Call:
boot(data = x, statistic = f, R = 99999, stype = "f")
Bootstrap Statistics :
original bias std. error
t1* 4.5 0.0006200062 1.021678
Under the TCL: SE(µ̂) =√
var(x)/n
> sqrt(var(x)/length(x))
[1] 1.118034
![Page 13: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/13.jpg)
Why the bootstrap?
ä Small samples: TCL or ML are not accurate.
ä Situations with non-continuous parameters: tree/clustering/phylogeny.
![Page 14: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/14.jpg)
C
B
A
C
B
A
body size
metabolism
etc.
History Data
Evolution
Phylogenetic inference
Methods of phylogenetic inference:
ä Parsimony: based on the principle of minimum evolution.
ä Distance: minimise the discrepancy between the observed distances andthose inferred from a tree.
![Page 15: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/15.jpg)
ä Maximum Likelihood: Assuming a model of evolution of the characters, max-imise the likelihood of the character evolution along the tree.
ä Bayesian: id. but maximise the “posterior” probabilities.
These methods help to answer the first fundamental question: which tree?
C
B
A
B
C
A
A
B
C
The second fundamental question is: how uncertain is this result?
![Page 16: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/16.jpg)
1 2 3 4 5 6
C
B
A
A G C T
C
B
A
1 2 2 1 1 4
Bootstrap samples
Resampling of columns WITH replacement
C
B
A
6 5 6 2 5 4
C
B
A
Bootstrap trees
A
B
C
![Page 17: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/17.jpg)
What statistic to compute with the bootstrap trees?
Each branch of an unrooted tree defines a split.
A
BC
D This tree has n = 4 tips (or leaves) and thus defines 6 splits:
ä The internal branche (edge) defines one non-trivial split: AB|CD
ä The terminal branches define four trivial splits: A|BCD, B|ACD, C|ABD, D|ABC
ä One additional trivial split is defined by the empty set: ∅|ABCD
![Page 18: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/18.jpg)
An unrooted tree with n tips has n−3 internal branches, and thus defines 2n−2
splits: n− 3 non-trivial and n+ 1 trivial.
![Page 19: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/19.jpg)
An unrooted tree with n tips has n−3 internal branches, and thus defines 2n−2
splits: n− 3 non-trivial and n+ 1 trivial.
The number of possible splits grows exponentially with n: 2n−1
4 tips → 8 possible splits10 51250 1014
100 1029
![Page 20: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/20.jpg)
An unrooted tree with n tips has n−3 internal branches, and thus defines 2n−2
splits: n− 3 non-trivial and n+ 1 trivial.
The number of possible splits grows exponentially with n: 2n−1
4 tips → 8 possible splits10 51250 1014
100 1029
The bootstrap statistic in phylogenetics considers the n − 3 non-trivial splits andcounts how many times they appear in the bootstrap trees. These are the boot-strap proportions (BP) and should be interpreted as measures of confidence inthe estimated tree (not as probabilities).
![Page 21: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/21.jpg)
The result can be represented graphically with the BP on the internal branch ofthe estimated tree—usually close to the node defining an MRCA after rooting thetree.
A
BC
D
0.99A
B
C
D0.99
![Page 22: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/22.jpg)
The result can be represented graphically with the BP on the internal branch ofthe estimated tree—usually close to the node defining an MRCA after rooting thetree.
A
BC
D
0.99A
B
C
D0.99
> library(ape)
> data(woodmouse)
> d <- dist.dna(woodmouse)
> tw <- nj(d)
> tw
![Page 23: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/23.jpg)
Phylogenetic tree with 15 tips and 13 internal nodes.
Tip labels:
No305, No304, No306, No0906S, No0908S, No0909S, ...
Unrooted; includes branch lengths.
> is.rooted(tw)
[1] FALSE
> f <- function(x) nj(dist.dna(x))
> BP <- boot.phylo(tw, woodmouse, f)
|==================================================| 100%
> BP
[1] 100 51 56 59 67 49 69 74 85 95 86 100 61
![Page 24: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/24.jpg)
An alternative (though less elegant) is to create the function “on-the-fly”:
boot.phylo(tw, woodmouse, function(x) nj(dist.dna(x)))
boot.phylo(phy, x, FUN, B = 100, block = 1, trees = FALSE,
quiet = FALSE, rooted = FALSE)
> plot(tw)
> nodelabels(BP, adj = 1)
![Page 25: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/25.jpg)
No305
No304
No306
No0906S
No0908S
No0909S
No0910S
No0912S
No0913S
No1103S
No1007S
No1114S
No1202S
No1206S
No1208S
100
53
64
60
6448
75
64
89
94
86
9854
See ?nodelabels for all details on how to customise the appearance of the labels(see the examples).
What if the method of phylogenetic inference returns a rooted tree?
![Page 26: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/26.jpg)
> fr <- function(x) root(nj(dist.dna(x)), "No1114S")
> twr <- root(tw, "No1114S")
> boot.phylo(twr, woodmouse, fr, rooted = TRUE)
[1] 100 88 65 93 100 56 77 69 57 43 61 65 81
In this case boot.phylo counts clades instead of splits.
It is crucial that the tree is estimated with the same method than used for thebootstrap.
![Page 27: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/27.jpg)
So far we have used a neighbour–joining (NJ) method with distances calculatedwith Kimura’s two-parameter model (K80). What about other methods? For in-stance, BIONJ method with Tamura’s (1993) distance:
f <- function(x) bionj(dist.dna(x, "T93"))
Euclidean distance and UPGMA (requires the package phangorn):
f <- function(x) upgma(dist(x))
Maximum likelihood with molecular sequences:
> library(phangorn)
> o <- optim.pml(pml(tw, as.phyDat(woodmouse)))
> ctr <- pml.control(trace = 0)
![Page 28: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/28.jpg)
> TR <- bootstrap.pml(o, control = ctr)
> TR
100 phylogenetic trees
The bootstrap proportions can then be calculated with prop.clades.
![Page 29: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/29.jpg)
The bootstrap proportions refer only to the splits observed in the estimated tree.What about the other splits?
> res <- boot.phylo(tw, woodmouse, f, trees = TRUE, quiet = TRUE)
> res
$BP
[1] 100 49 50 52 63 46 73 61 86 93 90 100 62
$trees
100 phylogenetic trees
> pp <- prop.part(res$trees)
> plot(pp)
![Page 30: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/30.jpg)
040
80
Fre
quen
cy
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
No305
No304
No306
No0906S
No0908S
No0909S
No0910S
No0912S
No0913S
No1103S
No1007S
No1114S
No1202S
No1206S
No1208S
![Page 31: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/31.jpg)
ä support: the observed frequency of the split;
ä conflict: the sum of the support values of the splits that are not compatiblewith the considered one.
Two splits are not compatible if they cannot be observed in the same tree (e.g.,AB|CD and A|BCD are compatible, while AB|CD and AC|BD are not).
![Page 32: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/32.jpg)
Lento plot
−60
0−
400
−20
00
No305No304No306No0906SNo0908SNo0909SNo0910SNo0912SNo0913SNo1103SNo1007SNo1114SNo1202SNo1206SNo1208S
![Page 33: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/33.jpg)
1
2
3
4
1
3
2
4
1
4
2
3
![Page 34: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/34.jpg)
1
2
3
4
1
3
2
4
1
4
2
3
1
2
3
4
![Page 35: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/35.jpg)
1
2
3
4
1
3
2
4
1
4
2
3
1
4
3
2
![Page 36: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/36.jpg)
1
2
3
4
1
3
2
4
1
4
2
3
1
4
3
2
1
2
3
4 1
2
3
4
![Page 37: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/37.jpg)
Consensus network:
> CN <- consensusNet(res$trees)
> plot(CN, "2")
![Page 38: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/38.jpg)
No305
No304No306
No0906S No0908S
No0909S
No0910S
No0912S
No0913S
No1103SNo1007S
No1114S
No1202S
No1206S
No1208S
![Page 39: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/39.jpg)
The bootstrap is used to assess uncertainty (or confidence) of a single phylogeny.
How to compare different phylogenies?
![Page 40: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/40.jpg)
The bootstrap is used to assess uncertainty (or confidence) of a single phylogeny.
How to compare different phylogenies?
A special test based on the bootstrap, the Shimodaira–Hasegawa test, must beused. The principle is to perform a statistical comparison based on resampling thelikelihood of the trees.
![Page 41: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/41.jpg)
The bootstrap is used to assess uncertainty (or confidence) of a single phylogeny.
How to compare different phylogenies?
A special test based on the bootstrap, the Shimodaira–Hasegawa test, must beused. The principle is to perform a statistical comparison based on resampling thelikelihood of the trees.
H0: the differences in likelihood are not statistically differentH1: the tree with the largest likelihood is statistically better
tr2 <- bionj(d)
X <- as.phyDat(woodmouse)
fit0 <- optim.pml(pml(tw, X))
![Page 42: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/42.jpg)
fit2 <- optim.pml(pml(tr2, X))
SH.test(fit0, fit2)
See ?SH.test for another example.
![Page 43: Bootstrap Methods in Phylogenetics](https://reader030.vdocuments.site/reader030/viewer/2022013001/61cc02d44c044a4d4a79d925/html5/thumbnails/43.jpg)
Summary:
ä The bootstrap is a very general method to assess confidence in statisticalestimation.
ä Bootstrap proportions assess confidence in phylogeny estimation by countingsplits.
ä Complementary methods (Lento plot, consensus networks) help to explorealternative splits not observed in the estimated tree.
ä The Shimodaira–Hasegawa test must be used to test statistically differenttrees with the bootstrap.