phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional...

72
ARTICLES https://doi.org/10.1038/s41559-020-1239-x Phylogenetic analyses with systematic taxon sampling show that mitochondria branch within Alphaproteobacteria Lu Fan  1,2,3,7 , Dingfeng Wu 4,7 , Vadim Goremykin 5,7 , Jing Xiao 4 , Yanbing Xu 4 , Sriram Garg  6 , Chuanlun Zhang 1,3 , William F. Martin  6 and Ruixin Zhu  4 1 Shenzhen Key Laboratory of Marine Archaea Geo-Omics, Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China. 2 Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology (SUSTech), Shenzhen, China. 3 Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China. 4 Department of Bioinformatics, Putuo People’s Hospital, Tongji University, Shanghai, China. 5 Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy. 6 Institute of Molecular Evolution, Heinrich Heine University, Düsseldorf, Germany. 7 These authors contributed equally: Lu Fan, Dingfeng Wu, Vadim Goremykin. e-mail: [email protected]; [email protected]; [email protected] SUPPLEMENTARY INFORMATION In the format provided by the authors and unedited. NATURE ECOLOGY & EVOLUTION | www.nature.com/natecolevol

Upload: others

Post on 19-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

Articleshttps://doi.org/10.1038/s41559-020-1239-x

Phylogenetic analyses with systematic taxon sampling show that mitochondria branch within AlphaproteobacteriaLu Fan   1,2,3,7 ✉, Dingfeng Wu4,7, Vadim Goremykin5,7, Jing Xiao4, Yanbing Xu4, Sriram Garg   6, Chuanlun Zhang1,3, William F. Martin   6 ✉ and Ruixin Zhu   4 ✉

1Shenzhen Key Laboratory of Marine Archaea Geo-Omics, Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China. 2Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology (SUSTech), Shenzhen, China. 3Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China. 4Department of Bioinformatics, Putuo People’s Hospital, Tongji University, Shanghai, China. 5Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy. 6Institute of Molecular Evolution, Heinrich Heine University, Düsseldorf, Germany. 7These authors contributed equally: Lu Fan, Dingfeng Wu, Vadim Goremykin. ✉e-mail: [email protected]; [email protected]; [email protected]

SUPPLEMENTARY INFORMATION

In the format provided by the authors and unedited.

NAtuRe eCoLoGY & eVoLutioN | www.nature.com/natecolevol

Page 2: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

1

Supplementary Information 1

This file contains Supplementary Notes 1–9, Supplementary Methods, Supplementary Figures 1–60, and 2 Supplementary References. 3

4

Table of Contents 5

Supplementary Notes………02 6

Supplementary Methods…...07 7

Supplementary Figures…….09 8

Supplementary References...69 9

10

Page 3: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

2

Supplementary Notes 11

12

Supplementary Note 1 13

Difficulties in resolving the phylogenetic relationship with extant alphaproteobacterial lineages (see a 14 detailed review in reference 5). (1) Considerable phylogenetic divergence and metabolic variety within 15

Alphaproteobacteria1; (2) faint historical signals left behind the very ancient event of mitochondria origin2; 16 (3) limited number of marker genes shared between mitochondria and Alphaproteobacteria due to extensive 17

gene loss in the prior3; (4) taxonomic bias in datasets towards clinically or agriculturally important 18 alphaproteobacterial members4; and (5) strong phylogenetic artefacts such as long-branch attraction (LBA) 19

and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages. 20

21

Supplementary Note 2 22

Heterogeneity mitigation methods. As datasets in studies on phylogeny between mitochondria and 23

Alphaproteobacteria heavily suffer from compositional heterogeneity and LBA, various approaches to 24 mitigate non-historical signals have been adopted but the drawbacks of these methods are rarely examined. 25 Among them, protein recoding cause signal loss and artificial mutation saturation6. Nucleus-encoded 26

mitochondrial genes have to be adapted to new rules of expression and regulation in the nucleus system and 27 therefore may actually have undergone intensive site substation compared to mitochondrion-encoded genes. 28

Thus, the reliability of using nucleus-encoded mitochondrial genes in phylogenetic analysis of mitochondria 29 need further justification7,8. The idea of excluding potentially model-violating sites to improve phylogenetic 30

prediction was introduced over two decades ago9,10 but has been opposed by some researchers (see review 31 by Shepherd and Klaere11). The concern is that in spite of non-historical signals, these sites may contain 32

useful information. Nonetheless, various versions of site exclusion have been applied in phylogenetic studies 33 of mitochondria and Alphaproteobacteria either based on evolving rate12,13 or amino acid composition14,15,16. 34

However, conflicting results were reported by using different site-exclusion metrics17. 35

36

Supplementary Note 3 37

Explanations and verifications to the topological shift of trees observed in Martijn et al. (2018). In the 38 study of Martijn et al.16, it was observed that when certain amount of sites in the alignment matrix were 39

excluded by using either a Stuart-score based stationary trimming method or a !2-score based method, the 40

tree topology shifted from supporting Rickettsiales-sister to Alphaproteobacteria-sister for mitochondria. At 41 least two mechanisms may explain this observation: (1) Mitochondria originated independently from extant 42

Alphaproteobacteria. The affinity between mitochondria and Rickettsiales in phylogenetic trees is the result 43 of convergent evolution in amino acid composition instead of common evolutionary history. Site-exclusion 44

approaches filter out this non-historical signal by removing the most heterogenous sites in datasets and 45 restore the true phylogenetic position of mitochondria; and (2) Historical signals between Mitochondria and 46

some alphaproteobacterial groups exist but are weak as the result of divergent evolution for billions of years. 47

Page 4: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

3

Site-exclusion approaches employed by Martijn et al. removed most of the sites containing these signals and 48 broke the true evolutionary connection between mitochondria and Alphaproteobacteria. The long branch of 49 mitochondria in the phylogenetic tree is then attracted by the long branch leading towards the outgroup taxa 50

containing other proteobacteria resulting an Alphaproteobacteria-sister topology. However, since there is a 51 lack of known close relatives to the entire Alphaproteobacteria clade, it is impossible to validate the LBA 52

effect in this tree by replacing the distant outgroup with a much closer alternative (but see the placement of 53 mitochondria in the Alpha IIb with Alpha IIa as a short-branched outgroup in this study). 54

Martijn et al. testified the potent LBA effect of outgroup taxa by conducting three analyses16. As admitted 55 by them, the outgroup removal analysis did not proof or disproof the hypothesis of LBA by the outgroup. 56

Secondly, the parametric simulation approach they conducted generated ten trees in which eight recovered 57 a Rickettsiales-sister topology with weak node supports (below 80%) (Supplementary Figure 20 in their 58

publication). As suggested by the IQ-TREE manuscript17, Ultra-Fast bootstrapping values below 95% are 59 unreliable. Therefore, the result they obtained may actually suggest that a trustable Rickettsiales-sister 60

topology was unable to be recovered and LBA effect of the outgroup might have a function on it. Lastly, the 61 authors declared that random sequence replacements showed that the long branch of outgroup did not attract 62 any of the random sequences. However, this is completely untrue! In the ten trees they provided as 63

Supplementary Data, there are 8, 0, 3, 8, 5, 0, 0, 6, 0, and 3 of the 10 random sequences attracted by the 64 outgroup, respectively. Therefore, the authors failed to exclude the possible artificial attraction of outgroup 65

to mitochondria after site exclusion. 66

67

Supplementary Note 4 68

Notable inconsistence in data reproducibility in the study of Martijn et al.16. In addition to the 69

misinterpretation of results by Martijn et al. we mentioned in Supplementary Note 3, we here report two 70 cases of data irreproducibility in Martijn et al. (2018). First, the Bayesian consensus tree we obtained of the 71

untreated ‘24-alphamitoCOGs’ dataset (Supplementary Figure 1) is fundamentally different from the ones 72 show in Supplementary Figure 9 in Martijn et al. (2018), albeit we used the same dataset and tree 73

construction parameters as they did. Specifically, in our tree, fast-evolving taxonomic groups do not form 74 monophyletic groups. Instead, some taxa such as Pelagibacter were placed with slow-evolving 75 alphaproteobacteria. Mitochondria and Rickettsiales formed a monophyletic clade, which is in adjacent by 76

other alphaproteobacterial groups. Thus, our result shows notable capacity of the CAT model in dealing with 77 phylogenetic artefacts. 78

Second, the results of posterior predictive tests to the Bayesian inference of the site-excluded dataset based 79 on Stuart’s test (this tree is the cornerstone to support the Alphaproteobacteria-sister conclusion of 80

mitochondria by Martijn et al., Supplementary Table 2) are not comparable to the ones reported by Martijn 81 et al. (Supplementary Figure 11 and Supplementary Table 3 in reference 16). Specifically, they reported a 82

much better model fit according to the maximum squared heterogeneity (Z-score 0.5 – 0.6 in their study, but 83 around 3.09 in this study; P-value 0.25 – 0.27 in their study, but around 0.01 in this study) and mean squared 84

heterogeneity (Z-score -2.5 – -2.3 in their study, but around 2.67 in this study; P-value 0.98 – 0.99 in their 85

Page 5: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

4

study, but around 0.004 in this study). We are not sure why such inconsistent results were obtained in these 86 two studies since (1) we used the same dataset (the dataset was directly provided by Martijn) and tree 87 reconstruction protocol; and (2) we both obtain the almost same consensus tree topology (Supplementary 88

Figure 2 in this study and Supplementary Figure 11 in their study). The only difference that may explain the 89 above two cases of inconsistency may be that different versions of PhyloBayes MPI were used in the two 90

studies (i.e. version 1.8 here and 1.7a in their study). 91

92

Supplementary Note 5 93

Sequence heterogeneity and taxa selection of the ‘Modified18’ dataset. In the ‘Modified18’ dataset, GC-94

rich mitochondrial sequences were selected to remarkably reduce the heterogeneity in FYMINK/GARP ratio 95 between mitochondria and slowly evolving alphaproteobacteria (Supplementary Figure 24). High GC taxa 96

of Rickettsiales taxa were also selected with higher genome G&C content in comparison to those in the ‘24-97 alphamitoCOGs’ dataset. The rationale behind this selection is that the phylogenetic position of Rickettsiales 98

to the backbone taxa of alphaproteobacteria is difficulty to resolve (see Supplementary Note 7). 99 Rickettsiales with low G&C content may artificially attracted by other fast-evolving species in the tree 100 leading to the loss of weak phylogenetic connection to the backbone taxa. 101

It is necessary to notice that the introduced GC-rich mitochondria were all from higher plants. While this 102 may compromise the representation of data, it has been noticed as early as in 1980s by Carl Woese et al. that 103

the mitochondrial sequences of higher plants may have diverged from bacterial sequences to low extent18-20, 104 possibly as a result of low mutation rate in genes maintained by DNA repair mechanisms21. With this in 105

concern, plant mitochondria are as suitable as those of jakobids for phylogenetic analysis with bacterial 106 sequences in term of branch length. Indeed, branch lengths of plant mitochondria in trees based on the 107

‘Modified18’ dataset are comparable to those of single cell eukaryotes based on the ‘Modified24’ dataset as 108 shown in this study (Supplementary Figures 31–36). 109

110

Supplementary Note 6 111

The four backbone clades. Group GT (Supplementary Table 3) is equivalent to Geminicoccaceae in 112 Muñoz-Gómez et al. (2019)15. Alpha I comprises core Alphaproteobacterial orders including 113 Kordiimonadales, Sphingomonadales, Rhizobiales, Caulobacterales, Parvularculales and Rhodobacterales. 114

Grouping of these lineages is in consistence with the findings by Muñoz-Gómez et al. and others1,15,22. Alpha 115 II comprises three isolates belonging to Rhodospirillaceae and several marine alphaproteobacterial 116

metagenome-assembled genomes (MAGs). Grouping of these lineages was observed in by Williams et al. 117 and others1,15,22. Alpha III comprises Kiloniellaceae, SAR116, Acetobacteraceae, Azospirillaceae, and some 118

taxa classified to the polyphyletic Rhodospirillaceae. This result is similar to the finding by Muñoz-Gómez 119 et al.15. 120

Notably, separation of these four groups were exactly recovered by Martijn et al. in their untreated ‘24-121 alphamitoCOGs’ dataset (Supplementary Figure 9, 10 in Martijn et al. (2018)), but not in their stationary-122

Page 6: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

5

trimmed dataset (Fig. 4a and Supplementary Figure 11, 12 in Martijn et al. (2018)), which they claimed to 123 support their ‘mito-out’ result. This again suggests site exclusion may result in abnormal tree topology for 124 even slow-evolving species. 125

126

Supplementary Note 7 127

Phylogenetic positions of fast-evolving alphaproteobacterial lineages. Holosporales was previously 128 considered as a subclade of Rickettsiales based on phylogenetic analysis and the factor that members of both 129

groups are obligate endosymbionts23-25. However, other studies suggested the phylogenetic affinity between 130 Holosporales and other Rickettsiales families is the result of artifact16,23,26-28. In the very recent study by 131

Muñoz-Gómez et al. where amino acid bias was corrected by site exclusion, Holosporales was suggested to 132 have a derived position within the Rhodospirillales and possibly close to Azospirllaceae15. Here, 133

Holosporales are in sister-relationship with the entire Alpha III (Fig. 2cd). Our results support that 134 Holosporales are independent of Rickettsiales but may be close to taxa in Alpha III. Recent studies suggested 135

the grouping of Pelagibacterales, alpha proteobacterium HIMB59 and Rickettsiales, as reported by many 136 earlier studies is the result of a compositional bias artefact15,16. Using site-excluded datasets, it was suggested 137 that Pelagibacterales should be placed after the common ancestor of Sphingomonadales (belonging to Alpha 138

Ia here) but before the divergence of Rhodobacterales, Caulobacterales and Rhizobiales (belonging to Alpha 139 Ib here)15. Our result is consistent with this (Fig. 2ef). Moreover, the exact phylogenetic position of alpha 140

proteobacterium HIMB59 could not be resolved by itself (Fig. 2gh). When mitochondria were present, it is 141 placed in the Alpha IIb clade based on the ‘Modified18’ dataset (Fig. 3f). Rickettsiales appearing as sister to 142

all other alphaproteobacteria has been reported in some artefact-attenuated studies15 while conflicting results 143 were recovered in others16 suggesting the current difficulty in resolving its relationship with slow-evolving 144

alphaproteobacteria. We found that Rickettsiales were placed within Alpha II, as the sister of MarineAlpha9 145 Bin5 based on the ‘Modified18’ dataset (Fig. 2i). 146

147

Supplementary Note 8 148

Evidences that the tree topology in Fig. 4 is not the results of artefact. First, taxon-exclusion analyses 149 clearly demonstrate the phylogenetic connections of fast-evolving alphaproteobacterial lineages including 150 Rickettsiales and fast-evolving MAGs (FEMAGs) to slow-evolving taxa MarineAlpha9 Bin5 and 151

MarineAlpha11 in the absence of possible influence from non-historical signals (Fig. 2ij). Secondly, 152 mitochondria and these fast-evolving taxa do not form a singlet clade falling apart from backbone clades as 153

a result of LBA – something shown in Supplementary Figure 9, 10 in Martijn et al. (2018). Instead, they 154 were placed with slow-evolving taxa within Alpha IIb (Fig. 3kl and 4). Lastly, taxon-reduction and site-155

exclusion examinations on the datasets ‘Modified24-AlphaII’, ‘Modified24-AlphaII-MoreTaxa’ and 156 ‘Modified18-AlphaII’ as shown in Fig. 1d suggest their tree topology is robust. 157

158

Supplementary Note 9 159

Page 7: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

6

Metabolic analysis. Metabolism reconstruction of the bacterial ancestor of mitochondria is essential to 160 syntrophic-based models of eukaryogenesis29-31. Proteome study of alphaproteobacterial sister of 161 mitochondria might provide hints to the metabolic nature of the latest common ancestor of mitochondria and 162

Alphaproteobacteria particularly for those proteins ultimately having been lost in extant mitochondria or 163 nuclei31,32. However, as evidenced by our previous studies33,34, extensive horizontal gene transfer events 164

have been taken place in the 1.8 billion years’ evolutionary history of extant alphaproteobacteria since 165 eukaryogenesis. Consequently, few proteins in extant alphaproteobacteria other than those used for 166

phylogeny here (the 24 marker proteins including riboproteins) were acquired through vertical transfer from 167 the latest common ancestor of mitochondria and Alphaproteobacteria. We therefore think approaches 168

inferring the very ancient event of eukaryogenesis based on proteomes of extant Alphaproteobacteria without 169 robust justification of each protein’s evolutionary pathway is prone to systematic errors. 170

Nevertheless, we here test two hallmarks of metabolic characters in the bacterial ancestor of mitochondria 171 predicted by the ‘hydrogen hypothesis’30 – aerobic and anaerobic oxidation of pyruvate and hydrogen 172

production – by annotating proteomes of taxa in the ‘24-alphamitoCOGs’ dataset. Pyruvate dehydrogenase 173 E1 component alpha subunit (PdhA) involving in aerobic pyruvate oxidation is found in almost all the 174 alphaproteobacteria but only one species in the outgroup (Supplementary Table 7). In contrast, one of the 175

pyruvate ferredoxin oxidoreductases PorA is only detected in Magnetospirillum magneticum, while the other 176 two, OorA and IorA, are found generally in the Alpha II subclade including FEMAG I, some Alpha I and 177

Alpha II taxa and two outgroup taxa. The result suggests ubiquitous aerobic pyruvate oxidation but 178 sporadically distributed anaerobic pyruvate oxidation in extant alphaproteobacteria. NiFe hydrogenase HoxF 179

is only encoded by M. magneticum and two outgroup taxa, probably because most taxa studied here are 180 better adapted to aerobic environments. 181

To investigate the possible evolutionary pathways of these enzymes, phylogenetic trees were reconstructed 182 for PdhA, OorA and IorA (Supplementary Figures 59 and 60). In general, all the trees are in low resolution 183

as shown by nearly half of the nodes with unstable supports (ultra-fast bootstrapping values < 95%) likely 184 caused by short sequence length of each protein. In addition, possible systematic errors such as 185

compositional heterogeneity and LBA may bias the results. Despite, in branches with stable supports, we 186 observe both evidences of local vertical transfer as shown by small monophyletic subclades (e.g. PdhA in 187 Rickettsiales, FEMAG II and Alpha Ia, respectively, Supplementary Figure 59) and lateral transfer as 188

demonstrated by polyphyletic subclades. Overall, it is likely that the three enzymes representing 189 aerobic/anaerobic pyruvate oxidation were present in the common ancestors of all extant alphaproteobacteria 190

and experienced differential loss, duplication and sporadically lateral exchanges with other bacteria. 191 However, since the low resolution of the trees, it is impossible to trace the basal nodes of Alpha II taxa. No 192

evidence of pyruvate oxidation or hydrogen production for the common ancestor of mitochondria and Alpha 193 II taxa can be obtained based on this very limited data. 194

195 196

Page 8: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

7

Supplementary Methods 197

198

The site-exclusion method based on the score of Bowker’s test. Compared to Stuart’s test, Bowker’s test 199

of symmetry was reported to more comprehensive and sufficient to assess the compliance of symmetry, 200 reversibility and homogeneity in time-reversible model assumptions35,36. We used Bowker’s test of 201

symmetry to produce subsets of the ‘alphamitoCOGs-24’ dataset by meeting increasingly stringent p-value-202 based thresholds (>0.01, >0.1, >0.3, >0.4 and >0.5, respectively). The Bowker’s test has long been used as 203

an overall test for symmetry36. The test assesses symmetry in an r × r contingency table with the ij-th cell 204 containing the observed frequency nij. The null hypothesis for symmetry is H0 = nij = nji, i ≠ j, i,j = 1,…,r, 205

and the test value is computed by using equation (1). 206

(1) 207

The test statistics follows !2 distribution with the number of degrees of freedom equal to the number of 208

comparisons (nij vs nji) made. 209

The scoring function (SF) utilized for symmetry-based alignment trimming employed here is a sum of 210

absolute values of natural logarithms of Bowker's test's p-values, each raised to a certain power (15 as the 211 default value). SF can be computed as a mean over the values in an upper or lower triangular part of a square 212

matrix which rows and columns represent taxa, populated with |ln p|x values for Bowker’s tests among these 213 taxa, as shown in equation (2). 214

(2) 215

wherein h is the number of taxa in the msa, and pab is a p value for the sequences a and b. 216

The script which performs symmetry-based trimming (available as Supplementary Software) deletes a site 217 in an alignment, computes a SF value and restores the original alignment. The operation is performed for 218

every alignment site. Then, the site which removal results in lowest SF value is deleted irreversibly. The 219 procedure is repeated for each shortened alignment subset until the lowest p-value for a pair-wise Bowker’s 220

test in the trimmed dataset exceeds certain p-value-based threshold(s). 221

Exponentiation in formula 2 leads to a sooner recovery of trimmed subsets. The exponentiation 222 disproportionally increases the addend values in formula 2 (|ln pab|x) for smaller p values. For instance, the 223

default addend in the formula 2 for p-value 0.5 is 0.004 and the addend for p-value 0.005 is 72789633288. 224 Thus, when there is a disparity in individual p-values in the data, which is the case when the method is 225

needed, the exponentiation increases the relative contribution of the lowest p-values onto the SF value size. 226 At each trimming step the heuristic algorithm identifies a site which removal is likely to improve the worst 227

Page 9: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

8

(lowest) p-values. The script outputs a trimmed subset when the lowest p-value exceeds the threshold value. 228 The suggested exponentiation, causing preferential improvement of the worst p-values at each site stripping 229 step, is able to deliver a result when less positions are removed. The default exponent value (x = 15) has 230

been determined experimentally. 231

232

Functional annotation and phylogenetic reconstruction of the pyruvate oxidation and hydrogen 233 generation enzymes. A subset of taxa in the ‘24-alphamitoCOGs’ dataset with high genome quality were 234

selected based on validation by using CheckM (v1.0.11)37 followed by two filtering criteria: (1) for each 235 composite bin, the original bin with the highest genome completeness were selected; and (2) only 236

bins/genomes with completeness > 80% were kept. Proteomes of the selected 65 genomes/bins were 237 searched against to the KEGG database by using KofamKOALA ‘-E 0.01’ (v1.2.0)38. Hits to the entries 238

K00161, K00169, K00174, K00179 and K18005 were calculated for each taxon (Supplementary Table 7). 239 Proteins annotated as K00161, K00174 and K00179 were aligned by using MUSCLE (v3.8)39 and then 240

trimmed by using trimAl ‘-gappyout’ (v1.4)40, respectively. Maximum-likelihood trees for the trimmed 241 alignments were reconstructed by using IQ-TREE (v1.6.12)17 with the ‘LG+C60+F’ model set. 242

243

244

Page 10: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

9

Supplementary Figures 245

246

247

Supplementary Figure 1 | Bayesian phylogenetic tree of the untreated ‘24-alphamitoCOGs’ dataset. 248 The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, 249 mitochondria. 250 251

Page 11: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

10

252

Supplementary Figure 2 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset treated by 253 site exclusion based on the Stuart’s test score. The tree is rooted to the outgroup. Posterior probability 254 support values at nodes are shown. mito, mitochondria. 255 256

Page 12: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

11

257

Supplementary Figure 3 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 5% sites 258 excluded based on the !2-score method. The tree is rooted to the outgroup. Posterior probability support 259 values at nodes are shown. mito, mitochondria. 260

Page 13: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

12

261

Supplementary Figure 4 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 10% 262 sites excluded based on the !2-score method. The tree is rooted to the outgroup. Posterior probability 263 support values at nodes are shown. mito, mitochondria. 264 265

Page 14: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

13

266

Supplementary Figure 5 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 20% 267 sites excluded based on the !2-score method. The tree is rooted to the outgroup. Posterior probability 268 support values at nodes are shown. mito, mitochondria. 269

Page 15: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

14

270

Supplementary Figure 6 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 40% 271 sites excluded based on the !2-score method. The tree is rooted to the outgroup. Posterior probability 272 support values at nodes are shown. mito, mitochondria. 273 274

Page 16: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

15

275 Supplementary Figure 7 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 60% 276 sites excluded based on the !2-score method. The tree is rooted to the outgroup. Posterior probability 277 support values at nodes are shown. mito, mitochondria. 278 279

Page 17: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

16

280

Supplementary Figure 8 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 5% sites 281 excluded based on the evolving-rate score method. The tree is rooted to the outgroup. Posterior probability 282 support values at nodes are shown. mito, mitochondria. 283 284

Page 18: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

17

285

Supplementary Figure 9 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 10% 286 sites excluded based on the evolving-rate score method. The tree is rooted to the outgroup. Posterior 287 probability support values at nodes are shown. mito, mitochondria. 288

Page 19: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

18

289 Supplementary Figure 10 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 20% 290 sites excluded based on the evolving-rate score method. The tree is rooted to the outgroup. Posterior 291 probability support values at nodes are shown. mito, mitochondria. 292 293

Page 20: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

19

294 Supplementary Figure 11 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 40% 295 sites excluded based on the evolving-rate score method. The tree is rooted to the outgroup. Posterior 296 probability support values at nodes are shown. mito, mitochondria. 297 298

Page 21: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

20

299

Supplementary Figure 12 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 60% 300 sites excluded based on the evolving-rate score method. The tree is rooted to the outgroup. Posterior 301 probability support values at nodes are shown. mito, mitochondria. 302 303

Page 22: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

21

304

Supplementary Figure 13 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 5% 305 sites excluded based on the ɀ-score method. The tree is rooted to the outgroup. Posterior probability 306 support values at nodes are shown. mito, mitochondria. 307 308

Page 23: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

22

309

Supplementary Figure 14 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 10% 310 sites excluded based on the ɀ-score method. The tree is rooted to the outgroup. Posterior probability 311 support values at nodes are shown. mito, mitochondria. 312 313

Page 24: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

23

314

Supplementary Figure 15 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 20% 315 sites excluded based on the ɀ-score method. The tree is rooted to the outgroup. Posterior probability 316 support values at nodes are shown. mito, mitochondria. 317 318

Page 25: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

24

319

Supplementary Figure 16 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 40% 320 sites excluded based on the ɀ-score method. The tree is rooted to the outgroup. Posterior probability 321 support values at nodes are shown. mito, mitochondria. 322 323

Page 26: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

25

324 Supplementary Figure 17 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset with 60% 325 sites excluded based on the ɀ-score method. The tree is rooted to the outgroup. Posterior probability 326 support values at nodes are shown. mito, mitochondria. 327 328

Page 27: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

26

329

Supplementary Figure 18 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset treated by 330 site exclusion based on the Bowker’s test score method with P-value > 0.01. The tree is rooted to the 331 outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. 332 333

Page 28: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

27

334 Supplementary Figure 19 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset treated by 335 site exclusion based on the Bowker’s test score method with P-value > 0.1. The tree is rooted to the 336 outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. 337 338

Page 29: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

28

339 Supplementary Figure 20 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset treated by 340 site exclusion based on the Bowker’s test score method P-value > 0.3. The tree is rooted to the outgroup. 341 Posterior probability support values at nodes are shown. mito, mitochondria. 342 343

Page 30: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

29

344 Supplementary Figure 21 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset treated by 345 site exclusion based on the Bowker’s test score method P-value > 0.4. The tree is rooted to the outgroup. 346 Posterior probability support values at nodes are shown. mito, mitochondria. 347 348

Page 31: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

30

349 Supplementary Figure 22 | Bayesian phylogenetic tree of the ‘24-alphamitoCOGs’ dataset treated by 350 site exclusion based on the Bowker’s test score method P-value > 0.5. The tree is rooted to the outgroup. 351 Posterior probability support values at nodes are shown. mito, mitochondria. 352 353

Page 32: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

31

354

355

Supplementary Figure 23 | Relationships between alignment sites, the phylogenetic position of 356 mitochondria and model fit (max heterogeneity across taxa test) based on different datasets, site-357 exclusion and taxon-selection approaches. Bayesian inference with model CAT+GTR was conducted. X 358 axel shows the number of sites in each dataset for phylogenetic inference. Y axle shows the Z-scores of the 359 ‘max heterogeneity across taxa’ posterior predictive test. Numbers aside markers show node support values 360 (posterior probability support values) of the consensus trees. Values ⩾95 are in bold. Strings in parentheses 361 show the closest relatives of mitochondria in the tree. R, Rickettsiales. T, Tistrella mobilis. F2, FEMAG II. 362 AII, Alpha II. AIII, Alpha III. GT, Geminicoccus roseus and T. mobilis. MA9, MarineAlpha9. mito-in, 363 mitochondria branch within Alphaproteobacteria. mito-out, mitochondria branch outside 364 Alphaproteobacteria. mito-in AlphaIIb, mitochondria brance within the AlphaIIb clade of 365 Alphaproteobacteria. a, site-exclusion methods applied on the ‘24-alphamitoCOGs’ dataset in Martijn et al. 366 (2018). Trees are shown in Supplementary Figure 1–22. b, !2-score based site-exclusion and taxon-367 reduction methods applied on the subsets of the ‘Modified24’ and the ‘Modified18’ datasets, respectively, 368 containing only the backbone, Rickettsiales and mitochondrial sequences. Trees are shown in 369 Supplementary Figure 35, 37–42. c, !2-score based site-exclusion and taxon-reduction methods applied on 370 the subsets of the ‘Modified24’ and the ‘Modified18’ datasets, respectively, containing only the backbone, 371 FEMAGs and mitochondrial sequences. Trees are shown in Supplementary Figure 36, 42–48. d, !2-score 372 based site exclusion applied on the subsets of the ‘Modified24-AlphaII’, the ‘Modified24-AlphaII-MoreTaxa’ 373 and the ‘Modified18-AlphaII’ datasets, respectively. Trees are shown in Fig. 4 and Supplementary Figure 374 50–58. 375 376

Page 33: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

32

377

378

379

Supplementary Figure 24 | GC content and amino acid compositional heterogeneity among 380 alphaproteobacterial lineages and mitochondria. Dots represent taxa. Lineages are colored according to 381 Fig. 3 except empty dots represent Beta-, Gammaproteobacteria and Magnetococcales. a, taxa in the 382 ‘Modified24’ dataset. b, taxa in the ‘Modified18’ dataset. 383 384

Page 34: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

33

385

386

Supplementary Figure 25 | Bayesian phylogenetic tree of the slowly-evolving backbone taxa of Alphaproteobacteria. The tree is rooted to the outgroup. Posterior 387 probability support values at nodes are shown. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 388 389

Page 35: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

34

390

Supplementary Figure 26 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria and Holosporales. The tree is rooted to the outgroup. Posterior 391 probability support values at nodes are shown. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 392 393

Page 36: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

35

394

Supplementary Figure 27| Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria and Pelagibacterales. The tree is rooted to the outgroup. Posterior 395 probability support values at nodes are shown. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 396 397

Page 37: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

36

398

Supplementary Figure 28 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria and alpha proteobacterium HIMB59. The tree is rooted to the 399 outgroup. Posterior probability support values at nodes are shown. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 400 401

Page 38: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

37

402

403

Supplementary Figure 29 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria and Rickettsiales. The tree is rooted to the outgroup. Posterior 404 probability support values at nodes are shown. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 405 406

Page 39: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

38

407

Supplementary Figure 30 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria and FEMAGs. The tree is rooted to the outgroup. Posterior 408 probability support values at nodes are shown. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 409 410

Page 40: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

39

411

Supplementary Figure 31 | Bayesian phylogenetic tree of the slowly-evolving backbone taxa of Alphaproteobacteria and mitochondria. The tree is rooted to the 412 outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 413

Page 41: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

40

414

Supplementary Figure 32 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Holosporales and mitochondria. The tree is rooted to the 415 outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 416

Page 42: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

41

417

Supplementary Figure 33 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Pelagibacterales and mitochondria. The tree is rooted to the 418 outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 419

Page 43: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

42

420

Supplementary Figure 34 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, alpha proteobacterium HIMB59 and mitochondria. The tree 421 is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ 422 dataset. 423

Page 44: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

43

424

Supplementary Figure 35 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria. The tree is rooted to the 425 outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 426

Page 45: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

44

427

Supplementary Figure 36 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria. The tree is rooted to the outgroup. 428 Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 429

Page 46: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

45

430

Supplementary Figure 37 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria after 5% sites excluded. The 431 !2-score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on 432 the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 433

Page 47: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

46

434

Supplementary Figure 38 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria after 10% sites excluded. The 435 !2-score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on 436 the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 437

Page 48: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

47

438

Supplementary Figure 39 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria after 20% sites excluded. The 439 !2-score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on 440 the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 441

Page 49: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

48

442

Supplementary Figure 40 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria after 40% sites excluded. The 443 !2-score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on 444 the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 445

Page 50: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

49

446

Supplementary Figure 41 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria after 60% sites excluded. The 447 !2-score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on 448 the ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 449

Page 51: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

50

450

Supplementary Figure 42 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, Rickettsiales and mitochondria after taxon reduction. The !2-451 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 452 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 453

Page 52: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

51

454

Supplementary Figure 43 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria after 5% sites excluded. The !2-455 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 456 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 457 458

Page 53: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

52

459

Supplementary Figure 44 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria after 10% sites excluded. The !2-460 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 461 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 462

Page 54: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

53

463

Supplementary Figure 45 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria after 20% sites excluded. The !2-464 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 465 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 466

Page 55: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

54

467

Supplementary Figure 46 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria after 40% sites excluded. The !2-468 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 469 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 470

Page 56: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

55

471

Supplementary Figure 47 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria after 60% sites excluded. The !2-472 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 473 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 474

Page 57: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

56

475

Supplementary Figure 48 | Bayesian phylogenetic tree of the backbone taxa of Alphaproteobacteria, FEMAGs and mitochondria after taxon reduction. The !2-476 score method for site exclusion was applied. The tree is rooted to the outgroup. Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the 477 ‘Modified24’ dataset. b, based on the ‘Modified18’ dataset. 478 479

Page 58: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

57

480

Supplementary Figure 49 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria. The tree is rooted to the outgroup (Alpha IIa). Posterior 481 probability support values at nodes are shown. a, based on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 482 483

Page 59: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

58

484

485

Supplementary Figure 50 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria. The tree is rooted to the outgroup (Alpha 486 IIa). Posterior probability support values at nodes are shown. mito, mitochondria. a, based on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 487 488

Page 60: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

59

489

490

Supplementary Figure 51 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 5% sites excluded. The !2-score 491 method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support values at nodes are shown. mito, mitochondria. a, based 492 on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 493 494

Page 61: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

60

495

496

Supplementary Figure 52 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 10% sites excluded. The !2-score 497 method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support values at nodes are shown. mito, mitochondria. a, based 498 on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 499 500

Page 62: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

61

501

502

Supplementary Figure 53 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 20% sites excluded. The !2-score 503 method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support values at nodes are shown. mito, mitochondria. a, based 504 on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 505 506

Page 63: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

62

507

508

Supplementary Figure 54 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 40% sites excluded. The !2-score 509 method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support values at nodes are shown. mito, mitochondria. a, based 510 on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 511 512

Page 64: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

63

513

514

Supplementary Figure 55 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 60% sites excluded. The !2-score 515 method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support values at nodes are shown. mito, mitochondria. a, based 516 on the ‘Modified24-AlphaII’ dataset. b, based on the ‘Modified18-AlphaII’ dataset. 517 518

Page 65: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

64

519

520

Supplementary Figure 56 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 5-10% sites excluded based on the 521 ‘Modified24-AlphaII-MoreTaxa’ dataset. The !2-score method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support 522 values at nodes are shown. mito, mitochondria. a, 5% sites excluded. b, 10% sites excluded. 523 524

Page 66: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

65

525

526

Supplementary Figure 57 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 20-40% sites excluded based on 527 the ‘Modified24-AlphaII-MoreTaxa’ dataset. The !2-score method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability 528 support values at nodes are shown. mito, mitochondria. a, 20% sites excluded. b, 40% sites excluded. 529 530

Page 67: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

66

531

532

Supplementary Figure 58 | Bayesian phylogenetic tree of the Alpha II subclade of Alphaproteobacteria and mitochondria after 60% sites excluded based on the 533 ‘Modified24-AlphaII-MoreTaxa’ dataset. The !2-score method for site exclusion was applied. The tree is rooted to the outgroup (Alpha IIa). Posterior probability support 534 values at nodes are shown. mito, mitochondria. 535

536

Page 68: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

67

537

538 Supplementary Figure 59 | Maximum-likelihood phylogenetic tree of pyruvate dehydrogenase E1 539 component alpha subunit (PdhA) found in the taxa of the ‘24-alphamitoCOGs’ dataset. The tree is 540 rooted to the midpoint. Node values show the ultra-fast bootstraps based on 1,000 iterations. Taxa are541 assignedtosubcladesnamedaccordingtoSupplementary Table 3. 542

543

Page 69: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

68

544

545

Supplementary Figure 60 | Maximum-likelihood phylogenetic tree of pyruvate ferredoxin oxidoreductase found in the taxa of the ‘24-alphamitoCOGs’ dataset. 546 The trees are rooted to the midpoints. Node values show the ultra-fast bootstraps based on 1,000 iterations. Taxa are assigned to subclades named according to547 Supplementary Table 3. a, 2-oxoglutarate/2-oxoacid ferredoxin oxidoreductase subunit alpha (OorA). b, indolepyruvate ferredoxin oxidoreductase subunit alpha (IorA). 548

549

Page 70: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

69

Supplementary References 550

551

1. Ettema, T. J. & Andersson, S. G. The alpha-proteobacteria: the Darwin finches of the bacterial world. 552 Biol Lett 5, 429-432 (2009). 553

2. Betts, H. C., Puttick, M. N., Clark, J. W., Williams, T. A., et al. Integrated genomic and fossil evidence 554 illuminates life's early evolution and eukaryote origin. Nat Ecol Evol (2018). 555

3. Karnkowska, A., Vacek, V., Zubáčová, Z., Treitli, S. C., et al. A eukaryote without a mitochondrial 556 organelle. Curr Biol 26, 1274-1284 (2016). 557

4. Brindefalk, B., Ettema, T. J., Viklund, J., Thollesson, M. & Andersson, S. G. A phylometagenomic 558 exploration of oceanic alphaproteobacteria reveals mitochondrial relatives unrelated to the SAR11 559 clade. PLoS One 6, e24457 (2011). 560

5. Roger, A. J., Muñoz-Gómez, S. A. & Kamikawa, R. The origin and diversification of mitochondria. Curr 561 Biol 27, R1177-R1192 (2017). 562

6. Philippe, H., Brinkmann, H., Lavrov, D. V., Littlewood, D. T., et al. Resolving difficult phylogenetic 563 questions: why more sequences are not enough. PLoS Biol 9, e1000602 (2011). 564

7. Derelle, R. & Lang, B. F. Rooting the eukaryotic tree with mitochondrial and bacterial proteins. Mol Biol 565 Evol 29, 1277-1289 (2012). 566

8. Adams, K. L., Song, K., Roessler, P. G., Nugent, J. M., et al. Intracellular gene transfer in action: dual 567 transcription and multiple silencings of nuclear and mitochondrial cox2 genes in legumes. Proc Natl 568 Acad Sci U S A 96, 13863-13868 (1999). 569

9. Hansmann, S. & Martin, W. Phylogeny of 33 ribosomal and six other proteins encoded in an ancient 570 gene cluster that is conserved across prokaryotic genomes: influence of excluding poorly alignable sites 571 from analysis. Int J Syst Evol Microbiol 50 Pt 4, 1655-1663 (2000). 572

10. Goremykin, V. V., Hansmann, S. & Martin, W. F. Evolutionary analysis of 58 proteins encoded in six 573 completely sequenced chloroplast genomes: revised molecular estimates of two seed plant divergence 574 times. Plant Systematics and Evolution 206, 337-351 (1997). 575

11. A Shepherd, D. & Klaere, S. How well does your phylogenetic model fit your data? Syst Biol 68, 157-576 167 (2019). 577

12. Esser, C., Ahmadinejad, N., Wiegand, C., Rotte, C., et al. A genome phylogeny for mitochondria among 578 alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol Biol Evol 579 21, 1643-1660 (2004). 580

13. Fitzpatrick, D. A., Creevey, C. J. & McInerney, J. O. Genome phylogenies indicate a meaningful alpha-581 proteobacterial phylogeny and support a grouping of the mitochondria with the Rickettsiales. Mol Biol 582 Evol 23, 74-85 (2006). 583

14. Viklund, J., Ettema, T. J. & Andersson, S. G. Independent genome reduction and phylogenetic 584 reclassification of the oceanic SAR11 clade. Mol Biol Evol 29, 599-615 (2012). 585

15. Muñoz-Gómez, S. A., Hess, S., Burger, G., Lang, B. F., et al. An updated phylogeny of the 586 Alphaproteobacteria reveals that the parasitic Rickettsiales and Holosporales have independent origins. 587 Elife 8, (2019). 588

16. Martijn, J., Vosseberg, J., Guy, L., Offre, P. & Ettema, T. J. G. Deep mitochondrial origin outside the 589 sampled alphaproteobacteria. Nature 557, 101-105 (2018). 590

Page 71: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

70

17. Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic 591 algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32, 268-274 (2015). 592

18. Woese, C. R. in The Evolution of Prokaryotes, eds. Schleifer, K.-H & Stackebrandt, E. (Academic, 593 London). (1985). 594

19. Yang, D., Oyaizu, Y., Oyaizu, H., Olsen, G. J. & Woese, C. R. Mitochondrial origins. Proc Natl Acad 595 Sci U S A 82, 4443-4447 (1985). 596

20. Palmer, J. D. & Herbon, L. A. Plant mitochondrial DNA evolves rapidly in structure, but slowly in 597 sequence. J Mol Evol 28, 87-97 (1988). 598

21. Christensen, A. C. Plant mitochondrial genome evolution can be explained by DNA repair mechanisms. 599 Genome Biol Evol 5, 1079-1086 (2013). 600

22. Williams, K. P., Sobral, B. W. & Dickerman, A. W. A robust species tree for the alphaproteobacteria. J 601 Bacteriol 189, 4578-4586 (2007). 602

23. Szokoli, F., Castelli, M., Sabaneyeva, E., Schrallhammer, M., et al. Disentangling the taxonomy of 603 Rickettsiales and description of two novel symbionts ("Candidatus Bealeia paramacronuclearis" and 604 "Candidatus Fokinia cryptica") sharing the cytoplasm of the ciliate protist Paramecium biaurelia. Appl 605 Environ Microbiol 82, 7236-7247 (2016). 606

24. Vannini, C., Ferrantini, F., Schleifer, K. H., Ludwig, W., et al. "Candidatus anadelfobacter veles" and 607 "Candidatus cyrtobacter comes," two new rickettsiales species hosted by the protist ciliate Euplotes 608 harpa (Ciliophora, Spirotrichea). Appl Environ Microbiol 76, 4047-4054 (2010). 609

25. Martijn, J., Schulz, F., Zaremba-Niedzwiedzka, K., Viklund, J., et al. Single-cell genomics of a rare 610 environmental alphaproteobacterium provides unique insights into Rickettsiaceae evolution. ISME J 9, 611 2373-2385 (2015). 612

26. Wang, Z. & Wu, M. An integrated phylogenomic approach toward pinpointing the origin of 613 mitochondria. Sci Rep 5, 7949 (2015). 614

27. Georgiades, K., Madoui, M. A., Le, P., Robert, C. & Raoult, D. Phylogenomic analysis of Odyssella 615 thessalonicensis fortifies the common origin of Rickettsiales, Pelagibacter ubique and Reclimonas 616 americana mitochondrion. PLoS One 6, e24857 (2011). 617

28. Ferla, M. P., Thrash, J. C., Giovannoni, S. J. & Patrick, W. M. New rRNA gene-based phylogenies of 618 the Alphaproteobacteria provide perspective on major groups, mitochondrial ancestry and phylogenetic 619 instability. PLoS One 8, e83383 (2013). 620

29. Imachi, H., Nobu, M. K., Nakahara, N., Morono, Y., et al. Isolation of an archaeon at the prokaryote-621 eukaryote interface. Nature 577, 519-525 (2020). 622

30. Martin, W. & Müller, M. The hydrogen hypothesis for the first eukaryote. Nature 392, 37-41 (1998). 623

31. Gabaldón, T. & Huynen, M. A. Reconstruction of the proto-mitochondrial metabolism. Science 301, 624 609 (2003). 625

32. Wang, Z. & Wu, M. Phylogenomic reconstruction indicates mitochondrial ancestor was an energy 626 parasite. PLoS One 9, e110685 (2014). 627

33. Martin, W. Mosaic bacterial chromosomes: a challenge en route to a tree of genomes. Bioessays 21, 99-628 104 (1999). 629

34. Ku, C., Nelson-Sathi, S., Roettger, M., Sousa, F. L., et al. Endosymbiotic origin and differential loss of 630 eukaryotic genes. Nature 524, 427-432 (2015). 631

Page 72: Phylogenetic analyses with systematic taxon sampling show ...10.1038... · 20 and compositional heterogeneity associating mitochondria with fast-evolving alphaproteobacterial lineages

71

35. Jermiin, L. S., Jayaswal, V., Ababneh, F. M. & Robinson, J. Identifying optimal models of evolution. 632 Methods Mol Biol 1525, 379-420 (2017). 633

36. Bowker, A. H. A test for symmetry in contingency tables. J Am Stat Assoc 43, 572-574 (1948). 634

37. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the 635 quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 636 1043-1055 (2015). 637

38. Aramaki, T., Blanc-Mathieu, R., Endo, H., Ohkubo, K., et al. KofamKOALA: KEGG ortholog 638 assignment based on profile HMM and adaptive score threshold. Bioinformatics (2019). DOI: 639 10.1093/bioinformatics/btz859 640

39. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic 641 Acids Res 32, 1792-1797 (2004). 642

40. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment 643 trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972-1973 (2009). 644

645

646