misled by the mitochondrial genome - göteborgs universitet · the mitochondrial genome is haploid...

Misled by the

mitochondrial genome A phylogenetic study in Topaza hummingbirds

Tobias Hofmann

Degree project for Master of Science (Two Years)

Biodiversity and Systematics

Degree course in Next-Generation Sequencing (60 hec)

Autumn 2014 and Spring 2015

Examiner: Bengt Oxelman

Supervisors: Urban Olsson, Alexandre Antonelli

Co-Supervisors: Bernard Pfeil, Mats Töpel

Department of Biological and Environmental Sciences

University of Gothenburg

Cover illustration by David Alker, published in Schmitz-Ornés & Haase (2009)

I

Abstract

Phylogenetic analyses on shallow evolutionary times pose several challenges, since many classically used genetic markers provide insufficient variable sites in order to reliably reconstruct evolutionary history. In animals, mitochondrial sequences are therefore preferably used for inference of recent evolutionary history, due to a higher variability in comparison to nuclear gene loci. However, a growing number of evidence points out specific perils concerning mitochondrial sequences challenging their utility for phylogenetic analyses. Here we show a case of recent population divergences within the hummingbird genus Topaza, in which the mitochondrial tree is deviating from the species tree, leading to different divergence patterns between nuclear and mitochondrial datasets. We apply state of the art phylogenetic and population genetic methods in order to infer species trees, to define and delimit genetically distinct subspecies, and to compare the admixture pattern among them. The resulting population pattern indicates that the Amazon River acts as a strong dispersal barrier for Topaza hummingbirds. The underlying dataset consists of thousands of Ultraconserved Elements (UCEs) and additional nuclear genes, which provide an extensive nuclear dataset as counterpart to the complete mitochondrial genome, which was sequenced within this study. Our study provides an exemplary case of a powerful genetic approach aimed to recover recent phylogenetic history on the subspecies level by taking a novel pathway in extracting genome-wide SNPs (Single Nucleotide Polymorphisms) from UCE data. We further discuss important biological information that can be inferred from the observed discrepancy between the mitochondrial tree and the species tree. Beyond that, this study provides a direct comparison of a variety of datasets and analytical methods, exploring their performance on shallow evolutionary times, aiming to provide information for future studies for an informed choice of the most suitable genetic dataset.

Keywords: Illumina sequencing, sequence capture, UCE, SNP, species tree, gene tree, mitochondrial genome

II

Content

Introduction ...................................................................................................................................... 1

The Mitochondrion ....................................................................................................................... 1

UCEs as a novel source of genetic data ........................................................................................ 3

SNPs .............................................................................................................................................. 4

Topaza ........................................................................................................................................... 4

Aims of this study .......................................................................................................................... 5

Methods ............................................................................................................................................ 6

Taxon sampling ............................................................................................................................. 6

Next-Generation Sequencing ........................................................................................................ 8

DNA extraction and library preparation ................................................................................... 8

Probe design ............................................................................................................................. 8

Sequence enrichment and sequencing ..................................................................................... 9

Data processing ........................................................................................................................... 10

Mitochondrial tree ...................................................................................................................... 13

Species tree ................................................................................................................................. 14

Nuclear dataset ....................................................................................................................... 14

Mixed dataset ......................................................................................................................... 15

DISSECT ................................................................................................................................... 15

UCEs ........................................................................................................................................ 16

SNPs ........................................................................................................................................ 16

Population structure ................................................................................................................... 17

Results ............................................................................................................................................. 18

Data exploration ......................................................................................................................... 18

Mitochondrial tree ...................................................................................................................... 22

Species tree ................................................................................................................................. 23

Individuals analyzed separately .............................................................................................. 23

Species delimitation analysis .................................................................................................. 27

III

Individuals assigned to populations ........................................................................................ 28

Population structure ................................................................................................................... 30

Discussion ....................................................................................................................................... 31

Evaluation of phylogenetic relationships .................................................................................... 31

Mitochondrial tree - the odd one out ......................................................................................... 32

Rivers as dispersal barriers ......................................................................................................... 34

Effect of mtDNA on species tree ................................................................................................. 35

Evaluation of datasets................................................................................................................. 35

Conclusion ....................................................................................................................................... 37

Acknowledgements ........................................................................................................................ 38

References ...................................................................................................................................... 39

Supplemental Material ................................................................................................................... 46

Introduction

1

Introduction

The Mitochondrion

The mitochondrial genome has been a very popular source of genetic information for animals

since the beginnings of phylogenetic DNA-analyses, because it is easy to access and exists in high

copy numbers in most tissue cells. Mitochondrial DNA (mtDNA) in animals is generally characterized

by high mutation rates in comparison to nuclear DNA [1]; these high mutation rates produce valuable

phylogenetic information, even on relatively shallow evolutionary times. In birds, the average length

of the mitochondrial genome lies at around 17,000 bp (value based on all currently available

mitochondrial genomes (n=403) for birds at NCBI) but varies quite substantially in length between

different bird clades, ranging up to a length of more than 22,000 bp in hornbills (Bucerotidae) [2].

The mitochondrion contains multiple mRNA translated genes which code for subunits of enzymes

involved in important cellular functions associated with the cell metabolism, namely the generation

of Adenosine triphosphate (ATP). These genes are the cytochrome oxidase units (cox 1-3), the NADH

dehydrogenase units (nad 1-6), the ATP synthase units (atp 6+8) and cytochrome b (an integral

membrane protein involved in the respiratory chain). Additionally, the mitochondrial genome

contains its own set of tRNA and rRNA (12S and 16S) coding sequences (see Figure 4).

Combinations of exclusively mitochondrial genes (most frequently cytb, nd2 and cox1) have been

commonly used in bird phylogenetics throughout the last decade to infer phylogenetic trees [3]–[5].

Due to their fast mutation rate (in animals), mitochondrial loci are often more informative, and

therefore are considered more suitable than nuclear loci to explore the more recent phylogenetic

history. However, there are certain caveats to consider when using mitochondrial sequences for

species tree inference, which we address and discuss within this study.

Mitochondrial tree discordance

In vertebrates the mitochondrion is inherited maternally as a functional unit of the egg-cell [6].

Cases of recombination within mitochondrial genomes have been detected in various phylogenetic

studies [2], [7]–[9], but appear to be rather the exception than the common case, even though

Sammler et al. [2] suggest a rather frequent mitochondrial recombination rate among hornbills

(Aves: Bucerotidae). It is therefore recommended to test for recombination when using

mitochondrial sequence data. If no recombination is detected, the mitochondrial genome is to be

treated as one, uniparentally inherited locus. The consequence is that all mitochondrial

genes/sequences from one individual are in complete linkage and share the same evolutionary

Tobias Hofmann: Misled by the mitochondrial genome - a phylogenetic study in Topaza hummingbirds

2

history. This is an important point to consider when using mitochondrial genes in phylogenetic

studies. Unlike nuclear loci, which in many cases are unlinked (unless they are located in relatively

close proximity on the same chromosome), different mitochondrial genes do not represent different

genealogies; therefore any tree based on only mitochondrial markers represents a single gene tree

and not the species tree. This is an important distinction because of a variety of factors that can lead

to gene tree species tree discordance, as broadly discussed in a recent review by Degnan and

Rosenberg [10]. The most common mechanisms possibly causing discordance between the

mitochondrial tree and the species tree in particular are shown in Figure 1.

Figure 1: Main mechanisms potentially causing discordance between the mitochondrial tree (red) and the species-

tree. Drawn in blue is an alternative genealogy that is not in discordance with the species tree. a) Incomplete lineage

sorting: Looking backward in time, different, unlinked gene loci are expected to coalesce at different times between taxa

A and B. The coalescence of single genealogies may date back further than the actual speciation event (deep

coalescence), particularly in populations with large effective population sizes (Ne). This can lead to some loci coalescing

with the outgroup taxon C before coalescing between the two sister taxa A+B (see left diagram), causing a discordance of

that particular gene tree with the species tree. This is referred to as incomplete lineage sorting. b) Introgression: In the

case that two species or populations are not completely reproductively isolated, eventual gene-flow may occur. Such

gene-flow, even if brought by only a small group or a single individual, can lead to fixation of the new genes in the gene-

pool of the receiving population, which is referred to as introgression. When sampling a locus that has been subject to

such introgression, the resulting gene tree shows discordance with the species tree (see right diagram). In the example

here the mitochondrial lineage (red) intogresses from taxon C into taxon B.

Linkage and non-neutrality

There are various concerns about the neutrality of the mitochondrial locus, which is an important

criterion for all phylogenetic models. One main argument in this context is that mitochondria in birds

are strictly maternally inherited and are therefore in complete genetic linkage with the W

chromosome (the female sex chromosome in birds) [11], [12]. Therefore, indirect selection acting on

mitochondria through linkage with the W chromosome is proposed to be rather common [13] and is

thought to be of major concern when using mitochondrial loci for phylogenetic inference, as it makes

a) Incomplete lineage sorting b) Introgression

Introduction

3

inferences based solely on mtDNA unreliable [14]. Additionally, several studies [14]–[16] have found

indication of direct selection on mitochondrial loci, which further enforces the above-mentioned

concerns.

Mitochondrial bias on tree inference methods

The mitochondrial genome is haploid and in vertebrates exclusively inherited maternally [6]. For

these reasons, mitochondrial genes are generally modeled to have one quarter of the population size

of nuclear genes. This has been argued to make mitochondrial genes a more reliable source in terms

of species tree inference, since genes with lower ploidy are expected to more accurately follow the

species tree, due to their lower effective population size [17]. However, the same reasons are

currently causing scientific debate around the question of whether mitochondrial genes have a

disproportionate effect on species-tree inference with mixed (nuclear and mitochondrial), multi-

locus datasets. A recent study conducted by Jockusch et al. [18] on slender salamanders found a bias

particularly in Bayesian tree inference methods toward loci with higher variability. They found that

the addition of mitochondrial sequences to multi-locus nuclear datasets forced the species tree

inference to disproportionally gravitate toward the more informative mitochondrial gene tree when

analyzed in a multispecies coalescent framework. A similar bias of mitochondrial genes has been

shown in other studies [19]. Yet, there are also various studies that explicitly test and find no

evidence for a disproportional influence of mitochondrial genes on specie-tree inference [5], [20],

[21].

Setting the mitochondrial effective population size to one quarter that of nuclear genes has been

argued to not be accurate in all cases, since this is based on the assumption of only one

mitochondrial copy being transmitted per egg-cell and it further assumes equal gender ratios. If

these assumptions are not met, especially if the gender ratio differs significantly from being equal,

the effective population size of mitochondria may even exceed that of nuclear loci [6]. Therefore, the

opinion that the mitochondrial tree will be more likely to follow the species tree more closely than

nuclear loci is not shared among all scientists [6], [18].

UCEs as a novel source of genetic data

The aforementioned concerns about discordance of single genealogies and particular caveats

surrounding mitochondrial sequences point out the importance of sampling a sufficient number of

unlinked genetic markers in order to accurately estimate the species tree. One novel approach is the

generation of Ultraconserved Element (UCE) sequences, which are distributed across the complete

nuclear genome and provide a massive multilocus dataset of unlinked nuclear loci.

UCEs are not classified by their function, but solely by the fact that they are highly conserved

across a wide range of animal taxa. In fact, for many of these sequences the function is unknown but


4

many of them are thought to be involved in essential processes during ontogenesis [22] and are

involved in gene regulation [23]. Many such highly conserved sequences have been identified,

distributed across the whole genome [22], [24], [25]. The highly conserved nature of these regions

makes them adequate candidates for standard multilocus sequence capture kits, which do not have

to be specifically designed for the taxon group of interest, but are broadly applicable. There are

selected UCE probe sets for sequence capture available, specific for different, broadly defined

organism groups (e.g. amniotes, fish, etc.), which contain several thousands of unlinked, highly

conserved loci (http://ultraconserved.org, last accessed April 20, 2015). The target loci of such UCE

kits have been selected to match a certain profile; they have to consist of a highly conserved core-

region of 100-200 bp length flanked in direct proximity by more variable regions. These flanking

regions have to be located closely enough to the core region to be captured on the same fragment as

the conserved target sequence during the sequence capture process [26]. Several recent studies

have shown the effectiveness of UCE datasets for estimating population divergence times [27] and

phylogenetic tree inference on shallow [28], as well as deep evolutionary times [29].

SNPs

Another type of genetic data that is becoming more popular in phylogenetic studies are Single

Nucleotide Polymorphisms (SNPs). The increasing feasibility of sequencing vast numbers of unlinked

genetic loci across the complete genome (such as e.g. UCE data), provides a good basis for the

generation of genome-wide SNP data. These data are usually generated by extracting single

polymorphic sites from unlinked genetic loci (one site per locus), then adding these extracted sites

into one joined alignment. Most commonly, only biallelic polymorphisms are extracted, meaning

those sites that show variation between only two separate nucleotides. This leads to a dataset with

maximum informativeness, containing only sites that are variable between the targeted taxa. The

recent development of Bayesian methods for inference of species trees from biallelic character sets

such as SNPs [30] has made the use of SNP data more accessible and attractive for phylogenetic

studies. Additionally, SNP data can be used for a range of methods developed in the field of

population genetics in order to examine ancient admixture patterns and genetic introgression [31],

[32] to name only a few applications. Within recent phylogenetic avian studies, SNPs were proven to

be a very useful and powerful tool to examine fine-scale population patterns within bird

communities [28], [33], [34].

Topaza

The genus Topaza contains some of the most spectacular and largest hummingbirds worldwide,

measuring up to 23 cm (for adult males, including tail feathers) and weighing up to 12 g [35], [36].

http://ultraconserved.org/

Introduction

5

Characteristic of males within this genus are the two long, purplish black tail feathers, which can

reach lengths of up to 12 cm. Adult males and females of this genus are strongly dimorphic. The

plumage of adult male Topaza hummingbirds shows a characteristic metallic shine, particularly on

the yellow-green throat (gorget) and the green undertail coverts. Chest and breast feathers are

colored reddish-grey whereas the whole bird seen from a distance appears red, which is gradually

decreasing in intensity from the head toward the tail, turning into a yellow green at the rump and tail

coverts. The adult female birds appear in a more coherent green, with a metallic shining, orange

throat. These birds are usually found in the canopy along forest edges and clearings, and are often

seen close to river banks [37].

The number of species distinguished within this genus has been the subject of taxonomic

discussion; based on uncertain morphological evidence, some scientists refer to Topaza as a

monotypic genus [37], [38], whereas others distinguish two separate species [39], [40]. The current

consensus among ornithologists commonly distinguishes two separate species within this genus - the

Crimson Topaz (Topaza pella) [36] and the Fiery Topaz (Topaza pyra) [35] - which both occur in

northern South America (Figure 2). The two species do not occur sympatrically except for a very small

range along the Rio Negro. There are multiple conflicting hypotheses concerning subspecies

assignments within these species, which have been discussed in the scientific literature over the last

decades, based on morphological characters [38], [40]. At this point, no genetic study known to the

authors has explored the genetic structure within Topaza species and has addressed these

subspecies assignments on a molecular level.

Aims of this study

Here we use a variety of multilocus datasets to explore the genetic substructure within the

currently recognized Topaza species (T. pella and T. pyra). We discuss the genetic validity of previous

subspecies assignments and the effect of the Amazon River as a possible dispersal barrier for Topaza

hummingbirds. Further, we investigate an apparent discrepancy of the mitochondrial tree with the

species tree, and explore how this discrepancy influences the inference of species trees when

mitochondrial sequences are added to a multilocus nuclear dataset. Finally, we aim to evaluate the

utility of different datasets for the inference of genetic structure below the species level. We are

specifically addressing the following questions:

1. Which is the correctly inferred phylogenetic relationship between the sampled individuals

and which samples belong to the same population?

2. Do we see a genetic separation of populations concordant with the course of the Amazon

River?

3. Do these inferred populations match previously described morphological subspecies?


6

4. How does the mitochondrial tree match the inferred species tree?

5. How does the addition of the complete mitochondrial genome to a multilocus nuclear

dataset influence species tree inference?

6. How suitable are the applied datasets and methods to explore genetic substructure below

the species level?

In order to address these questions, we generate novel data from thousands of UCE loci and

further sequence a set of previously characterized nuclear gene loci [41]–[44], using a sequence

capture approach and Illumina® Next-Generation sequencing. In order to further improve the

informativeness of this dataset for examining shallow evolutionary timescales, we extract SNP data

from the large number of UCE-loci. Additionally, the sequence capture approach provides sufficient

coverage of mitochondrial sequences which we use in this study to assemble the entire

mitochondrial genome for each sample. We use a variety of phylogenetic methods [30], [45]–[48] in

order to explore the extensive genetic information lying within these mitochondrial and nuclear

sequence data. Each dataset is used separately for species tree inference, and consistent

phylogenetic patterns are evaluated and further explored with Bayesian species delimitation

methods [48]. Furthermore, we use population genetic methods [31] to explore the admixture

patterns among the sampled individuals and relate these results to the inferred phylogenetic

relationships. We find a well-supported discrepancy of the mitochondrial tree with the species tree,

which is consistently present in all inferred species trees. We test how this discrepancy influences

species delimitation and species tree inference when the mitochondrial sequence is added to a

multilocus nuclear dataset.

Methods

Taxon sampling

The individuals for this study were sampled with the goal to cover the maximum extent of the

Topaza distribution. Additionally, in order to test for a possible dispersal barrier effect of the Amazon

River, we sampled individuals from both sides (north and south) of the river. This resulted in a total

of 4 samples for T. pyra (2 north, 2 south) and 5 samples for T. pella (2 north, 3 south, see Figure 2).

The distribution range of Topaza species was modeled with the R package SpeciesGeoCoder [49],

based on occurrence data available in the eBird database [50]. Further, we included one sample of

the phylogenetically closest sister genus Florisuga (F. fusca) as an outgroup taxon. All samples were

ordered from museum specimen, with skins being available for further reference (Table 1).

Methods

7

Table 1: Voucher information for sampled taxa. The column ‘Taxon’ refers to the current species assignments.

ID Taxon Voucher code Institution

1 T. pyra INPAA1106 Instituto Nacional de Pesquisas da Amazônia 2 T. pyra MPEG62475 Museum Paraense Emílio Goeldi 3 T. pyra MPEG62474 Museum Paraense Emílio Goeldi 4 T. pyra MPEG52721 Museum Paraense Emílio Goeldi 5 T. pella USNM586322 National Museum of Natural History,

Smithsonian Institution, Washington DC, USA 6 T. pella INPAA3319 Instituto Nacional de Pesquisas da Amazônia 7 T. pella MPEG61688 Museum Paraense Emílio Goeldi 8 T. pella MPEG65603 Museum Paraense Emílio Goeldi 9 T. pella INPAA6233 Instituto Nacional de Pesquisas da Amazônia 10 Florisuga fusca MPEG70697 Museum Paraense Emílio Goeldi

Figure 2: Distribution map of T. pyra (green) and T. pella (red) and collection location (numbered symbols) of the

Topaza samples. Distribution ranges of Topaza species were modeled with the R-package speciesgeocodeR, based on

available eBird occurrence data. All occurrence points used for the distribution range modeling are plotted as

transparent symbols. Shown in light blue are the courses of the main rivers in the Amazon basin, namely the Amazon

River (horizontal axis), joined by the Rio Negro from the north and the Rio Madeira from the South.


8

Next-Generation Sequencing

DNA extraction and library preparation

DNA of all samples was extracted from muscle tissue using the Quiagen DNeasy Blood and Tissue

Kit according to the manufacturer’s instructions (Quiagen GmbH, Hilden, Germany). Before library

preparation, all samples were sonicated with a Covaris S220 sonication device in order to break the

genomic DNA into shorter fragments. The settings were chosen to break the DNA into fragments of

approx. 800 bp. This fragment length is the maximum recommended for sequence capture and was

chosen in order to capture as much of the variable flanking regions of the UCE loci as possible. Paired

end, size selected DNA libraries were prepared for sequencing on an Illumina® platform using the

magnetic bead based NEXTflexTM Rapid DNA-Seq Kit (Catalog #: 5144-02, Bioo Scientific Corporation,

Austin, TX, USA) following the enclosed manual (version 14.02), containing the following steps:

End-Repair and Adenylation: In this step, sticky ends are removed from the double-stranded DNA

fragments and an Adenin is ligated on the end of each strand which is necessary for the adapter

ligation in the next step.

Adapter Ligation: We used barcode adapters from the NEXTflexTM DNA Barcodes 48 kit (Catalog #:

514104) which were ligated to the double-stranded DNA fragments.

Magnetic Bead Size Selection: We selected fragments of 650-800 bp length using Magnetic Beads

(Agencourt AMPure XP), including several washing steps to purify the DNA in the final sample-

solution.

PCR Amplification & Purification: The final size selected, cleaned DNA samples were amplified per

PCR (15 cycles) using the NEXTflex primer set provided in the NEXTflexTM DNA Barcode kit. The PCR

product was purified with the QIAquick® PCR Purification kit (QIAGEN group), following the

manufacturers manual, but using only 30 µL of elution buffer (instead of the recommended 50µL) for

the final elution, in order to retrieve a higher concentration of the final DNA library.

Probe design

Ultraconserved elements:

The sequence capture probe library consisted of a set of 2,560 probes targeting 2,386

Ultraconserved Elements (Tetrapods-UCE-2.5Kv1), as first described by Faircloth et al. [26]. We used

probes of 120 bp length; sequences were downloaded from http://ultraconserved.org (last accessed

April 20, 2015). The majority of UCE loci are targeted with only one single probe per locus. Given the

size-selected, approximately 800 bp long fragments and the probe sequences of 120 bp length, one

can expect to receive up to 680 bp of flanking region on each side of the target sequence. The UCE

probe set used for this project is designed for tetrapods, and can therefore be applied to a broad

http://ultraconserved.org/

Methods

9

range of animals, including amphibians, reptiles, birds and mammals. The selected loci are

distributed across the complete genome and are genetically unlinked.

Nuclear genetic markers:

We further designed probes for capturing 10 nuclear genetic markers, commonly used in bird

phylogenetics, namely the genes coding for:

1. Beta fibrinogen, exons 7+8 (Bfib)

2. Eukaryotic translation elongation factor 2, exons 5-9 (EEF2)

3. Early growth response 1, exon 2 (EGR1)

4. Fibrinogen beta chain, exon 5 (FGB)

5. Myoglobin, exon 2 (MB)

6. Ornithine decarboxylase, exons 6-8 (ODC)

7. Recombination activating protein 1 (RAG1)

8. Transforming growth factor beta 2, exons 5+6 (TGFB2)

9. Zinc finger protein, exon 2 (ZENK2)

10. Zinc finger protein, 3‘ UTR (ZENK3)

For creating target-specific probes (length 120 bp) covering these loci, we used a 30 bp tiling

design (new probe starting every 30 bp of the target sequence), resulting in 4-fold probe coverage of

each locus. Probes were designed based on available reference sequences for these loci of closely

related taxa, obtained from NCBI GenBank (see Table 2 for accession-no. and locus information).

Mitochondrion:

Due to the high copy number of mitochondrial genomes in muscle cells in particular, a very large

number of fragments covering the mitochondrial genome is found in the final target selected

sequence mix, even if no probes associated to the mitochondrial genome are used for sequence

capture. In this study we had no probes targeting mitochondrial sequences, yet we were able to

assemble the complete mitochondrial genome for all samples (see Table 2 for information about

mitochondrial coverage).

Sequence enrichment and sequencing

The sequence enrichment was performed using a sequence capture MYbaits kit according to the

enclosed user manual (V. 1.3.8). The target-specific probes were mixed with the hybridization buffers

and the DNA library and incubated for hybridization for 38 hours at 65°C. During this hybridization

period, the biotinylated baits bind to their specific target regions. In the next step, magnetic

Streptavidin beads are applied which have a high affinity to Biotin. The biotinylated probes, which

have hybridized to the target DNA region bind to the Streptavidin beads. These beads are then


10

fixated with a magnet stand, and the supernatant, containing the non-target DNA, is discarded. After

several washing steps, the target DNA then was eluted form the beads and transferred into a fresh

tube. We then desalted the sequence capture product with a QIAquick® PCR purification column

(QIAGEN group) following the manufacturers manual, but only using 20 µL elution buffer for the final

elution, in order to retrieve a more concentrated DNA solution. After this we ran another PCR (14

cycles) to amplify the final product for all samples. After this final PCR, all samples were pooled into

an equimolar mix, with a total DNA content of 689 ng double-stranded, barcoded DNA with an

average fragment length of 632 bp.

The final product was sent in for sequencing to the Sahlgrenska Genomics Care Facility in

Gothenburg, Sweden. Sequencing was performed with a one lane, 250 bp paired-end Illumina®

MiSeq run (Illumina Inc. San Diego, CA, USA).

Data processing

UCE assembly:

We used the PHYLUCE software package (https://github.com/faircloth-lab/phyluce last accessed

April 20, 2015) for reviewing and assembling the sequenced UCE-loci. All programs and scripts

mentioned in the following are integrated in the PHYLUCE package. A more precise documentation of

the complete workflow described here can be found at https://github.com/tobiashofmann88/UCE-

data-management/wiki.

We used “illumiprocessor” to trim all reads of adapter contamination and sort out reads with low

quality scores or ambiguous bases. The trimmed reads were then assembled into contigs using

“assemblo_abyss.py”. Contigs are clusters of reads that are covering the same region (see Figure 3).

The consensus sequences of all assembled contigs are printed into one fasta-file, resulting in a file

with >100.000 separate contig consensus sequences (in the following, simply referred to as contigs)

with each sequence carrying an individual ID. All of these contigs were mapped against the UCE

sequences from the probe order file with “match_contigs_to_probes.py” in order to find those

sequences which represent UCE-loci that were selected and amplified during the sequence capture

process. This program prints the results of the mapping process into a SQL database; more

specifically, it prints the information containing which UCE loci could be found in which sample, and

the corresponding contig IDs. Given this information, we extracted all those UCE-loci from the contig-

fasta-file that were present in all sampled taxa, using “get_fastas_from_match_counts.py”. The

extracted sequences were aligned among all samples for each locus using MAFFT as implemented in

the PHYLUCE software package (“seqcap_align_2.py”).

https://github.com/faircloth-lab/phyluce

https://github.com/tobiashofmann88/UCE-data-management/wiki

https://github.com/tobiashofmann88/UCE-data-management/wiki

Methods

11

Figure 3: Assembling of reads into contigs. Reads can be assembled into contigs by either mapping them against a

reference sequence (gene of interest), as in this example, or they can be assembled relative to each other without the

use of a reference sequence. Such algorithms performing the latter find overlapping regions of single reads and use these

matching reads to create a growing consensus sequence, until they reach a minimum threshold of read coverage on

either side of the contig. Contigs can consist of assemblies of only a handful of reads or can span over big genomic

regions (e.g. the complete mitochondrial genome), entailing 100,000s of reads. The vertical extent of a contig is referred

to as read-depth, which is a measure of how reliably certain regions are covered.

Mapping and phasing of nuclear genes

Sequences of nuclear genes were assembled using the CLC Workbench software (CLC-

AssemblyCell, version 4.3.0, CLC Bio-Qiagen, Aarhus, Denmark). The adapter- and quality-trimmed

reads from the illumiprocessor processing (see ‘UCE assembly’) were mapped against the reference

sequence for each gene (same sequences used for probe design, Table 2 ), using “clc_mapper”. After

converting the resulting cas-assembly-files into bam-format (with the program “clc_cas_to_sam”),

we used samtools, version 0.1.19 [51] to sort the bam-files (“sort”) and create bam-index-files

(“index”) in order to view the assemblies in Tablet [52]. Assemblies were controlled by eye for

contamination with low quality reads and duplicate reads. The CLC-AssemblyCell software package

contains software options for quality trimming (“clc_quality_trim”) and removal of duplicates

(“clc_remove_duplicates”) which can be applied to improve assemblies if they show the above

mentioned contaminations.

The final bam-assemblies were phased with samtools (“phase”) in order to sort the reads from

the assembly into two separate alleles, if present. The consensus sequence of the resulting phased

assemblies was created with a combination of the samtools “mpileup” command, bcftools and

vcfutils.pl, as suggested in the samtools manual (http://samtools.sourceforge.net/samtools.shtml

last accessed May 17, 2015). The final consensus sequences were checked for the absence of

ambiguous sites and were further controlled for correct phasing by examining the equivalent bam-

assembly-files to each sequence. The mentioned commands are part of the samtools software

package, which is freely available at https://github.com/samtools (last accessed May 17, 2015). An

automated workflow of the above-described steps of assembling and phasing gene loci with

http://samtools.sourceforge.net/samtools.shtml

https://github.com/samtools


12

illumprocessor trimmed reads is available at https://github.com/tobiashofmann88/Processing-

Illumina-reads.

Alignments for each locus were created using the MAFFT multiple alignment builder plugin

(version 1.3) in Geneious, version 6.1.8 [53], using the default settings. If two alleles were present at

a given locus, both were included into the alignment. Final alignments were controlled by eye for

alignment errors and were exported into nexus-format.

Mitochondrial genome assembly

For the assembly of the mitochondrial genome we used the same trimmed-read files as in the

previously described assemblies. First we ran “clc_assembler” (part of CLC-AssemblyCell) in order to

assemble the reads into contigs (see Figure 3). The program prints the consensus sequence of each

contig that could be assembled into one joint fasta-file. We then mapped all reads from the trimmed-

read files against these assembled contig consensus sequences (“clc_mapper”) in order to receive

information about the read coverage of each contig (“clc_sequence_info”). In the next step, we

created a blast-database from the contig-fasta-files, using the command “makeblastdb” from the

blast+ software package [54]. We downloaded the taxonomically closest available mitochondrial

genome sequence (Trochilidae: Amazilia versicolor, see Table 2) from NCBI and blasted this sequence

against the previously created contig database. Blasting was done using the command “tblastx”,

which translates the nucleotide sequences into amino acid sequences before matching it to the

database, which makes the blast search less conservative, and results in more matches. All hits from

the contig-blast-database were printed into an xml-file, which was reviewed using ngKlast, version

4.5 [55]. The longest match was inspected, checking the extent of coverage of the reference

mitochondrial genome. In all cases, the longest matching contig was covering the complete

mitochondrial genome and was therefore extracted from the contig-fasta-file. We provide an

automated workflow of the above described steps for assembling the mitochondrial genome at

https://github.com/tobiashofmann88/assembling-complete-mt-genome.

The extracted longest contigs, representing the mitochondrial genome, were aligned with the

reference mitochondrion of A. versicolor for all samples, using the MAFFT online alignment software

(version 7, http://mafft.cbrc.jp/alignment/server/ last accessed May 17, 2015). All sequences were

oriented in the same direction and edited to start at the same position (according to the reference

sequence). The separated sequences were then annotated using DOGMA [56]. DOGMA blasts

(“tblastx”) the input nucleotide sequence in all six reading frames (both reading directions and each

of the three possible codon positions) against an amino acid sequence database of each

mitochondrial sequence element (mRNA, rRNA and tRNA coding sequences). The database is located

on the DOGMA server and contains a multitude of mitochondrial sequences across all animal groups.

As a result, the user receives a list of the identified coding regions and the respective names and

https://github.com/tobiashofmann88/Processing-Illumina-reads

https://github.com/tobiashofmann88/Processing-Illumina-reads

https://github.com/tobiashofmann88/assembling-complete-mt-genome

http://mafft.cbrc.jp/alignment/server/

Methods

13

positions (in bp) of these regions on the input sequence. We plotted the resulting annotations with

GenomeVx [57] to create a circular map of the mitochondrial genome (e.g. Figure 4).

Mitochondrial genome sequences of all taxa were realigned and annotations were checked and,

if necessary, synchronized across the alignment, using the bioedit alignment editor [58]. The

sequence alignment for each annotated coding element was extracted separately in fasta-format

using Geneious, version 6.1.8 [53]. The amino acid sequence for the extracted sequence alignments

of the 13 mRNA coding genes were examined in bioedit for alignment errors leading to reading frame

shifts.

SNP-datasets

Each UCE-locus-alignment that could be assembled for all taxa was scanned for sites that were

biallelic polymorphic within the Topaza samples and did not contain missing data. Among these

polymorphic sites, one single nucleotide polymorphism (SNP) was randomly chosen per locus and

coded into binary format (0 or 1) into a joint alignment file. This resulted in a set of 570 SNPs.

Additional SNP datasets were extracted, specifically aiming for variation within the currently

recognized species, containing only sites that were found biallelic polymorphic within T. pella (621

SNPs) and T. pyra (524 SNPs) respectively. All the above steps were performed using a customized

script, provided by Yann Bertrand (Department of Biological and Environmental Sciences,

Gothenburg University, Sweden).

Mitochondrial tree

Former studies have found considerable differences in substitution rates between the different

regions across the mitochondrial genome [18], [59], [60]. In order to apply the most suitable

parameters in both terms of substitution rate and substitution model, we partitioned the data in 15

partitions, including a separate partition for each protein-coding locus (13), one partition (1)

including all concatenated 22 tRNA-coding sequences and another partition (1) containing both rRNA

coding sequences (12S and 16S ribosomal subunit). Substitution models and clock models were

unlinked for all 15 partitions. The most suitable substitution model for each partition according to

the Bayesian Information Criterion (BIC) was determined with jModeltest [61]. We excluded the

control region (misc_feature) of the mitochondrial genome, which is located in between the coding

regions for ND6 and the 12S ribosomal subunit (see Figure 4), since this region contains too highly

variable regions which caused difficulties properly aligning these sequences. Since the mitochondrion

is inherited as a single unit, all partitions are expected to follow the same gene tree, given that no

recombination has taken place. We therefore conducted a recombination test using RDP, version

3.44 [62] on the alignment containing the complete mitochondrial genome sequences. The three

methods RDP [63], MaxChi [64] and Chimaera [65] were applied, setting the p-value of 0.1, in order


14

to screen the alignment for possible recombinant elements. We found no indications for

recombination events across the alignment, and therefore linked the trees of all partitions.

In order to retrieve a dated phylogeny of Topaza, we used substitution rate priors of the

mitochondrial genes ND2, ND4 and the tRNA-partition, estimated for honeycreapers by Lerner et al.

[66]. These rate-priors were defined as normal distributions scaled in mutations/site/Ma with equal

rates for ND2 and ND4 (mean = 0.0219, SD = 0.0015) and a slower rate for the tRNA partition (mean

= 0.005, SD = 0.00207). We further implemented clade-age priors for the split between Topaza and

its sister genus Florisuga (mean=18.84, SD=1.6 Ma) and the split between T. pyra and T. pella

(mean=3.01, SD=0.4 Ma), which were estimated by McGuire et al. [67] as part of the species-tree of

the complete hummingbird family (Trochilidae) based on the above-mentioned substitution rate

priors and on island-age, as well as fossil calibrations for outgroups of the hummingbird family. We

tested different combinations of the aforementioned dating priors in order to check how these priors

influence each other.

We used BEAUti, version 1.8.0 [45] to set up an xml-file with the above described priors

concerning partitioning and dating. We assigned a log-normal relaxed clock to each partition and

chose a Yule process speciation tree prior [68]. The MCMC chain was set for 100 million generations

and trees and logging information printed every 10,000 generations, using BEAST, version 1.8 [45].

After initial issues with convergence of the MCMC chain (see Results), we set the base frequencies

for all partitions to ‘empirical’ and restricted the uncorrelated lognormal relaxed clock (ucld) mean

values from the very broad default to a more realistic range (mRNA coding loci: uniform, initial=0.02,

upper=0.2, lower=0.002; tRNAs and rRNAs: initial=0.005, upper=0.05, lower=0.0005). After checking

MCMC runs for proper convergence with Tracer, version 1.6 [69], we summarized the posterior tree

distribution into the maximum clade credibility tree using TreeAnnotator, version 1.8 [45], discarding

the first 1,000 trees (10%) as burn-in.

Species tree

Nuclear dataset

We estimated the species tree by analyzing the 10 nuclear gene loci in *BEAST [46]. Substitution

models, clock models and trees were unlinked for all loci. In order to avoid over-parameterization of

the xml-file, we kept each gene sequence as one partition, without sub-partitioning it by codon

position. Separate alleles and homozygous sequences within the alignments belonging to the same

sample were given the same trait value, thereby assigning each individual a separate taxon.

Substitution models for each gene were determined with jModeltest according to the Bayesian

Information Criterion (BIC). Initial issues with the convergence of the MCMC led to the exclusion of

Methods

15

EGR1 and ZENK2 from further analyses (see Results for reasoning). We applied the same clade-age

priors as for the mitochondrial gene tree (see above) and set the substitution rates of Bfib and ODC

according to Lerner et al. [66], which were defined as normal distributions scaled in

substitutions/site/Ma. The substitution rate for Bfib was set to 0.0019, 0.0003 (mean, SD) and for

ODC to 0.0015, 0.000237. A lognormal relaxed clock was applied to each locus and a Birth-Death

prior [68] for the species tree. Base frequencies were set to ‘empirical’ and the ucld mean was set to

a more restrictive, yet realistic range for each locus (initital=0.002, upper=0.004, lower=0.00002). The

MCMC was set for 100 million generations and states, and trees were logged every 10,000

generations. After checking the MCMC for convergence, the maximum clade credibility tree was

inferred with 9,000 trees of the posterior tree distribution (burn-in 1,000).

Mixed dataset

Another xml file was set up containing the eight nuclear loci with the exact settings as above,

combined with all mitochondrial loci (mixed dataset). Mitochondrial sequences were loaded into

BEAUti and the same settings were applied as described for the tree inference in BEAST for the

mitochondrion (15 partitions, unlinked substitution models and clock models, linked trees). The

ploidy type was set to ‘mitochondrial’ and the specific substitution rates for ND2, ND4 and the tRNAs

were applied additionally to the nuclear substitution rates for Bfib and ODC and the above-described

clade priors. The MCMC was run with the same settings as in the previous runs and analyzed in the

same manner.

DISSECT

In order to run species delimitation analyses in DISSECT [48], the *BEAST xml-files from the two

analyses described above (nuclear dataset and mixed dataset) were translated into DISSECT xml files.

Therefore, the Birth-Death species tree prior was replaced with the Birth-Death-Collapse model, as

described in the DISSECT user manual (last updated February 17, 2014). Parameter values for ε

(collapsing height) and w (collapsing weight) were left at default. All other settings were left identical

to the previous *BEAST runs. The xml-file was executed using the DISSECT-modified BEAST 1.8.0

version (“beast-dissect.sh”). The resulting log-file was checked for convergence, and the maximum

clade credibility tree was calculated from 9,000 trees of the posterior distribution, discarding the first

1,000 trees as burn-in.

We used “SpeciesDelimitationAnalyser”, which is a DISSECT tool that collapses nodes of small

height and exports a data table, listing the clusters that were found. After examining the log-files and

checking for convergence and effective sample size (ESS) values greater than 200, the burn-in was set

to 10 %. The values for collapse height and the similarity threshold for joining two clusters (simcutoff)


16

were left at default. Similarity matrices were visualized using the R code provided in the DISSECT user

manual.

After examination of the DISSECT results, we grouped the samples according to the found

clusters, possibly representing distinct populations with only limited gene-flow among each other.

We set up another xml file for the nuclear, as well as the mixed dataset, assigning a joint trait value

to each sample sharing the same cluster in order to infer the species tree with *BEAST and examine

the effect of the addition of the mitochondrial genome on the species tree inference. The trait value

assignments for the samples were as follows: samples 1-4 to “T. pyra”, samples 5 and 6 to “T. pella

north”, sample 7 to “T. pella intermediate”, sample 8 and 9 to “T. pella south”. Other settings and

priors were identical, as described before for the nuclear and mixed dataset.

UCEs

Separate gene trees were created for each UCE-alignment with PhyML [70], using the parallelized

implementation in CloudForest (https://github.com/ngcrawford/CloudForest last accessed May 19,

2015). The resulting, unrooted trees were printed in Newick-format into one cumulative tree file. In

order to receive a measure of node-support in the final species tree, we generated 1000 non-

parametric bootstrap replications of the UCE dataset by resampling nucleotides within the UCE-

alignments, as well as resampling UCE-loci within the data set [71], using CloudForest. All trees were

rooted using the “RerootTree” function on the STRAW server [72] by setting sample 10 (Florisuga

fusca) as outgroup. We used MP-EST [47] to infer the species tree, which estimates the most likely

species tree given a set of gene tree topologies. For the bootstrap dataset, we ran MP-EST separately

for each bootstrap replicate tree-set. The resulting set of 1000 bootstrap species trees was

summarized to one maximum clade credibility tree with TreeAnnotator, version 1.8 [45]. The

resulting node values represent bootstrap support of the respective clade.

Since many of the UCE loci showed little to no variation among the Topaza samples, we extracted

a subset containing only the most informative loci. Only those loci were selected which contained

more than 20 polymorphic sites across the alignment. We created 1000 bootstrap replicates of this

reduced dataset in the same manner as before for the complete dataset, and analyzed the rooted

gene trees in MP-EST. Two separate MP-EST analyses were conducted, one with every sample being

assigned a separate label in the species tree, and another one with the cluster assignments resulting

from the DISSECT analysis.

SNPs

The binary SNP alignment, consisting of 570 unlinked polymorphic sites, was formatted for

analysis in SNAPP [30]. SNAPP is a MCMC based species-tree and species-demographics inference

program that uses unlinked biallelic markers (such as SNPs) as input. We used BEAUti 2, version 2.2.1

https://github.com/ngcrawford/CloudForest

Methods

17

[73] to set up the xml file for species tree inference. BEAUti 2 contains the option to download

additional packages in order to set up a customized xml file for different implementations in BEAST 2

[73]. Coalescent rate and mutation rates (forward mutation rate “U” (0 to 1) and backward mutation

rate “V” (1 to 0)) were set to be estimated by SNAPP based on the input data. The Yule species-tree

prior parameter λ, which sets the rate at which species diverge, was left at default (0.00765). The

MCMC was set to 10,000,000 generations and trees and other parameters were logged every 1,000

generations. Two separate SNAPP analyses were launched, one in which each sample was assigned

its own clade, and another one with the clade assignments resulting from the DISSECT species

delimitation analysis.

Population structure

In order to explore the genetic structuring within the species boundaries of the two currently

recognized Topaza species (T. pyra and T. pella), we conducted population structure analyses based

on the SNP datasets that were extracted separately for each species (621 SNPs for T. pella and 524

SNPs for T. pyra). We used the program STRUCTURE, version 2.3.4 which was first described by

Pritchard et al. [31]. STRUCTURE is based on a Bayesian MCMC algorithm which explores genetic

clusters (populations) within a given dataset and assigns individuals to these inferred populations.

The number of clusters (k) to be explored is set by the user, and STRUCTURE assigns individuals in

random combinations to these clusters in order to find the best fit of the variation pattern. We

explored k values from 1 to 3 within both Topaza species. The ploidy level of the data was set to 1,

since we were using an effectively haploid SNP dataset, which was extracted from the consensus

sequences of assembled contigs, not containing biallelic information within a sample. Lambda (λ), a

quantitative measure of independence between markers, was chosen to be inferred by STRUCTURE

based on the data. There are two separate ancestry models available in STRUCTURE, the ‘no

admixture’ and the ‘admixture’ model. ‘No admixture’ would imply the assumption that the

ancestors of inferred populations were belonging to completely discrete populations themselves. We

therefore chose the ‘admixture’ model, since we have strong reason to assume admixture in the

ancestral populations of now putatively separate populations. This assumption is based on the

species tree inference results, which show shallow evolutionary times of all splits between samples

assigned to the same species, indicating relatively recent admixture within the species boundaries.

The first 10,000 generations of the MCMC were discarded as burn-in, and the chain was set to run for

an additional 100,000 generations after burn-in. The distribution of posterior likelihood estimates,

and the estimation of the data-probability under the chosen k value were checked for convergence.

A separate STRUCTURE analysis was run for each of the two Topaza species.


18

Results

Data exploration

Mitochondrion

Despite the fact that no probes were used during sequence-capture that were targeted toward

mitochondrial sequences, we received a very deep-read coverage for the mitochondrial genome. In

fact, in many cases the average coverage per base pair was much higher for the mitochondrion than

for the selected nuclear loci that we selected during sequence capture (Table 2). Between the

different samples, 1-12 % of sequenced reads were of mitochondrial origin (Table S3).

The complete mitochondrial genome could be assembled for all 10 samples in this study. We

found no gene duplications or tandem repeats of mitochondrial regions which have been reported to

occur on the mitochondrial genome in other bird taxa [2]. The assembled genomes were of varying

length, ranging from 16,762 to 16,862 bp (Table S3). This variation of length in the mitochondrial

genomes is mainly attributable to the very variable end of the control region (misc. feature), which

presents challenges for assembly due to many tandem repeats of microsatellite elements,

consequentially causing difficulties for the alignments of these variable reasons, even among closely

related taxa. The control region and the intergenic spaces were discarded from subsequent analyses,

leaving a total alignment of 15,428 bp length for phylogenetic analyses, which was free of missing

data. Figure 4 shows the position of the identified regions on the mitochondrial genome, exemplarily

for sample 2 (T. pyra2). For more information on sequence length and exact positions of all identified

coding regions, see Table S5.

Nuclear loci

All ten nuclear genes that were targeted in the sequence capture enrichment could be recovered

in their entirety for all samples with extensive read coverage (Table 2), adding up in total to 10,201

bp of nuclear DNA sequence for each sample. In general, the recovered nuclear loci showed little

variation within the genus Topaza (see Table S4), due to relatively shallow evolutionary times of the

deepest splits of lineages within this genus (< 3 Ma), according to prior information [67]. We decided

to exclude loci from further phylogenetic analyses that showed less than 1 % variable sites within the

alignment, which led to the exclusion of EGR1 (0 %) and ZENK2 (0.2 %). This left 8,404 bp of nuclear

sequence information for further analyses.

Results

19

Figure 4: Circular map of the mitochondrial genome of T. pyra2. The inner ring shows the scale in kb (kilo base pairs).

The section between position 15,558 and the end (position 16,762), here marked as a black box, is commonly referred to

as miscellaneous feature. This region contains sequences which function as control region for replication and

transcription of the circular mitochondrial genome. Protein-coding genes are marked as colored blocks, color-coded to

indicate gene families. Marked in dark brown are rRNA coding sequences (rrnS = small ribosomal subunit (12S), rrnL =

large ribosomal subunit (16S)) and in yellow the tRNA coding sequences.

Ultraconserved Elements (UCEs)

We assembled a set of 824 UCE-loci that were present in all 10 samples. The length of the

assembled UCE alignments ranged from 223 bp to 1130 bp (mean = 870 bp, stdev = 150 bp, see

Figure 5). As expected, the central regions of the UCE alignments showed little to no variation among

the different samples (Figure 6). These regions represent the highly conserved core regions of the

UCE loci that were targeted by the sequence capture probes. The further the distance from the

conserved core region, the more variation could be found within the alignments (Figure 6). A subset


20

of 73 UCE-loci was extracted where each contained more than 20 variable sites among Topaza

samples, which was used for further analyses, containing in total 68,997 bp of sequence alignment.

Table 2: Locus information and read coverage for 10 nuclear loci and the mitochondrial genome. The first block

contains information about the reference sequences used for probe design (nuclear loci only) and for the assembly of

reads (nuclear loci and mitochondrion). The information displayed for each sample is the total number of reads (# reads)

and the average coverage per base pair (Ø coverage/bp) for each locus, extracted from the bam-assembly-files, viewed

with Tablet [52].

Locus Reference sequence T. pyra1 T. pyra2

organism acc# NCBI length (bp) # reads Ø coverage/bp # reads Ø coverage/bp

Bfib Topaza pella GU167142.1 1,076 785 167 4,880 1,045

EEF2 Phaethornis griseogularis EU738666.1 1,619 774 101 5,017 667

EGR1 Phaethornis griseogularis EU738996.1 609 450 144 2,736 938

FGB Phaethornis griseogularis EU739148.1 660 241 79 1,874 634

MB Phaethornis griseogularis EU740011.1 718 380 119 2,143 661

ODC Topaza pella GU167086.1 618 412 144 2,541 903

RAG1 Phaethornis bourcieri JN558646.1 2,639 1,557 134 10,164 901

TGFB2 Phaethornis griseogularis EU737426.1 571 207 71 1,545 556

ZENK2 Eutoxeres aquila AF492503.1 1,188 829 145 5,366 1,004

ZENK3 Eutoxeres aquila AF492533.1 503 330 138 2,136 881

Mitochondrion Amazilia versicolor NC_024156.1 16,861 63,816 823 154,537 2,044

Locus T. pyra3 T. pyra4 T. pella5 T. pella6

# reads Ø coverage/bp # reads Ø coverage/bp # reads Ø coverage/bp # reads Ø coverage/bp

Bfib 778 164 1,279 273 874 184 1,914 401

EEF2 713 90 1,693 224 885 112 2,339 293

EGR1 383 122 892 297 634 200 1,359 433

FGB 336 110 566 187 312 102 656 208

MB 320 95 627 192 386 116 1,041 309

ODC 406 140 697 245 545 186 1,164 402

RAG1 1,587 135 3,168 280 1,750 147 4,714 395

TGFB2 185 63 383 136 263 94 639 221

ZENK2 795 137 1,650 298 1,110 192 2,538 435

ZENK3 332 130 747 308 354 145 1,000 408

Mitochondrion 38,572 490 16,146 211 72,207 899 164,804 1,964

Locus T. pella7 T. pella8 T. pella9 Florisuga10

# reads Ø coverage/bp # reads Ø coverage/bp # reads Ø coverage/bp # reads Ø coverage/bp

Bfib 1,042 222 1,069 226 625 133 521 108

EEF2 1,571 212 1,482 192 344 44 642 86

EGR1 814 269 813 270 167 57 423 130

FGB 413 139 361 122 219 70 324 108

MB 526 158 538 163 178 54 357 108

ODC 654 232 611 215 296 102 379 132

RAG1 2,582 227 2,377 207 1,217 106 1,667 145

TGFB2 330 121 347 122 180 64 135 49

ZENK2 1,558 286 1,583 280 359 64 819 141

ZENK3 639 260 636 259 166 69 357 144

Mitochondrion 59,947 762 125,537 1,481 5,979 72 9,241 116

Results

21

Figure 5: The length distribution of assembled UCE loci alignments. In total 824 UCE alignments were assembled for

all samples. Plotted in this graph is the number of alignments that fell into the respective length interval (interval size 23

bp), ranging from 223 bp (min) to 1130 bp (max). The mean length of all UCE alignments lies at 870 bp (stdev = 150 bp).

Figure 6: Plot of variable sites within UCE-alignments. This plot shows the frequency of variable sites for each

position (relative to the total number of sequences that contain that position) across all UCE-alignments plotted in

relation to distance from the center of the conserved region (x=0). Plotting the UCE alignment data in this manner, the

highly conserved region around the core region becomes apparent, flanked by considerably more variable flanking

regions.


22

Mitochondrial tree

The log-files of all Bayesian analyses were viewed in Tracer, version 1.6 [69] and examined for

convergence and effective sample sizes (ESS values) greater than 200. For the initial runs the MCMC

did not converge properly, because the sampling for various parameters stopped after several million

generations which caused a sudden leap in the inferred posterior likelihood (Figure S15). This issue

seemed to occur when the xml file was over-parameterized and the parameters were given too wide

of ranges to fluctuate within. We therefore applied more restrictive prior settings for ucld mean

values of all partitions (see Methods), and set the base frequencies for all partitions from ‘estimated’

to ‘empirical’ which solved the issue.

Concerning the dating priors, we found that MCMC runs converge well when all age priors

(substitution rates and clade age priors) were applied. When substitution rates were applied without

setting the clade age priors, the MCMC stopped sampling various parameters after approximately 4

million generations, indicating that these dating priors alone are not restrictive enough. The same

issue could be observed after 8-10 million generations when only one of the two clade-age priors was

applied without additional substitution rate information. When examining the data preceding the

problematic point in the MCMC, the estimated ages of unrestricted clades were concordant to

analyses in which all age priors were applied. This led us to the decision to apply all age priors

(substitution rates and clade-age priors as described in Methods) for further analyses.

A mitochondrial maximum clade credibility tree was generated from 9,000 trees of the posterior

distribution, with a burn-in of 10% (Figure 7). The split between Topaza and its sister genus Florisuga

(not shown in Figure 7) was inferred at 16.74 Ma (stdev = 1.38 Ma). The deepest split of

mitochondrial lineages within Topaza, the split between T. pyra and T. pella, is estimated to have

occurred 2.36 Ma ago (stdev = 0.21 Ma). Further, the mitochondrial tree suggests a relatively deep

split within T. pyra at 0.68 Ma ago (stdev = 0.09 Ma), leading to two separate mitochondrial lineages,

dividing samples sampled north from those sampled south of the Amazon River. Topaza pella shows

a similar pattern, even though the split of mitochondrial lineages appears to have occurred more

recently at 0.39 Ma ago (stdev = 0.05 Ma), and T. pella7, which was sampled at the southern bank of

the Amazon, appears in one clade with T. pella5 and T. pella6, both of which were sampled north of

the Amazon River. The mitochondrial tree in Figure 7 is completely resolved with 100 % support for

each node (Bayesian posterior probability).

Results

23

Figure 7: Time-calibrated phylogeny of Topaza based on the complete mitochondrial genome (BEAST). Taxa are

colored according to minimum clades that were found to be monophyletic throughout all tree inferences conducted in

this study and which are further confirmed though species delimitation analysis. Shown is the maximum clade credibility

tree, generated with 9,000 trees (1,000 burnin) of the posterior tree distribution. Node support values represent

Bayesian posterior probabilities and the time scale is in millions of years.

Species tree

Individuals analyzed separately

*BEAST - 8 Nuclear Genes

Similar to the convergence issues described above for the mitochondrial dataset, initial *BEAST

MCMC runs for the nuclear dataset stopped sampling certain parameters after several million

generations. These parameters were mainly concerning the loci EGR1 and ZENK2, which were the

most uninformative loci, showing less than 0.5% variable sites within Topaza samples across the

complete alignment (Table S4). After removing these two loci from the *BEAST analysis, the MCMC

showed good convergence. The resulting maximum clade credibility tree is shown in Figure 8a. The

split between Topaza and Florisuga (not shown in Figure 8a) is estimated to have occurred 18.23 Ma

ago (stdev = 1.47 Ma). The divergence between T. pyra and T. pella was estimated at 2.03 Ma ago


24

(stdev = 0.33 Ma). Concerning the phylogenetic structure within the two recognized species, this

multilocus nuclear dataset shows a different pattern than the mitochondrion (Figure 7). The topology

within T. pyra is distinctively different, and does not show a deep split between two separate

lineages grouping northern and southern samples separately. Within the T. pella complex, the

sample T. pella7 is placed with the two southern samples (T. pella8 and 9), and a split between

northern and southern samples (in relation to the Amazon River) is inferred to have occurred 0.65

Ma ago (stdev = 0.26 Ma). All support values for nodes within the recognized species are rather low,

which could either be due to the sequence alignments being too uninformative in order to analyze

phylogenetic patterns on such shallow time intervals, or it could be due to no phylogenetic structure

within the recognized species being present. The latter would cause a violation of the *BEAST

assumptions of no admixture between the separate tips, leading to low support values of the

respective nodes. We discuss these two possibilities in the following.

MP-EST - 824 Ultraconserved Elements

Among the entirety of the 824 assembled UCE alignments, a majority had an insufficient number

of variable sites in order to build informative gene trees for these loci. This resulted in a vast majority

of gene trees inferring polytomies between all samples. Lacking informative sites, occasional

mutations, which would then represent a good fraction of the complete variability of a UCE locus,

were weighed disproportionally in the gene tree inference, therefore not depicting the evolutionary

pattern but random stochastic processes. The only phylogenetic pattern that was consistently seen

among the gene tree topologies was the split between T. pyra and T. pella. The species tree which

was inferred based on the set of 824 gene trees (plus bootstrap replicates) with MP-EST (Figure S13),

weighs every gene tree equally and only evaluates the topology of the input gene trees, not

considering branch lengths of the input trees. As a result of the inconsistent gene tree topologies, the

species tree in Figure S13 shows extremely short internodal branch-lengths, as the few informative

loci that show shallower phylogenetic substructure are diluted among the many uninformative loci.

When only selecting the 73 most informative UCE-loci (>20 variable sites within Topaza), the

resulting MP-EST species tree Figure 8b shows an improved inference of the internodal structure.

The topology within T. pella is identical to the one inferred by *BEAST based on the nuclear gene-loci

Figure 8a. The inferred substructure within T. pyra has very low bootstrap support values and does

not show congruence with the split between northern and southern samples as inferred by the

mitochondrial tree (Figure 7).

SNAPP - 570 SNPs

SNAPP estimated both possible types of mutations within the binary SNP alignment (u: 0 -> 1 and

v: 1 -> 0) to occur equally as often, as the confidence intervals around both rates overlap (u: mean

Results

25

=0.92, stdev=0.0849; v: mean = 1.119, stdev=0.1252). The inferred species tree is depicted in Figure

8c and shows the same internal topology within T. pella as the other nuclear species trees (Figure 8a

and b). T. pella5 and 6, which were both sampled north of the Amazon River, form a well-supported

clade (98% posterior probability), and so do T. pella8 and 9 (99% posterior probability), both sampled

from south of the Amazon. Sample T. pella7 forms a monophyletic group with the southern clade

(79% posterior probability). The substructure within T. pyra is not very well supported (posterior

probabilities of 26% for both internal nodes). Figure 9 shows a DensiTree plot of the posterior species

tree distribution of the SNAPP analysis (discarding the first 1,000 trees as burn-in). Here, the lacking

substructure within T. pyra becomes apparent, as no predominant pattern can be seen among the

plotted trees within this clade. Differently, the plotted trees show a clear separation of two separate

lineages within T. pella, separating the northern (T. pella5 and 6) from the southern samples (T.

pyra7, 8 and 9). Yet, in the case of T. pella7, a smaller fraction of trees groups this sample with the

northern clade, in concordance with the inferred mitochondrial tree (Figure 7).

*BEAST - 8 Nuclear Genes and Mitochondrial genome

The inferred species tree from the 8 nuclear loci, and the addition of the complete mitochondrial

genome as a 9th partition, is shown in Figure 8d. Dissimilar to all other inferred species trees based

solely on nuclear data (Figure 8a, b and c), northern and southern samples within T. pyra form

separate monophyletic groups, yet are not very well supported (see node-support values). The

inferred split between these two lineages is dated very recently (mean = 0.05 Ma, stdev = 0.08 Ma).

The substructure within T. pella, on the other hand, is well supported, placing T. pella5 and 6 in a

monophyletic group with 78% posterior probability, and 8 and 9 in a separate clade with 95% node

support. The divergence between these two rather well-supported groups is estimated to have

occurred 0.23 Ma ago (stdev = 0.1 Ma), which is considerably earlier than estimated based on the

nuclear gene loci alone (Figure 8a), and on the more recent end of the confidence interval for the

dating in the mitochondrial tree Figure 7. The sample T. pella7 is positioned more closely to the

northern samples (5 and 6), forming a monophyletic clade with these samples that is not very

strongly supported (69% posterior probability). This positioning of T. pella7 is consistent with the

mitochondrial tree (Figure 7) but is not supported by the other species trees inferred from multi-

locus nuclear data (Figure 8a, b and c).


26

Figure 8: Species trees inferred for Topaza based on multilocus datasets, treating each individual as a separate

population (no species/population assignments). Taxa are colored according to consistently monophyletic clades that

were found across all tree inferences and are consistent with the species delimitation analysis. a) Time-calibrated *BEAST

species tree inference based on 8 nuclear genes (8,404 bp). Shown is the maximum clade credibility (mcc) tree based on

9,000 trees of the posterior distribution (burn-in 1,000), with node labels representing the Bayesian posterior probability

(Bpp) of the respective node. Time is scaled in million years. b) MP-EST generated species tree based on the gene trees of

the 73 most informative UCE-loci, scaled by coalescent units. Node labels represent percentages of bootstrap support

(1,000 bootstrap replicas) of respective nodes. c) SNAPP species tree based on 570 unlinked SNPs, scaled in generations

relative to mutation rate (µ, in mutations/site/generation). Shown is the mcc tree based on 9,000 trees of the posterior

distribution (burnin 1,000). Node labels show Bpp values. d) Time-calibrated *BEAST species tree based on the complete

mitochondrial genome (approx. 15,500 bp) and 8 nuclear loci (8,404 bp). The shown mcc tree was created from 9,000

trees of the posterior distribution (burn-in 1,000), node labels show Bpp values. Time is scaled in million years.

Results

27

Figure 9: DensiTree [74] plot of 9,000 trees of the posterior species tree distribution (burn-in 1,000) from the SNAPP

analysis of 570 SNPs. No coherent substructure is apparent among the plotted trees within T. pyra, yet a consistent

substructure within T. pella can be seen in the plot. Note the majority of trees connecting T_pella7 with the southern

clade (8 and 9), while a small fraction of trees connects this individual with the northern clade (5 and 6).

Species delimitation analysis

Initial DISSECT analyses of the nuclear and the mixed dataset, including all samples, show

indications of T. pella7 being of admixed origin, as this sample is grouped with both of the otherwise

distinct clades T. pella5+6 and T. pella8+9 (see Figure S14). As this is also supported by the above-

described results (Figure 8 & Figure 9), we excluded T. pella7 from further analyses since admixture

between separate inferred clades violates the DISSECT assumptions and can lead to the grouping of

distinct clades into one cluster, as these become linked through the admixed individual. This effect of

exclusion of one single problematic individual on the complete similarity matrix within T. pella can be

seen by comparing Figure S14 with Figure 10. The similarity matrices in Figure 10 show a strongly

supported genetic separation within T. pella between northern and southern samples. This split is

inferred more strongly within the exclusively nuclear dataset with posterior probability node support

values of 88% and 94% supporting the monophyly of the two separate clades. The dating of this split

is estimated at 0.85 Ma ago (stdev = 0.25 Ma). The DISSECT analysis of the mixed dataset, including

the mitochondrial genome, shows slightly weaker support for the split within T. pella (78% for both

clades) and dates this split more recently at 0.4 Ma ago (stdev = 0.02 Ma), more similarly to the

mitochondrial tree (Figure 7). Yet, it still suggests clear genetic substructure within T. pella. At the

same time, both analyses (nuclear and mixed dataset) show no support for any genetic substructure

among T. pyra samples. Based on these results, which are consistent with the various species tree

inferences described above (Figure 8) we assigned each sample to one of 4 distinct clades (Figure 11).


28

Figure 10: Similarity matrices showing results of SpeciesDelimitationAnalyser processing of the species tree

distribution inferred with DISSECT (burn-in 10%). Placed to the left of each matrix is the maximum clade credibility tree

of the posterior tree distribution (burn-in 1,000 trees). Node support values represent Bayesian posterior probabilities.

Trees are scaled by time in Ma. a) Based on 8 nuclear genes; b) Based on 8 nuclear genes and the mitochondrial genome.

Individuals assigned to populations

Species tree inferences were rerun with samples assigned to putatively separate populations

(Figure 11), as consistently suggested by the results of previous analyses. Sample T. pella7 was

assigned a separate clade in order to explore its position in the tree in relation to the identified

populations. The resulting species trees are shown in Figure 11. The dating of the basal split between

the separate populations within T. pella inferred with *BEAST is consistent to the previous analyses

of the same data without population assignments, yet with a narrower and therefore more precise

confidence interval, that is 0.69 Ma ago (stdev = 0.14 Ma) for the 8 nuclear gene dataset (Figure 11a)

a)

b)

Results

29

and 0.25 Ma ago (stdev = 0.08 Ma) for the mixed dataset (Figure 11d). The sample T. pella7 is

consistently placed in a monophyletic group with the southern population (T. pellaS) in the three

multilocus nuclear datasets (Figure 11a-c). When adding the mitochondrial DNA to the 8 nuclear

genes as a 9th partition (Figure 11d), T. pella7 is placed concordantly to the mitochondrial tree (Figure

7) with the northern population (T. pellaN), supported rather confidently with 81% Bayesian

posterior probability.

Figure 11: Species trees inferred for Topaza based on multilocus datasets, with applied population assignments for

all samples. See caption of Figure 8 for more information about the inference of the separate species trees. Note the

position of sample T. pella7 which is placed in a monophyletic clade with the southern population (T. pellaS) in the

species trees inferred from multilocus nuclear data (a-c) but is placed with good node support with the northern clade (T.

pellaN) in the *BEAST inference of the mixed dataset, including the mitochondrial genome. The top part of the figure

shows the new clade assignments based on consistently identified clusters in all previous species tree inferences and the

species delimitation analyses with DISSECT.


30

Population structure

The bar graphs in Figure 12 depict the results of the STRUCTURE analyses of unlinked

polymorphic positions within the two recognized Topaza species. The plot is showing the admixture

of each individual between two putative genetic clusters (k=2). No population structure appears to

be present within T. pyra, as all samples within this species are equally admixed, which further

complements the previous evidence from the species tree inference methods and the DISSECT

analyses, shown in Figure 8, Figure 9 and Figure 10. In the case of T. pella, the inferred admixture

pattern is consistent with our previous results, with T. pella5 and 6 (population T. pellaN) showing a

different admixture pattern than T. pella7, 8 and 9, indicating two separate populations within T.

pella. The results suggest active or recent admixture between T. pella7 and the other samples with

origin south of the Amazon River. At the same time, T. pella7 contains a slightly higher percentage

(approx. 10%) of genetic material from “Cluster1” (dark grey) which is mainly present in the northern

samples (5 and 6) and rare among the other samples from south of the Amazon (8 and 9).

For both datasets (T. pella SNPs and T. pyra SNPs), the estimated probability of the data under

different settings for k were promoting k=1 to best fit the data, meaning that not sufficiently distinct

admixture patterns can be found within either of the species to truly infer two or multiple separate

genetic clusters. We chose k=2, which was the second best fit for the data, due to the incentive to

test for identifiable differences in the admixture pattern between different individuals within the

same species, requiring at least two separate genetic clusters.

Figure 12: Barplot of STRUCTURE results of admixture between the different individuals, generated separately for

each species. When two clusters are set by the user (k=2), no population structure can be seen within T. pyra (upper

plot), as all individuals appear to be equally admixed between these two clusters. Within T. pella, population structure is

visible when two clusters are inferred (k=2). Individuals are not equally admixed between these two putative clusters,

with T. pella5 and 6 carrying more than 80% of their genetic makeup from Cluster1, which is only present to less than

10% in T. pella7, 8 and 9.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

T_pyra1 T_pyra2 T_pyra3 T_pyra4

Cluster2 Cluster1

0% 10% 20% 30%

40% 50% 60% 70% 80% 90%

100%

T_pella5 T_pella6 T_pella7 T_pella8 T_pella9

Cluster2 Cluster1

Discussion

31

Discussion

Evaluation of phylogenetic relationships

Within this study we explored the genetic structure within the genus Topaza using a variety of

approaches based on different multilocus, nuclear and mitochondrial datasets. Consistently, through

all analyses, we see a clear divergence between the two currently recognized species T. pyra and T.

pella with no indications of gene flow between these species, thereby clearly advocating their rank as

separate species, which has been challenged based on morphological data by some authors [37],

[38].

Examining the genetic structure within these species, we consistently find a separation of two

lineages within T. pella, separating individuals sampled north from those sampled south of the

Amazon river. This split between northern and southern samples becomes apparent in all species

trees (Figure 8, Figure 9, and Figure 11). It further is strongly supported by species delimitation

analyses in DISSECT (Figure 10) and also indicated by the population structure analysis with

STRUCTURE (Figure 12). A recent study by Schmit-Ornés et al. [38] based on color spectral data,

found evidence for significant variation in colorization measurements between northern and

southern samples of T. pella in relation to the Amazon river, leading to the definition of two distinct

subspecies, namely T. pella pella (north) and T. pella microrhyncha (south), separated by the river.

Our data strongly supports these morphological findings and suggests the distinction of these two

separate subspecies within T. pella.

One exception which is not as strongly supported is sample T. pella7, which in all species tree

inferences based on multilocus nuclear data, is placed with the southern samples (T. pella

microrhyncha), but with rather low node support for this placement (Figure 8a-c and Figure 11). The

plot of the complete species tree distribution of the SNAPP analysis (Figure 9) shows that there is

some uncertainty whether to place this sample with the northern clade (T. pella pella, samples 5 and

6) or the southern clade (T. pella microrhyncha, samples 8 and 9), even though the vast majority of

trees suggest a placement closer to the southern subspecies T. pella microrhyncha, which is also

supported by the STRUCTURE results (Figure 12). The sample T. pella7 was collected from the

southern bank of the Amazon River, close to the estuary of the Amazon, laying within the geographic

range that has formerly been assigned to a separate subspecies (based on morphometric and

coloration data) referred to as T. pella smaragdula [39], [40]. The proposed range of this putative

subspecies extends from the southern riverbed close to the estuary of the Amazon across the eastern

part of the Guiana Shield in the area of French Guiana. Sequence data of more individuals from

particularly the area of French Guiana would be required to genetically examine the validity of this


32

subspecies assignment. The observed uncertainty of the placement of sample T. pella7 in the species

tree could be due to a past admixture between the two otherwise distinct subspecies T. pella pella

and T. pella microrhyncha, followed by a time of adjacent gene-flow between the ancestral

population of T. pella7 and T. pella microrhyncha, causing the majority of gene loci to share a more

recent common history with this subspecies, while a smaller fraction of genes share a more recent

common history with T. pella pella. Such a putative admixture event would further explain the

position of T. pella7 in the mitochondrial tree, where it is placed more closely to T. pella pella with

absolute certainty (100% Bayesian posterior probability).

Mitochondrial tree - the odd one out

One striking pattern that becomes obvious when comparing the mitochondrial tree (Figure 7)

with all inferred species trees (Figure 8, Figure 9 and Figure 10), is the deep split between northern

(T. pyra1 and 2) and southern samples (3 and 4) within T. pyra, which is only present within the

mitochondrial tree. All other trees show alternating topologies within T. pyra, none of which are well

supported. The species delimitation analysis based on the dataset of 8 nuclear genes does not detect

any genetic clusters within T. pyra, suggesting to collapse all four samples within T. pyra in the

species tree. Additionally, the conducted STRUCTURE analysis, based exclusively on SNPs that show

polymorphisms within the T. pyra samples, did not recover any differences in the genetic makeup of

the different T. pyra samples, indicating ongoing or recent admixture between all sampled individuals

within this species. This combined evidence of no genetic structure within T. pyra seems

contradictive to the inferred history of the mitochondrial lineages. Previous studies based on

morphological data [38], [40] distinguish two separate subspecies within T. pyra, which are separated

by an east-west gradient, one of which occurs in the Peruvian highlands, while the other one occurs

along the Rio Negro. However, no north-south division between individuals of T. pyra as suggested

by the mitochondrial tree has been previously hypothesized based on morphological evidence. There

are various possible explanations for the observed discrepancy between the mitochondrial tree and

the species trees.

One explanation could be that the nuclear loci are simply not informative enough in order to

infer recent splits of lineages, while mitochondrial DNA with a generally higher mutation rate shows

genetic structure on shallower times, therefore being the more sensitive and suitable dataset for

exploring more recent genetic substructure. This, however, seems unlikely; in particular, the SNP

dataset, consisting of only polymorphic positions within Topaza, represents a dataset of maximum

informativeness, exceeding the mitochondrial mutation rate by orders of magnitude. Particularly

those SNPs with polymorphisms within T. pyra, as extracted for the STRUCTURE analysis, would be

expected to show a pattern if population structure is present, which was not found in this study

Discussion

33

(Figure 12). Further, all nuclear datasets consistently recovered genetic substructure within the sister

species T. pella on comparably shallow evolutionary times, demonstrating the suitability of the

nuclear datasets for exploring such substructuring.

A biological mechanism that could cause the inferred deep split between mitochondrial lineages

within T. pyra is selection, acting on two separate mitochondrial haplotypes. Previous studies have

found evidence for such biallelic selection on mitochondrial haplotypes causing deep divergences of

mitochondrial lineages within sympatric bird populations [15], [75]. Possible divergence of two

separate mitochondrial haplotypes within a panmictically admixing population due to selection has

also been demonstrated by simulation studies [76]. Cases of direct selection on mitochondrial

sequences have mainly been related to altitudinal differences [15], where differing aerobic

conditions may act as selection factors for mitochondrial loci coding for enzymes involved in

oxidative phosphorylation. No noteworthy differences in altitude are present between the sample

locations of the T. pyra specimens used for this study. Yet, the possibility of direct or indirect

selection on the mitochondrial genome through other selection factors, maintaining two distinct

haplotypes, remains a plausible explanation.

Another possible biological explanation is that the observed pattern of strong genetic structure

within the mitochondrial tree, which is not present in the nuclear data, could be connected to the

different inheritance mechanism of mitochondria in comparison to nuclear DNA. As a solely

maternally inherited locus, the genetic divergence of two separate mitochondrial lineages could be

restricted to females. A scenario is thinkable in which the Amazon River acts as a dispersal barrier for

female individuals of T. pyra, while male individuals occasionally cross the Amazon, thereby keeping

the population genetically admixed. Gender-specific differences in the average dispersal distance

have been commonly found within avian studies in the last decades [77]–[80]. These studies more

commonly found females to be the further dispersing gender, yet there appear to be family-specific

differences as to which is the further dispersing gender [78]. No information about gender-specific

dispersal rates for hummingbirds (Trochilidae) are known to the authors. However, a possibly higher

dispersal rate of male Topaza hummingbirds could explain a pattern as the one found in this study,

where two distinctly different mitochondrial lineages are present in a population that is genetically

admixed in respect to autosomal DNA. Considering that the nuclear genes, as well as the UCE data

and the extracted SNPs, are of exclusively autosomal origin, this explanation appears to be a likely

scenario to explain the observed discrepancy between the mitochondrial and the nuclear data.

Additional sequence data of the female sex chromosome (W), which is not present in males could be

used to further test this hypothesis, as these sequences would be expected to show the same genetic

pattern as the mitochondrion. Furthermore, ecological data concerning gender specific dispersal

rates of Topaza hummingbirds could bring more light into this discussion.


34

This study adds to a growing number of avian studies that identify cases of mitochondrial

structure found within otherwise admixing populations [15], [81]–[83]. Such discrepancies between

the mitochondrial tree and the species tree are, in general, to be seen more as informative rather

than problematic. It can provide possible evidence for important biological factors, such as selection

or gender-specific dispersal patterns, which are important drivers of evolution. At the same time, this

case also points out the problematic nature of the mitochondrion as a phylogenetic marker. As a

locus that can be strongly affected by selection [15], [84]–[87], the mitochondrion may present a

violation to the assumption of neutrality in coalescent methods. Further, a hypothetical case of

female specific lineage divergence, as discussed in this study, could lead to the mistaken definition of

cryptic species or subspecies when only looking at genetic information from the mitochondrion,

thereby overseeing the present admixture of autosomal loci.

Rivers as dispersal barriers

We are not aware of any ecological data regarding the specific dispersal ability of Topaza

hummingbirds. However, the data presented here suggest that, in particular the Amazon River

appears to act as a dispersal barrier for Topaza hummingbirds. In the case of T. pella, the range of the

subspecies T. pella pella and T. pella microrhyncha is separated by the Amazon River. For a rainforest

dwelling genus like Topaza [40], no other obvious dispersal barriers are present between the ranges

of these two subspecies. This makes it a likely conclusion that the Amazon imposes a dispersal barrier

on T. pella strong enough to lead to the separation of two genetically distinct subspecies on opposite

sides of the Amazon River. A dispersal barrier effect of the Amazon River on forest-dwelling bird

species in particular has been confirmed by other studies [4], [88]–[90], yet it appears to be unique

for the Amazon River and has not been confirmed consistently for a variety of bird species for other

big river systems. Even though the barrier effect of the Amazon appears to be very strong for T. pella,

considering the strict separation of T. pella pella and T. pella microrhyncha in all datasets, it appears

to be a somewhat permeable barrier as indicated by sample T. pella7. This sample, which was

collected at the southern river bank close to the estuary of the Amazon shows indications of possible

admixture with the northern subspecies T. pella pella, which is also indicated by the mitochondrial

tree (Figure 7). It is plausible that the dispersal barrier effect around the estuary area is somewhat

reduced, due to the forking of the Amazon into a wide delta region characterized by a multitude of

small islands.

No consistent substructure is promoted within T. pyra that would indicate a dispersal barrier

effect of the Amazon. This finding is consistent with the results of a previous study, executed with

large variety of bird taxa (n > 400), which found that the lower and wider section of the Amazon

presents a significantly stronger dispersal barrier than the upper, narrower section [90]. Yet, there is

Discussion

35

evidence within the mitochondrial data of T. pyra suggesting divergence of two separate

mitochondrial haplotypes, separated by the Amazon River, which could be attributed to a dispersal

barrier effect of the Amazon on female birds, as discussed above.

Effect of mtDNA on species tree

When adding the genetic information of the complete mitochondrial genome to the multilocus

nuclear dataset (8 genetic markers), the position of sample T. pella7 becomes heavily influenced by

this additional partition. While the nuclear dataset places T. pella7 with a lot of uncertainty in a

monophyletic group with the samples belonging to T. pella microrhyncha (Figure 11a), the mixed

dataset, including the mitochondrion, places it rather confidently (81 % Bayesian posterior

probability) with the other subspecies T. pella pella (Figure 11d). Considering the previous

uncertainty, the addition of one locus (in this case the mitochondrion) has a substantial influence on

the inferred species tree regarding both the placement of sample T. pella7 and the respective node

support. Within this mixed dataset the mitochondrion is by far the most extensive and informative

locus. Such a difference in informativeness between simultaneously analyzed loci has been projected

to bias multilocus coalescent methods [18] such as *BEAST. If the influence of the additional

mitochondrial sequence on the species tree observed here is truly disproportionate, would need to

be explored further with simulation studies. Yet, this case points out the importance of sequencing

loci with a sufficient amount of informative sites when using these sequences in a multispecies

coalescent approach. Particularly on shallow evolutionary time scales it seems therefore plausible

that mitochondrial loci within mixed multilocus datasets substantially contribute to the species tree

phylogeny, possibly leading to false certainty of inferred clades.

Evaluation of datasets

The relatively recent time of divergence events addressed within this study (< 1 Ma ago) pushes

the standard nuclear genetic markers to the boundary of their utility. This becomes obvious when

looking at the node support values in the *BEAST inferred species tree in Figure 8a. This tree in

general lacks good node support values for all inferred clades below the species level, which in the

case of the internal nodes within T. pyra is probably due to no genetic structure being present (see

other species tree inferences, Figure 8b-d, Figure 9 and Figure 10 and STRUCTURE results, Figure 12).

Yet, the lack of support for the inferred clades within T. pella appears to be due to a lack of

informativeness within these data, as these clades are well supported in other species trees (Figure

8c&Figure 9) and supported by the STRUCTURE results (Figure 12). The DISSECT results show, that

this issue is eliminated when excluding the phylogenetically problematic sample T. pella7 (compare

Figure S14a with Figure 10a). After the exclusion of this sample the nuclear loci provide a reliable


36

dataset for the species delimitation analyses with DISSECT, as can be judged by the support values in

the DISSECT tree (Figure 10a). Nevertheless, the nuclear gene loci used in this study do not provide

sufficient information to reliably explore genetic patterns below the species level.

We therefore strongly recommend the generation of genome-wide multilocus datasets which

allow for the extraction of highly informative SNP data when working on such evolutionary shallow

times. Within this study, SNP data yield the most reliable species tree inference (see node support

values in Figure 8c), and additionally open up a range of further analytical methods. The UCE dataset

used in this study proves to be an excellent source of genetic information for the generation of such

SNP datasets, as UCE sequences are easily generated and universally applicable for a wide range of

organism groups. The use of complete UCE sequences for gene tree and subsequent species tree

analyses on the other hand, was found to not be very feasible within this study, since these loci

remain too conserved among the sampled individuals. This opposes the findings of previous studies

examining the performance of UCE loci on shallow evolutionary times [28]. We found that gene trees

inferred from UCE loci were rather uninformative, resulting in many cases in wide polytomies. This

variety of uninformative and in many cases conflicting gene tree topologies causes the species tree

inference in MP-EST, which only considers the topology of the input gene trees, to infer a species

tree with very short intermodal distances (Figure 8b & Figure S13). On the other hand the terminal

branches are outstandingly long, creating an unusual tree shape which has also been recognized by

previous studies [91], [92], being referred to as a “bonsai tree”. This appears to be an MP-EST specific

issue with particularly UCE data, which in previous studies has been attributed to the inaccurate

reconstruction of gene trees [92]. The consistently long terminal branch lengths occur due to MP-EST

arbitrarily assigning a branch length value around 9 coalescent units when branch length cannot be

properly estimated [91]. We find that a selection of the most informative UCE loci for species tree

inference in MP-EST reduces the bonsai-effect (compare Figure S13 with Figure 8b). Further we find

that this filtering of the data improves the topology of the inferred species tree; the species tree

inference based on the subset of the 73 most informative UCE loci (Figure 8b) is concordant with the

consistently inferred clades in other species trees (Figure 8a,c,d), whereas the topology of the tree

based on all 824 UCE loci conflicts with good bootstrap support values with the otherwise

consistently observed monophyly of the samples T. pella5 and T. pella6. These findings show that

“shortcut” coalescent methods (referring to methods which do not co-estimate gene trees and

species tree) like MP-EST, do not necessarily follow the dogma “the more, the better”, but can be

substantially improved by sorting out too uninformative loci, particularly when inferring phylogenetic

relationships on shallow evolutionary times.

Conclusion

37

Conclusion

The inference of genetic substructure within species-limits is located in an area of overlap

between the fields of phylogenetics and population genetics. In this field of overlap, we find that

phylogenetically popular data sources do not perform well when inferring evolutionary history, due

to a lack of informativeness. Additionally, complex admixture patterns among subspecies can limit

the utility of multispecies coalescent methods, such as BEAST and *BEAST, since samples cannot

always be assigned to clearly defined populations without gene-flow among each other. The

mitochondrial tree provides an exceptionally well resolved gene tree, which enables us to explore

the phylogenetic relationship between mitochondrial haplotypes. As shown in this study, these

phylogenetic relationships can be misleading, since the mitochondrial tree resembles a single

genealogy, which is in many cases not concordant with the species tree, particularly on shallow

evolutionary times. An appropriate estimate of the species tree in this case can most successfully be

achieved with highly informative SNP data. Additionally, SNP data open up the possibility of applying

population genetic methods such as STRUCTURE, which are of great value for the exploration of the

genetic data. We find that a combination of phylogenetic and population genetic methods is very

useful to identify consistent patterns among the nuclear data which we use to infer subspecies

assignments. We conclude that especially the SNP data are a very useful dataset in order to explore

genetic substructure and phylogenetic relationships between individuals. Our finding that the lower

Amazon River constitutes a rather strict dispersal barrier for Topaza is novel for this genus, and may

inspire future studies to investigate further if limited dispersal across the Amazon River can also be

observed in closely related hummingbird genera. Further we want to highlight in this study that a

discrepancy between the mitochondrial tree and the species tree can give rise to biologically

intriguing hypotheses, such as gender specific dispersal barriers or selection on mitochondrial genes;

it is our intent to further pursue these postulations.


38

Acknowledgements

I want to thank Alexandre Antonelli and Urban Olsson as my main supervisors for enabling me to

carry out such an interesting and exciting project for my Master’s thesis project and for all the time

and support you were able to give me. It was a lot of fun and a great experience to work with you

over the past year. Special thanks go to Alex Antonelli for giving me the opportunity of being a full

member of his research group, for the great group retreats and of course for free breakfast every

Monday.

I thank Alexandre Fernandes for providing this wonderful dataset for my Masters project and for

interesting and helpful information about the biology of Topaza hummingbirds.

Big thanks also to Bernard Pfeil for great help with the phylogenetic tree inference with in

particular Bayesian methods and all the time that we spent together philosophizing about the

outcomes.

Further I thank Mats Töpel as my bioinformatic supervisor for the invaluable support with

software issues on the Albiorix cluster and beyond that for helpful advice with other bioinformatic

challenges around the project.

Thanks to Yann Bertrand and Filipe de Sousa for helpful input and the sharing of various scripts in

particular for Illumina read data processing. A big thank you also goes to Alexander Zizka, who was of

big help with his excellent R skills for creating the distribution maps. Further, I want to thank Daniele

Silvestro from the University of Gothenburg and Martin Ryberg from the University of Uppsala for

helpful scripts for specific data processing steps. I also want to thank Britt Anderson for proof-

reading this manuscript and giving very helpful input during the writing process. I thank the

managers of the named institutions and museums (Table 1) and the collectors for having provided

the material for DNA extraction that made this study possible. Finally I want to thank the European

and Sweden Research Council for funding this project.

References

39

References

[1] W. M. Brown, M. George, and A. C. Wilson, “Rapid evolution of animal mitochondrial DNA,” Proc. Natl. Acad. Sci., vol. 76, no. 4, pp. 1967–1971, Apr. 1979.

[2] S. Sammler, C. Bleidorn, and R. Tiedemann, “Full mitochondrial genome sequences of two endemic Philippine hornbill species (Aves: Bucerotidae) provide evidence for pervasive mitochondrial DNA recombination.,” BMC Genomics, vol. 12, no. 1, p. 35, 2011.

[3] G. Voelker, S. Rohwer, R. C. K. Bowie, and D. C. Outlaw, “Molecular systematics of a speciose, cosmopolitan songbird genus: Defining the limits of, and relationships among, the Turdus thrushes,” Mol. Phylogenet. Evol., vol. 42, no. 2, pp. 422–434, 2007.

[4] A. M. Fernandes, M. Wink, and A. Aleixo, “Phylogeography of the chestnut-tailed antbird (Myrmeciza hemimelaena) clarifies the role of rivers in Amazonian biogeography,” J. Biogeogr., vol. 39, no. 8, pp. 1524–1535, 2012.

[5] S. G. DuBay and C. C. Witt, “An improved phylogeny of the Andean tit-tyrants (Aves, Tyrannidae): More characters trump sophisticated analyses,” Mol. Phylogenet. Evol., vol. 64, no. 2, pp. 285–296, 2012.

[6] J. W. O. Ballard and M. C. Whitlock, “The incomplete natural history of mitochondria,” Mol. Ecol., vol. 13, no. 4, pp. 729–744, 2004.

[7] A. Tatarenkov and J. C. Avise, “Rapid concerted evolution in animal mitochondrial DNA.,” Proc. Biol. Sci., vol. 274, no. 1619, pp. 1795–8, Jul. 2007.

[8] K. Ogoh and Y. Ohmiya, “Concerted evolution of duplicated control regions within an ostracod mitochondrial genome.,” Mol. Biol. Evol., vol. 24, no. 1, pp. 74–8, Jan. 2007.

[9] J. R. Eberhard, T. F. Wright, and E. Bermingham, “Duplication and concerted evolution of the mitochondrial control region in the parrot genus Amazona.,” Mol. Biol. Evol., vol. 18, no. 7, pp. 1330–42, Jul. 2001.

[10] J. H. Degnan and N. a. Rosenberg, “Gene tree discordance, phylogenetic inference and the multispecies coalescent,” Trends Ecol. Evol., vol. 24, no. 6, pp. 332–340, 2009.

[11] B. Nabholz, S. Glémin, and N. Galtier, “The erratic mitochondrial clock: variations of mutation rate, not population size, affect mtDNA diversity across birds and mammals.,” BMC Evol. Biol., vol. 9, p. 54, 2009.

[12] S. Berlin and H. Ellegren, “Evolutionary genetics. Clonal inheritance of avian mitochondrial DNA.,” Nature, vol. 413, no. 6851, pp. 37–8, Sep. 2001.

[13] S. Berlin, D. Tomaras, and B. Charlesworth, “Low mitochondrial variability in birds may indicate Hill-Robertson effects on the W chromosome.,” Heredity (Edinb)., vol. 99, no. 4, pp. 389–96, Oct. 2007.


40

[14] G. D. D. Hurst and F. M. Jiggins, “Problems with mitochondrial DNA as a marker in population, phylogeographic and phylogenetic studies: the effects of inherited symbionts.,” Proc. Biol. Sci., vol. 272, no. 1572, pp. 1525–1534, 2005.

[15] Z. a. Cheviron and R. T. Brumfield, “Migration-selection balance and local adaptation of mitochondrial haplotypes in Rufous-Collared Sparrows (Zonotrichia Capensis) along an elevational gradient,” Evolution (N. Y)., vol. 63, no. 6, pp. 1593–1605, 2009.

[16] J. W. Ballard and M. Kreitman, “Unraveling selection in the mitochondrial genome of Drosophila.,” Genetics, vol. 138, no. 3, pp. 757–72, Nov. 1994.

[17] A. Corl and H. Ellegren, “Sampling strategies for species trees: The effects on phylogenetic inference of the number of genes, number of individuals, and whether loci are mitochondrial, sex-linked, or autosomal,” Mol. Phylogenet. Evol., vol. 67, no. 2, pp. 358–366, 2013.

[18] E. L. Jockusch, I. Martinez-Solano, and E. K. Timpe, “The Effects of Inference Method, Population Sampling, and Gene Sampling on Species Tree Inferences: An Empirical Study in Slender Salamanders (Plethodontidae: Batrachoseps),” Syst. Biol., vol. 64, no. 1, pp. 66–83, 2014.

[19] F. Jacobsen and K. E. Omland, “Species tree inference in a recent radiation of orioles (Genus Icterus): Multiple markers and methods reveal cytonuclear discordance in the northern oriole group,” Mol. Phylogenet. Evol., vol. 61, no. 2, pp. 460–469, 2011.

[20] A. Camargo, L. J. Avila, M. Morando, and J. W. Sites, “Accuracy and precision of species trees: effects of locus, individual, and base pair sampling on inference of species trees in lizards of the Liolaemus darwinii group (Squamata, Liolaemidae).,” Syst. Biol., vol. 61, no. 2, pp. 272–88, Mar. 2012.

[21] J. S. Williams, J. H. Niedzwiecki, and D. W. Weisrock, “Species tree reconstruction of a poorly resolved clade of salamanders (Ambystomatidae) using multiple nuclear loci.,” Mol. Phylogenet. Evol., vol. 68, no. 3, pp. 671–82, Sep. 2013.

[22] G. Bejerano, M. Pheasant, I. Makunin, S. Stephen, W. J. Kent, J. S. Mattick, and D. Haussler, “Ultraconserved elements in the human genome.,” Science, vol. 304, no. 5675, pp. 1321–5, May 2004.

[23] L. A. Pennacchio, N. Ahituv, A. M. Moses, S. Prabhakar, M. A. Nobrega, M. Shoukry, S. Minovitsky, I. Dubchak, A. Holt, K. D. Lewis, I. Plajzer-Frick, J. Akiyama, S. De Val, V. Afzal, B. L. Black, O. Couronne, M. B. Eisen, A. Visel, and E. M. Rubin, “In vivo enhancer analysis of human conserved non-coding sequences.,” Nature, vol. 444, no. 7118, pp. 499–502, Nov. 2006.

[24] A. Siepel, G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller, and D. Haussler, “Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.,” Genome Res., vol. 15, no. 8, pp. 1034–50, Aug. 2005.

[25] W. Miller, K. Rosenbloom, R. C. Hardison, M. Hou, J. Taylor, B. Raney, R. Burhans, D. C. King, R. Baertsch, D. Blankenberg, S. L. Kosakovsky Pond, A. Nekrutenko, B. Giardine, R. S. Harris, S. Tyekucheva, M. Diekhans, T. H. Pringle, W. J. Murphy, A. Lesk, G. M. Weinstock, K. Lindblad-Toh, R. A. Gibbs, E. S. Lander, A. Siepel, D. Haussler, and W. J. Kent, “28-way vertebrate

References

41

alignment and conservation track in the UCSC Genome Browser.,” Genome Res., vol. 17, no. 12, pp. 1797–808, Dec. 2007.

[26] B. C. Faircloth, J. E. McCormack, N. G. Crawford, M. G. Harvey, R. T. Brumfield, and T. C. Glenn, “Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales.,” Syst. Biol., vol. 61, no. 5, pp. 717–26, Oct. 2012.

[27] B. T. Smith, J. E. McCormack, A. M. Cuervo, M. J. Hickerson, A. Aleixo, C. D. Cadena, J. Pérez-Emán, C. W. Burney, X. Xie, M. G. Harvey, B. C. Faircloth, T. C. Glenn, E. P. Derryberry, J. Prejean, S. Fields, and R. T. Brumfield, “The drivers of tropical speciation,” Nature, vol. 515, no. 7527, pp. 406–409, Sep. 2014.

[28] B. T. Smith, M. G. Harvey, B. C. Faircloth, T. C. Glenn, and R. T. Brumfield, “Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales,” Syst. Biol., vol. 63, no. 1, pp. 83–95, 2014.

[29] E. D. Jarvis, S. Mirarab, a. J. Aberer, B. Li, P. Houde, C. Li, S. Y. W. Ho, B. C. Faircloth, B. Nabholz, J. T. Howard, a. Suh, C. C. Weber, R. R. da Fonseca, J. Li, F. Zhang, H. Li, L. Zhou, N. Narula, L. Liu, G. Ganapathy, B. Boussau, M. S. Bayzid, V. Zavidovych, S. Subramanian, T. Gabaldon, S. Capella-Gutierrez, J. Huerta-Cepas, B. Rekepalli, K. Munch, M. Schierup, B. Lindow, W. C. Warren, D. Ray, R. E. Green, M. W. Bruford, X. Zhan, a. Dixon, S. Li, N. Li, Y. Huang, E. P. Derryberry, M. F. Bertelsen, F. H. Sheldon, R. T. Brumfield, C. V. Mello, P. V. Lovell, M. Wirthlin, M. P. C. Schneider, F. Prosdocimi, J. a. Samaniego, a. M. V. Velazquez, a. Alfaro-Nunez, P. F. Campos, B. Petersen, T. Sicheritz-Ponten, a. Pas, T. Bailey, P. Scofield, M. Bunce, D. M. Lambert, Q. Zhou, P. Perelman, a. C. Driskell, B. Shapiro, Z. Xiong, Y. Zeng, S. Liu, Z. Li, B. Liu, K. Wu, J. Xiao, X. Yinqi, Q. Zheng, Y. Zhang, H. Yang, J. Wang, L. Smeds, F. E. Rheindt, M. Braun, J. Fjeldsa, L. Orlando, F. K. Barker, K. a. Jonsson, W. Johnson, K.-P. Koepfli, S. O’Brien, D. Haussler, O. a. Ryder, C. Rahbek, E. Willerslev, G. R. Graves, T. C. Glenn, J. McCormack, D. Burt, H. Ellegren, P. Alstrom, S. V. Edwards, a. Stamatakis, D. P. Mindell, J. Cracraft, E. L. Braun, T. Warnow, W. Jun, M. T. P. Gilbert, and G. Zhang, Whole-genome analyses resolve early branches in the tree of life of modern birds, vol. 346, no. 6215. 2014.

[30] D. Bryant, R. Bouckaert, J. Felsenstein, N. A. Rosenberg, and A. RoyChoudhury, “Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis.,” Mol. Biol. Evol., vol. 29, no. 8, pp. 1917–32, Aug. 2012.

[31] J. K. Pritchard, M. Stephens, and P. Donnelly, “Inference of population structure using multilocus genotype data.,” Genetics, vol. 155, no. 2, pp. 945–59, Jun. 2000.

[32] E. Y. Durand, N. Patterson, D. Reich, and M. Slatkin, “Testing for ancient admixture between closely related populations,” Mol. Biol. Evol., vol. 28, no. 8, pp. 2239–2252, 2011.

[33] F. E. Rheindt, M. K. Fujita, P. R. Wilton, and S. V. Edwards, “Introgression and phenotypic assimilation in zimmerius flycatchers (Tyrannidae): Population genetic and phylogenetic inferences from genome-wide SNPs,” Syst. Biol., vol. 63, no. 2, pp. 134–152, 2014.

[34] P. H. Brito and S. V Edwards, “Multilocus phylogeography and phylogenetics using sequence-based markers.,” Genetica, vol. 135, no. 3, pp. 439–55, Apr. 2009.

[35] J. del Hoyo, N. Collar, G. M. Kirwan, and P. Boesman, “Fiery Topaz (Topaza pyra),” Handbook of the Birds of the World Alive. Lynx Edicions, Barcelona, 2015.


42

[36] K. L. Schuchmann, G. M. Kirwan, and P. Boesman, “Crimson Topaz (Topaza pella),” Handbook of the Birds of the World Alive. Lynx Edicions, Barcelona, 2015.

[37] K. L. Schuchmann, “Family Trochilidae (hummingbirds),” in Handbook of the birds of the world, Volume 5., J. del Hoyo, A. Elliott, and J. Sargatal, Eds. Barcelona, Spain: Lynx Edicions, 1999, pp. 468–680.

[38] A. Ornés-Schmitz and K. L. Schuchmann, “Taxonomic review and phylogeny of the hummingbird genus Topaza Gray , 1840 using plumage color spectral information,” Ornitol. Neotrop., no. 22, pp. 25–38, 2011.

[39] J. L. Peters, Check-list of birds of the world, Volume 5. Cambridge, Massachusetts: Harvard Univ. Press, 1945.

[40] D.-S. Hu, L. Joseph, and D. J. Agro, “Distribution, variation, and taxonomy of Topaza Hummingbirds (Aves: Trochilidae),” Ornitol. Neotrop., vol. 11, no. 1982, pp. 123–142, 2000.

[41] S. J. Hackett, R. T. Kimball, S. Reddy, R. C. K. Bowie, E. L. Braun, M. J. Braun, J. L. Chojnowski, W. A. Cox, K.-L. Han, J. Harshman, C. J. Huddleston, B. D. Marks, K. J. Miglia, W. S. Moore, F. H. Sheldon, D. W. Steadman, C. C. Witt, and T. Yuri, “A phylogenomic study of birds reveals their evolutionary history.,” Science, vol. 320, no. 5884, pp. 1763–1768, 2008.

[42] C. H. Graham, J. L. Parra, C. Rahbek, and J. a McGuire, “Phylogenetic structure in tropical hummingbird communities.,” Proc. Natl. Acad. Sci. U. S. A., vol. 106 Suppl , pp. 19673–19678, 2009.

[43] A. L. Chubb, “Nuclear corroboration of DNA-DNA hybridization in deep phylogenies of hummingbirds, swifts, and passerines: the phylogenetic utility of ZENK (ii).,” Mol. Phylogenet. Evol., vol. 30, no. 1, pp. 128–39, Jan. 2004.

[44] E. Quintero, C. C. Ribas, and J. Cracraft, “The Andean Hapalopsittaca parrots (Psittacidae, Aves): an example of montane-tropical lowland vicariance,” Zool. Scr., vol. 42, no. 1, pp. 28–43, Jan. 2013.

[45] A. J. Drummond, M. A. Suchard, D. Xie, and A. Rambaut, “Bayesian phylogenetics with BEAUti and the BEAST 1.7.,” Mol. Biol. Evol., vol. 29, no. 8, pp. 1969–73, Aug. 2012.

[46] J. Heled and A. J. Drummond, “Bayesian inference of species trees from multilocus data.,” Mol. Biol. Evol., vol. 27, no. 3, pp. 570–80, Mar. 2010.

[47] L. Liu, L. Yu, and S. V Edwards, “A maximum pseudo-likelihood approach for estimating species trees under the coalescent model.,” BMC Evol. Biol., vol. 10, no. 1, p. 302, 2010.

[48] G. Jones, Z. Aydin, and B. Oxelman, “DISSECT: an assignment-free Bayesian discovery method for species delimitation under the multispecies coalescent.,” Bioinformatics, vol. 31, no. 7, pp. 991–998, Nov. 2014.

[49] M. Topel, M. F. Calio, A. Zizka, R. Scharn, D. Silvestro, and A. Antonelli, “SpeciesGeoCoder: Fast categorisation of species occurrences for analyses of biodiversity, biogeography, ecology and evolution,” Cold Spring Harbor Labs Journals, Sep. 2014.

References

43

[50] B. L. Sullivan, C. L. Wood, M. J. Iliff, R. E. Bonney, D. Fink, and S. Kelling, “eBird: An online database of bird distribution and abundance [web application],” Biological Conservation 142, 2009. [Online]. Available: http://www.ebird.org. [Accessed: 11-May-2015].

[51] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools.,” Bioinformatics, vol. 25, no. 16, pp. 2078–9, Aug. 2009.

[52] I. Milne, G. Stephen, M. Bayer, P. J. A. Cock, L. Pritchard, L. Cardle, P. D. Shaw, and D. Marshall, “Using Tablet for visual exploration of second-generation sequencing data.,” Brief. Bioinform., vol. 14, no. 2, pp. 193–202, Mar. 2013.

[53] M. Kearse, R. Moir, A. Wilson, S. Stones-Havas, M. Cheung, S. Sturrock, S. Buxton, A. Cooper, S. Markowitz, C. Duran, T. Thierer, B. Ashton, P. Meintjes, and A. Drummond, “Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data.,” Bioinformatics, vol. 28, no. 12, pp. 1647–9, Jun. 2012.

[54] C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden, “BLAST+: architecture and applications.,” BMC Bioinformatics, vol. 10, no. 1, p. 421, Jan. 2009.

[55] V. H. Nguyen and D. Lavenier, “PLAST: parallel local alignment search tool for database comparison.,” BMC Bioinformatics, vol. 10, no. 1, p. 329, Jan. 2009.

[56] S. K. Wyman, R. K. Jansen, and J. L. Boore, “Automatic annotation of organellar genomes with DOGMA.,” Bioinformatics, vol. 20, no. 17, pp. 3252–5, Nov. 2004.

[57] G. C. Conant and K. H. Wolfe, “GenomeVx: simple web-based creation of editable circular chromosome maps.,” Bioinformatics, vol. 24, no. 6, pp. 861–2, Mar. 2008.

[58] T. A. Hall, “BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT.,” in Nucleic Acids Symposium, vol. 41, Oxford University Press, 1999, pp. 95–98.

[59] M. A. Pacheco, F. U. Battistuzzi, M. Lentino, R. F. Aguilar, S. Kumar, and A. a. Escalante, “Evolution of modern birds revealed by mitogenomics: Timing the radiation and origin of major orders,” Mol. Biol. Evol., vol. 28, no. 6, pp. 1927–1942, 2011.

[60] H. D. Marshall, A. J. Baker, and A. R. Grant, “Complete mitochondrial genomes from four subspecies of common chaffinch (Fringilla coelebs): New inferences about mitochondrial rate heterogeneity, neutral theory, and phylogenetic relationships within the order Passeriformes,” Gene, vol. 517, no. 1, pp. 37–45, 2013.

[61] D. Posada and T. R. Buckley, “Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests.,” Syst. Biol., vol. 53, no. 5, pp. 793–808, Oct. 2004.

[62] D. P. Martin, P. Lemey, M. Lott, V. Moulton, D. Posada, and P. Lefeuvre, “RDP3: a flexible and fast computer program for analyzing recombination.,” Bioinformatics, vol. 26, no. 19, pp. 2462–3, Oct. 2010.

[63] D. Martin and E. Rybicki, “RDP: detection of recombination amongst aligned sequences,” Bioinformatics, vol. 16, no. 6, pp. 562–563, Jun. 2000.


44

[64] J. M. Smith, “Analyzing the mosaic structure of genes.,” J. Mol. Evol., vol. 34, no. 2, pp. 126–9, Mar. 1992.

[65] D. Posada and K. A. Crandall, “Evaluation of methods for detecting recombination from DNA sequences: computer simulations.,” Proc. Natl. Acad. Sci. U. S. A., vol. 98, no. 24, pp. 13757–62, Nov. 2001.

[66] H. R. L. Lerner, M. Meyer, H. F. James, M. Hofreiter, and R. C. Fleischer, “Multilocus resolution of phylogeny and timescale in the extant adaptive radiation of Hawaiian honeycreepers,” Curr. Biol., vol. 21, no. 21, pp. 1838–1844, 2011.

[67] J. a. McGuire, C. C. Witt, J. V. Remsen, A. Corl, D. L. Rabosky, D. L. Altshuler, and R. Dudley, “Molecular phylogenetics and the diversification of hummingbirds,” Curr. Biol., vol. 24, no. 8, pp. 910–916, 2014.

[68] T. Gernhard, “The conditioned reconstructed process.,” J. Theor. Biol., vol. 253, no. 4, pp. 769–78, Aug. 2008.

[69] A. Rambaut, M. A. Suchard, W. Xie, and A. Drummond, “Tracer v1. 6,” 2013.

[70] S. Guindon, J.-F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel, “New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0.,” Syst. Biol., vol. 59, no. 3, pp. 307–21, May 2010.

[71] T.-K. Seo, “Calculating bootstrap probabilities of phylogeny using multilocus sequence data.,” Mol. Biol. Evol., vol. 25, no. 5, pp. 960–71, May 2008.

[72] T. I. Shaw, Z. Ruan, T. C. Glenn, and L. Liu, “STRAW: Species TRee Analysis Web server.,” Nucleic Acids Res., vol. 41, no. Web Server issue, pp. W238–41, Jul. 2013.

[73] R. Bouckaert, J. Heled, D. Kühnert, T. Vaughan, C.-H. Wu, D. Xie, M. A. Suchard, A. Rambaut, and A. J. Drummond, “BEAST 2: a software platform for Bayesian evolutionary analysis.,” PLoS Comput. Biol., vol. 10, no. 4, p. e1003537, Apr. 2014.

[74] R. R. Bouckaert, “DensiTree: making sense of sets of phylogenetic trees.,” Bioinformatics, vol. 26, no. 10, pp. 1372–3, May 2010.

[75] D. P. L. Toews and A. Brelsford, “The biogeography of mitochondrial and nuclear discordance in animals,” Mol. Ecol., vol. 21, no. 16, pp. 3907–3930, 2012.

[76] D. E. Irwin, “Local adaptation along smooth ecological gradients causes phylogeographic breaks and phenotypic clustering.,” Am. Nat., vol. 180, no. 1, pp. 35–49, Jul. 2012.

[77] P. J. Greenwood and P. H. Harvey, “The Natal and Breeding Dispersal of Birds,” Annu. Rev. Ecol. Syst., vol. 13, pp. 1–21, 1982.

[78] P. J. Greenwood, “Mating systems, philopatry and dispersal in birds and mammals,” Anim. Behav., vol. 28, no. 4, pp. 1140–1162, Nov. 1980.

[79] B. Czyz, M. Borowiec, A. Wasiñska, R. Pawliszko, and K. Mazur, “Breeding-season dispersal of male and female Penduline Tits (Remiz pendulinus) in south-western Poland,” Ornis Fenn., Jan. 2012.

References

45

[80] M. Szulkin and B. C. Sheldon, “Dispersal as a means of inbreeding avoidance in a wild bird population.,” Proc. Biol. Sci., vol. 275, no. 1635, pp. 703–11, Mar. 2008.

[81] D. E. Irwin, S. Bensch, J. H. Irwin, and T. D. Price, “Speciation by distance in a ring species.,” Science, vol. 307, no. 5708, pp. 414–6, Jan. 2005.

[82] Å. M. Ribeiro, P. Lloyd, and R. C. K. Bowie, “A tight balance between natural selection and gene flow in a southern African arid-zone endemic bird.,” Evolution, vol. 65, no. 12, pp. 3499–514, Dec. 2011.

[83] C. N. Spottiswoode, K. F. Stryjewski, S. Quader, J. F. R. Colebrook-Robjent, and M. D. Sorenson, “Ancient host specificity within a single species of brood parasitic bird.,” Proc. Natl. Acad. Sci. U. S. A., vol. 108, no. 43, pp. 17738–42, Oct. 2011.

[84] M. Ehinger, P. Fontanillas, E. Petit, and N. Perrin, “Mitochondrial DNA variation along an altitudinal gradient in the greater white-toothed shrew, Crocidura russula.,” Mol. Ecol., vol. 11, no. 5, pp. 939–45, May 2002.

[85] D. Mishmar, E. Ruiz-Pesini, P. Golik, V. Macaulay, A. G. Clark, S. Hosseini, M. Brandon, K. Easley, E. Chen, M. D. Brown, R. I. Sukernik, A. Olckers, and D. C. Wallace, “Natural selection shaped regional mtDNA variation in humans.,” Proc. Natl. Acad. Sci. U. S. A., vol. 100, no. 1, pp. 171–6, Jan. 2003.

[86] E. Ruiz-Pesini, D. Mishmar, M. Brandon, V. Procaccio, and D. C. Wallace, “Effects of purifying and adaptive selection on regional variation in human mtDNA.,” Science, vol. 303, no. 5655, pp. 223–6, Jan. 2004.

[87] P. Fontanillas, A. Dépraz, M. S. Giorgi, and N. Perrin, “Nonshivering thermogenesis capacity associated to mitochondrial DNA haplotypes and gender in the greater white-toothed shrew, Crocidura russula.,” Mol. Ecol., vol. 14, no. 2, pp. 661–70, Feb. 2005.

[88] A. M. Fernandes, M. Wink, C. H. Sardelli, and A. Aleixo, “Multiple speciation across the Andes and throughout Amazonia: The case of the spot-backed antbird species complex (Hylophylax naevius/Hylophylax naevioides),” J. Biogeogr., vol. 41, no. 6, pp. 1094–1104, 2014.

[89] C. C. Ribas, a. Aleixo, a. C. R. Nogueira, C. Y. Miyaki, and J. Cracraft, “A palaeobiogeographic model for biotic diversification within Amazonia over the past three million years,” Proc. R. Soc. B Biol. Sci., vol. 279, no. 1729, pp. 681–689, 2012.

[90] F. E. Hayes and J. A. N. Sewlal, “The Amazon River as a dispersal barrier to passerine birds: Effects of river width, habitat and taxonomy,” J. Biogeogr., vol. 31, no. 11, pp. 1809–1818, 2004.

[91] J. Gatesy and M. S. Springer, “Phylogenetic Analysis at Deep Timescales: Unreliable Gene Trees, Bypassed Hidden Support, and the Coalescence/Concatalescence Conundrum.,” Mol. Phylogenet. Evol., vol. 80, pp. 231–266, 2014.

[92] M. S. Springer and J. Gatesy, “Land plant origins and coalescence confusion.,” Trends Plant Sci., vol. 19, no. 5, pp. 267–9, May 2014.


46

Supplemental Material

Figure S13: Species tree inferred with MP-EST based on 824 gene trees of UCEs. We created 1000 bootstrap

replicates of the 824 gene tree dataset and inferred each set separately in MP-EST. The 1000 resulting trees were

collapsed to the maximum clade credibility tree with TreeAnnotator. The tree is scaled in coalescent units. The node

support values represent bootstrap support.


47

Figure S14: Similarity matrices showing results of SpeciesDelimitationAnalyser processing of the species tree

distribution inferred with DISSECT (burn-in 10%). To the left of each matrix is the maximum clade credibility tree of the

posterior tree distribution (burn-in 1,000 trees). Node support values represent Bayesian posterior probabilities. a) Based

on 8 nuclear genes; b) Based on 8 nuclear genes and the mitochondrial genome. Particularly the nuclear dataset (a)

suggests sample T. pella7 being an admixed individual between the two otherwise distinct populations (5+6 and 8+9).

a)

b)


48

Table S3: Overview of read coverage of individual samples and the amount of reads (relative and total)

ID no. of reads (total)

Mitochondrial reads (total)

Mitochondrial reads (relative)

Length of mitochondrial genome (in bp)

1 545,396 63,816 0.117 16,783

2 5,641,580 154,537 0.027 16,762

3 877,918 38,572 0.044 16,835

4 1,777,653 16,146 0.009 16,862

5 738,337 72,207 0.098 16,849

6 2,929,439 164,804 0.056 16,828

7 920,515 59,947 0.065 16,834

8 1,423,376 125,537 0.088 16,844

9 678,092 5,979 0.009 16,824

10 586,239 9,241 0.016 16,842

Table S4: The table shows the gene loci that were assembled, the length of the alignment of these loci between all

samples and the number of variable sites per locus that were found across the alignment of the 9 Topaza samples,

including substitutions and insertions and deletions (the latter count as 1, independently of length of in/del). Marked in

grey are the loci that were excluded from phylogenetic analyses due to too less informative sites (<1%).

Locus Length Variable sites (total)

Variable sites, relative to length

Bfib 1,079 28 0.026

EEF2 1,634 31 0.019

EGR1 609 0 0.000

FGB 660 28 0.042

MB 724 14 0.019

ODC 618 16 0.026

RAG1 2,638 39 0.015

TGFB2 575 13 0.023

ZENK2 1,187 2 0.002

ZENK3 504 6 0.012


49

Figure S15: Example of initial issues with MCMC convergence in BEAST. When prior settings were too unrestrictive

and parameters were given wide ranges to fluctuate within, the MCMC stopped sampling certain parameters after

several million generations, as e.g. the Cytb ucld.mean.rate shown in the upper graph. This stop of sampling of some

parameters caused the posterior likelihood estimation to leap over several 100 likelihood units (see lower graph).

Bayesian posterior likelihood

Cytb ucld.mean.rate


50

Table S5: Mitochondrial genome annotation information for all samples. The first column states the name of the

locus, the second column gives information about the orientation of the reading frame for the respective locus (forward

strand or backward strand). The numbers in the start/end columns mark the position of each locus on the mitochondrial

genome (unit is bp).

locus Florisuga T. pyra1 T. pyra2 T. pyra3 T. pyra4

start end start end start end start end start end

trnF-gaa + 1 72 1 69 1 69 1 69 1 69

rrnS + 73 1031 70 1030 70 1030 70 1030 70 1030

trnVuac + 1041 1114 1039 1111 1039 1111 1040 1112 1040 1112

rrnL + 1149 2693 1173 2663 1173 2663 1174 2664 1174 2664

trnLuaa + 2705 2778 2705 2778 2705 2778 2706 2779 2706 2779

nad1 + 2790 3764 2807 3763 2807 3763 2808 3764 2808 3764

trnIgau + 3766 3839 3765 3837 3765 3837 3766 3838 3766 3838

trnQuug - 3851 3921 3849 3919 3849 3919 3850 3920 3850 3920

trnMcau + 3921 3990 3919 3989 3919 3989 3920 3990 3920 3990

nad2 + 3991 5028 3990 5027 3990 5027 3991 5028 3991 5028

trnWuca + 5030 5099 5029 5098 5029 5098 5030 5099 5030 5099

trnAugc - 5101 5169 5100 5168 5100 5168 5101 5169 5101 5169

trnNguu - 5173 5245 5172 5244 5172 5244 5173 5245 5173 5245

trnCgca - 5249 5315 5248 5314 5248 5314 5249 5315 5249 5315

trnYgua - 5315 5386 5314 5385 5314 5385 5315 5386 5315 5386

cox1 + 5388 6935 5387 6934 5387 6934 5388 6935 5388 6935

trnPugg - 6930 7003 6929 7002 6929 7002 6930 7003 6930 7003

trnDguc + 7006 7074 7005 7073 7005 7073 7006 7074 7006 7074

cox2 + 7076 7756 7075 7755 7075 7755 7076 7756 7076 7756

trnAagc + 7760 7828 7760 7829 7760 7829 7761 7830 7761 7830

atp8 + 7830 7994 7831 7995 7831 7995 7832 7996 7832 7996

atp6 + 7988 8668 7989 8669 7989 8669 7990 8670 7990 8670

cox3 + 8671 9453 8672 9454 8672 9454 8673 9455 8673 9455

trnGucc + 9455 9523 9456 9524 9456 9524 9457 9525 9457 9525

nad3 + 9524 9697 9525 9698 9525 9698 9526 9699 9526 9699

nad3 + 9699 9872 9700 9873 9700 9873 9695 9874 9701 9874

trnRucg + 9877 9946 9879 9948 9879 9948 9880 9949 9880 9949

nad4l + 9948 10241 9950 10243 9950 10243 9951 10244 9951 10244

nad4 + 10238 11605 10240 11607 10240 11607 10241 11608 10241 11608

trnHgug + 11617 11685 11619 11687 11619 11687 11620 11688 11620 11688

trnSgcu + 11686 11751 11688 11754 11688 11754 11689 11755 11689 11755

trnLuag + 11752 11823 11756 11827 11756 11827 11757 11828 11757 11828

nad5 + 11826 13625 11834 13633 11834 13633 11835 13634 11835 13634

cob + 13650 14789 13659 14798 13659 14798 13660 14799 13660 14799

trnTugu + 14796 14864 14805 14872 14805 14872 14806 14873 14806 14873

trnPugg - 14867 14936 14877 14946 14877 14946 14878 14947 14878 14947

nad6 - 14951 15469 14968 15486 14968 15486 14969 15487 14969 15487

trnEuuc - 15470 15540 15487 15557 15487 15557 15488 15558 15488 15558

misc. + 15541 16842 15558 16783 15558 16762 15559 16835 15559 16862


51

Extension of Table S5:

locus T. pella5 T. pella6 T. pella 7 T. pella8 T. pella9

start end start end start end start end start end

trnF-gaa + 1 69 1 69 1 69 1 69 1 69

rrnS + 70 1034 70 1030 70 1033 70 1029 70 1031

trnV-uac + 1040 1112 1039 1111 1039 1111 1037 1109 1037 1109

rrnL + 1174 2664 1173 2663 1173 2663 1171 2660 1171 2660

trnL-uaa + 2707 2780 2706 2779 2706 2779 2704 2777 2704 2777

nad1 + 2809 3765 2829 3764 2808 3764 2806 3762 2806 3762

trnI-gau + 3767 3839 3766 3838 3766 3838 3764 3836 3764 3836

trnQ-uug - 3851 3921 3850 3920 3850 3920 3848 3918 3848 3918

trnM-cau + 3921 3991 3920 3990 3920 3990 3918 3988 3918 3988

nad2 + 3992 5029 3991 5028 3991 5028 3989 5026 3989 5026

trnW-uca + 5031 5100 5030 5099 5030 5099 5028 5097 5028 5097

trnA-ugc - 5102 5170 5101 5169 5101 5169 5099 5167 5099 5167

trnR-ucg + 5156 5228 5155 5227 5155 5227 5153 5225 5153 5225

trnN-guu - 5174 5246 5173 5245 5173 5245 5171 5243 5171 5243

trnC-gca - 5250 5316 5249 5315 5249 5315 5247 5313 5247 5313

trnY-gua - 5316 5387 5315 5386 5315 5386 5313 5384 5313 5384

cox1 + 5389 6936 5388 6935 5388 6935 5386 6933 5386 6933

trnS-uga - 6931 7004 6930 7003 6930 7003 6928 7001 6928 7001

trnL-aag - 7005 7075 7004 7074 7004 7074 7002 7072 7002 7072

trnD-guc + 7007 7075 7006 7074 7006 7074 7004 7072 7004 7072

cox2 + 7077 7757 7076 7756 7076 7756 7074 7754 7074 7754

trnA-agc + 7762 7831 7761 7830 7761 7830 7759 7828 7759 7828

atp8 + 7833 7997 7832 7996 7832 7996 7830 7994 7830 7994

atp6 + 7991 8671 7990 8670 7990 8670 7988 8668 7988 8668

cox3 + 8674 9456 8673 9455 8673 9455 8671 9453 8671 9453

trnG-ucc + 9458 9526 9457 9525 9457 9525 9455 9523 9455 9523

nad3 + 9527 9700 9526 9699 9526 9699 9524 9697 9524 9697

nad3 + 9702 9875 9701 9874 9701 9874 9699 9872 9699 9872

trnR-ucg + 9881 9950 9880 9949 9880 9949 9878 9947 9878 9947

nad4l + 9952 10245 9951 10244 9951 10244 9949 10242 9949 10242

nad4 + 10242 11609 10241 11608 10241 11608 10239 11606 10239 11606

trnH-gug + 11621 11689 11620 11688 11620 11688 11618 11686 11618 11686

trnS-gcu + 11690 11756 11689 11755 11689 11755 11687 11753 11687 11753

trnL-uag + 11758 11829 11757 11828 11757 11828 11755 11826 11755 11826

nad5 + 11838 13637 11837 13636 11837 13636 11834 13633 11834 13633

cob + 13663 14802 13662 14801 13662 14801 13659 14798 13659 14798

trnT-ugu + 14809 14876 14808 14875 14808 14875 14805 14873 14805 14873

trnP-ugg - 14881 14950 14880 14949 14880 14949 14878 14947 14878 14947

nad6 - 14971 15489 14970 15488 14969 15487 14968 15486 14968 15486

trnE-uuc - 15490 15560 15489 15559 15488 15558 15487 15557 15487 15557

misc. + 15561 16849 15560 16828 15559 16834 15558 16844 15558 16824

misled by the mitochondrial genome - göteborgs universitet · the mitochondrial genome is haploid...

Documents